How do we go about breaking up a string into tokens? We need to come up with a
way to compare some characters in a string against a set of rules. But "rules"
is a very general term - we could, for instance, define a particular
token that is a fibonacci number - 1, 2, 3, 5, and so on would be marked
as a "fibonacci number", while the other numbers will be marked as just
a regular number. To support that, our rules would get pretty complex. And
equally complex will become our checking of these rules for particular strings.
Fortunately, we're not insane. We observe that the rules for tokens in practice
are fairly simple - one or more digits is an integer, a few letters together
are a variable name. In order to be able to efficiently break text up into
such tokens, we restrict ourselves to __regular languages__. A language
is defined as a set of strings (potentially infinite), and a regular
language for which we can write a __regular expression__ to check if
a string is in the set. Regular expressions are a way of representing
patterns that a string has to match. We define regular expressions
as follows:
* Any character is a regular expression that matches that character. Thus,
\\(a\\) is a regular expression (from now shortened to regex) that matches
the character 'a', and nothing else.
* \\(r_1r_2\\), or the concatenation of \\(r_1\\) and \\(r_2\\), is
a regular expression that matches anything matched by \\(r_1\\), followed
by anything that matches \\(r_2\\). For instance, \\(ab\\), matches
the character 'a' followed by the character 'b' (thus matching "ab").
* \\(r_1|r_2\\) matches anything that is either matched by \\(r_1\\) or
\\(r_2\\). Thus, \\(a|b\\) matches the character 'a' or the character 'b'.
* \\(r_1?\\) matches either an empty string, or anything matched by \\(r_1\\).
* \\(r_1+\\) matches one or more things matched by \\(r_1\\). So,
\\(a+\\) matches "a", "aa", "aaa", and so on.
* \\((r_1)\\) matches anything that matches \\(r_1\\). This is mostly used
to group things together in more complicated expressions.
* \\(.\\) matches any character.
More powerful variations of regex also include an "any of" operator, \\([c_1c_2c_3]\\),
which is equivalent to \\(c_1|c_2|c_3\\), and a "range" operator, \\([c_1-c_n]\\), which
matches all characters in the range between \\(c_1\\) and \\(c_n\\), inclusive.
Let's see some examples. An integer, such as 326, can be represented with \\([0-9]+\\).
This means, one or more characters between 0 or 9. Some (most) regex implementations
have a special symbol for \\([0-9]\\), written as \\(\\setminus d\\). A variable,
starting with a lowercase letter and containing lowercase or uppercase letters after it,
can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide
a special operator for \\((r_1+)?\\), written as \\(r_1*\\).
So how does one go about checking if a regular expression matches a string? An efficient way is to
first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
by literally translating each part of it to a series of states, one-to-one. This machine is called
a __Nondeterministic Finite Automaton__, or NFA for short. The "Finite" means that the number of
states in the state machine is, well, finite. For us, this means that we can store such
a machine on disk. The "Nondeterministic" part, though, is more complex: given a particular character
and a particular state, it's possible that an NFA has the option of transitioning into more
than one other state. Well, which state __should__ it pick? No easy way to tell. Each time
we can transition to more than one state, we exponentially increase the number of possible
states that we can be in. This isn't good - we were going for efficiency, remember?
What we can do is convert our NFA into another kind of state machine, in which for every character,
only one possible state transition is possible. This machine is called a __Deterministic Finite Automaton__,
or DFA for short. There's an algorithm to convert an NFA into a DFA, which I won't explain here.
Since both the conversion of a regex into an NFA and a conversion of an NFA into a DFA is done
by following an algorithm, we're always going to get the same DFA for the same regex we put in.
If we come up with the rules for our tokens once, we don't want to be building a DFA each time
our tokenizer is run - the result will always be the same! Even worse, translating a regular
expression all the way into a DFA is the inefficient part of the whole process. The solution is to
generate a state machine, and convert it into code to simulate that state machine. Then, we include
that code as part of our compiler. This way, we have a state machine "hardcoded" into our tokenizer,
and no conversion of regex to DFAs needs to be done at runtime.
#### The Practice
Creating an NFA, and then a DFA, and then generating C++ code are all cumbersome. If we had to
write code to do this every time we made a compiler, it would get very repetitive, very fast.
Fortunately, there exists a tool that does exactly this for us - it's called `flex`. Flex
takes regular expressions, and generates code that matches a string against those regular expressions.
It does one more thing in addition to that - for each regular expression it matches, flex
runs a user-defined action (which we write in C++). We can use this to convert strings that
represent numbers directly into numbers, and do other small tasks.
So, what tokens do we have? From our arithmetic definition, we see that we have integers.
Let's use the regex `[0-9]+` for those. We also have the operators `+`, `-`, `*`, and `/`.
`-` is simple enough: the corresponding regex is `-`. We need to
preface our `/`, `+` and `*` with a backslash, though, since they happen to also be modifiers
in flex's regular expressions: `\/`, `\+`, `\*`.
Let's also represent some reserved keywords. We'll say that `defn`, `data`, `case`, and `of`
are reserved. Their regular expressions are just their names. We also want to tokenize
`=`, `->`, `{`, `}`, `,`, `(` and `)`. Finally, we want to represent identifiers, like `f`,
`x`, `Nil`, and `Cons`. We will actually make a distinction between lowercase identifiers
and uppercase identifiers, as we will follow Haskell's convention of representing
data type constructors with uppercase letters, and functions and variables with lowercase ones.
So, our two regular expressions will be `[a-z][a-zA-Z]*` for the lowercase variables, and
`[A-Z][a-zA-Z]*` for uppercase variables. Let's make a tokenizer in flex with all this. To do
this, we create a new file, `scanner.l`, in which we write a mix of regular expressions