177 lines
11 KiB
Markdown
177 lines
11 KiB
Markdown
---
|
|
title: Compiling a Functional Language Using C++, Part 1 - Tokenizing
|
|
date: 2019-08-03T01:02:30-07:00
|
|
tags: ["C++", "Functional Languages", "Compilers"]
|
|
series: "Compiling a Functional Language using C++"
|
|
description: "In this post, we tackle the first component of our compiler: tokenizing."
|
|
---
|
|
It makes sense to build a compiler bit by bit, following the stages we outlined in
|
|
the first post of the series. This is because these stages are essentially a pipeline,
|
|
with program text coming in one end, and the final program coming out of the other.
|
|
So as we build up our pipeline, we'll be able to push program text further and further,
|
|
until eventually we get something that we can run on our machine.
|
|
|
|
This is how most tutorials go about building a compiler, too. The result is that
|
|
there are a __lot__ of tutorials covering tokenizing and parsing.
|
|
Nonetheless, I will cover the steps required to tokenize and parse our little functional
|
|
language. Before we start, it might help to refresh your memory about
|
|
the syntax of the language, which we outlined in the
|
|
[previous post]({{< relref "00_compiler_intro.md" >}}).
|
|
|
|
When we first get our program text, it's in a representation difficult for us to make
|
|
sense of. If we look at how it's represented in C++, we see that it's just an array
|
|
of characters (potentially hundreds, thousands, or millions in length). We __could__
|
|
jump straight to parsing the text (which involves creating a tree structure, known
|
|
as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach -
|
|
in fact, in functional languages, tokenizing is frequently skipped. However,
|
|
in our closer-to-metal language (C++), it happens to be more convenient to first break the
|
|
input text into a bunch of distinct segments (tokens).
|
|
|
|
For example, consider the string "320+6". If we skip tokenizing and go straight
|
|
into parsing, we'd feed our parser the sequence of characters `['3', '2', '6', '+', '6', '\0']`.
|
|
On the other hand, if we run a tokenizing step on the string first, we'd be feeding our
|
|
parser three tokens, `("320", NUMBER)`, `("+", OPERATOR)`, and `("6", NUMBER)`.
|
|
To us, this is a bit more clear - we've partitioned the string into logical segments.
|
|
Our parser, then, won't have to care about recognizing a number - it will just know
|
|
that a number is next in the string, and do with that information what it needs.
|
|
|
|
### The Theory
|
|
How do we go about breaking up a string into tokens? We need to come up with a
|
|
way to compare some characters in a string against a set of rules. But "rules"
|
|
is a very general term - we could, for instance, define a particular
|
|
token that is a fibonacci number - 1, 2, 3, 5, and so on would be marked
|
|
as a "fibonacci number", while the other numbers will be marked as just
|
|
a regular number. To support that, our rules would get pretty complex. And
|
|
equally complex will become our checking of these rules for particular strings.
|
|
|
|
Fortunately, we're not insane. We observe that the rules for tokens in practice
|
|
are fairly simple - one or more digits is an integer, a few letters together
|
|
are a variable name. In order to be able to efficiently break text up into
|
|
such tokens, we restrict ourselves to __regular languages__. A language
|
|
is defined as a set of strings (potentially infinite), and a regular
|
|
language is one for which we can write a __regular expression__ to check if
|
|
a string is in the set. Regular expressions are a way of representing
|
|
patterns that a string has to match. We define regular expressions
|
|
as follows:
|
|
|
|
* Any character is a regular expression that matches that character. Thus,
|
|
\(a\) is a regular expression (from now shortened to regex) that matches
|
|
the character 'a', and nothing else.
|
|
* \(r_1r_2\), or the concatenation of \(r_1\) and \(r_2\), is
|
|
a regular expression that matches anything matched by \(r_1\), followed
|
|
by anything that matches \(r_2\). For instance, \(ab\), matches
|
|
the character 'a' followed by the character 'b' (thus matching "ab").
|
|
* \(r_1|r_2\) matches anything that is either matched by \(r_1\) or
|
|
\(r_2\). Thus, \(a|b\) matches the character 'a' or the character 'b'.
|
|
* \(r_1?\) matches either an empty string, or anything matched by \(r_1\).
|
|
* \(r_1+\) matches one or more things matched by \(r_1\). So,
|
|
\(a+\) matches "a", "aa", "aaa", and so on.
|
|
* \((r_1)\) matches anything that matches \(r_1\). This is mostly used
|
|
to group things together in more complicated expressions.
|
|
* \(.\) matches any character.
|
|
|
|
More powerful variations of regex also include an "any of" operator, \([c_1c_2c_3]\),
|
|
which is equivalent to \(c_1|c_2|c_3\), and a "range" operator, \([c_1-c_n]\), which
|
|
matches all characters in the range between \(c_1\) and \(c_n\), inclusive.
|
|
|
|
Let's see some examples. An integer, such as 326, can be represented with \([0-9]+\).
|
|
This means, one or more characters between 0 or 9. Some (most) regex implementations
|
|
have a special symbol for \([0-9]\), written as \(\setminus d\). A variable,
|
|
starting with a lowercase letter and containing lowercase or uppercase letters after it,
|
|
can be written as \([a-z]([a-zA-Z]+)?\). Again, most regex implementations provide
|
|
a special operator for \((r_1+)?\), written as \(r_1*\).
|
|
|
|
So how does one go about checking if a regular expression matches a string? An efficient way is to
|
|
first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
|
|
by literally translating each part of it to a series of states, one-to-one. This machine is called
|
|
a __Nondeterministic Finite Automaton__, or NFA for short. The "Finite" means that the number of
|
|
states in the state machine is, well, finite. For us, this means that we can store such
|
|
a machine on disk. The "Nondeterministic" part, though, is more complex: given a particular character
|
|
and a particular state, it's possible that an NFA has the option of transitioning into more
|
|
than one other state. Well, which state __should__ it pick? No easy way to tell. Each time
|
|
we can transition to more than one state, we exponentially increase the number of possible
|
|
states that we can be in. This isn't good - we were going for efficiency, remember?
|
|
|
|
What we can do is convert our NFA into another kind of state machine, in which for every character,
|
|
only one possible state transition is possible. This machine is called a __Deterministic Finite Automaton__,
|
|
or DFA for short. There's an algorithm to convert an NFA into a DFA, which I won't explain here.
|
|
|
|
Since both the conversion of a regex into an NFA and a conversion of an NFA into a DFA is done
|
|
by following an algorithm, we're always going to get the same DFA for the same regex we put in.
|
|
If we come up with the rules for our tokens once, we don't want to be building a DFA each time
|
|
our tokenizer is run - the result will always be the same! Even worse, translating a regular
|
|
expression all the way into a DFA is the inefficient part of the whole process. The solution is to
|
|
generate a state machine, and convert it into code to simulate that state machine. Then, we include
|
|
that code as part of our compiler. This way, we have a state machine "hardcoded" into our tokenizer,
|
|
and no conversion of regex to DFAs needs to be done at runtime.
|
|
|
|
### The Practice
|
|
Creating an NFA, and then a DFA, and then generating C++ code are all cumbersome. If we had to
|
|
write code to do this every time we made a compiler, it would get very repetitive, very fast.
|
|
Fortunately, there exists a tool that does exactly this for us - it's called `flex`. Flex
|
|
takes regular expressions, and generates code that matches a string against those regular expressions.
|
|
It does one more thing in addition to that - for each regular expression it matches, flex
|
|
runs a user-defined action (which we write in C++). We can use this to convert strings that
|
|
represent numbers directly into numbers, and do other small tasks.
|
|
|
|
So, what tokens do we have? From our arithmetic definition, we see that we have integers.
|
|
Let's use the regex `[0-9]+` for those. We also have the operators `+`, `-`, `*`, and `/`.
|
|
The regex for `-` is simple enough: it's just `-`. However, we need to
|
|
preface our `/`, `+` and `*` with a backslash, since they happen to also be modifiers
|
|
in flex's regular expressions: `\/`, `\+`, `\*`.
|
|
|
|
Let's also represent some reserved keywords. We'll say that `defn`, `data`, `case`, and `of`
|
|
are reserved. Their regular expressions are just their names. We also want to tokenize
|
|
`=`, `->`, `{`, `}`, `,`, `(` and `)`. Finally, we want to represent identifiers, like `f`,
|
|
`x`, `Nil`, and `Cons`. We will actually make a distinction between lowercase identifiers
|
|
and uppercase identifiers, as we will follow Haskell's convention of representing
|
|
data type constructors with uppercase letters, and functions and variables with lowercase ones.
|
|
So, our two regular expressions will be `[a-z][a-zA-Z]*` for the lowercase variables, and
|
|
`[A-Z][a-zA-Z]*` for uppercase variables. Let's make a tokenizer in flex with all this. To do
|
|
this, we create a new file, `scanner.l`, in which we write a mix of regular expressions
|
|
and C++ code. Here's the whole thing:
|
|
|
|
{{< rawblock "compiler/01/scanner.l" >}}
|
|
|
|
A flex file starts with options. I set the `noyywrap` option, which disables a particular
|
|
feature of flex that we won't use, and which causes linker errors. Next up,
|
|
flex allows us to put some C++ code that we want at the top of our generated code.
|
|
I simply include `iostream`, so that we can use `cout` to print out our tokens.
|
|
Next, `%%`, and after that, the meat of our tokenizer: regular expressions, followed by
|
|
C++ code that should be executed when the regular expression is matched.
|
|
|
|
The first token: whitespace. This includes the space character,
|
|
and the newline character. We ignore it, so its rule is empty. After that,
|
|
we have the regular expressions for the tokens we've talked about. For each, I just
|
|
print a description of the token that matched. This will change when we hook this up to
|
|
a parser, but for now, this works fine. Notice that the variable `yytext` contains
|
|
the string matched by our regular expression. This variable is set by the code flex
|
|
generates, and we can use it to get the extract text that matched a regex. This is
|
|
useful, for instance, to print the variable name that we matched. After
|
|
all of our tokens, another `%%`, and more C++ code. For this simple example,
|
|
I declare a `main` function, which just calls `yylex`, a function flex
|
|
generates for us. Let's generate the C++ code, and compile it:
|
|
|
|
```
|
|
flex -o scanner.cpp scanner.l
|
|
g++ -o scanner scanner.cpp
|
|
```
|
|
|
|
Now, let's feed it an expression:
|
|
```
|
|
echo "3+2*6" | ./scanner
|
|
```
|
|
|
|
We get the output:
|
|
```
|
|
NUMBER: 3
|
|
PLUS
|
|
NUMBER: 2
|
|
TIMES
|
|
NUMBER: 6
|
|
```
|
|
Hooray! We have tokenizing.
|
|
|
|
With our text neatly divided into meaningful chunks, we
|
|
can continue on to [Part 2 - Parsing]({{< relref "02_compiler_parsing.md" >}}).
|