blog-static/content/blog/01_compiler_tokenizing.md

11 KiB

title date tags description
Compiling a Functional Language Using C++, Part 1 - Tokenizing 2019-08-03T01:02:30-07:00
C and C++
Functional Languages
Compilers
In this post, we tackle the first component of our compiler: tokenizing.

It makes sense to build a compiler bit by bit, following the stages we outlined in the first post of the series. This is because these stages are essentially a pipeline, with program text coming in one end, and the final program coming out of the other. So as we build up our pipeline, we'll be able to push program text further and further, until eventually we get something that we can run on our machine.

This is how most tutorials go about building a compiler, too. The result is that there are a lot of tutorials covering tokenizing and parsing. Nonetheless, I will cover the steps required to tokenize and parse our little functional language. Before we start, it might help to refresh your memory about the syntax of the language, which we outlined in the [previous post]({{< relref "00_compiler_intro.md" >}}).

When we first get our program text, it's in a representation difficult for us to make sense of. If we look at how it's represented in C++, we see that it's just an array of characters (potentially hundreds, thousands, or millions in length). We could jump straight to parsing the text (which involves creating a tree structure, known as an abstract syntax tree; more on that later). There's nothing wrong with this approach - in fact, in functional languages, tokenizing is frequently skipped. However, in our closer-to-metal language (C++), it happens to be more convenient to first break the input text into a bunch of distinct segments (tokens).

For example, consider the string "320+6". If we skip tokenizing and go straight into parsing, we'd feed our parser the sequence of characters ['3', '2', '6', '+', '6', '\0']. On the other hand, if we run a tokenizing step on the string first, we'd be feeding our parser three tokens, ("320", NUMBER), ("+", OPERATOR), and ("6", NUMBER). To us, this is a bit more clear - we've partitioned the string into logical segments. Our parser, then, won't have to care about recognizing a number - it will just know that a number is next in the string, and do with that information what it needs.

The Theory

How do we go about breaking up a string into tokens? We need to come up with a way to compare some characters in a string against a set of rules. But "rules" is a very general term - we could, for instance, define a particular token that is a fibonacci number - 1, 2, 3, 5, and so on would be marked as a "fibonacci number", while the other numbers will be marked as just a regular number. To support that, our rules would get pretty complex. And equally complex will become our checking of these rules for particular strings.

Fortunately, we're not insane. We observe that the rules for tokens in practice are fairly simple - one or more digits is an integer, a few letters together are a variable name. In order to be able to efficiently break text up into such tokens, we restrict ourselves to regular languages. A language is defined as a set of strings (potentially infinite), and a regular language is one for which we can write a regular expression to check if a string is in the set. Regular expressions are a way of representing patterns that a string has to match. We define regular expressions as follows:

  • Any character is a regular expression that matches that character. Thus, \(a\) is a regular expression (from now shortened to regex) that matches the character 'a', and nothing else.
  • \(r_1r_2\), or the concatenation of \(r_1\) and \(r_2\), is a regular expression that matches anything matched by \(r_1\), followed by anything that matches \(r_2\). For instance, \(ab\), matches the character 'a' followed by the character 'b' (thus matching "ab").
  • \(r_1|r_2\) matches anything that is either matched by \(r_1\) or \(r_2\). Thus, \(a|b\) matches the character 'a' or the character 'b'.
  • \(r_1?\) matches either an empty string, or anything matched by \(r_1\).
  • \(r_1+\) matches one or more things matched by \(r_1\). So, \(a+\) matches "a", "aa", "aaa", and so on.
  • \((r_1)\) matches anything that matches \(r_1\). This is mostly used to group things together in more complicated expressions.
  • \(.\) matches any character.

More powerful variations of regex also include an "any of" operator, \([c_1c_2c_3]\), which is equivalent to \(c_1|c_2|c_3\), and a "range" operator, \([c_1-c_n]\), which matches all characters in the range between \(c_1\) and \(c_n\), inclusive.

Let's see some examples. An integer, such as 326, can be represented with \([0-9]+\). This means, one or more characters between 0 or 9. Some (most) regex implementations have a special symbol for \([0-9]\), written as \(\setminus d\). A variable, starting with a lowercase letter and containing lowercase or uppercase letters after it, can be written as \([a-z]([a-zA-Z]+)?\). Again, most regex implementations provide a special operator for \((r_1+)?\), written as \(r_1*\).

So how does one go about checking if a regular expression matches a string? An efficient way is to first construct a state machine. A type of state machine can be constructed from a regular expression by literally translating each part of it to a series of states, one-to-one. This machine is called a Nondeterministic Finite Automaton, or NFA for short. The "Finite" means that the number of states in the state machine is, well, finite. For us, this means that we can store such a machine on disk. The "Nondeterministic" part, though, is more complex: given a particular character and a particular state, it's possible that an NFA has the option of transitioning into more than one other state. Well, which state should it pick? No easy way to tell. Each time we can transition to more than one state, we exponentially increase the number of possible states that we can be in. This isn't good - we were going for efficiency, remember?

What we can do is convert our NFA into another kind of state machine, in which for every character, only one possible state transition is possible. This machine is called a Deterministic Finite Automaton, or DFA for short. There's an algorithm to convert an NFA into a DFA, which I won't explain here.

Since both the conversion of a regex into an NFA and a conversion of an NFA into a DFA is done by following an algorithm, we're always going to get the same DFA for the same regex we put in. If we come up with the rules for our tokens once, we don't want to be building a DFA each time our tokenizer is run - the result will always be the same! Even worse, translating a regular expression all the way into a DFA is the inefficient part of the whole process. The solution is to generate a state machine, and convert it into code to simulate that state machine. Then, we include that code as part of our compiler. This way, we have a state machine "hardcoded" into our tokenizer, and no conversion of regex to DFAs needs to be done at runtime.

The Practice

Creating an NFA, and then a DFA, and then generating C++ code are all cumbersome. If we had to write code to do this every time we made a compiler, it would get very repetitive, very fast. Fortunately, there exists a tool that does exactly this for us - it's called flex. Flex takes regular expressions, and generates code that matches a string against those regular expressions. It does one more thing in addition to that - for each regular expression it matches, flex runs a user-defined action (which we write in C++). We can use this to convert strings that represent numbers directly into numbers, and do other small tasks.

So, what tokens do we have? From our arithmetic definition, we see that we have integers. Let's use the regex [0-9]+ for those. We also have the operators +, -, *, and /. The regex for - is simple enough: it's just -. However, we need to preface our /, + and * with a backslash, since they happen to also be modifiers in flex's regular expressions: \/, \+, \*.

Let's also represent some reserved keywords. We'll say that defn, data, case, and of are reserved. Their regular expressions are just their names. We also want to tokenize =, ->, {, }, ,, ( and ). Finally, we want to represent identifiers, like f, x, Nil, and Cons. We will actually make a distinction between lowercase identifiers and uppercase identifiers, as we will follow Haskell's convention of representing data type constructors with uppercase letters, and functions and variables with lowercase ones. So, our two regular expressions will be [a-z][a-zA-Z]* for the lowercase variables, and [A-Z][a-zA-Z]* for uppercase variables. Let's make a tokenizer in flex with all this. To do this, we create a new file, scanner.l, in which we write a mix of regular expressions and C++ code. Here's the whole thing:

{{< rawblock "compiler/01/scanner.l" >}}

A flex file starts with options. I set the noyywrap option, which disables a particular feature of flex that we won't use, and which causes linker errors. Next up, flex allows us to put some C++ code that we want at the top of our generated code. I simply include iostream, so that we can use cout to print out our tokens. Next, %%, and after that, the meat of our tokenizer: regular expressions, followed by C++ code that should be executed when the regular expression is matched.

The first token: whitespace. This includes the space character, and the newline character. We ignore it, so its rule is empty. After that, we have the regular expressions for the tokens we've talked about. For each, I just print a description of the token that matched. This will change when we hook this up to a parser, but for now, this works fine. Notice that the variable yytext contains the string matched by our regular expression. This variable is set by the code flex generates, and we can use it to get the extract text that matched a regex. This is useful, for instance, to print the variable name that we matched. After all of our tokens, another %%, and more C++ code. For this simple example, I declare a main function, which just calls yylex, a function flex generates for us. Let's generate the C++ code, and compile it:

flex -o scanner.cpp scanner.l
g++ -o scanner scanner.cpp

Now, let's feed it an expression:

echo "3+2*6" | ./scanner

We get the output:

NUMBER: 3
PLUS
NUMBER: 2
TIMES
NUMBER: 6

Hooray! We have tokenizing.

With our text neatly divided into meaningful chunks, we can continue on to [Part 2 - Parsing]({{< relref "02_compiler_parsing.md" >}}).