295 lines
16 KiB
Markdown
295 lines
16 KiB
Markdown
---
|
|
title: Compiling a Functional Language Using C++, Part 2 - Parsing
|
|
date: 2019-08-03T01:02:30-07:00
|
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
|
draft: true
|
|
---
|
|
In the previous post, we covered tokenizing. We learned how to convert an input string into logical segments, and even wrote up a tokenizer to do it according to the rules of our language. Now, it's time to make sense of the tokens, and parse our language.
|
|
|
|
### The Theory
|
|
The rules to parse a language are more complicated than the rules for
|
|
recognizing tokens. For instance, consider a simple language of a matching
|
|
number of open and closed parentheses, like `()` and `((()))`. You can't
|
|
write a regular expression for it! We resort to a wider class of languages, called
|
|
__context free languages__. These languages are ones that are matched by __context free grammars__.
|
|
A context free grammar is a list of rules in the form of \\(S \\rightarrow \\alpha\\), where
|
|
\\(S\\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and
|
|
\\(\\alpha\\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't
|
|
expand into other things; for us, this is a token).
|
|
|
|
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
|
|
|
|
$$
|
|
\\begin{align}
|
|
S & \\rightarrow ( S ) \\\\\\
|
|
S & \\rightarrow ()
|
|
\\end{align}
|
|
$$
|
|
|
|
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string,
|
|
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
|
|
we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
|
|
have to do a little more work:
|
|
|
|
$$
|
|
S \\rightarrow (S) \\rightarrow ((S)) \\rightarrow ((()))
|
|
$$
|
|
|
|
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
|
|
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
|
|
a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will
|
|
first try to match \\(alpha\\), then \\(beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse
|
|
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
|
|
$$
|
|
\\begin{align}
|
|
S & \\rightarrow Sa \\\\\\
|
|
S & \\rightarrow a
|
|
\\end{align}
|
|
$$
|
|
A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will
|
|
try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)...
|
|
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
|
|
__left recursive__, and top-down parsers aren't able to handle those grammars.
|
|
|
|
We __could__ rewrite our grammar without using left-recursion, but we don't want to. Instead, we'll use a __bottom up__ parser,
|
|
using specifically the LALR(1) parsing algorithm. Here's an example of how it works, using our left-recursive grammar. We start with our
|
|
goal string, and a "dot" indicating where we are. At first, the dot is behind all the characters:
|
|
$$
|
|
.aaa
|
|
$$
|
|
We see nothing interesting on the left side of the dot, so we move (or __shift__) the dot forward by one character:
|
|
$$
|
|
a.aa
|
|
$$
|
|
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one).
|
|
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)):
|
|
$$
|
|
S.aa
|
|
$$
|
|
There's nothing else we can do with the left side, so we shift again:
|
|
$$
|
|
Sa.a
|
|
$$
|
|
Great, we see another body on the left of the dot. We reduce it:
|
|
$$
|
|
S.a
|
|
$$
|
|
Just like before, we shift over the dot, and again, we reduce. We end up with our
|
|
start symbol, and nothing on the right of the dot, so we're done!
|
|
|
|
### The Practice
|
|
In practice, we don't want to just match a grammar. That would be like saying "yup, this is our language".
|
|
Instead, we want to create something called an __abstract syntax tree__, or AST for short. This tree
|
|
captures the structure of our language, and is easier to work with than its textual representation. The structure
|
|
of the tree we build will often mimic the structure of our grammar: a rule in the form \\(S \\rightarrow A B\\)
|
|
will result in a tree named "S", with two children corresponding the trees built for A and B. Since
|
|
an AST captures the structure of the language, we'll be able to toss away some punctuation
|
|
like `,` and `(`. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally,
|
|
we will write our grammar ignoring whitespace, since our tokenizer conveniently throws that into the trash.
|
|
|
|
The grammar for arithmetic actually involves more effort than it would appear at first. We want to make sure that our
|
|
parser respects the order of operations. This way, when we have our tree, it will immediately have the structure in
|
|
which multiplication is done before addition. We do this by creating separate "levels" in our grammar, with one
|
|
nonterminal matching addition and subtraction, while another nonterminal matches multiplication and division.
|
|
We want the operation that has the least precedence to be __higher__ in our tree than one of higher precedence.
|
|
For instance, for `3+2*6`, we want our tree to have `+` as the root, `3` as the left child, and the tree for `2*6` as the right child.
|
|
Why? Because this tree represents "the addition of 3 and the result of multiplying 2 by 6". If we had `*` be the root, we'd have
|
|
a tree representing "the multiplication of the result of adding 3 to 2 and 6", which is __not__ what our expression means.
|
|
|
|
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and
|
|
for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules:
|
|
$$
|
|
\\begin{align}
|
|
A\_{add} & \\rightarrow A\_{add}+A\_{mult} \\\\\\
|
|
A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\
|
|
A\_{add} & \\rightarrow A\_{mult}
|
|
\\end{align}
|
|
$$
|
|
The first rule matches another addition, added to the result of another addition. We use the addition in the body
|
|
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
|
|
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
|
|
of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking,
|
|
|
|
> You define addition in terms of addition; how will parsing end? What if there's no addition at all, like `2*6`?
|
|
|
|
This is the purpose of the third rule, which serves to say "an addition expression can just be a multiplication,
|
|
without any plusses or minuses." Our rules for multiplication are very similar:
|
|
$$
|
|
\\begin{align}
|
|
A\_{mult} & \\rightarrow A\_{mult}*P \\\\\\
|
|
A\_{mult} & \\rightarrow A\_{mult}/P \\\\\\
|
|
A\_{mult} & \\rightarrow P
|
|
\\end{align}
|
|
$$
|
|
|
|
P, in this case, is an a__p__lication (remember, application has higher precedence than any binary operator).
|
|
Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application.
|
|
|
|
Application is refreshingly simple:
|
|
$$
|
|
\\begin{align}
|
|
P & \\rightarrow P B \\\\\\
|
|
P & \\rightarrow B
|
|
\\end{align}
|
|
$$
|
|
An application is either only one "thing" (represented with \\(B\\), for __b__ase), such as a number or an identifier,
|
|
or another application followed by a thing.
|
|
|
|
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
|
|
arithmetic expression a "thing", allowing us to wrap right back around and recognize anything inside parentheses:
|
|
$$
|
|
\\begin{align}
|
|
B & \\rightarrow \text{num} \\\\\\
|
|
B & \\rightarrow \text{lowerVar} \\\\\\
|
|
B & \\rightarrow \text{upperVar} \\\\\\
|
|
B & \\rightarrow ( A\_{add} ) \\\\\\
|
|
B & \\rightarrow C
|
|
\\end{align}
|
|
$$
|
|
What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that:
|
|
$$
|
|
\\begin{align}
|
|
C & \\rightarrow \\text{case} \\; A\_{add} \\; \\text{of} \\; \\{ L\_B\\} \\\\\\
|
|
L\_B & \\rightarrow R \\; , \\; L\_B \\\\\\
|
|
L\_B & \\rightarrow R \\\\\\
|
|
R & \\rightarrow N \\; \\text{arrow} \\; \\{ A\_{add} \\} \\\\\\
|
|
N & \\rightarrow \\text{lowerVar} \\\\\\
|
|
N & \\rightarrow \\text{upperVar} \\; L\_L \\\\\\
|
|
L\_L & \\rightarrow \\text{lowerVar} \\; L\_L \\\\\\
|
|
L\_L & \\rightarrow \\epsilon
|
|
\\end{align}
|
|
$$
|
|
\\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the
|
|
form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name
|
|
(\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be
|
|
lowercase names, and a list of those arguments will be represented with \\(L\_L\\). One of the bodies
|
|
of this nonterminal is just the character \\(\\epsilon\\), which just means "nothing".
|
|
We use this because a constructor can have no arguments (like Nil).
|
|
|
|
We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`,
|
|
where multiply is a function that, well, multiplies. We start with \\(A_{add}\\):
|
|
$$
|
|
\\begin{align}
|
|
& A\_{add} \\\\\\
|
|
& \\rightarrow A\_{add} + A\_{mult} \\\\\\
|
|
& \\rightarrow A\_{mult} + A\_{mult} \\\\\\
|
|
& \\rightarrow P + A\_{mult} \\\\\\
|
|
& \\rightarrow B + A\_{mult} \\\\\\
|
|
& \\rightarrow \\text{num(3)} + A\_{mult} \\\\\\
|
|
& \\rightarrow \\text{num(3)} + P \\\\\\
|
|
& \\rightarrow \\text{num(3)} + B \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (A\_{add}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (A\_{mult}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (P) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(6)}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (\\text{lowerVar(multiply)} \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
|
|
\\end{align}
|
|
$$
|
|
|
|
We're almost there. We now want a rule for a "something that can appear at the top level of a program", like
|
|
a function or data type declaration. We make a new set of rules:
|
|
$$
|
|
\\begin{align}
|
|
T & \\rightarrow \\text{defn} \\; \\text{lowerVar} \\; L\_P =\\{ A\_{add} \\} \\\\\\
|
|
T & \\rightarrow \\text{data} \\; \\text{upperVar} = \\{ L\_D \\} \\\\\\
|
|
L\_D & \\rightarrow D \\; , \\; L\_D \\\\\\
|
|
L\_D & \\rightarrow D \\\\\\
|
|
L\_P & \\rightarrow \\text{lowerVar} \\; L\_P \\\\\\
|
|
L\_P & \\rightarrow \\epsilon \\\\\\
|
|
D & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
|
|
L\_U & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
|
|
L\_U & \\rightarrow \\epsilon
|
|
\\end{align}
|
|
$$
|
|
That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either
|
|
a function or a data definition. A function definition consists of the keyword "defn",
|
|
followed by a function name (starting with a lowercase letter), followed by a list of
|
|
parameters, represented by \\(L\_P\\).
|
|
|
|
A data type definition consists of the name of the data type (starting with an uppercase letter),
|
|
and a list \\(L\_D\\) of data constructors \\(D\\). There must be at least one data constructor in this list,
|
|
so we don't use the empty string rule here. A data constructor is simply an uppercase variable representing
|
|
a constructor of the data type, followed by a list \\(L\_U\\) of zero or more uppercase variables (representing
|
|
the types of the arguments of the constructor).
|
|
|
|
Finally, we want one or more of these declarations in a valid program:
|
|
$$
|
|
\\begin{align}
|
|
G & \\rightarrow T \\; G \\\\\\
|
|
G & \\rightarrow T
|
|
\\end{align}
|
|
$$
|
|
|
|
Just like with tokenizing, there exists a piece of software that will generate a bottom-up parser for us, given our grammar.
|
|
It's called Bison, and it is frequently used with Flex. Before we get to bison, though, we need to pay a debt we've already
|
|
incurred - the implementation of our AST. Such a tree is language-specific, so Bison doesn't generate it for us. Here's what
|
|
I came up with:
|
|
{{< codeblock "C++" "compiler/02/ast.hpp" >}}
|
|
We create a base class for an expression tree, which we call `ast`. Then, for each possible syntactic construct in our language
|
|
(a number, a variable, a binary operation, a case expression) we create a subclass of `ast`. The `ast_case` subclass
|
|
is the most complex, since it must contain a list of case expression branches, which are a combination of a `pattern` and
|
|
another expression.
|
|
|
|
Finally, we get to writing our Bison file, `parser.y`. Here's what I come up with:
|
|
{{< rawblock "compiler/02/parser.y" >}}
|
|
|
|
There's a few things to note here. First of all, the __parser__ is the "source of truth" regarding what tokens exist in our language.
|
|
We have a list of `%token` declarations, each of which corresponds to a regular expression in our scanner.
|
|
|
|
Next, observe that there's
|
|
a certain symmetry between our parser and our scanner. In our scanner, we mixed the theoretical idea of a regular expression
|
|
with an __action__, a C++ code snippet to be executed when a regular expression is matched. This same idea is present
|
|
in the parser, too. Each rule can produce a value, which we call a __semantic value__. For type safety, we allow
|
|
each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \\(A_{add}\\) must
|
|
produce an expression. We specify the type of each nonterminal and using `%type` directives. The types of terminals
|
|
are specified when they're declared.
|
|
|
|
Next, we must recognize that Bison was originally made for C, rather than C++. In order to allow the parser
|
|
to store and operate on semantic values of various types, the canonical solution back in those times was to
|
|
use a C `union`. Unions are great, but for C++, they're more trouble than they're worth: unions don't
|
|
allow for non-trivial constructors! This means that stuff like `std::unique_ptr` and `std::string` is off limits as
|
|
a semantic value. But we'd really much rather use them! The solution is to:
|
|
|
|
1. Specify the language to be C++, rather than C.
|
|
2. Enable the `variant` API feature, which uses a lightweight `std::variant` alternative in place of a union.
|
|
3. Enable the creation of token constructors, which we will use in Flex.
|
|
|
|
In order to be able to use the variant-based API, we also need to change the Flex `yylex` function
|
|
to return `yy::parser::symbol_type`. You can see it in our forward declaration of `yylex`.
|
|
|
|
Now that we made these changes, it's time to hook up Flex to all this. Here's a new version
|
|
of the Flex scanner, with all necessary modifications:
|
|
{{< rawblock "compiler/02/scanner.l" >}}
|
|
|
|
The key two ideas are that we overrode the default signature of `yylex` by changing the
|
|
`YY_DECL` preprocessor variable, and used the `yy::parser::make_<TOKEN>` functions
|
|
to return the `symbol_type` rather than `int`.
|
|
|
|
Finally, let's get a main function so that we can at least check for segmentation faults
|
|
and other obvious mistakes:
|
|
{{< codeblock "C++" "compiler/02/main.cpp" >}}
|
|
|
|
Now, we can compile and run the code:
|
|
{{< codeblock "Bash" "compiler/02/compile.sh" >}}
|
|
We used the `-d` option for Bison to generate the `compiler_parser.hpp` header file,
|
|
which exports our token declarations and token creation functions, allowing
|
|
us to use them in Flex.
|
|
|
|
At last, we can feed some code to the parser from `stdin`. Let's try it:
|
|
```
|
|
./a.out
|
|
defn main = { add 320 6 }
|
|
defn add x y = { x + y }
|
|
```
|
|
The program prints `2`, indicating two declarations were made. Let's try something obviously
|
|
wrong:
|
|
```
|
|
./a.out
|
|
}{
|
|
```
|
|
We are told an error occured. Excellent! We're not really sure what our tree looks like, though.
|
|
We just know there's __stuff__ in the list of definitions. Having read our source code into
|
|
a structure we're more equipped to handle, we can now try to verify that the code
|
|
makes sense in [Part 3 - Type Checking]({{< relref "03_compiler_typechecking.md" >}}).
|