title: Compiling a Functional Language Using C++, Part 2 - Parsing
date: 2019-08-03T01:02:30-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
draft: true
---
In the previous post, we covered tokenizing. We learned how to convert an input string into logical segments, and even wrote up a tokenizer to do it according to the rules of our language. Now, it's time to make sense of the tokens, and parse our language.
### The Theory
The rules to parse a language are more complicated than the rules for
recognizing tokens. For instance, consider a simple language of a matching
number of open and closed parentheses, like `()` and `((()))`. You can't
write a regular expression for it! We resort to a wider class of languages, called
__context free languages__. These languages are ones that are matched by __context free grammars__.
A context free grammar is a list of rules in the form of \\(S \\rightarrow \\alpha\\), where
\\(S\\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and
\\(\\alpha\\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't
expand into other things; for us, this is a token).
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
$$
\\begin{align}
S & \\rightarrow ( S ) \\\\\\
S & \\rightarrow ()
\\end{align}
$$
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string,
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
have to do a little more work:
$$
S \\rightarrow (S) \\rightarrow ((S)) \\rightarrow ((()))
$$
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will
first try to match \\(alpha\\), then \\(beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
$$
\\begin{align}
S & \\rightarrow Sa \\\\\\
S & \\rightarrow a
\\end{align}
$$
A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will
try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)...
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
__left recursive__, and top-down parsers aren't able to handle those grammars.
We __could__ rewrite our grammar without using left-recursion, but we don't want to. Instead, we'll use a __bottom up__ parser,
using specifically the LALR(1) parsing algorithm. Here's an example of how it works, using our left-recursive grammar. We start with our
goal string, and a "dot" indicating where we are. At first, the dot is behind all the characters:
$$
.aaa
$$
We see nothing interesting on the left side of the dot, so we move (or __shift__) the dot forward by one character:
$$
a.aa
$$
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one).
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)):
$$
S.aa
$$
There's nothing else we can do with the left side, so we shift again:
$$
Sa.a
$$
Great, we see another body on the left of the dot. We reduce it:
$$
S.a
$$
Just like before, we shift over the dot, and again, we reduce. We end up with our
start symbol, and nothing on the right of the dot, so we're done!
### The Practice
In practice, we don't want to just match a grammar. That would be like saying "yup, this is our language".
Instead, we want to create something called an __abstract syntax tree__, or AST for short. This tree
captures the structure of our language, and is easier to work with than its textual representation. The structure
of the tree we build will often mimic the structure of our grammar: a rule in the form \\(S \\rightarrow A B\\)
will result in a tree named "S", with two children corresponding the trees built for A and B. Since
an AST captures the structure of the language, we'll be able to toss away some punctuation
like `,` and `(`. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally,
we will write our grammar ignoring whitespace, since our tokenizer conveniently throws that into the trash.
The grammar for arithmetic actually involves more effort than it would appear at first. We want to make sure that our
parser respects the order of operations. This way, when we have our tree, it will immediately have the structure in
which multiplication is done before addition. We do this by creating separate "levels" in our grammar, with one
nonterminal matching addition and subtraction, while another nonterminal matches multiplication and division.
We want the operation that has the least precedence to be __higher__ in our tree than one of higher precedence.
For instance, for `3+2*6`, we want our tree to have `+` as the root, `3` as the left child, and the tree for `2*6` as the right child.
Why? Because this tree represents "the addition of 3 and the result of multiplying 2 by 6". If we had `*` be the root, we'd have
a tree representing "the multiplication of the result of adding 3 to 2 and 6", which is __not__ what our expression means.
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and
for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules:
$$
\\begin{align}
A\_{add} & \\rightarrow A\_{add}+A\_{mult} \\\\\\
A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\
A\_{add} & \\rightarrow A\_{mult}
\\end{align}
$$
The first rule matches another addition, added to the result of another addition. We use the addition in the body
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking,
> You define addition in terms of addition; how will parsing end? What if there's no addition at all, like `2*6`?
This is the purpose of the third rule, which serves to say "an addition expression can just be a multiplication,
without any plusses or minuses." Our rules for multiplication are very similar:
$$
\\begin{align}
A\_{mult} & \\rightarrow A\_{mult}*P \\\\\\
A\_{mult} & \\rightarrow A\_{mult}/P \\\\\\
A\_{mult} & \\rightarrow P
\\end{align}
$$
P, in this case, is an a__p__lication (remember, application has higher precedence than any binary operator).
Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application.
Application is refreshingly simple:
$$
\\begin{align}
P & \\rightarrow P B \\\\\\
P & \\rightarrow B
\\end{align}
$$
An application is either only one "thing" (represented with \\(B\\), for __b__ase), such as a number or an identifier,
or another application followed by a thing.
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
arithmetic expression a "thing", allowing us to wrap right back around and recognize anything inside parentheses:
$$
\\begin{align}
B & \\rightarrow \text{num} \\\\\\
B & \\rightarrow \text{lowerVar} \\\\\\
B & \\rightarrow \text{upperVar} \\\\\\
B & \\rightarrow ( A\_{add} ) \\\\\\
B & \\rightarrow C
\\end{align}
$$
What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that: