236 lines
12 KiB
Markdown
236 lines
12 KiB
Markdown
---
|
|
title: Compiling a Functional Language Using C++, Part 2 - Parsing
|
|
date: 2019-08-03T01:02:30-07:00
|
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
|
draft: true
|
|
---
|
|
In the previous post, we covered tokenizing. We learned how to convert an input string into logical segments, and even wrote up a tokenizer to do it according to the rules of our language. Now, it's time to make sense of the tokens, and parse our language.
|
|
|
|
### The Theory
|
|
The rules to parse a language are more complicated than the rules for
|
|
recognizing tokens. For instance, consider a simple language of a matching
|
|
number of open and closed parentheses, like `()` and `((()))`. You can't
|
|
write a regular expression for it! We resort to a wider class of languages, called
|
|
__context free languages__. These languages are ones that are matched by __context free grammars__.
|
|
A context free grammar is a list of rules in the form of \\(S \\rightarrow \\alpha\\), where
|
|
\\(S\\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and
|
|
\\(\\alpha\\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't
|
|
expand into other things; for us, this is a token).
|
|
|
|
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
|
|
|
|
$$
|
|
\\begin{align}
|
|
S & \\rightarrow ( S ) \\\\\\
|
|
S & \\rightarrow ()
|
|
\\end{align}
|
|
$$
|
|
|
|
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string,
|
|
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
|
|
we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
|
|
have to do a little more work:
|
|
|
|
$$
|
|
S \\rightarrow (S) \\rightarrow ((S)) \\rightarrow ((()))
|
|
$$
|
|
|
|
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
|
|
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
|
|
a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will
|
|
first try to match \\(alpha\\), then \\(beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse
|
|
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
|
|
$$
|
|
\\begin{align}
|
|
S & \\rightarrow Sa \\\\\\
|
|
S & \\rightarrow a
|
|
\\end{align}
|
|
$$
|
|
A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will
|
|
try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)...
|
|
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
|
|
__left recursive__, and top-down parsers aren't able to handle those grammars.
|
|
|
|
We __could__ rewrite our grammar without using left-recursion, but we don't want to. Instead, we'll use a __bottom up__ parser,
|
|
using specifically the LALR(1) parsing algorithm. Here's an example of how it works, using our left-recursive grammar. We start with our
|
|
goal string, and a "dot" indicating where we are. At first, the dot is behind all the characters:
|
|
$$
|
|
.aaa
|
|
$$
|
|
We see nothing interesting on the left side of the dot, so we move (or __shift__) the dot forward by one character:
|
|
$$
|
|
a.aa
|
|
$$
|
|
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one).
|
|
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)):
|
|
$$
|
|
S.aa
|
|
$$
|
|
There's nothing else we can do with the left side, so we shift again:
|
|
$$
|
|
Sa.a
|
|
$$
|
|
Great, we see another body on the left of the dot. We reduce it:
|
|
$$
|
|
S.a
|
|
$$
|
|
Just like before, we shift over the dot, and again, we reduce. We end up with our
|
|
start symbol, and nothing on the right of the dot, so we're done!
|
|
|
|
### The Practice
|
|
In practice, we don't want to just match a grammar. That would be like saying "yup, this is our language".
|
|
Instead, we want to create something called an __abstract syntax tree__, or AST for short. This tree
|
|
captures the structure of our language, and is easier to work with than its textual representation. The structure
|
|
of the tree we build will often mimic the structure of our grammar: a rule in the form \\(S \\rightarrow A B\\)
|
|
will result in a tree named "S", with two children corresponding the trees built for A and B. Since
|
|
an AST captures the structure of the language, we'll be able to toss away some punctuation
|
|
like `,` and `(`. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally,
|
|
we will write our grammar ignoring whitespace, since our tokenizer conveniently throws that into the trash.
|
|
|
|
The grammar for arithmetic actually involves more effort than it would appear at first. We want to make sure that our
|
|
parser respects the order of operations. This way, when we have our tree, it will immediately have the structure in
|
|
which multiplication is done before addition. We do this by creating separate "levels" in our grammar, with one
|
|
nonterminal matching addition and subtraction, while another nonterminal matches multiplication and division.
|
|
We want the operation that has the least precedence to be __higher__ in our tree than one of higher precedence.
|
|
For instance, for `3+2*6`, we want our tree to have `+` as the root, `3` as the left child, and the tree for `2*6` as the right child.
|
|
Why? Because this tree represents "the addition of 3 and the result of multiplying 2 by 6". If we had `*` be the root, we'd have
|
|
a tree representing "the multiplication of the result of adding 3 to 2 and 6", which is __not__ what our expression means.
|
|
|
|
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and
|
|
for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules:
|
|
$$
|
|
\\begin{align}
|
|
A\_{add} & \\rightarrow A\_{add}+A\_{mult} \\\\\\
|
|
A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\
|
|
A\_{add} & \\rightarrow A\_{mult}
|
|
\\end{align}
|
|
$$
|
|
The first rule matches another addition, added to the result of another addition. We use the addition in the body
|
|
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
|
|
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
|
|
of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking,
|
|
|
|
> You define addition in terms of addition; how will parsing end? What if there's no addition at all, like `2*6`?
|
|
|
|
This is the purpose of the third rule, which serves to say "an addition expression can just be a multiplication,
|
|
without any plusses or minuses." Our rules for multiplication are very similar:
|
|
$$
|
|
\\begin{align}
|
|
A\_{mult} & \\rightarrow A\_{mult}*P \\\\\\
|
|
A\_{mult} & \\rightarrow A\_{mult}/P \\\\\\
|
|
A\_{mult} & \\rightarrow P
|
|
\\end{align}
|
|
$$
|
|
|
|
P, in this case, is an a__p__lication (remember, application has higher precedence than any binary operator).
|
|
Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application.
|
|
|
|
Application is refreshingly simple:
|
|
$$
|
|
\\begin{align}
|
|
P & \\rightarrow P B \\\\\\
|
|
P & \\rightarrow B
|
|
\\end{align}
|
|
$$
|
|
An application is either only one "thing" (represented with \\(B\\), for __b__ase), such as a number or an identifier,
|
|
or another application followed by a thing.
|
|
|
|
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
|
|
arithmetic expression a "thing", allowing us to wrap right back around and recognize anything inside parentheses:
|
|
$$
|
|
\\begin{align}
|
|
B & \\rightarrow \text{num} \\\\\\
|
|
B & \\rightarrow \text{lowerVar} \\\\\\
|
|
B & \\rightarrow \text{upperVar} \\\\\\
|
|
B & \\rightarrow ( A\_{add} ) \\\\\\
|
|
B & \\rightarrow C
|
|
\\end{align}
|
|
$$
|
|
What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that:
|
|
$$
|
|
\\begin{align}
|
|
C & \\rightarrow \\text{case} \\; A\_{add} \\; \\text{of} \\; \\{ L\_B\\} \\\\\\
|
|
L\_B & \\rightarrow R \\; , \\; L\_B \\\\\\
|
|
L\_B & \\rightarrow R \\\\\\
|
|
R & \\rightarrow N \\; \\text{arrow} \\; \\{ A\_{add} \\} \\\\\\
|
|
N & \\rightarrow \\text{lowerVar} \\\\\\
|
|
N & \\rightarrow \\text{upperVar} \\; L\_L \\\\\\
|
|
L\_L & \\rightarrow \\text{lowerVar} \\; L\_L \\\\\\
|
|
L\_L & \\rightarrow \\epsilon
|
|
\\end{align}
|
|
$$
|
|
\\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the
|
|
form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name
|
|
(\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be
|
|
lowercase names, and a list of those arguments will be represented with \\(L\_L\\). One of the bodies
|
|
of this nonterminal is just the character \\(\\epsilon\\), which just means "nothing".
|
|
We use this because a constructor can have no arguments (like Nil).
|
|
|
|
We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`,
|
|
where multiply is a function that, well, multiplies. We start with \\(A_{add}\\):
|
|
$$
|
|
\\begin{align}
|
|
& A\_{add} \\\\\\
|
|
& \\rightarrow A\_{add} + A\_{mult} \\\\\\
|
|
& \\rightarrow A\_{mult} + A\_{mult} \\\\\\
|
|
& \\rightarrow P + A\_{mult} \\\\\\
|
|
& \\rightarrow B + A\_{mult} \\\\\\
|
|
& \\rightarrow \\text{num(3)} + A\_{mult} \\\\\\
|
|
& \\rightarrow \\text{num(3)} + P \\\\\\
|
|
& \\rightarrow \\text{num(3)} + B \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (A\_{add}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (A\_{mult}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (P) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(6)}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
|
|
& \\rightarrow \\text{num(3)} + (\\text{lowerVar(multiply)} \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
|
|
\\end{align}
|
|
$$
|
|
|
|
We're almost there. We now want a rule for a "something that can appear at the top level of a program", like
|
|
a function or data type declaration. We make a new set of rules:
|
|
$$
|
|
\\begin{align}
|
|
T & \\rightarrow \\text{defn} \\; \\text{lowerVar} \\; L\_P =\\{ A\_{add} \\} \\\\\\
|
|
T & \\rightarrow \\text{data} \\; \\text{upperVar} = \\{ L\_D \\} \\\\\\
|
|
L\_D & \\rightarrow D \\; , \\; L\_D \\\\\\
|
|
L\_D & \\rightarrow D \\\\\\
|
|
L\_P & \\rightarrow \\text{lowerVar} \\; L\_P \\\\\\
|
|
L\_P & \\rightarrow \\epsilon \\\\\\
|
|
D & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
|
|
L\_U & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
|
|
L\_U & \\rightarrow \\epsilon
|
|
\\end{align}
|
|
$$
|
|
That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either
|
|
a function or a data definition. A function definition consists of the keyword "defn",
|
|
followed by a function name (starting with a lowercase letter), followed by a list of
|
|
parameters, represented by \\(L\_P\\).
|
|
|
|
A data type definition consists of the name of the data type (starting with an uppercase letter),
|
|
and a list \\(L\_D\\) of data constructors \\(D\\). There must be at least one data constructor in this list,
|
|
so we don't use the empty string rule here. A data constructor is simply an uppercase variable representing
|
|
a constructor of the data type, followed by a list \\(L\_U\\) of zero or more uppercase variables (representing
|
|
the types of the arguments of the constructor).
|
|
|
|
Finally, we want one or more of these declarations in a valid program:
|
|
$$
|
|
\\begin{align}
|
|
G & \\rightarrow T \\; G \\\\\\
|
|
G & \\rightarrow T
|
|
\\end{align}
|
|
$$
|
|
|
|
Just like with tokenizing, there exists a piece of software that will generate a bottom-up parser for us, given our grammar.
|
|
It's called Bison, and it is frequently used with Flex. Before we get to bison, though, we need to pay a debt we've already
|
|
incurred - the implementation of our AST. Such a tree is language-specific, so Bison doesn't generate it for us. Here's what
|
|
I came up with:
|
|
{{< codeblock "C++" "compiler_ast.hpp" >}}
|
|
We create a base class for an expression tree, which we call `ast`. Then, for each possible syntactic construct in our language
|
|
(a number, a variable, a binary operation, a case expression) we create a subclass of `ast`. The `ast_case` subclass
|
|
is the most complex, since it must contain a list of case expression branches, which are a combination of a `pattern` and
|
|
another expression.
|
|
|
|
Finally, we get to writing our Bison file, `parser.y`. Here's what I come up with:
|
|
{{< rawblock "compiler_parser.y" >}}
|