diff --git a/content/blog/02_compiler_parsing.md b/content/blog/02_compiler_parsing.md index 6b09da2..1898e94 100644 --- a/content/blog/02_compiler_parsing.md +++ b/content/blog/02_compiler_parsing.md @@ -18,33 +18,35 @@ expand into other things; for us, this is a token). Let's write a context free grammar (CFG from now on) to match our parenthesis language: -$$ -\\begin{align} -S & \\rightarrow ( S ) \\\\\\ -S & \\rightarrow () -\\end{align} -$$ +{{< latex >}} +\begin{aligned} +S & \rightarrow ( S ) \\ +S & \rightarrow () +\end{aligned} +{{< /latex >}} So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string, we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`, we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we have to do a little more work: -$$ -S \\rightarrow (S) \\rightarrow ((S)) \\rightarrow ((())) -$$ +{{< latex >}} +S \rightarrow (S) \rightarrow ((S)) \rightarrow ((())) +{{< /latex >}} In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar -$$ -\\begin{align} -S & \\rightarrow Sa \\\\\\ -S & \\rightarrow a -\\end{align} -$$ + +{{< latex >}} +\begin{aligned} +S & \rightarrow Sa \\ +S & \rightarrow a +\end{aligned} +{{< /latex >}} + A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)... This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules @@ -53,26 +55,36 @@ __left recursive__, and top-down parsers aren't able to handle those grammars. We __could__ rewrite our grammar without using left-recursion, but we don't want to. Instead, we'll use a __bottom up__ parser, using specifically the LALR(1) parsing algorithm. Here's an example of how it works, using our left-recursive grammar. We start with our goal string, and a "dot" indicating where we are. At first, the dot is behind all the characters: -$$ + +{{< latex >}} .aaa -$$ +{{< /latex >}} + We see nothing interesting on the left side of the dot, so we move (or __shift__) the dot forward by one character: -$$ + +{{< latex >}} a.aa -$$ +{{< /latex >}} + Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one). So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)): -$$ + +{{< latex >}} S.aa -$$ +{{< /latex >}} + There's nothing else we can do with the left side, so we shift again: -$$ + +{{< latex >}} Sa.a -$$ +{{< /latex >}} + Great, we see another body on the left of the dot. We reduce it: -$$ + +{{< latex >}} S.a -$$ +{{< /latex >}} + Just like before, we shift over the dot, and again, we reduce. We end up with our start symbol, and nothing on the right of the dot, so we're done! @@ -97,13 +109,15 @@ a tree representing "the multiplication of the result of adding 3 to 2 and 6", w So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules: -$$ -\\begin{align} -A\_{add} & \\rightarrow A\_{add}+A\_{mult} \\\\\\ -A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\ -A\_{add} & \\rightarrow A\_{mult} -\\end{align} -$$ + +{{< latex >}} +\begin{aligned} +A_{add} & \rightarrow A_{add}+A_{mult} \\ +A_{add} & \rightarrow A_{add}-A_{mult} \\ +A_{add} & \rightarrow A_{mult} +\end{aligned} +{{< /latex >}} + The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level @@ -113,51 +127,58 @@ of the tree to be the rightmost `+` or `-`, since that means it will be the "las This is the purpose of the third rule, which serves to say "an addition expression can just be a multiplication, without any plusses or minuses." Our rules for multiplication are very similar: -$$ -\\begin{align} -A\_{mult} & \\rightarrow A\_{mult}*P \\\\\\ -A\_{mult} & \\rightarrow A\_{mult}/P \\\\\\ -A\_{mult} & \\rightarrow P -\\end{align} -$$ + +{{< latex >}} +\begin{aligned} +A_{mult} & \rightarrow A_{mult}*P \\ +A_{mult} & \rightarrow A_{mult}/P \\ +A_{mult} & \rightarrow P +\end{aligned} +{{< /latex >}} P, in this case, is an application (remember, application has higher precedence than any binary operator). Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application. Application is refreshingly simple: -$$ -\\begin{align} -P & \\rightarrow P B \\\\\\ -P & \\rightarrow B -\\end{align} -$$ -An application is either only one "thing" (represented with \\(B\\), for __b__ase), such as a number or an identifier, + +{{< latex >}} +\begin{aligned} +P & \rightarrow P B \\ +P & \rightarrow B +\end{aligned} +{{< /latex >}} + +An application is either only one "thing" (represented with \\(B\\), for base), such as a number or an identifier, or another application followed by a thing. We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized arithmetic expression a "thing", allowing us to wrap right back around and recognize anything inside parentheses: -$$ -\\begin{align} -B & \\rightarrow \text{num} \\\\\\ -B & \\rightarrow \text{lowerVar} \\\\\\ -B & \\rightarrow \text{upperVar} \\\\\\ -B & \\rightarrow ( A\_{add} ) \\\\\\ -B & \\rightarrow C -\\end{align} -$$ + +{{< latex >}} +\begin{aligned} +B & \rightarrow \text{num} \\ +B & \rightarrow \text{lowerVar} \\ +B & \rightarrow \text{upperVar} \\ +B & \rightarrow ( A_{add} ) \\ +B & \rightarrow C +\end{aligned} +{{< /latex >}} + What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that: -$$ -\\begin{align} -C & \\rightarrow \\text{case} \\; A\_{add} \\; \\text{of} \\; \\{ L\_B\\} \\\\\\ -L\_B & \\rightarrow R \\; L\_B \\\\\\ -L\_B & \\rightarrow R \\\\\\ -R & \\rightarrow N \\; \\text{arrow} \\; \\{ A\_{add} \\} \\\\\\ -N & \\rightarrow \\text{lowerVar} \\\\\\ -N & \\rightarrow \\text{upperVar} \\; L\_L \\\\\\ -L\_L & \\rightarrow \\text{lowerVar} \\; L\_L \\\\\\ -L\_L & \\rightarrow \\epsilon -\\end{align} -$$ + +{{< latex >}} +\begin{aligned} +C & \rightarrow \text{case} \; A_{add} \; \text{of} \; \{ L_B\} \\ +L_B & \rightarrow R \; L_B \\ +L_B & \rightarrow R \\ +R & \rightarrow N \; \text{arrow} \; \{ A_{add} \} \\ +N & \rightarrow \text{lowerVar} \\ +N & \rightarrow \text{upperVar} \; L_L \\ +L_L & \rightarrow \text{lowerVar} \; L_L \\ +L_L & \rightarrow \epsilon +\end{aligned} +{{< /latex >}} + \\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name (\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be @@ -167,40 +188,43 @@ We use this because a constructor can have no arguments (like Nil). We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`, where multiply is a function that, well, multiplies. We start with \\(A_{add}\\): -$$ -\\begin{align} -& A\_{add} \\\\\\ -& \\rightarrow A\_{add} + A\_{mult} \\\\\\ -& \\rightarrow A\_{mult} + A\_{mult} \\\\\\ -& \\rightarrow P + A\_{mult} \\\\\\ -& \\rightarrow B + A\_{mult} \\\\\\ -& \\rightarrow \\text{num(3)} + A\_{mult} \\\\\\ -& \\rightarrow \\text{num(3)} + P \\\\\\ -& \\rightarrow \\text{num(3)} + B \\\\\\ -& \\rightarrow \\text{num(3)} + (A\_{add}) \\\\\\ -& \\rightarrow \\text{num(3)} + (A\_{mult}) \\\\\\ -& \\rightarrow \\text{num(3)} + (P) \\\\\\ -& \\rightarrow \\text{num(3)} + (P \\; \\text{num(6)}) \\\\\\ -& \\rightarrow \\text{num(3)} + (P \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\ -& \\rightarrow \\text{num(3)} + (\\text{lowerVar(multiply)} \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\ -\\end{align} -$$ + +{{< latex >}} +\begin{aligned} +& A_{add} \\ +& \rightarrow A_{add} + A_{mult} \\ +& \rightarrow A_{mult} + A_{mult} \\ +& \rightarrow P + A_{mult} \\ +& \rightarrow B + A_{mult} \\ +& \rightarrow \text{num(3)} + A_{mult} \\ +& \rightarrow \text{num(3)} + P \\ +& \rightarrow \text{num(3)} + B \\ +& \rightarrow \text{num(3)} + (A_{add}) \\ +& \rightarrow \text{num(3)} + (A_{mult}) \\ +& \rightarrow \text{num(3)} + (P) \\ +& \rightarrow \text{num(3)} + (P \; \text{num(6)}) \\ +& \rightarrow \text{num(3)} + (P \; \text{num(2)} \; \text{num(6)}) \\ +& \rightarrow \text{num(3)} + (\text{lowerVar(multiply)} \; \text{num(2)} \; \text{num(6)}) \\ +\end{aligned} +{{< /latex >}} We're almost there. We now want a rule for a "something that can appear at the top level of a program", like a function or data type declaration. We make a new set of rules: -$$ -\\begin{align} -T & \\rightarrow \\text{defn} \\; \\text{lowerVar} \\; L\_P =\\{ A\_{add} \\} \\\\\\ -T & \\rightarrow \\text{data} \\; \\text{upperVar} = \\{ L\_D \\} \\\\\\ -L\_D & \\rightarrow D \\; , \\; L\_D \\\\\\ -L\_D & \\rightarrow D \\\\\\ -L\_P & \\rightarrow \\text{lowerVar} \\; L\_P \\\\\\ -L\_P & \\rightarrow \\epsilon \\\\\\ -D & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\ -L\_U & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\ -L\_U & \\rightarrow \\epsilon -\\end{align} -$$ + +{{< latex >}} +\begin{aligned} +T & \rightarrow \text{defn} \; \text{lowerVar} \; L_P =\{ A_{add} \} \\ +T & \rightarrow \text{data} \; \text{upperVar} = \{ L_D \} \\ +L_D & \rightarrow D \; , \; L_D \\ +L_D & \rightarrow D \\ +L_P & \rightarrow \text{lowerVar} \; L_P \\ +L_P & \rightarrow \epsilon \\ +D & \rightarrow \text{upperVar} \; L_U \\ +L_U & \rightarrow \text{upperVar} \; L_U \\ +L_U & \rightarrow \epsilon +\end{aligned} +{{< /latex >}} + That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either a function or a data definition. A function definition consists of the keyword "defn", followed by a function name (starting with a lowercase letter), followed by a list of @@ -213,12 +237,12 @@ a constructor of the data type, followed by a list \\(L\_U\\) of zero or more up the types of the arguments of the constructor). Finally, we want one or more of these declarations in a valid program: -$$ -\\begin{align} -G & \\rightarrow T \\; G \\\\\\ -G & \\rightarrow T -\\end{align} -$$ +{{< latex >}} +\begin{aligned} +G & \rightarrow T \; G \\ +G & \rightarrow T +\end{aligned} +{{< /latex >}} Just like with tokenizing, there exists a piece of software that will generate a bottom-up parser for us, given our grammar. It's called Bison, and it is frequently used with Flex. Before we get to bison, though, we need to pay a debt we've already