diff --git a/content/blog/02_compiler_parsing.md b/content/blog/02_compiler_parsing.md index 9d13599..31efd67 100644 --- a/content/blog/02_compiler_parsing.md +++ b/content/blog/02_compiler_parsing.md @@ -13,9 +13,9 @@ recognizing tokens. For instance, consider a simple language of a matching number of open and closed parentheses, like `()` and `((()))`. You can't write a regular expression for it! We resort to a wider class of languages, called __context free languages__. These languages are ones that are matched by __context free grammars__. -A context free grammar is a list of rules in the form of \\(S \\rightarrow \\alpha\\), where -\\(S\\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and -\\(\\alpha\\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't +A context free grammar is a list of rules in the form of \(S \rightarrow \alpha\), where +\(S\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and +\(\alpha\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't expand into other things; for us, this is a token). Let's write a context free grammar (CFG from now on) to match our parenthesis language: @@ -27,9 +27,9 @@ S & \rightarrow () \end{aligned} {{< /latex >}} -So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string, +So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \(S\). Then, to get a desired string, we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`, -we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we +we start with \(S\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we have to do a little more work: {{< latex >}} @@ -38,8 +38,8 @@ S \rightarrow (S) \rightarrow ((S)) \rightarrow ((())) In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse -a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will -first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse +a nonterminal by trying to match each symbol in its body. In the rule \(S \rightarrow \alpha \beta \gamma\), it will +first try to match \(\alpha\), then \(\beta\), and so on. If one of the three contains a nonterminal, it will attempt to parse that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar {{< latex >}} @@ -49,8 +49,8 @@ S & \rightarrow a \end{aligned} {{< /latex >}} -A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will -try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)... +A top down parser will start with \(S\). It will then try the first rule, which starts with \(S\). So, dutifully, it will +try to parse __that__ \(S\). And to do that, it will once again try the first rule, and find that it starts with another \(S\)... This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules __left recursive__, and top-down parsers aren't able to handle those grammars. @@ -68,8 +68,8 @@ We see nothing interesting on the left side of the dot, so we move (or __shift__ a.aa {{< /latex >}} -Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one). -So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)): +Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \(S\) (the second one). +So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\(S\)): {{< latex >}} S.aa @@ -94,7 +94,7 @@ start symbol, and nothing on the right of the dot, so we're done! In practice, we don't want to just match a grammar. That would be like saying "yup, this is our language". Instead, we want to create something called an __abstract syntax tree__, or AST for short. This tree captures the structure of our language, and is easier to work with than its textual representation. The structure -of the tree we build will often mimic the structure of our grammar: a rule in the form \\(S \\rightarrow A B\\) +of the tree we build will often mimic the structure of our grammar: a rule in the form \(S \rightarrow A B\) will result in a tree named "S", with two children corresponding the trees built for A and B. Since an AST captures the structure of the language, we'll be able to toss away some punctuation like `,` and `(`. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally, @@ -109,8 +109,8 @@ For instance, for `3+2*6`, we want our tree to have `+` as the root, `3` as the Why? Because this tree represents "the addition of 3 and the result of multiplying 2 by 6". If we had `*` be the root, we'd have a tree representing "the multiplication of the result of adding 3 to 2 and 6", which is __not__ what our expression means. -So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and -for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules: +So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \(A_{add}\), to be matched first, and +for its children to be trees created by the multiplication rule, \(A_{mult}\). So we write the following rules: {{< latex >}} \begin{aligned} @@ -120,7 +120,7 @@ A_{add} & \rightarrow A_{mult} \end{aligned} {{< /latex >}} -The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body +The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \(A_{add}\) on the left side of \(+\) and \(-\) in the body because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking, @@ -139,7 +139,7 @@ A_{mult} & \rightarrow P {{< /latex >}} P, in this case, is an application (remember, application has higher precedence than any binary operator). -Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application. +Once again, if there's no `*` or `\`, we simply fall through to a \(P\) nonterminal, representing application. Application is refreshingly simple: @@ -150,7 +150,7 @@ P & \rightarrow B \end{aligned} {{< /latex >}} -An application is either only one "thing" (represented with \\(B\\), for base), such as a number or an identifier, +An application is either only one "thing" (represented with \(B\), for base), such as a number or an identifier, or another application followed by a thing. We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized @@ -166,7 +166,7 @@ B & \rightarrow C \end{aligned} {{< /latex >}} -What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that: +What's the last \(C\)? We also want a "thing" to be a case expression. Here are the rules for that: {{< latex >}} \begin{aligned} @@ -181,15 +181,15 @@ L_L & \rightarrow \epsilon \end{aligned} {{< /latex >}} -\\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the -form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name -(\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be -lowercase names, and a list of those arguments will be represented with \\(L\_L\\). One of the bodies -of this nonterminal is just the character \\(\\epsilon\\), which just means "nothing". +\(L_B\) is the list of branches in our case expression. \(R\) is a single branch, which is in the +form `Pattern -> Expression`. \(N\) is a pattern, which we will for now define to be either a variable name +(\(\text{lowerVar}\)), or a constructor with some arguments. The arguments of a constructor will be +lowercase names, and a list of those arguments will be represented with \(L_L\). One of the bodies +of this nonterminal is just the character \(\epsilon\), which just means "nothing". We use this because a constructor can have no arguments (like Nil). We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`, -where multiply is a function that, well, multiplies. We start with \\(A_{add}\\): +where multiply is a function that, well, multiplies. We start with \(A_{add}\): {{< latex >}} \begin{aligned} @@ -227,15 +227,15 @@ L_U & \rightarrow \epsilon \end{aligned} {{< /latex >}} -That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either +That's a lot of rules! \(T\) is the "top-level declaration rule. It matches either a function or a data definition. A function definition consists of the keyword "defn", followed by a function name (starting with a lowercase letter), followed by a list of -parameters, represented by \\(L\_P\\). +parameters, represented by \(L_P\). A data type definition consists of the name of the data type (starting with an uppercase letter), -and a list \\(L\_D\\) of data constructors \\(D\\). There must be at least one data constructor in this list, +and a list \(L_D\) of data constructors \(D\). There must be at least one data constructor in this list, so we don't use the empty string rule here. A data constructor is simply an uppercase variable representing -a constructor of the data type, followed by a list \\(L\_U\\) of zero or more uppercase variables (representing +a constructor of the data type, followed by a list \(L_U\) of zero or more uppercase variables (representing the types of the arguments of the constructor). Finally, we want one or more of these declarations in a valid program: @@ -266,7 +266,7 @@ Next, observe that there's a certain symmetry between our parser and our scanner. In our scanner, we mixed the theoretical idea of a regular expression with an __action__, a C++ code snippet to be executed when a regular expression is matched. This same idea is present in the parser, too. Each rule can produce a value, which we call a __semantic value__. For type safety, we allow -each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \\(A_{add}\\) must +each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \(A_{add}\) must produce an expression. We specify the type of each nonterminal and using `%type` directives. The types of terminals are specified when they're declared.