Compare commits
	
		
			2 Commits
		
	
	
		
			31e9e58304
			...
			80d722568e
		
	
	| Author | SHA1 | Date | |
|---|---|---|---|
| 80d722568e | |||
| b9fcac974d | 
| @ -18,33 +18,35 @@ expand into other things; for us, this is a token). | ||||
| 
 | ||||
| Let's write a context free grammar (CFG from now on) to match our parenthesis language: | ||||
| 
 | ||||
| $$ | ||||
| \\begin{align} | ||||
| S & \\rightarrow ( S ) \\\\\\ | ||||
| S & \\rightarrow () | ||||
| \\end{align} | ||||
| $$ | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| S & \rightarrow ( S ) \\ | ||||
| S & \rightarrow () | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string, | ||||
| we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`, | ||||
| we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we | ||||
| have to do a little more work: | ||||
| 
 | ||||
| $$ | ||||
| S \\rightarrow (S) \\rightarrow ((S)) \\rightarrow ((())) | ||||
| $$ | ||||
| {{< latex >}} | ||||
| S \rightarrow (S) \rightarrow ((S)) \rightarrow ((())) | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets | ||||
| of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse | ||||
| a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will | ||||
| first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse | ||||
| that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar | ||||
| $$ | ||||
| \\begin{align} | ||||
| S & \\rightarrow Sa \\\\\\ | ||||
| S & \\rightarrow a | ||||
| \\end{align} | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| S & \rightarrow Sa \\ | ||||
| S & \rightarrow a | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will | ||||
| try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)... | ||||
| This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules | ||||
| @ -53,26 +55,36 @@ __left recursive__, and top-down parsers aren't able to handle those grammars. | ||||
| We __could__ rewrite our grammar without using left-recursion, but we don't want to. Instead, we'll use a __bottom up__ parser,  | ||||
| using specifically the LALR(1) parsing algorithm. Here's an example of how it works, using our left-recursive grammar. We start with our | ||||
| goal string, and a "dot" indicating where we are. At first, the dot is behind all the characters: | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| .aaa | ||||
| $$ | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| We see nothing interesting on the left side of the dot, so we move (or __shift__) the dot forward by one character: | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| a.aa | ||||
| $$ | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one). | ||||
| So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)): | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| S.aa | ||||
| $$ | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| There's nothing else we can do with the left side, so we shift again: | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| Sa.a | ||||
| $$ | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| Great, we see another body on the left of the dot. We reduce it: | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| S.a | ||||
| $$ | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| Just like before, we shift over the dot, and again, we reduce. We end up with our | ||||
| start symbol, and nothing on the right of the dot, so we're done! | ||||
| 
 | ||||
| @ -97,13 +109,15 @@ a tree representing "the multiplication of the result of adding 3 to 2 and 6", w | ||||
| 
 | ||||
| So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and | ||||
| for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules: | ||||
| $$ | ||||
| \\begin{align} | ||||
| A\_{add} & \\rightarrow A\_{add}+A\_{mult} \\\\\\ | ||||
| A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\ | ||||
| A\_{add} & \\rightarrow A\_{mult} | ||||
| \\end{align} | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| A_{add} & \rightarrow A_{add}+A_{mult} \\ | ||||
| A_{add} & \rightarrow A_{add}-A_{mult} \\ | ||||
| A_{add} & \rightarrow A_{mult} | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body | ||||
| because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because | ||||
| subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level | ||||
| @ -113,51 +127,58 @@ of the tree to be the rightmost `+` or `-`, since that means it will be the "las | ||||
| 
 | ||||
| This is the purpose of the third rule, which serves to say "an addition expression can just be a multiplication, | ||||
| without any plusses or minuses." Our rules for multiplication are very similar: | ||||
| $$ | ||||
| \\begin{align} | ||||
| A\_{mult} & \\rightarrow A\_{mult}*P \\\\\\ | ||||
| A\_{mult} & \\rightarrow A\_{mult}/P \\\\\\ | ||||
| A\_{mult} & \\rightarrow P | ||||
| \\end{align} | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| A_{mult} & \rightarrow A_{mult}*P \\ | ||||
| A_{mult} & \rightarrow A_{mult}/P \\ | ||||
| A_{mult} & \rightarrow P | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| P, in this case, is an application (remember, application has higher precedence than any binary operator). | ||||
| Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application. | ||||
| 
 | ||||
| Application is refreshingly simple: | ||||
| $$ | ||||
| \\begin{align} | ||||
| P & \\rightarrow P B \\\\\\ | ||||
| P & \\rightarrow B | ||||
| \\end{align} | ||||
| $$ | ||||
| An application is either only one "thing" (represented with \\(B\\), for __b__ase), such as a number or an identifier, | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| P & \rightarrow P B \\ | ||||
| P & \rightarrow B | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| An application is either only one "thing" (represented with \\(B\\), for base), such as a number or an identifier, | ||||
| or another application followed by a thing. | ||||
| 
 | ||||
| We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized | ||||
| arithmetic expression a "thing", allowing us to wrap right back around and recognize anything inside parentheses: | ||||
| $$ | ||||
| \\begin{align} | ||||
| B & \\rightarrow \text{num} \\\\\\ | ||||
| B & \\rightarrow \text{lowerVar} \\\\\\ | ||||
| B & \\rightarrow \text{upperVar} \\\\\\ | ||||
| B & \\rightarrow ( A\_{add} ) \\\\\\ | ||||
| B & \\rightarrow C | ||||
| \\end{align} | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| B & \rightarrow \text{num} \\ | ||||
| B & \rightarrow \text{lowerVar} \\ | ||||
| B & \rightarrow \text{upperVar} \\ | ||||
| B & \rightarrow ( A_{add} ) \\ | ||||
| B & \rightarrow C | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that: | ||||
| $$ | ||||
| \\begin{align} | ||||
| C & \\rightarrow \\text{case} \\; A\_{add} \\; \\text{of} \\; \\{ L\_B\\} \\\\\\ | ||||
| L\_B & \\rightarrow R \\; L\_B \\\\\\ | ||||
| L\_B & \\rightarrow R \\\\\\ | ||||
| R & \\rightarrow N \\; \\text{arrow} \\; \\{ A\_{add} \\} \\\\\\ | ||||
| N & \\rightarrow \\text{lowerVar} \\\\\\ | ||||
| N & \\rightarrow \\text{upperVar} \\; L\_L \\\\\\ | ||||
| L\_L & \\rightarrow \\text{lowerVar} \\; L\_L \\\\\\ | ||||
| L\_L & \\rightarrow \\epsilon | ||||
| \\end{align} | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| C & \rightarrow \text{case} \; A_{add} \; \text{of} \; \{ L_B\} \\ | ||||
| L_B & \rightarrow R \; L_B \\ | ||||
| L_B & \rightarrow R \\ | ||||
| R & \rightarrow N \; \text{arrow} \; \{ A_{add} \} \\ | ||||
| N & \rightarrow \text{lowerVar} \\ | ||||
| N & \rightarrow \text{upperVar} \; L_L \\ | ||||
| L_L & \rightarrow \text{lowerVar} \; L_L \\ | ||||
| L_L & \rightarrow \epsilon | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| \\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the | ||||
| form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name | ||||
| (\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be | ||||
| @ -167,40 +188,43 @@ We use this because a constructor can have no arguments (like Nil). | ||||
| 
 | ||||
| We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`, | ||||
| where multiply is a function that, well, multiplies. We start with \\(A_{add}\\): | ||||
| $$ | ||||
| \\begin{align} | ||||
| & A\_{add} \\\\\\ | ||||
| & \\rightarrow A\_{add} + A\_{mult} \\\\\\ | ||||
| & \\rightarrow A\_{mult} + A\_{mult} \\\\\\ | ||||
| & \\rightarrow P + A\_{mult} \\\\\\ | ||||
| & \\rightarrow B + A\_{mult} \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + A\_{mult} \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + P \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + B \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + (A\_{add}) \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + (A\_{mult}) \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + (P) \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + (P \\; \\text{num(6)}) \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + (P \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\ | ||||
| & \\rightarrow \\text{num(3)} + (\\text{lowerVar(multiply)} \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\ | ||||
| \\end{align} | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| & A_{add} \\ | ||||
| & \rightarrow A_{add} + A_{mult} \\ | ||||
| & \rightarrow A_{mult} + A_{mult} \\ | ||||
| & \rightarrow P + A_{mult} \\ | ||||
| & \rightarrow B + A_{mult} \\ | ||||
| & \rightarrow \text{num(3)} + A_{mult} \\ | ||||
| & \rightarrow \text{num(3)} + P \\ | ||||
| & \rightarrow \text{num(3)} + B \\ | ||||
| & \rightarrow \text{num(3)} + (A_{add}) \\ | ||||
| & \rightarrow \text{num(3)} + (A_{mult}) \\ | ||||
| & \rightarrow \text{num(3)} + (P) \\ | ||||
| & \rightarrow \text{num(3)} + (P \; \text{num(6)}) \\ | ||||
| & \rightarrow \text{num(3)} + (P \; \text{num(2)} \; \text{num(6)}) \\ | ||||
| & \rightarrow \text{num(3)} + (\text{lowerVar(multiply)} \; \text{num(2)} \; \text{num(6)}) \\ | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| We're almost there. We now want a rule for a "something that can appear at the top level of a program", like | ||||
| a function or data type declaration. We make a new set of rules: | ||||
| $$ | ||||
| \\begin{align} | ||||
| T & \\rightarrow \\text{defn} \\; \\text{lowerVar} \\; L\_P =\\{ A\_{add} \\} \\\\\\ | ||||
| T & \\rightarrow \\text{data} \\; \\text{upperVar} = \\{ L\_D \\} \\\\\\ | ||||
| L\_D & \\rightarrow D \\; , \\; L\_D \\\\\\ | ||||
| L\_D & \\rightarrow D \\\\\\ | ||||
| L\_P & \\rightarrow \\text{lowerVar} \\; L\_P \\\\\\ | ||||
| L\_P & \\rightarrow \\epsilon \\\\\\ | ||||
| D & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\ | ||||
| L\_U & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\ | ||||
| L\_U & \\rightarrow \\epsilon | ||||
| \\end{align} | ||||
| $$ | ||||
| 
 | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| T & \rightarrow \text{defn} \; \text{lowerVar} \; L_P =\{ A_{add} \} \\ | ||||
| T & \rightarrow \text{data} \; \text{upperVar} = \{ L_D \} \\ | ||||
| L_D & \rightarrow D \; , \; L_D \\ | ||||
| L_D & \rightarrow D \\ | ||||
| L_P & \rightarrow \text{lowerVar} \; L_P \\ | ||||
| L_P & \rightarrow \epsilon \\ | ||||
| D & \rightarrow \text{upperVar} \; L_U \\ | ||||
| L_U & \rightarrow \text{upperVar} \; L_U \\ | ||||
| L_U & \rightarrow \epsilon | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either | ||||
| a function or a data definition. A function definition consists of the keyword "defn", | ||||
| followed by a function name (starting with a lowercase letter), followed by a list of | ||||
| @ -213,12 +237,12 @@ a constructor of the data type, followed by a list \\(L\_U\\) of zero or more up | ||||
| the types of the arguments of the constructor). | ||||
| 
 | ||||
| Finally, we want one or more of these declarations in a valid program: | ||||
| $$ | ||||
| \\begin{align} | ||||
| G & \\rightarrow T \\; G \\\\\\ | ||||
| G & \\rightarrow T | ||||
| \\end{align} | ||||
| $$ | ||||
| {{< latex >}} | ||||
| \begin{aligned} | ||||
| G & \rightarrow T \; G \\ | ||||
| G & \rightarrow T | ||||
| \end{aligned} | ||||
| {{< /latex >}} | ||||
| 
 | ||||
| Just like with tokenizing, there exists a piece of software that will generate a bottom-up parser for us, given our grammar. | ||||
| It's called Bison, and it is frequently used with Flex. Before we get to bison, though, we need to pay a debt we've already | ||||
|  | ||||
| @ -13,7 +13,9 @@ | ||||
|     <link rel="stylesheet" href="{{ $sidenotes.Permalink }}"> | ||||
|     <link rel="icon" type="image/png" href="{{ $icon.Permalink }}"> | ||||
| 
 | ||||
|     <script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-MML-AM_CHTML' async></script> | ||||
|     <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.css" integrity="sha384-zB1R0rpPzHqg7Kpt0Aljp8JPLqbXI3bhnPWROx27a9N0Ll6ZP/+DiW/UqRcLbRjq" crossorigin="anonymous"> | ||||
|     <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/katex.min.js" integrity="sha384-y23I5Q6l+B6vatafAwxRu/0oK/79VlbSz7Q9aiSZUvyWYIYsd+qj+o24G5ZU2zJz" crossorigin="anonymous"></script> | ||||
|     <script defer src="https://cdn.jsdelivr.net/npm/katex@0.11.1/dist/contrib/auto-render.min.js" integrity="sha384-kWPLUVMOks5AQFrykwIup5lo0m3iMkkHrD0uJ4H5cjeGihAutqP0yW0J6dpFiVkI" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script> | ||||
|     {{ template "_internal/google_analytics.html" . }} | ||||
| 
 | ||||
|     <title>{{ .Title }}</title> | ||||
|  | ||||
		Loading…
	
		Reference in New Issue
	
	Block a user