Use the new latex shortcode to remove backslashes
This commit is contained in:
		
							parent
							
								
									31e9e58304
								
							
						
					
					
						commit
						b9fcac974d
					
				@ -18,33 +18,35 @@ expand into other things; for us, this is a token).
 | 
			
		||||
 | 
			
		||||
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
 | 
			
		||||
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
S & \\rightarrow ( S ) \\\\\\
 | 
			
		||||
S & \\rightarrow ()
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
S & \rightarrow ( S ) \\
 | 
			
		||||
S & \rightarrow ()
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string,
 | 
			
		||||
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
 | 
			
		||||
we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
 | 
			
		||||
have to do a little more work:
 | 
			
		||||
 | 
			
		||||
$$
 | 
			
		||||
S \\rightarrow (S) \\rightarrow ((S)) \\rightarrow ((()))
 | 
			
		||||
$$
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
S \rightarrow (S) \rightarrow ((S)) \rightarrow ((()))
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
 | 
			
		||||
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
 | 
			
		||||
a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will
 | 
			
		||||
first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse
 | 
			
		||||
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
S & \\rightarrow Sa \\\\\\
 | 
			
		||||
S & \\rightarrow a
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
S & \rightarrow Sa \\
 | 
			
		||||
S & \rightarrow a
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will
 | 
			
		||||
try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)...
 | 
			
		||||
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
 | 
			
		||||
@ -53,26 +55,36 @@ __left recursive__, and top-down parsers aren't able to handle those grammars.
 | 
			
		||||
We __could__ rewrite our grammar without using left-recursion, but we don't want to. Instead, we'll use a __bottom up__ parser, 
 | 
			
		||||
using specifically the LALR(1) parsing algorithm. Here's an example of how it works, using our left-recursive grammar. We start with our
 | 
			
		||||
goal string, and a "dot" indicating where we are. At first, the dot is behind all the characters:
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
.aaa
 | 
			
		||||
$$
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
We see nothing interesting on the left side of the dot, so we move (or __shift__) the dot forward by one character:
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
a.aa
 | 
			
		||||
$$
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one).
 | 
			
		||||
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)):
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
S.aa
 | 
			
		||||
$$
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
There's nothing else we can do with the left side, so we shift again:
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
Sa.a
 | 
			
		||||
$$
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
Great, we see another body on the left of the dot. We reduce it:
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
S.a
 | 
			
		||||
$$
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
Just like before, we shift over the dot, and again, we reduce. We end up with our
 | 
			
		||||
start symbol, and nothing on the right of the dot, so we're done!
 | 
			
		||||
 | 
			
		||||
@ -97,13 +109,15 @@ a tree representing "the multiplication of the result of adding 3 to 2 and 6", w
 | 
			
		||||
 | 
			
		||||
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and
 | 
			
		||||
for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules:
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
A\_{add} & \\rightarrow A\_{add}+A\_{mult} \\\\\\
 | 
			
		||||
A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\
 | 
			
		||||
A\_{add} & \\rightarrow A\_{mult}
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
A_{add} & \rightarrow A_{add}+A_{mult} \\
 | 
			
		||||
A_{add} & \rightarrow A_{add}-A_{mult} \\
 | 
			
		||||
A_{add} & \rightarrow A_{mult}
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body
 | 
			
		||||
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
 | 
			
		||||
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
 | 
			
		||||
@ -113,51 +127,58 @@ of the tree to be the rightmost `+` or `-`, since that means it will be the "las
 | 
			
		||||
 | 
			
		||||
This is the purpose of the third rule, which serves to say "an addition expression can just be a multiplication,
 | 
			
		||||
without any plusses or minuses." Our rules for multiplication are very similar:
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
A\_{mult} & \\rightarrow A\_{mult}*P \\\\\\
 | 
			
		||||
A\_{mult} & \\rightarrow A\_{mult}/P \\\\\\
 | 
			
		||||
A\_{mult} & \\rightarrow P
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
A_{mult} & \rightarrow A_{mult}*P \\
 | 
			
		||||
A_{mult} & \rightarrow A_{mult}/P \\
 | 
			
		||||
A_{mult} & \rightarrow P
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
P, in this case, is an application (remember, application has higher precedence than any binary operator).
 | 
			
		||||
Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application.
 | 
			
		||||
 | 
			
		||||
Application is refreshingly simple:
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
P & \\rightarrow P B \\\\\\
 | 
			
		||||
P & \\rightarrow B
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
An application is either only one "thing" (represented with \\(B\\), for __b__ase), such as a number or an identifier,
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
P & \rightarrow P B \\
 | 
			
		||||
P & \rightarrow B
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
An application is either only one "thing" (represented with \\(B\\), for base), such as a number or an identifier,
 | 
			
		||||
or another application followed by a thing.
 | 
			
		||||
 | 
			
		||||
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
 | 
			
		||||
arithmetic expression a "thing", allowing us to wrap right back around and recognize anything inside parentheses:
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
B & \\rightarrow \text{num} \\\\\\
 | 
			
		||||
B & \\rightarrow \text{lowerVar} \\\\\\
 | 
			
		||||
B & \\rightarrow \text{upperVar} \\\\\\
 | 
			
		||||
B & \\rightarrow ( A\_{add} ) \\\\\\
 | 
			
		||||
B & \\rightarrow C
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
B & \rightarrow \text{num} \\
 | 
			
		||||
B & \rightarrow \text{lowerVar} \\
 | 
			
		||||
B & \rightarrow \text{upperVar} \\
 | 
			
		||||
B & \rightarrow ( A_{add} ) \\
 | 
			
		||||
B & \rightarrow C
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that:
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
C & \\rightarrow \\text{case} \\; A\_{add} \\; \\text{of} \\; \\{ L\_B\\} \\\\\\
 | 
			
		||||
L\_B & \\rightarrow R \\; L\_B \\\\\\
 | 
			
		||||
L\_B & \\rightarrow R \\\\\\
 | 
			
		||||
R & \\rightarrow N \\; \\text{arrow} \\; \\{ A\_{add} \\} \\\\\\
 | 
			
		||||
N & \\rightarrow \\text{lowerVar} \\\\\\
 | 
			
		||||
N & \\rightarrow \\text{upperVar} \\; L\_L \\\\\\
 | 
			
		||||
L\_L & \\rightarrow \\text{lowerVar} \\; L\_L \\\\\\
 | 
			
		||||
L\_L & \\rightarrow \\epsilon
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
C & \rightarrow \text{case} \; A_{add} \; \text{of} \; \{ L_B\} \\
 | 
			
		||||
L_B & \rightarrow R \; L_B \\
 | 
			
		||||
L_B & \rightarrow R \\
 | 
			
		||||
R & \rightarrow N \; \text{arrow} \; \{ A_{add} \} \\
 | 
			
		||||
N & \rightarrow \text{lowerVar} \\
 | 
			
		||||
N & \rightarrow \text{upperVar} \; L_L \\
 | 
			
		||||
L_L & \rightarrow \text{lowerVar} \; L_L \\
 | 
			
		||||
L_L & \rightarrow \epsilon
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
\\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the
 | 
			
		||||
form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name
 | 
			
		||||
(\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be
 | 
			
		||||
@ -167,40 +188,43 @@ We use this because a constructor can have no arguments (like Nil).
 | 
			
		||||
 | 
			
		||||
We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`,
 | 
			
		||||
where multiply is a function that, well, multiplies. We start with \\(A_{add}\\):
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
& A\_{add} \\\\\\
 | 
			
		||||
& \\rightarrow A\_{add} + A\_{mult} \\\\\\
 | 
			
		||||
& \\rightarrow A\_{mult} + A\_{mult} \\\\\\
 | 
			
		||||
& \\rightarrow P + A\_{mult} \\\\\\
 | 
			
		||||
& \\rightarrow B + A\_{mult} \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + A\_{mult} \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + P \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + B \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + (A\_{add}) \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + (A\_{mult}) \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + (P) \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(6)}) \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
 | 
			
		||||
& \\rightarrow \\text{num(3)} + (\\text{lowerVar(multiply)} \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
& A_{add} \\
 | 
			
		||||
& \rightarrow A_{add} + A_{mult} \\
 | 
			
		||||
& \rightarrow A_{mult} + A_{mult} \\
 | 
			
		||||
& \rightarrow P + A_{mult} \\
 | 
			
		||||
& \rightarrow B + A_{mult} \\
 | 
			
		||||
& \rightarrow \text{num(3)} + A_{mult} \\
 | 
			
		||||
& \rightarrow \text{num(3)} + P \\
 | 
			
		||||
& \rightarrow \text{num(3)} + B \\
 | 
			
		||||
& \rightarrow \text{num(3)} + (A_{add}) \\
 | 
			
		||||
& \rightarrow \text{num(3)} + (A_{mult}) \\
 | 
			
		||||
& \rightarrow \text{num(3)} + (P) \\
 | 
			
		||||
& \rightarrow \text{num(3)} + (P \; \text{num(6)}) \\
 | 
			
		||||
& \rightarrow \text{num(3)} + (P \; \text{num(2)} \; \text{num(6)}) \\
 | 
			
		||||
& \rightarrow \text{num(3)} + (\text{lowerVar(multiply)} \; \text{num(2)} \; \text{num(6)}) \\
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
We're almost there. We now want a rule for a "something that can appear at the top level of a program", like
 | 
			
		||||
a function or data type declaration. We make a new set of rules:
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
T & \\rightarrow \\text{defn} \\; \\text{lowerVar} \\; L\_P =\\{ A\_{add} \\} \\\\\\
 | 
			
		||||
T & \\rightarrow \\text{data} \\; \\text{upperVar} = \\{ L\_D \\} \\\\\\
 | 
			
		||||
L\_D & \\rightarrow D \\; , \\; L\_D \\\\\\
 | 
			
		||||
L\_D & \\rightarrow D \\\\\\
 | 
			
		||||
L\_P & \\rightarrow \\text{lowerVar} \\; L\_P \\\\\\
 | 
			
		||||
L\_P & \\rightarrow \\epsilon \\\\\\
 | 
			
		||||
D & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
 | 
			
		||||
L\_U & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
 | 
			
		||||
L\_U & \\rightarrow \\epsilon
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
T & \rightarrow \text{defn} \; \text{lowerVar} \; L_P =\{ A_{add} \} \\
 | 
			
		||||
T & \rightarrow \text{data} \; \text{upperVar} = \{ L_D \} \\
 | 
			
		||||
L_D & \rightarrow D \; , \; L_D \\
 | 
			
		||||
L_D & \rightarrow D \\
 | 
			
		||||
L_P & \rightarrow \text{lowerVar} \; L_P \\
 | 
			
		||||
L_P & \rightarrow \epsilon \\
 | 
			
		||||
D & \rightarrow \text{upperVar} \; L_U \\
 | 
			
		||||
L_U & \rightarrow \text{upperVar} \; L_U \\
 | 
			
		||||
L_U & \rightarrow \epsilon
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either
 | 
			
		||||
a function or a data definition. A function definition consists of the keyword "defn",
 | 
			
		||||
followed by a function name (starting with a lowercase letter), followed by a list of
 | 
			
		||||
@ -213,12 +237,12 @@ a constructor of the data type, followed by a list \\(L\_U\\) of zero or more up
 | 
			
		||||
the types of the arguments of the constructor).
 | 
			
		||||
 | 
			
		||||
Finally, we want one or more of these declarations in a valid program:
 | 
			
		||||
$$
 | 
			
		||||
\\begin{align}
 | 
			
		||||
G & \\rightarrow T \\; G \\\\\\
 | 
			
		||||
G & \\rightarrow T
 | 
			
		||||
\\end{align}
 | 
			
		||||
$$
 | 
			
		||||
{{< latex >}}
 | 
			
		||||
\begin{aligned}
 | 
			
		||||
G & \rightarrow T \; G \\
 | 
			
		||||
G & \rightarrow T
 | 
			
		||||
\end{aligned}
 | 
			
		||||
{{< /latex >}}
 | 
			
		||||
 | 
			
		||||
Just like with tokenizing, there exists a piece of software that will generate a bottom-up parser for us, given our grammar.
 | 
			
		||||
It's called Bison, and it is frequently used with Flex. Before we get to bison, though, we need to pay a debt we've already
 | 
			
		||||
 | 
			
		||||
		Loading…
	
		Reference in New Issue
	
	Block a user