Use the new latex shortcode to remove backslashes
This commit is contained in:
parent
31e9e58304
commit
b9fcac974d
|
@ -18,33 +18,35 @@ expand into other things; for us, this is a token).
|
|||
|
||||
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
|
||||
|
||||
$$
|
||||
\\begin{align}
|
||||
S & \\rightarrow ( S ) \\\\\\
|
||||
S & \\rightarrow ()
|
||||
\\end{align}
|
||||
$$
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
S & \rightarrow ( S ) \\
|
||||
S & \rightarrow ()
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string,
|
||||
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
|
||||
we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
|
||||
have to do a little more work:
|
||||
|
||||
$$
|
||||
S \\rightarrow (S) \\rightarrow ((S)) \\rightarrow ((()))
|
||||
$$
|
||||
{{< latex >}}
|
||||
S \rightarrow (S) \rightarrow ((S)) \rightarrow ((()))
|
||||
{{< /latex >}}
|
||||
|
||||
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
|
||||
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
|
||||
a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will
|
||||
first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse
|
||||
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
|
||||
$$
|
||||
\\begin{align}
|
||||
S & \\rightarrow Sa \\\\\\
|
||||
S & \\rightarrow a
|
||||
\\end{align}
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
S & \rightarrow Sa \\
|
||||
S & \rightarrow a
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will
|
||||
try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)...
|
||||
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
|
||||
|
@ -53,26 +55,36 @@ __left recursive__, and top-down parsers aren't able to handle those grammars.
|
|||
We __could__ rewrite our grammar without using left-recursion, but we don't want to. Instead, we'll use a __bottom up__ parser,
|
||||
using specifically the LALR(1) parsing algorithm. Here's an example of how it works, using our left-recursive grammar. We start with our
|
||||
goal string, and a "dot" indicating where we are. At first, the dot is behind all the characters:
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
.aaa
|
||||
$$
|
||||
{{< /latex >}}
|
||||
|
||||
We see nothing interesting on the left side of the dot, so we move (or __shift__) the dot forward by one character:
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
a.aa
|
||||
$$
|
||||
{{< /latex >}}
|
||||
|
||||
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one).
|
||||
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)):
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
S.aa
|
||||
$$
|
||||
{{< /latex >}}
|
||||
|
||||
There's nothing else we can do with the left side, so we shift again:
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
Sa.a
|
||||
$$
|
||||
{{< /latex >}}
|
||||
|
||||
Great, we see another body on the left of the dot. We reduce it:
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
S.a
|
||||
$$
|
||||
{{< /latex >}}
|
||||
|
||||
Just like before, we shift over the dot, and again, we reduce. We end up with our
|
||||
start symbol, and nothing on the right of the dot, so we're done!
|
||||
|
||||
|
@ -97,13 +109,15 @@ a tree representing "the multiplication of the result of adding 3 to 2 and 6", w
|
|||
|
||||
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and
|
||||
for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules:
|
||||
$$
|
||||
\\begin{align}
|
||||
A\_{add} & \\rightarrow A\_{add}+A\_{mult} \\\\\\
|
||||
A\_{add} & \\rightarrow A\_{add}-A\_{mult} \\\\\\
|
||||
A\_{add} & \\rightarrow A\_{mult}
|
||||
\\end{align}
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
A_{add} & \rightarrow A_{add}+A_{mult} \\
|
||||
A_{add} & \rightarrow A_{add}-A_{mult} \\
|
||||
A_{add} & \rightarrow A_{mult}
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body
|
||||
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
|
||||
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
|
||||
|
@ -113,51 +127,58 @@ of the tree to be the rightmost `+` or `-`, since that means it will be the "las
|
|||
|
||||
This is the purpose of the third rule, which serves to say "an addition expression can just be a multiplication,
|
||||
without any plusses or minuses." Our rules for multiplication are very similar:
|
||||
$$
|
||||
\\begin{align}
|
||||
A\_{mult} & \\rightarrow A\_{mult}*P \\\\\\
|
||||
A\_{mult} & \\rightarrow A\_{mult}/P \\\\\\
|
||||
A\_{mult} & \\rightarrow P
|
||||
\\end{align}
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
A_{mult} & \rightarrow A_{mult}*P \\
|
||||
A_{mult} & \rightarrow A_{mult}/P \\
|
||||
A_{mult} & \rightarrow P
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
P, in this case, is an application (remember, application has higher precedence than any binary operator).
|
||||
Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application.
|
||||
|
||||
Application is refreshingly simple:
|
||||
$$
|
||||
\\begin{align}
|
||||
P & \\rightarrow P B \\\\\\
|
||||
P & \\rightarrow B
|
||||
\\end{align}
|
||||
$$
|
||||
An application is either only one "thing" (represented with \\(B\\), for __b__ase), such as a number or an identifier,
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
P & \rightarrow P B \\
|
||||
P & \rightarrow B
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
An application is either only one "thing" (represented with \\(B\\), for base), such as a number or an identifier,
|
||||
or another application followed by a thing.
|
||||
|
||||
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
|
||||
arithmetic expression a "thing", allowing us to wrap right back around and recognize anything inside parentheses:
|
||||
$$
|
||||
\\begin{align}
|
||||
B & \\rightarrow \text{num} \\\\\\
|
||||
B & \\rightarrow \text{lowerVar} \\\\\\
|
||||
B & \\rightarrow \text{upperVar} \\\\\\
|
||||
B & \\rightarrow ( A\_{add} ) \\\\\\
|
||||
B & \\rightarrow C
|
||||
\\end{align}
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
B & \rightarrow \text{num} \\
|
||||
B & \rightarrow \text{lowerVar} \\
|
||||
B & \rightarrow \text{upperVar} \\
|
||||
B & \rightarrow ( A_{add} ) \\
|
||||
B & \rightarrow C
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that:
|
||||
$$
|
||||
\\begin{align}
|
||||
C & \\rightarrow \\text{case} \\; A\_{add} \\; \\text{of} \\; \\{ L\_B\\} \\\\\\
|
||||
L\_B & \\rightarrow R \\; L\_B \\\\\\
|
||||
L\_B & \\rightarrow R \\\\\\
|
||||
R & \\rightarrow N \\; \\text{arrow} \\; \\{ A\_{add} \\} \\\\\\
|
||||
N & \\rightarrow \\text{lowerVar} \\\\\\
|
||||
N & \\rightarrow \\text{upperVar} \\; L\_L \\\\\\
|
||||
L\_L & \\rightarrow \\text{lowerVar} \\; L\_L \\\\\\
|
||||
L\_L & \\rightarrow \\epsilon
|
||||
\\end{align}
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
C & \rightarrow \text{case} \; A_{add} \; \text{of} \; \{ L_B\} \\
|
||||
L_B & \rightarrow R \; L_B \\
|
||||
L_B & \rightarrow R \\
|
||||
R & \rightarrow N \; \text{arrow} \; \{ A_{add} \} \\
|
||||
N & \rightarrow \text{lowerVar} \\
|
||||
N & \rightarrow \text{upperVar} \; L_L \\
|
||||
L_L & \rightarrow \text{lowerVar} \; L_L \\
|
||||
L_L & \rightarrow \epsilon
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
\\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the
|
||||
form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name
|
||||
(\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be
|
||||
|
@ -167,40 +188,43 @@ We use this because a constructor can have no arguments (like Nil).
|
|||
|
||||
We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`,
|
||||
where multiply is a function that, well, multiplies. We start with \\(A_{add}\\):
|
||||
$$
|
||||
\\begin{align}
|
||||
& A\_{add} \\\\\\
|
||||
& \\rightarrow A\_{add} + A\_{mult} \\\\\\
|
||||
& \\rightarrow A\_{mult} + A\_{mult} \\\\\\
|
||||
& \\rightarrow P + A\_{mult} \\\\\\
|
||||
& \\rightarrow B + A\_{mult} \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + A\_{mult} \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + P \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + B \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + (A\_{add}) \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + (A\_{mult}) \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + (P) \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(6)}) \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + (P \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
|
||||
& \\rightarrow \\text{num(3)} + (\\text{lowerVar(multiply)} \\; \\text{num(2)} \\; \\text{num(6)}) \\\\\\
|
||||
\\end{align}
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
& A_{add} \\
|
||||
& \rightarrow A_{add} + A_{mult} \\
|
||||
& \rightarrow A_{mult} + A_{mult} \\
|
||||
& \rightarrow P + A_{mult} \\
|
||||
& \rightarrow B + A_{mult} \\
|
||||
& \rightarrow \text{num(3)} + A_{mult} \\
|
||||
& \rightarrow \text{num(3)} + P \\
|
||||
& \rightarrow \text{num(3)} + B \\
|
||||
& \rightarrow \text{num(3)} + (A_{add}) \\
|
||||
& \rightarrow \text{num(3)} + (A_{mult}) \\
|
||||
& \rightarrow \text{num(3)} + (P) \\
|
||||
& \rightarrow \text{num(3)} + (P \; \text{num(6)}) \\
|
||||
& \rightarrow \text{num(3)} + (P \; \text{num(2)} \; \text{num(6)}) \\
|
||||
& \rightarrow \text{num(3)} + (\text{lowerVar(multiply)} \; \text{num(2)} \; \text{num(6)}) \\
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
We're almost there. We now want a rule for a "something that can appear at the top level of a program", like
|
||||
a function or data type declaration. We make a new set of rules:
|
||||
$$
|
||||
\\begin{align}
|
||||
T & \\rightarrow \\text{defn} \\; \\text{lowerVar} \\; L\_P =\\{ A\_{add} \\} \\\\\\
|
||||
T & \\rightarrow \\text{data} \\; \\text{upperVar} = \\{ L\_D \\} \\\\\\
|
||||
L\_D & \\rightarrow D \\; , \\; L\_D \\\\\\
|
||||
L\_D & \\rightarrow D \\\\\\
|
||||
L\_P & \\rightarrow \\text{lowerVar} \\; L\_P \\\\\\
|
||||
L\_P & \\rightarrow \\epsilon \\\\\\
|
||||
D & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
|
||||
L\_U & \\rightarrow \\text{upperVar} \\; L\_U \\\\\\
|
||||
L\_U & \\rightarrow \\epsilon
|
||||
\\end{align}
|
||||
$$
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
T & \rightarrow \text{defn} \; \text{lowerVar} \; L_P =\{ A_{add} \} \\
|
||||
T & \rightarrow \text{data} \; \text{upperVar} = \{ L_D \} \\
|
||||
L_D & \rightarrow D \; , \; L_D \\
|
||||
L_D & \rightarrow D \\
|
||||
L_P & \rightarrow \text{lowerVar} \; L_P \\
|
||||
L_P & \rightarrow \epsilon \\
|
||||
D & \rightarrow \text{upperVar} \; L_U \\
|
||||
L_U & \rightarrow \text{upperVar} \; L_U \\
|
||||
L_U & \rightarrow \epsilon
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either
|
||||
a function or a data definition. A function definition consists of the keyword "defn",
|
||||
followed by a function name (starting with a lowercase letter), followed by a list of
|
||||
|
@ -213,12 +237,12 @@ a constructor of the data type, followed by a list \\(L\_U\\) of zero or more up
|
|||
the types of the arguments of the constructor).
|
||||
|
||||
Finally, we want one or more of these declarations in a valid program:
|
||||
$$
|
||||
\\begin{align}
|
||||
G & \\rightarrow T \\; G \\\\\\
|
||||
G & \\rightarrow T
|
||||
\\end{align}
|
||||
$$
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
G & \rightarrow T \; G \\
|
||||
G & \rightarrow T
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
Just like with tokenizing, there exists a piece of software that will generate a bottom-up parser for us, given our grammar.
|
||||
It's called Bison, and it is frequently used with Flex. Before we get to bison, though, we need to pay a debt we've already
|
||||
|
|
Loading…
Reference in New Issue
Block a user