Update 'compiler: parsing' article to new string literals
Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>
This commit is contained in:
parent
d9d5c8bf14
commit
05a31dd4d4
@ -13,9 +13,9 @@ recognizing tokens. For instance, consider a simple language of a matching
|
||||
number of open and closed parentheses, like `()` and `((()))`. You can't
|
||||
write a regular expression for it! We resort to a wider class of languages, called
|
||||
__context free languages__. These languages are ones that are matched by __context free grammars__.
|
||||
A context free grammar is a list of rules in the form of \\(S \\rightarrow \\alpha\\), where
|
||||
\\(S\\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and
|
||||
\\(\\alpha\\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't
|
||||
A context free grammar is a list of rules in the form of \(S \rightarrow \alpha\), where
|
||||
\(S\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and
|
||||
\(\alpha\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't
|
||||
expand into other things; for us, this is a token).
|
||||
|
||||
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
|
||||
@ -27,9 +27,9 @@ S & \rightarrow ()
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string,
|
||||
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \(S\). Then, to get a desired string,
|
||||
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
|
||||
we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
|
||||
we start with \(S\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
|
||||
have to do a little more work:
|
||||
|
||||
{{< latex >}}
|
||||
@ -38,8 +38,8 @@ S \rightarrow (S) \rightarrow ((S)) \rightarrow ((()))
|
||||
|
||||
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
|
||||
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
|
||||
a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will
|
||||
first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse
|
||||
a nonterminal by trying to match each symbol in its body. In the rule \(S \rightarrow \alpha \beta \gamma\), it will
|
||||
first try to match \(\alpha\), then \(\beta\), and so on. If one of the three contains a nonterminal, it will attempt to parse
|
||||
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
|
||||
|
||||
{{< latex >}}
|
||||
@ -49,8 +49,8 @@ S & \rightarrow a
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will
|
||||
try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)...
|
||||
A top down parser will start with \(S\). It will then try the first rule, which starts with \(S\). So, dutifully, it will
|
||||
try to parse __that__ \(S\). And to do that, it will once again try the first rule, and find that it starts with another \(S\)...
|
||||
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
|
||||
__left recursive__, and top-down parsers aren't able to handle those grammars.
|
||||
|
||||
@ -68,8 +68,8 @@ We see nothing interesting on the left side of the dot, so we move (or __shift__
|
||||
a.aa
|
||||
{{< /latex >}}
|
||||
|
||||
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one).
|
||||
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)):
|
||||
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \(S\) (the second one).
|
||||
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\(S\)):
|
||||
|
||||
{{< latex >}}
|
||||
S.aa
|
||||
@ -94,7 +94,7 @@ start symbol, and nothing on the right of the dot, so we're done!
|
||||
In practice, we don't want to just match a grammar. That would be like saying "yup, this is our language".
|
||||
Instead, we want to create something called an __abstract syntax tree__, or AST for short. This tree
|
||||
captures the structure of our language, and is easier to work with than its textual representation. The structure
|
||||
of the tree we build will often mimic the structure of our grammar: a rule in the form \\(S \\rightarrow A B\\)
|
||||
of the tree we build will often mimic the structure of our grammar: a rule in the form \(S \rightarrow A B\)
|
||||
will result in a tree named "S", with two children corresponding the trees built for A and B. Since
|
||||
an AST captures the structure of the language, we'll be able to toss away some punctuation
|
||||
like `,` and `(`. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally,
|
||||
@ -109,8 +109,8 @@ For instance, for `3+2*6`, we want our tree to have `+` as the root, `3` as the
|
||||
Why? Because this tree represents "the addition of 3 and the result of multiplying 2 by 6". If we had `*` be the root, we'd have
|
||||
a tree representing "the multiplication of the result of adding 3 to 2 and 6", which is __not__ what our expression means.
|
||||
|
||||
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and
|
||||
for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules:
|
||||
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \(A_{add}\), to be matched first, and
|
||||
for its children to be trees created by the multiplication rule, \(A_{mult}\). So we write the following rules:
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
@ -120,7 +120,7 @@ A_{add} & \rightarrow A_{mult}
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body
|
||||
The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \(A_{add}\) on the left side of \(+\) and \(-\) in the body
|
||||
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
|
||||
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
|
||||
of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking,
|
||||
@ -139,7 +139,7 @@ A_{mult} & \rightarrow P
|
||||
{{< /latex >}}
|
||||
|
||||
P, in this case, is an application (remember, application has higher precedence than any binary operator).
|
||||
Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application.
|
||||
Once again, if there's no `*` or `\`, we simply fall through to a \(P\) nonterminal, representing application.
|
||||
|
||||
Application is refreshingly simple:
|
||||
|
||||
@ -150,7 +150,7 @@ P & \rightarrow B
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
An application is either only one "thing" (represented with \\(B\\), for base), such as a number or an identifier,
|
||||
An application is either only one "thing" (represented with \(B\), for base), such as a number or an identifier,
|
||||
or another application followed by a thing.
|
||||
|
||||
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
|
||||
@ -166,7 +166,7 @@ B & \rightarrow C
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that:
|
||||
What's the last \(C\)? We also want a "thing" to be a case expression. Here are the rules for that:
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
@ -181,15 +181,15 @@ L_L & \rightarrow \epsilon
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
\\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the
|
||||
form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name
|
||||
(\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be
|
||||
lowercase names, and a list of those arguments will be represented with \\(L\_L\\). One of the bodies
|
||||
of this nonterminal is just the character \\(\\epsilon\\), which just means "nothing".
|
||||
\(L_B\) is the list of branches in our case expression. \(R\) is a single branch, which is in the
|
||||
form `Pattern -> Expression`. \(N\) is a pattern, which we will for now define to be either a variable name
|
||||
(\(\text{lowerVar}\)), or a constructor with some arguments. The arguments of a constructor will be
|
||||
lowercase names, and a list of those arguments will be represented with \(L_L\). One of the bodies
|
||||
of this nonterminal is just the character \(\epsilon\), which just means "nothing".
|
||||
We use this because a constructor can have no arguments (like Nil).
|
||||
|
||||
We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`,
|
||||
where multiply is a function that, well, multiplies. We start with \\(A_{add}\\):
|
||||
where multiply is a function that, well, multiplies. We start with \(A_{add}\):
|
||||
|
||||
{{< latex >}}
|
||||
\begin{aligned}
|
||||
@ -227,15 +227,15 @@ L_U & \rightarrow \epsilon
|
||||
\end{aligned}
|
||||
{{< /latex >}}
|
||||
|
||||
That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either
|
||||
That's a lot of rules! \(T\) is the "top-level declaration rule. It matches either
|
||||
a function or a data definition. A function definition consists of the keyword "defn",
|
||||
followed by a function name (starting with a lowercase letter), followed by a list of
|
||||
parameters, represented by \\(L\_P\\).
|
||||
parameters, represented by \(L_P\).
|
||||
|
||||
A data type definition consists of the name of the data type (starting with an uppercase letter),
|
||||
and a list \\(L\_D\\) of data constructors \\(D\\). There must be at least one data constructor in this list,
|
||||
and a list \(L_D\) of data constructors \(D\). There must be at least one data constructor in this list,
|
||||
so we don't use the empty string rule here. A data constructor is simply an uppercase variable representing
|
||||
a constructor of the data type, followed by a list \\(L\_U\\) of zero or more uppercase variables (representing
|
||||
a constructor of the data type, followed by a list \(L_U\) of zero or more uppercase variables (representing
|
||||
the types of the arguments of the constructor).
|
||||
|
||||
Finally, we want one or more of these declarations in a valid program:
|
||||
@ -266,7 +266,7 @@ Next, observe that there's
|
||||
a certain symmetry between our parser and our scanner. In our scanner, we mixed the theoretical idea of a regular expression
|
||||
with an __action__, a C++ code snippet to be executed when a regular expression is matched. This same idea is present
|
||||
in the parser, too. Each rule can produce a value, which we call a __semantic value__. For type safety, we allow
|
||||
each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \\(A_{add}\\) must
|
||||
each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \(A_{add}\) must
|
||||
produce an expression. We specify the type of each nonterminal and using `%type` directives. The types of terminals
|
||||
are specified when they're declared.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user