Update 'compiler: parsing' article to new string literals
Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>
This commit is contained in:
parent
d9d5c8bf14
commit
05a31dd4d4
|
@ -13,9 +13,9 @@ recognizing tokens. For instance, consider a simple language of a matching
|
||||||
number of open and closed parentheses, like `()` and `((()))`. You can't
|
number of open and closed parentheses, like `()` and `((()))`. You can't
|
||||||
write a regular expression for it! We resort to a wider class of languages, called
|
write a regular expression for it! We resort to a wider class of languages, called
|
||||||
__context free languages__. These languages are ones that are matched by __context free grammars__.
|
__context free languages__. These languages are ones that are matched by __context free grammars__.
|
||||||
A context free grammar is a list of rules in the form of \\(S \\rightarrow \\alpha\\), where
|
A context free grammar is a list of rules in the form of \(S \rightarrow \alpha\), where
|
||||||
\\(S\\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and
|
\(S\) is a __nonterminal__ (conceptualy, a thing that expands into other things), and
|
||||||
\\(\\alpha\\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't
|
\(\alpha\) is a sequence of nonterminals and terminals (a terminal is a thing that doesn't
|
||||||
expand into other things; for us, this is a token).
|
expand into other things; for us, this is a token).
|
||||||
|
|
||||||
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
|
Let's write a context free grammar (CFG from now on) to match our parenthesis language:
|
||||||
|
@ -27,9 +27,9 @@ S & \rightarrow ()
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \\(S\\). Then, to get a desired string,
|
So, how does this work? We start with a "start symbol" nonterminal, which we usually denote as \(S\). Then, to get a desired string,
|
||||||
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
|
we replace a nonterminal with the sequence of terminals and nonterminals on the right of one of its rules. For instance, to get `()`,
|
||||||
we start with \\(S\\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
|
we start with \(S\) and replace it with the body of the second one of its rules. This gives us `()` right away. To get `((()))`, we
|
||||||
have to do a little more work:
|
have to do a little more work:
|
||||||
|
|
||||||
{{< latex >}}
|
{{< latex >}}
|
||||||
|
@ -38,8 +38,8 @@ S \rightarrow (S) \rightarrow ((S)) \rightarrow ((()))
|
||||||
|
|
||||||
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
|
In practice, there are many ways of using a CFG to parse a programming language. Various parsing algorithms support various subsets
|
||||||
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
|
of context free languages. For instance, top down parsers follow nearly exactly the structure that we had. They try to parse
|
||||||
a nonterminal by trying to match each symbol in its body. In the rule \\(S \\rightarrow \\alpha \\beta \\gamma\\), it will
|
a nonterminal by trying to match each symbol in its body. In the rule \(S \rightarrow \alpha \beta \gamma\), it will
|
||||||
first try to match \\(\\alpha\\), then \\(\\beta\\), and so on. If one of the three contains a nonterminal, it will attempt to parse
|
first try to match \(\alpha\), then \(\beta\), and so on. If one of the three contains a nonterminal, it will attempt to parse
|
||||||
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
|
that nonterminal following the same strategy. However, this leaves a flaw - For instance, consider the grammar
|
||||||
|
|
||||||
{{< latex >}}
|
{{< latex >}}
|
||||||
|
@ -49,8 +49,8 @@ S & \rightarrow a
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
A top down parser will start with \\(S\\). It will then try the first rule, which starts with \\(S\\). So, dutifully, it will
|
A top down parser will start with \(S\). It will then try the first rule, which starts with \(S\). So, dutifully, it will
|
||||||
try to parse __that__ \\(S\\). And to do that, it will once again try the first rule, and find that it starts with another \\(S\\)...
|
try to parse __that__ \(S\). And to do that, it will once again try the first rule, and find that it starts with another \(S\)...
|
||||||
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
|
This will never end, and the parser will get stuck. A grammar in which a nonterminal can appear in the beginning of one of its rules
|
||||||
__left recursive__, and top-down parsers aren't able to handle those grammars.
|
__left recursive__, and top-down parsers aren't able to handle those grammars.
|
||||||
|
|
||||||
|
@ -68,8 +68,8 @@ We see nothing interesting on the left side of the dot, so we move (or __shift__
|
||||||
a.aa
|
a.aa
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \\(S\\) (the second one).
|
Now, on the left side of the dot, we see something! In particular, we see the body of one of the rules for \(S\) (the second one).
|
||||||
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\\(S\\)):
|
So we __reduce__ the thing on the left side of the dot, by replacing it with the left hand side of the rule (\(S\)):
|
||||||
|
|
||||||
{{< latex >}}
|
{{< latex >}}
|
||||||
S.aa
|
S.aa
|
||||||
|
@ -94,7 +94,7 @@ start symbol, and nothing on the right of the dot, so we're done!
|
||||||
In practice, we don't want to just match a grammar. That would be like saying "yup, this is our language".
|
In practice, we don't want to just match a grammar. That would be like saying "yup, this is our language".
|
||||||
Instead, we want to create something called an __abstract syntax tree__, or AST for short. This tree
|
Instead, we want to create something called an __abstract syntax tree__, or AST for short. This tree
|
||||||
captures the structure of our language, and is easier to work with than its textual representation. The structure
|
captures the structure of our language, and is easier to work with than its textual representation. The structure
|
||||||
of the tree we build will often mimic the structure of our grammar: a rule in the form \\(S \\rightarrow A B\\)
|
of the tree we build will often mimic the structure of our grammar: a rule in the form \(S \rightarrow A B\)
|
||||||
will result in a tree named "S", with two children corresponding the trees built for A and B. Since
|
will result in a tree named "S", with two children corresponding the trees built for A and B. Since
|
||||||
an AST captures the structure of the language, we'll be able to toss away some punctuation
|
an AST captures the structure of the language, we'll be able to toss away some punctuation
|
||||||
like `,` and `(`. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally,
|
like `,` and `(`. These tokens will appear in our grammar, but we will tweak our parser to simply throw them away. Additionally,
|
||||||
|
@ -109,8 +109,8 @@ For instance, for `3+2*6`, we want our tree to have `+` as the root, `3` as the
|
||||||
Why? Because this tree represents "the addition of 3 and the result of multiplying 2 by 6". If we had `*` be the root, we'd have
|
Why? Because this tree represents "the addition of 3 and the result of multiplying 2 by 6". If we had `*` be the root, we'd have
|
||||||
a tree representing "the multiplication of the result of adding 3 to 2 and 6", which is __not__ what our expression means.
|
a tree representing "the multiplication of the result of adding 3 to 2 and 6", which is __not__ what our expression means.
|
||||||
|
|
||||||
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \\(A\_{add}\\), to be matched first, and
|
So, with this in mind, we want our rule for __addition__ (represented with the nonterminal \(A_{add}\), to be matched first, and
|
||||||
for its children to be trees created by the multiplication rule, \\(A\_{mult}\\). So we write the following rules:
|
for its children to be trees created by the multiplication rule, \(A_{mult}\). So we write the following rules:
|
||||||
|
|
||||||
{{< latex >}}
|
{{< latex >}}
|
||||||
\begin{aligned}
|
\begin{aligned}
|
||||||
|
@ -120,7 +120,7 @@ A_{add} & \rightarrow A_{mult}
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \\(A\_{add}\\) on the left side of \\(+\\) and \\(-\\) in the body
|
The first rule matches another addition, added to the result of a multiplication. Similarly, the second rule matches another addition, from which the result of a multiplication is then subtracted. We use the \(A_{add}\) on the left side of \(+\) and \(-\) in the body
|
||||||
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
|
because we want to be able to parse strings like `1+2+3+4`, which we want to view as `((1+2)+3)+4` (mostly because
|
||||||
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
|
subtraction is [left-associative](https://en.wikipedia.org/wiki/Operator_associativity)). So, we want the top level
|
||||||
of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking,
|
of the tree to be the rightmost `+` or `-`, since that means it will be the "last" operation. You may be asking,
|
||||||
|
@ -139,7 +139,7 @@ A_{mult} & \rightarrow P
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
P, in this case, is an application (remember, application has higher precedence than any binary operator).
|
P, in this case, is an application (remember, application has higher precedence than any binary operator).
|
||||||
Once again, if there's no `*` or `\`, we simply fall through to a \\(P\\) nonterminal, representing application.
|
Once again, if there's no `*` or `\`, we simply fall through to a \(P\) nonterminal, representing application.
|
||||||
|
|
||||||
Application is refreshingly simple:
|
Application is refreshingly simple:
|
||||||
|
|
||||||
|
@ -150,7 +150,7 @@ P & \rightarrow B
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
An application is either only one "thing" (represented with \\(B\\), for base), such as a number or an identifier,
|
An application is either only one "thing" (represented with \(B\), for base), such as a number or an identifier,
|
||||||
or another application followed by a thing.
|
or another application followed by a thing.
|
||||||
|
|
||||||
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
|
We now need to define what a "thing" is. As we said before, it's a number, or an identifier. We also make a parenthesized
|
||||||
|
@ -166,7 +166,7 @@ B & \rightarrow C
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
What's the last \\(C\\)? We also want a "thing" to be a case expression. Here are the rules for that:
|
What's the last \(C\)? We also want a "thing" to be a case expression. Here are the rules for that:
|
||||||
|
|
||||||
{{< latex >}}
|
{{< latex >}}
|
||||||
\begin{aligned}
|
\begin{aligned}
|
||||||
|
@ -181,15 +181,15 @@ L_L & \rightarrow \epsilon
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
\\(L\_B\\) is the list of branches in our case expression. \\(R\\) is a single branch, which is in the
|
\(L_B\) is the list of branches in our case expression. \(R\) is a single branch, which is in the
|
||||||
form `Pattern -> Expression`. \\(N\\) is a pattern, which we will for now define to be either a variable name
|
form `Pattern -> Expression`. \(N\) is a pattern, which we will for now define to be either a variable name
|
||||||
(\\(\\text{lowerVar}\\)), or a constructor with some arguments. The arguments of a constructor will be
|
(\(\text{lowerVar}\)), or a constructor with some arguments. The arguments of a constructor will be
|
||||||
lowercase names, and a list of those arguments will be represented with \\(L\_L\\). One of the bodies
|
lowercase names, and a list of those arguments will be represented with \(L_L\). One of the bodies
|
||||||
of this nonterminal is just the character \\(\\epsilon\\), which just means "nothing".
|
of this nonterminal is just the character \(\epsilon\), which just means "nothing".
|
||||||
We use this because a constructor can have no arguments (like Nil).
|
We use this because a constructor can have no arguments (like Nil).
|
||||||
|
|
||||||
We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`,
|
We can use these grammar rules to represent any expression we want. For instance, let's try `3+(multiply 2 6)`,
|
||||||
where multiply is a function that, well, multiplies. We start with \\(A_{add}\\):
|
where multiply is a function that, well, multiplies. We start with \(A_{add}\):
|
||||||
|
|
||||||
{{< latex >}}
|
{{< latex >}}
|
||||||
\begin{aligned}
|
\begin{aligned}
|
||||||
|
@ -227,15 +227,15 @@ L_U & \rightarrow \epsilon
|
||||||
\end{aligned}
|
\end{aligned}
|
||||||
{{< /latex >}}
|
{{< /latex >}}
|
||||||
|
|
||||||
That's a lot of rules! \\(T\\) is the "top-level declaration rule. It matches either
|
That's a lot of rules! \(T\) is the "top-level declaration rule. It matches either
|
||||||
a function or a data definition. A function definition consists of the keyword "defn",
|
a function or a data definition. A function definition consists of the keyword "defn",
|
||||||
followed by a function name (starting with a lowercase letter), followed by a list of
|
followed by a function name (starting with a lowercase letter), followed by a list of
|
||||||
parameters, represented by \\(L\_P\\).
|
parameters, represented by \(L_P\).
|
||||||
|
|
||||||
A data type definition consists of the name of the data type (starting with an uppercase letter),
|
A data type definition consists of the name of the data type (starting with an uppercase letter),
|
||||||
and a list \\(L\_D\\) of data constructors \\(D\\). There must be at least one data constructor in this list,
|
and a list \(L_D\) of data constructors \(D\). There must be at least one data constructor in this list,
|
||||||
so we don't use the empty string rule here. A data constructor is simply an uppercase variable representing
|
so we don't use the empty string rule here. A data constructor is simply an uppercase variable representing
|
||||||
a constructor of the data type, followed by a list \\(L\_U\\) of zero or more uppercase variables (representing
|
a constructor of the data type, followed by a list \(L_U\) of zero or more uppercase variables (representing
|
||||||
the types of the arguments of the constructor).
|
the types of the arguments of the constructor).
|
||||||
|
|
||||||
Finally, we want one or more of these declarations in a valid program:
|
Finally, we want one or more of these declarations in a valid program:
|
||||||
|
@ -266,7 +266,7 @@ Next, observe that there's
|
||||||
a certain symmetry between our parser and our scanner. In our scanner, we mixed the theoretical idea of a regular expression
|
a certain symmetry between our parser and our scanner. In our scanner, we mixed the theoretical idea of a regular expression
|
||||||
with an __action__, a C++ code snippet to be executed when a regular expression is matched. This same idea is present
|
with an __action__, a C++ code snippet to be executed when a regular expression is matched. This same idea is present
|
||||||
in the parser, too. Each rule can produce a value, which we call a __semantic value__. For type safety, we allow
|
in the parser, too. Each rule can produce a value, which we call a __semantic value__. For type safety, we allow
|
||||||
each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \\(A_{add}\\) must
|
each nonterminal and terminal to produce only one type of semantic value. For instance, all rules for \(A_{add}\) must
|
||||||
produce an expression. We specify the type of each nonterminal and using `%type` directives. The types of terminals
|
produce an expression. We specify the type of each nonterminal and using `%type` directives. The types of terminals
|
||||||
are specified when they're declared.
|
are specified when they're declared.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user