2019-09-04 21:10:06 -07:00
|
|
|
---
|
|
|
|
title: Compiling a Functional Language Using C++, Part 6 - Compilation
|
|
|
|
date: 2019-08-06T14:26:38-07:00
|
2024-03-13 15:59:46 -07:00
|
|
|
tags: ["C++", "Functional Languages", "Compilers"]
|
2023-01-31 18:53:30 -08:00
|
|
|
series: "Compiling a Functional Language using C++"
|
2020-05-09 17:29:37 -07:00
|
|
|
description: "In this post, we enable our compiler to convert programs written in our functional language to G-machine instructions."
|
2019-09-04 21:10:06 -07:00
|
|
|
---
|
2019-09-16 01:57:15 -07:00
|
|
|
In the previous post, we defined a machine for graph reduction,
|
2019-09-04 21:10:06 -07:00
|
|
|
called a G-machine. However, this machine is still not particularly
|
|
|
|
connected to __our__ language. In this post, we will give
|
|
|
|
meanings to programs in our language in the context of
|
|
|
|
this G-machine. We will define a __compilation scheme__,
|
|
|
|
which will be a set of rules that tell us how to
|
|
|
|
translate programs in our language into G-machine instructions.
|
|
|
|
To mirror _Implementing Functional Languages: a tutorial_, we'll
|
2024-05-13 18:43:14 -07:00
|
|
|
call this compilation scheme \(\mathcal{C}\), and write it
|
|
|
|
as \(\mathcal{C} ⟦e⟧ = i\), meaning "the expression \(e\)
|
|
|
|
compiles to the instructions \(i\)".
|
2019-09-04 21:10:06 -07:00
|
|
|
|
|
|
|
To follow our route from the typechecking, let's start
|
|
|
|
with compiling expressions that are numbers. It's pretty easy:
|
2020-03-04 14:07:05 -08:00
|
|
|
|
|
|
|
{{< latex >}}
|
|
|
|
\mathcal{C} ⟦n⟧ = [\text{PushInt} \; n]
|
|
|
|
{{< /latex >}}
|
2019-09-04 21:10:06 -07:00
|
|
|
|
|
|
|
Here, we compiled a number expression to a list of
|
|
|
|
instructions with only one element - PushInt.
|
|
|
|
|
|
|
|
Just like when we did typechecking, let's
|
|
|
|
move on to compiling function applications. As
|
|
|
|
we informally stated in the previous chapter, since
|
|
|
|
the thing we're applying has to be on top,
|
|
|
|
we want to compile it last:
|
|
|
|
|
2020-03-04 14:07:05 -08:00
|
|
|
{{< latex >}}
|
|
|
|
\mathcal{C} ⟦e_1 \; e_2⟧ = \mathcal{C} ⟦e_2⟧ ⧺ \mathcal{C} ⟦e_1⟧ ⧺ [\text{MkApp}]
|
|
|
|
{{< /latex >}}
|
2019-09-04 21:10:06 -07:00
|
|
|
|
2024-05-13 18:43:14 -07:00
|
|
|
Here, we used the \(⧺\) operator to represent the concatenation of two
|
2019-09-04 21:10:06 -07:00
|
|
|
lists. Otherwise, this should be pretty intutive - we first run the instructions
|
|
|
|
to create the parameter, then we run the instructions to create the function,
|
|
|
|
and finally, we combine them using MkApp.
|
|
|
|
|
|
|
|
It's variables that once again force us to adjust our strategy. If our
|
|
|
|
program is well-typed, we know our variable will be on the stack:
|
|
|
|
our definition of Unwind makes it so for functions, and we will
|
|
|
|
define our case expression compilation scheme to match. However,
|
|
|
|
we still need to know __where__ on the stack each variable is,
|
|
|
|
and this changes as the stack is modified.
|
|
|
|
|
2024-05-13 18:43:14 -07:00
|
|
|
To accommodate for this, we define an environment, \(\rho\),
|
2019-09-04 21:10:06 -07:00
|
|
|
to be a partial function mapping variable names to thier
|
2024-05-13 18:43:14 -07:00
|
|
|
offsets on the stack. We write \(\rho = [x \rightarrow n, y \rightarrow m]\)
|
|
|
|
to say "the environment \(\rho\) maps variable \(x\) to stack offset \(n\),
|
|
|
|
and variable \(y\) to stack offset \(m\)". We also write \(\rho \; x\) to
|
|
|
|
say "look up \(x\) in \(\rho\)", since \(\rho\) is a function. Finally,
|
2019-09-04 21:10:06 -07:00
|
|
|
to help with the ever-changing stack, we define an augmented environment
|
2024-05-13 18:43:14 -07:00
|
|
|
\(\rho^{+n}\), such that \(\rho^{+n} \; x = \rho \; x + n\). In words,
|
|
|
|
this basically means "\(\rho^{+n}\) has all the variables from \(\rho\),
|
|
|
|
but their addresses are incremented by \(n\)". We now pass \(\rho\)
|
|
|
|
in to \(\mathcal{C}\) together with the expression \(e\). Let's
|
2019-09-04 21:10:06 -07:00
|
|
|
rewrite our first two rules. For numbers:
|
|
|
|
|
2020-03-04 14:07:05 -08:00
|
|
|
{{< latex >}}
|
|
|
|
\mathcal{C} ⟦n⟧ \; \rho = [\text{PushInt} \; n]
|
|
|
|
{{< /latex >}}
|
2019-09-04 21:10:06 -07:00
|
|
|
|
|
|
|
For function application:
|
2020-03-04 14:07:05 -08:00
|
|
|
|
|
|
|
{{< latex >}}
|
|
|
|
\mathcal{C} ⟦e_1 \; e_2⟧ \; \rho = \mathcal{C} ⟦e_2⟧ \; \rho \; ⧺ \;\mathcal{C} ⟦e_1⟧ \; \rho^{+1} \; ⧺ \; [\text{MkApp}]
|
|
|
|
{{< /latex >}}
|
2019-09-04 21:10:06 -07:00
|
|
|
|
2024-05-13 18:43:14 -07:00
|
|
|
Notice how in that last rule, we passed in \(\rho^{+1}\) when compiling the function's expression. This is because
|
|
|
|
the result of running the instructions for \(e_2\) will have left on the stack the function's parameter. Whatever
|
2019-09-04 21:10:06 -07:00
|
|
|
was at the top of the stack (and thus, had index 0), is now the second element from the top (address 1). The
|
|
|
|
same is true for all other things that were on the stack. So, we increment the environment accordingly.
|
|
|
|
|
|
|
|
With the environment, the variable rule is simple:
|
2020-03-04 14:07:05 -08:00
|
|
|
|
|
|
|
{{< latex >}}
|
|
|
|
\mathcal{C} ⟦x⟧ \; \rho = [\text{Push} \; (\rho \; x)]
|
|
|
|
{{< /latex >}}
|
2019-09-04 21:10:06 -07:00
|
|
|
|
|
|
|
One more thing. If we run across a function name, we want to
|
2024-05-13 18:43:14 -07:00
|
|
|
use PushGlobal rather than Push. Defining \(f\) to be a name
|
2019-09-04 21:10:06 -07:00
|
|
|
of a global function, we capture this using the following rule:
|
|
|
|
|
2020-03-04 14:07:05 -08:00
|
|
|
{{< latex >}}
|
|
|
|
\mathcal{C} ⟦f⟧ \; \rho = [\text{PushGlobal} \; f]
|
|
|
|
{{< /latex >}}
|
2019-09-04 21:10:06 -07:00
|
|
|
|
2019-09-04 21:45:39 -07:00
|
|
|
Now it's time for us to compile case expressions, but there's a bit of
|
|
|
|
an issue - our case expressions branches don't map one-to-one with
|
2024-05-13 18:43:14 -07:00
|
|
|
the \(t \rightarrow i_t\) format of the Jump instruction.
|
|
|
|
This is because we allow for name patterns in the form \(x\),
|
2019-09-04 21:45:39 -07:00
|
|
|
which can possibly match more than one tag. Consider this
|
|
|
|
rather useless example:
|
|
|
|
|
|
|
|
```
|
|
|
|
data Bool = { True, False }
|
|
|
|
defn weird b = { case b of { b -> { False } } }
|
|
|
|
```
|
|
|
|
|
|
|
|
We only have one branch, but we have two tags that should
|
2019-09-16 01:57:15 -07:00
|
|
|
lead to it! Not only that, but variable patterns are
|
|
|
|
location-dependent: if a variable pattern comes
|
|
|
|
before a constructor pattern, then the constructor
|
|
|
|
pattern will never be reached. On the other hand,
|
|
|
|
if a constructor pattern comes before a variable
|
|
|
|
pattern, it will be tried before the varible pattern,
|
|
|
|
and thus is reachable.
|
|
|
|
|
|
|
|
We will ignore this problem for now - we will define our semantics
|
|
|
|
as though each case expression branch can match exactly one tag.
|
|
|
|
In our C++ code, we will write a conversion function that will
|
|
|
|
figure out which tag goes to which sequence of instructions.
|
|
|
|
Effectively, we'll be performing [desugaring](https://en.wikipedia.org/wiki/Syntactic_sugar).
|
|
|
|
|
|
|
|
Now, on to defining the compilation rules for case expressions.
|
|
|
|
It's helpful to define compiling a single branch of a case expression
|
2024-05-13 18:43:14 -07:00
|
|
|
separately. For a branch in the form \(t \; x_1 \; x_2 \; ... \; x_n \rightarrow \text{body}\),
|
|
|
|
we define a compilation scheme \(\mathcal{A}\) as follows:
|
2019-09-04 21:45:39 -07:00
|
|
|
|
2020-03-04 14:07:05 -08:00
|
|
|
{{< latex >}}
|
|
|
|
\begin{aligned}
|
|
|
|
\mathcal{A} ⟦t \; x_1 \; ... \; x_n \rightarrow \text{body}⟧ \; \rho & =
|
|
|
|
t \rightarrow [\text{Split} \; n] \; ⧺ \; \mathcal{C}⟦\text{body}⟧ \; \rho' \; ⧺ \; [\text{Slide} \; n] \\
|
|
|
|
\text{where} \; \rho' &= \rho^{+n}[x_1 \rightarrow 0, ..., x_n \rightarrow n - 1]
|
|
|
|
\end{aligned}
|
|
|
|
{{< /latex >}}
|
2019-09-16 01:57:15 -07:00
|
|
|
|
|
|
|
First, we run Split - the node on the top of the stack is a packed constructor,
|
|
|
|
and we want access to its member variables, since they can be referenced by
|
2024-05-13 18:43:14 -07:00
|
|
|
the branch's body via \(x_i\). For the same reason, we must make sure to include
|
|
|
|
\(x_1\) through \(x_n\) in our environment. Furthermore, since the split values now occupy the stack,
|
|
|
|
we have to offset our environment by \(n\) before adding bindings to our new variables.
|
|
|
|
Doing all these things gives us \(\rho'\), which we use to compile the body, placing
|
2019-09-16 01:57:15 -07:00
|
|
|
the resulting instructions after Split. This leaves us with the desired graph on top of
|
|
|
|
the stack - the only thing left to do is to clean up the stack of the unpacked values,
|
|
|
|
which we do using Slide.
|
|
|
|
|
2024-05-13 18:43:14 -07:00
|
|
|
Notice that we didn't just create instructions - we created a mapping from the tag \(t\)
|
2019-09-16 01:57:15 -07:00
|
|
|
to the instructions that correspond to it.
|
|
|
|
|
|
|
|
Now, it's time for compiling the whole case expression. We first want
|
|
|
|
to construct the graph for the expression we want to perform case analysis on.
|
|
|
|
Next, we want to evaluate it (since we need a packed value, not a graph,
|
|
|
|
to read the tag). Finally, we perform a jump depending on the tag. This
|
2019-10-08 14:09:58 -07:00
|
|
|
is captured by the following rule:
|
2019-09-16 01:57:15 -07:00
|
|
|
|
2020-03-04 14:07:05 -08:00
|
|
|
{{< latex >}}
|
|
|
|
\mathcal{C} ⟦\text{case} \; e \; \text{of} \; \text{alt}_1 ... \text{alt}_n⟧ \; \rho =
|
|
|
|
\mathcal{C} ⟦e⟧ \; \rho \; ⧺ [\text{Eval}, \text{Jump} \; [\mathcal{A} ⟦\text{alt}_1⟧ \; \rho, ..., \mathcal{A} ⟦\text{alt}_n⟧ \; \rho]]
|
|
|
|
{{< /latex >}}
|
2019-09-16 01:57:15 -07:00
|
|
|
|
2024-05-13 18:43:14 -07:00
|
|
|
This works because \(\mathcal{A}\) creates not only instructions,
|
2019-09-16 01:57:15 -07:00
|
|
|
but also a tag mapping. We simply populate our Jump instruction such mappings
|
|
|
|
resulting from compiling each branch.
|
|
|
|
|
|
|
|
You may have noticed that we didn't add rules for binary operators. Just like
|
|
|
|
with type checking, we treat them as function calls. However, rather that constructing
|
|
|
|
graphs when we have to instantiate those functions, we simply
|
|
|
|
evaluate the arguments and perform the relevant arithmetic operation using BinOp.
|
|
|
|
We will do a similar thing for constructors.
|
|
|
|
|
2019-10-01 14:35:28 -07:00
|
|
|
### Implementation
|
|
|
|
|
|
|
|
With that out of the way, we can get around to writing some code. Let's
|
|
|
|
first define C++ structs for the instructions of the G-machine:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/06/instruction.hpp" >}}
|
|
|
|
|
2019-10-10 13:14:00 -07:00
|
|
|
I omit the implementation of the various (trivial) `print` methods in this post;
|
|
|
|
as always, you can look at the full project source code, which is
|
|
|
|
freely available for each post in the series.
|
|
|
|
|
2019-10-01 14:35:28 -07:00
|
|
|
We can now envision a method on the `ast` struct that takes an environment
|
2024-05-13 18:43:14 -07:00
|
|
|
(just like our compilation scheme takes the environment \(\rho\)),
|
2019-10-01 14:35:28 -07:00
|
|
|
and compiles the `ast`. Rather than returning a vector
|
2019-09-16 01:57:15 -07:00
|
|
|
of instructions (which involves copying, unless we get some optimization kicking in),
|
2019-10-01 14:35:28 -07:00
|
|
|
we'll pass a reference to a vector to our method. The method will then place the generated
|
|
|
|
instructions into the vector.
|
|
|
|
|
|
|
|
There's one more thing to be considered. How do we tell apart a "global"
|
|
|
|
from a variable? A naive solution would be to take a list or map of
|
|
|
|
global functions as a third parameter to our `compile` method.
|
|
|
|
But there's an easier way! We know that the program passed type checking.
|
|
|
|
This means that every referenced variable exists. From then, the situation is easy -
|
2024-05-13 18:43:14 -07:00
|
|
|
if actual variable names are kept in the environment, \(\rho\), then whenever
|
2019-10-01 14:35:28 -07:00
|
|
|
we see a variable that __isn't__ in the current environment, it must be a function name.
|
|
|
|
|
|
|
|
Having finished contemplating out method, it's time to define a signature:
|
|
|
|
```C++
|
2019-10-01 23:23:52 -07:00
|
|
|
virtual void compile(const env_ptr& env, std::vector<instruction_ptr>& into) const;
|
2019-10-01 14:35:28 -07:00
|
|
|
```
|
|
|
|
|
2019-10-01 23:23:52 -07:00
|
|
|
Ah, but now we have to define "environment". Let's do that. Here's our header:
|
2019-10-01 14:35:28 -07:00
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/06/env.hpp" >}}
|
|
|
|
|
2019-10-01 23:23:52 -07:00
|
|
|
And here's the source file:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/06/env.cpp" >}}
|
|
|
|
|
2019-10-08 14:09:58 -07:00
|
|
|
There's not that much to see here, but let's go through it anyway.
|
|
|
|
We define an environment as a linked list, kind of like
|
|
|
|
we did with the type environment. This time, though,
|
|
|
|
we use shared pointers instead of raw pointers to reference the parent.
|
|
|
|
I decided on this because we will need to be using virtual methods
|
|
|
|
(since we have two subclasses of `env`), and thus will need to
|
|
|
|
be passing the `env` by pointer. At that point, we might as well
|
|
|
|
use the "proper" way!
|
|
|
|
|
|
|
|
I implemented the environment as a linked list because it is, in essence,
|
|
|
|
a stack. However, not every "offset" in a stack is introduced by
|
|
|
|
binding variables - for instance, when we create an application node,
|
|
|
|
we first build the argument value on the stack, and then,
|
|
|
|
with that value still on the stack, build the left hand side of the application.
|
|
|
|
Thus, all the variable positions are offset by the presence of the argument
|
|
|
|
on the stack, and we must account for that. Similarly, in cases when we will
|
|
|
|
allocate space on the stack (we will run into these cases later), we will
|
|
|
|
need to account for that change. Thus, since we can increment
|
|
|
|
the offset by two ways (binding a variable and building something on the stack),
|
|
|
|
we allow for two types of nodes in our `env` stack.
|
|
|
|
|
|
|
|
During recursion we will be tweaking the return value of `get_offset` to
|
|
|
|
calculate the final location of a variable on the stack (if the
|
|
|
|
parent of a node returned offset `1`, but the node itself is a variable
|
|
|
|
node and thus introduces another offset, we need to return `2`). Because
|
|
|
|
of this, we cannot reasonably return a constant like `-1` (it will quickly
|
|
|
|
be made positive on a long list), and thus we throw an exception. To
|
|
|
|
allow for a safe way to check for an offset, without try-catch,
|
|
|
|
we also add a `has_variable` method which checks if the lookup will succeed.
|
|
|
|
A better approach would be to use `std::optional`, but it's C++17, so
|
|
|
|
we'll shy away from it.
|
2019-10-01 23:23:52 -07:00
|
|
|
|
|
|
|
It will also help to move some of the functions on the `binop` enum
|
|
|
|
into a separate file. The new neader is pretty small:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/06/binop.hpp" >}}
|
|
|
|
|
|
|
|
The new source file is not much longer:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/06/binop.cpp" >}}
|
|
|
|
|
2019-10-08 14:09:58 -07:00
|
|
|
And now, we begin our implementation. Let's start with the easy ones:
|
|
|
|
`ast_int`, `ast_lid` and `ast_uid`. The code for `ast_int` involves just pushing
|
|
|
|
the integer into the stack:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 36 38 >}}
|
2019-10-08 14:09:58 -07:00
|
|
|
|
|
|
|
The code for `ast_lid` needs to check if the variable is global or local,
|
|
|
|
just like we discussed:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 53 58 >}}
|
2019-10-08 14:09:58 -07:00
|
|
|
|
|
|
|
We do not have to do this for `ast_uid`:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 73 75 >}}
|
2019-10-08 14:09:58 -07:00
|
|
|
|
|
|
|
On to `ast_binop`! This is the first time we have to change our environment.
|
2019-10-08 21:42:25 -07:00
|
|
|
As we said earlier, once we build the right operand on the stack, every offset that we counted
|
2019-10-08 14:09:58 -07:00
|
|
|
from the top of the stack will have been shifted by 1 (we see this
|
|
|
|
in our compilation scheme for function application). So,
|
|
|
|
we create a new environment with `env_offset`, and use that
|
|
|
|
when we compile the left child:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 103 110 >}}
|
2019-10-08 14:09:58 -07:00
|
|
|
|
|
|
|
`ast_binop` performs two applications: `(+) lhs rhs`.
|
|
|
|
We push `rhs`, then `lhs`, then `(+)`, and then use MkApp
|
|
|
|
twice. In `ast_app`, we only need to perform one application,
|
|
|
|
`lhs rhs`:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 134 138 >}}
|
2019-10-08 14:09:58 -07:00
|
|
|
|
|
|
|
Note that we also extend our environment in this one,
|
|
|
|
for the exact same reason as before.
|
|
|
|
|
|
|
|
Case expressions are the only thing left on the agenda. This
|
|
|
|
is the time during which we have to perform desugaring. Here,
|
|
|
|
though, we run into an issue: we don't have tags assigned to constructors!
|
2019-10-08 21:42:25 -07:00
|
|
|
We need to adjust our code to keep track of the tags of the various
|
|
|
|
constructors of a type. To do this, we add a subclass for the `type_base`
|
|
|
|
struct, called `type_data`:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/type.hpp" 33 42 >}}
|
2019-10-08 21:42:25 -07:00
|
|
|
|
|
|
|
When we create types from `definition_data`, we tag the corresponding constructors:
|
|
|
|
|
2019-11-06 21:10:53 -08:00
|
|
|
{{< codelines "C++" "compiler/06/definition.cpp" 54 71 >}}
|
2019-10-08 21:42:25 -07:00
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
Ah, but adding constructor info to the type doesn't solve the problem.
|
|
|
|
Once we performed type checking, we don't keep
|
|
|
|
the types that we computed for an AST node, in the node. And obviously, we don't want
|
2019-10-08 21:42:25 -07:00
|
|
|
to go looking for them again. Furthermore, we can't just look up a constructor
|
|
|
|
in the environment, since we can well have patterns that don't have __any__ constructors:
|
|
|
|
|
|
|
|
```
|
|
|
|
match l {
|
|
|
|
l -> { 0 }
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
So, we want each `ast` node to store its type (well, in practice we only need this for
|
2019-10-08 23:46:35 -07:00
|
|
|
`ast_case`, but we might as well store it for all nodes). We can add it, no problem.
|
|
|
|
To add to that, we can add another, non-virtual `typecheck` method (let's call it `typecheck_common`,
|
2019-10-08 21:42:25 -07:00
|
|
|
since naming is hard). This method will call `typecheck`, and store the output into
|
|
|
|
the `node_type` field.
|
|
|
|
|
|
|
|
The signature is identical to `typecheck`, except it's neither virtual nor const:
|
|
|
|
```
|
|
|
|
type_ptr typecheck_common(type_mgr& mgr, const type_env& env);
|
|
|
|
```
|
|
|
|
|
|
|
|
And the implementation is as simple as you think:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 9 12 >}}
|
2019-10-08 21:42:25 -07:00
|
|
|
|
|
|
|
In client code (`definition_defn::typecheck_first` for instance), we should now
|
|
|
|
use `typecheck_common` instead of `typecheck`. With that done, we're almost there.
|
|
|
|
However, we're still missing something: most likely, the initial type assigned to any
|
|
|
|
node is a `type_var`, or a type variable. In this case, `type_var` __needs__ the information
|
|
|
|
from `type_mgr`, which we will not be keeping around. Besides, it's cleaner to keep the actual type
|
|
|
|
as a member of the node, not a variable type that references it. In order
|
|
|
|
to address this, we write two conversion functions that call `resolve` on all
|
|
|
|
types in an AST, given a type manager. After this is done, the type manager can be thrown away.
|
|
|
|
The signatures of the functions are as follows:
|
|
|
|
|
|
|
|
```
|
|
|
|
void resolve_common(const type_mgr& mgr);
|
|
|
|
virtual void resolve(const type_mgr& mgr) const = 0;
|
|
|
|
```
|
|
|
|
|
|
|
|
We also add the `resolve` method to `definition`, so that we can call it
|
2019-10-08 23:46:35 -07:00
|
|
|
without having to run `dynamic_cast`. The implementation for `ast::resolve_common`
|
2019-10-08 21:42:25 -07:00
|
|
|
just resolves the type:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 14 21 >}}
|
2019-10-08 21:42:25 -07:00
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
The virtual `ast::resolve` just calls `ast::resolve_common` on an all `ast` children
|
2019-10-08 21:42:25 -07:00
|
|
|
of a node. Here's a sample implementation from `ast_binop`:
|
|
|
|
|
2019-10-08 23:46:35 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 98 101 >}}
|
|
|
|
|
|
|
|
And here's the implementation of `definition::resolve` on `definition_defn`:
|
|
|
|
|
2019-10-10 13:14:00 -07:00
|
|
|
{{< codelines "C++" "compiler/06/definition.cpp" 32 42 >}}
|
2019-10-08 23:46:35 -07:00
|
|
|
|
|
|
|
Finally, we call `resolve` at the end `typecheck_program` in `main.cpp`:
|
|
|
|
|
|
|
|
{{< codelines "C++" "compiler/06/main.cpp" 40 42 >}}
|
|
|
|
|
|
|
|
At last, we're ready to implement the code for compiling `ast_case`.
|
|
|
|
Here it is, in all its glory:
|
|
|
|
|
2019-10-10 13:14:00 -07:00
|
|
|
{{< codelines "C++" "compiler/06/ast.cpp" 178 230 >}}
|
2019-10-08 23:46:35 -07:00
|
|
|
|
|
|
|
There's a lot to unpack here. First of all, just like we said in the compilation
|
|
|
|
scheme, we want to build and evaluate the expression that's being analyzed.
|
|
|
|
Once that's done, however, things get more tricky. We know that each
|
|
|
|
branch of a case expression will correspond to a vector of instructions -
|
|
|
|
in fact, our jump instruction contains a mapping from tags to instructions.
|
|
|
|
As we also discussed above, each list of instructions can be mapped to
|
|
|
|
by multiple tags. We don't want to recompile the same sequence of instructions
|
|
|
|
multiple times (or indeed, generate machine code for it). So, we keep
|
|
|
|
a mapping of tags to their corresponding sequences of instructions. We implement
|
|
|
|
this by having a vector of vectors of instructions (in which each inner vector
|
|
|
|
represents the code for a branch), and a map of tag number to index
|
|
|
|
in the vector containing all the branches. This way, multiple tags
|
|
|
|
can point to the same instruction set without duplicating information.
|
|
|
|
|
|
|
|
We also don't allow a tag to be mapped to more than one sequence of instructions.
|
|
|
|
This is handled differently depending on whether a variable pattern or a
|
|
|
|
constructor pattern are encountered. Variable patterns map all
|
|
|
|
tags that haven't been mapped yet, so no error can occur. Constructor patterns,
|
|
|
|
though, can explicitly try to map the same tag twice, and we don't want that.
|
|
|
|
|
|
|
|
I implied in the previous paragraph the implementation of our case expression
|
|
|
|
compilation algorithm, but let's go through it. Once we've compiled
|
|
|
|
the expression to be analyzed, and evaluated it (just like in our definitions
|
|
|
|
above), we proceed to look at all the branches specified in the case expression.
|
|
|
|
|
|
|
|
If a branch has a variable pattern, we must map to the result of the compilation
|
|
|
|
all the remaining, unmapped tags. We also aren't going to be taking apart
|
|
|
|
our value, so we don't need to use Split, but we do need to add 1 to the
|
|
|
|
environment offset to account the the presence of that value. So,
|
|
|
|
we compile the branch body with that offset, and iterate through
|
|
|
|
all the constructors of our data type. We skip a constructor
|
|
|
|
if it's been mapped, and if it hasn't been, we map it to the index
|
|
|
|
that this branch body will have in our list. Finally,
|
|
|
|
we push the newly compiled instruction sequence into the list of branch
|
|
|
|
bodies.
|
|
|
|
|
|
|
|
If a branch is a constructor pattern, on the other hand, we lead our compilation
|
|
|
|
output with a Split. This takes off the value from the stack, but pushes on
|
|
|
|
all the parameters of the constructor. We account for this by incrementing the
|
|
|
|
environment with the offset given by the number of arguments (just like we did
|
|
|
|
in our definitions of our compilation scheme). Before we map the tag,
|
|
|
|
we ensure that it hasn't already been mapped (and throw an exception, currently
|
|
|
|
in the form of a type error due to the growing length of this post),
|
|
|
|
and finally map it and insert the new branch code into the list of branches.
|
|
|
|
|
|
|
|
After we're done with all the branches, we also check for non-exhaustive patterns,
|
|
|
|
since otherwise we could run into runtime errors. With this, the case expression,
|
2019-10-08 23:50:21 -07:00
|
|
|
and the last of the AST nodes, can be compiled.
|
2019-10-08 21:42:25 -07:00
|
|
|
|
2019-10-10 13:14:00 -07:00
|
|
|
We also add a `compile` method to definitions, since they contain
|
|
|
|
our AST nodes. The method is empty for `defn_data`, and
|
|
|
|
looks as follows for `definition_defn`:
|
|
|
|
|
2019-11-06 21:10:53 -08:00
|
|
|
{{< codelines "C++" "compiler/06/definition.cpp" 44 52 >}}
|
2019-10-10 13:14:00 -07:00
|
|
|
|
2020-02-23 21:26:56 -08:00
|
|
|
Notice that we terminate the function with Update and Pop. Update
|
2019-10-10 18:00:13 -07:00
|
|
|
will turn the `ast_app` node that served as the "root"
|
|
|
|
of the application into an indirection to the value that we have computed.
|
2020-02-23 21:26:56 -08:00
|
|
|
After this, Pop will remove all "scratch work" from the stack.
|
2019-10-10 18:00:13 -07:00
|
|
|
In essense, this is how we can lazily evaluate expressions.
|
|
|
|
|
2019-10-10 13:14:00 -07:00
|
|
|
Finally, we make a function in our `main.cpp` file to compile
|
|
|
|
all the definitions:
|
|
|
|
|
|
|
|
{{< codelines "C++" "compiler/06/main.cpp" 45 56 >}}
|
|
|
|
|
|
|
|
In this method, we also include some extra
|
|
|
|
output to help us see the result of our compilation. Since
|
|
|
|
at the moment, only the `definition_defn` program has to
|
|
|
|
be compiled, we try cast all definitions to it, and if
|
|
|
|
we succeed, we print them out.
|
|
|
|
|
|
|
|
Let's try it all out! For the below sample program:
|
|
|
|
|
|
|
|
{{< rawblock "compiler/06/examples/works1.txt" >}}
|
|
|
|
|
|
|
|
Our compiler produces the following new output:
|
|
|
|
```
|
|
|
|
PushInt(6)
|
|
|
|
PushInt(320)
|
|
|
|
PushGlobal(plus)
|
|
|
|
MkApp()
|
|
|
|
MkApp()
|
2019-11-06 21:10:53 -08:00
|
|
|
Update(0)
|
|
|
|
Pop(0)
|
2019-10-10 13:14:00 -07:00
|
|
|
|
|
|
|
Push(1)
|
|
|
|
Push(1)
|
2019-11-06 21:10:53 -08:00
|
|
|
PushGlobal(plus)
|
2019-10-10 13:14:00 -07:00
|
|
|
MkApp()
|
|
|
|
MkApp()
|
2019-11-06 21:10:53 -08:00
|
|
|
Update(2)
|
|
|
|
Pop(2)
|
2019-10-10 13:14:00 -07:00
|
|
|
```
|
|
|
|
|
|
|
|
The first sequence of instructions is clearly `main`. It creates
|
|
|
|
an application of `plus` to `320`, and then applies that to
|
|
|
|
`6`, which results in `plus 320 6`, which is correct. The
|
|
|
|
second sequence of instruction pushes the parameter that
|
|
|
|
sits on offset 1 from the top of the stack (`y`). It then
|
|
|
|
pushes a parameter from the same offset again, but this time,
|
|
|
|
since `y` was previously pushed on the stack, `x` is now
|
|
|
|
in that position, so `x` is pushed onto the stack.
|
|
|
|
Finally, `+` is pushed, and the application
|
|
|
|
`(+) x y` is created, which is equivalent to `x+y`.
|
|
|
|
|
2019-10-10 18:00:13 -07:00
|
|
|
Let's also take a look at a case expression program:
|
|
|
|
|
|
|
|
{{< rawblock "compiler/06/examples/works3.txt" >}}
|
|
|
|
|
|
|
|
The result of the compilation is as follows:
|
|
|
|
|
|
|
|
```
|
|
|
|
Push(0)
|
|
|
|
Eval()
|
|
|
|
Jump(
|
|
|
|
Split()
|
|
|
|
PushInt(0)
|
|
|
|
Slide(0)
|
|
|
|
|
|
|
|
Split()
|
|
|
|
Push(1)
|
|
|
|
PushGlobal(length)
|
|
|
|
MkApp()
|
|
|
|
PushInt(1)
|
2019-11-06 21:10:53 -08:00
|
|
|
PushGlobal(plus)
|
2019-10-10 18:00:13 -07:00
|
|
|
MkApp()
|
|
|
|
MkApp()
|
|
|
|
Slide(2)
|
|
|
|
|
|
|
|
)
|
|
|
|
Update(1)
|
2019-11-06 21:10:53 -08:00
|
|
|
Pop(1)
|
2019-10-10 18:00:13 -07:00
|
|
|
```
|
|
|
|
|
|
|
|
We push the first (and only) parameter onto the stack. We then make
|
|
|
|
sure it's evaluated, and perform case analysis: if the list
|
|
|
|
is `Nil`, we simply push the number 0 onto the stack. If it's
|
|
|
|
a concatenation of some `x` and another lists `xs`, we
|
|
|
|
push `xs` and `length` onto the stack, make the application
|
|
|
|
(`length xs`), push the 1, and finally apply `+` to the result.
|
|
|
|
This all makes sense!
|
|
|
|
|
|
|
|
With this, we've been able to compile our expressions and functions
|
|
|
|
into G-machine code. We're not done, however - our computers
|
|
|
|
aren't G-machines. We'll need to compile our G-machine code to
|
|
|
|
__machine code__ (we will use LLVM for this), implement the
|
|
|
|
__runtime__, and develop a __garbage collector__. We'll
|
2019-11-06 21:10:53 -08:00
|
|
|
tackle the first of these in the next post - [Part 7 - Runtime]({{< relref "07_compiler_runtime.md" >}}).
|