blog-static/content/blog/06_compiler_compilation.md

505 lines
22 KiB
Markdown

---
title: Compiling a Functional Language Using C++, Part 6 - Compilation
date: 2019-08-06T14:26:38-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
---
In the previous post, we defined a machine for graph reduction,
called a G-machine. However, this machine is still not particularly
connected to __our__ language. In this post, we will give
meanings to programs in our language in the context of
this G-machine. We will define a __compilation scheme__,
which will be a set of rules that tell us how to
translate programs in our language into G-machine instructions.
To mirror _Implementing Functional Languages: a tutorial_, we'll
call this compilation scheme \\(\\mathcal{C}\\), and write it
as \\(\\mathcal{C} ⟦e⟧ = i\\), meaning "the expression \\(e\\)
compiles to the instructions \\(i\\)".
To follow our route from the typechecking, let's start
with compiling expressions that are numbers. It's pretty easy:
$$
\\mathcal{C} ⟦n⟧ = [\\text{PushInt} \\; n]
$$
Here, we compiled a number expression to a list of
instructions with only one element - PushInt.
Just like when we did typechecking, let's
move on to compiling function applications. As
we informally stated in the previous chapter, since
the thing we're applying has to be on top,
we want to compile it last:
$$
\\mathcal{C} ⟦e\_1 \\; e\_2⟧ = \\mathcal{C} ⟦e\_2⟧ ⧺ \\mathcal{C} ⟦e\_1⟧ ⧺ [\\text{MkApp}]
$$
Here, we used the \\(⧺\\) operator to represent the concatenation of two
lists. Otherwise, this should be pretty intutive - we first run the instructions
to create the parameter, then we run the instructions to create the function,
and finally, we combine them using MkApp.
It's variables that once again force us to adjust our strategy. If our
program is well-typed, we know our variable will be on the stack:
our definition of Unwind makes it so for functions, and we will
define our case expression compilation scheme to match. However,
we still need to know __where__ on the stack each variable is,
and this changes as the stack is modified.
To accommodate for this, we define an environment, \\(\\rho\\),
to be a partial function mapping variable names to thier
offsets on the stack. We write \\(\\rho = [x \\rightarrow n, y \\rightarrow m]\\)
to say "the environment \\(\\rho\\) maps variable \\(x\\) to stack offset \\(n\\),
and variable \\(y\\) to stack offset \\(m\\)". We also write \\(\\rho \\; x\\) to
say "look up \\(x\\) in \\(\\rho\\)", since \\(\\rho\\) is a function. Finally,
to help with the ever-changing stack, we define an augmented environment
\\(\\rho^{+n}\\), such that \\(\\rho^{+n} \\; x = \\rho \\; x + n\\). In words,
this basically means "\\(\\rho^{+n}\\) has all the variables from \\(\\rho\\),
but their addresses are incremented by \\(n\\)". We now pass \\(\\rho\\)
in to \\(\\mathcal{C}\\) together with the expression \\(e\\). Let's
rewrite our first two rules. For numbers:
$$
\\mathcal{C} ⟦n⟧ \\; \\rho = [\\text{PushInt} \\; n]
$$
For function application:
$$
\\mathcal{C} ⟦e\_1 \\; e\_2⟧ \\; \\rho = \\mathcal{C} ⟦e\_2⟧ \\; \\rho ⧺ \\mathcal{C} ⟦e\_1⟧ \\; \\rho^{+1} ⧺ [\\text{MkApp}]
$$
Notice how in that last rule, we passed in \\(\\rho^{+1}\\) when compiling the function's expression. This is because
the result of running the instructions for \\(e\_2\\) will have left on the stack the function's parameter. Whatever
was at the top of the stack (and thus, had index 0), is now the second element from the top (address 1). The
same is true for all other things that were on the stack. So, we increment the environment accordingly.
With the environment, the variable rule is simple:
$$
\\mathcal{C} ⟦x⟧ \\; \\rho = [\\text{Push} \\; (\\rho \\; x)]
$$
One more thing. If we run across a function name, we want to
use PushGlobal rather than Push. Defining \\(f\\) to be a name
of a global function, we capture this using the following rule:
$$
\\mathcal{C} ⟦f⟧ \\; \\rho = [\\text{PushGlobal} \\; f]
$$
Now it's time for us to compile case expressions, but there's a bit of
an issue - our case expressions branches don't map one-to-one with
the \\(t \\rightarrow i\_t\\) format of the Jump instruction.
This is because we allow for name patterns in the form \\(x\\),
which can possibly match more than one tag. Consider this
rather useless example:
```
data Bool = { True, False }
defn weird b = { case b of { b -> { False } } }
```
We only have one branch, but we have two tags that should
lead to it! Not only that, but variable patterns are
location-dependent: if a variable pattern comes
before a constructor pattern, then the constructor
pattern will never be reached. On the other hand,
if a constructor pattern comes before a variable
pattern, it will be tried before the varible pattern,
and thus is reachable.
We will ignore this problem for now - we will define our semantics
as though each case expression branch can match exactly one tag.
In our C++ code, we will write a conversion function that will
figure out which tag goes to which sequence of instructions.
Effectively, we'll be performing [desugaring](https://en.wikipedia.org/wiki/Syntactic_sugar).
Now, on to defining the compilation rules for case expressions.
It's helpful to define compiling a single branch of a case expression
separately. For a branch in the form \\(t \\; x\_1 \\; x\_2 \\; ... \\; x\_n \\rightarrow \text{body}\\),
we define a compilation scheme \\(\\mathcal{A}\\) as follows:
$$
\\begin{align}
\\mathcal{A} ⟦t \\; x\_1 \\; ... \\; x\_n \\rightarrow \text{body}⟧ \\; \\rho & =
t \\rightarrow [\\text{Split} \\; n] \\; ⧺ \\; \\mathcal{C}⟦\\text{body}⟧ \\; \\rho' \\; ⧺ \\; [\\text{Slide} \\; n] \\\\\\
\text{where} \\; \\rho' &= \\rho^{+n}[x\_1 \\rightarrow 0, ..., x\_n \\rightarrow n - 1]
\\end{align}
$$
First, we run Split - the node on the top of the stack is a packed constructor,
and we want access to its member variables, since they can be referenced by
the branch's body via \\(x\_i\\). For the same reason, we must make sure to include
\\(x\_1\\) through \\(x\_n\\) in our environment. Furthermore, since the split values now occupy the stack,
we have to offset our environment by \\(n\\) before adding bindings to our new variables.
Doing all these things gives us \\(\\rho'\\), which we use to compile the body, placing
the resulting instructions after Split. This leaves us with the desired graph on top of
the stack - the only thing left to do is to clean up the stack of the unpacked values,
which we do using Slide.
Notice that we didn't just create instructions - we created a mapping from the tag \\(t\\)
to the instructions that correspond to it.
Now, it's time for compiling the whole case expression. We first want
to construct the graph for the expression we want to perform case analysis on.
Next, we want to evaluate it (since we need a packed value, not a graph,
to read the tag). Finally, we perform a jump depending on the tag. This
is captured by the following rule:
$$
\\mathcal{C} ⟦\\text{case} \\; e \\; \\text{of} \\; \\text{alt}_1 ... \\text{alt}_n⟧ \\; \\rho =
\\mathcal{C} ⟦e⟧ \\; \\rho \\; ⧺ [\\text{Eval}, \\text{Jump} \\; [\\mathcal{A} ⟦\\text{alt}_1⟧ \; \\rho, ..., \\mathcal{A} ⟦\\text{alt}_n⟧ \; \\rho]]
$$
This works because \\(\\mathcal{A}\\) creates not only instructions,
but also a tag mapping. We simply populate our Jump instruction such mappings
resulting from compiling each branch.
You may have noticed that we didn't add rules for binary operators. Just like
with type checking, we treat them as function calls. However, rather that constructing
graphs when we have to instantiate those functions, we simply
evaluate the arguments and perform the relevant arithmetic operation using BinOp.
We will do a similar thing for constructors.
### Implementation
With that out of the way, we can get around to writing some code. Let's
first define C++ structs for the instructions of the G-machine:
{{< codeblock "C++" "compiler/06/instruction.hpp" >}}
I omit the implementation of the various (trivial) `print` methods in this post;
as always, you can look at the full project source code, which is
freely available for each post in the series.
We can now envision a method on the `ast` struct that takes an environment
(just like our compilation scheme takes the environment \\(\\rho\\\)),
and compiles the `ast`. Rather than returning a vector
of instructions (which involves copying, unless we get some optimization kicking in),
we'll pass a reference to a vector to our method. The method will then place the generated
instructions into the vector.
There's one more thing to be considered. How do we tell apart a "global"
from a variable? A naive solution would be to take a list or map of
global functions as a third parameter to our `compile` method.
But there's an easier way! We know that the program passed type checking.
This means that every referenced variable exists. From then, the situation is easy -
if actual variable names are kept in the environment, \\(\\rho\\), then whenever
we see a variable that __isn't__ in the current environment, it must be a function name.
Having finished contemplating out method, it's time to define a signature:
```C++
virtual void compile(const env_ptr& env, std::vector<instruction_ptr>& into) const;
```
Ah, but now we have to define "environment". Let's do that. Here's our header:
{{< codeblock "C++" "compiler/06/env.hpp" >}}
And here's the source file:
{{< codeblock "C++" "compiler/06/env.cpp" >}}
There's not that much to see here, but let's go through it anyway.
We define an environment as a linked list, kind of like
we did with the type environment. This time, though,
we use shared pointers instead of raw pointers to reference the parent.
I decided on this because we will need to be using virtual methods
(since we have two subclasses of `env`), and thus will need to
be passing the `env` by pointer. At that point, we might as well
use the "proper" way!
I implemented the environment as a linked list because it is, in essence,
a stack. However, not every "offset" in a stack is introduced by
binding variables - for instance, when we create an application node,
we first build the argument value on the stack, and then,
with that value still on the stack, build the left hand side of the application.
Thus, all the variable positions are offset by the presence of the argument
on the stack, and we must account for that. Similarly, in cases when we will
allocate space on the stack (we will run into these cases later), we will
need to account for that change. Thus, since we can increment
the offset by two ways (binding a variable and building something on the stack),
we allow for two types of nodes in our `env` stack.
During recursion we will be tweaking the return value of `get_offset` to
calculate the final location of a variable on the stack (if the
parent of a node returned offset `1`, but the node itself is a variable
node and thus introduces another offset, we need to return `2`). Because
of this, we cannot reasonably return a constant like `-1` (it will quickly
be made positive on a long list), and thus we throw an exception. To
allow for a safe way to check for an offset, without try-catch,
we also add a `has_variable` method which checks if the lookup will succeed.
A better approach would be to use `std::optional`, but it's C++17, so
we'll shy away from it.
It will also help to move some of the functions on the `binop` enum
into a separate file. The new neader is pretty small:
{{< codeblock "C++" "compiler/06/binop.hpp" >}}
The new source file is not much longer:
{{< codeblock "C++" "compiler/06/binop.cpp" >}}
And now, we begin our implementation. Let's start with the easy ones:
`ast_int`, `ast_lid` and `ast_uid`. The code for `ast_int` involves just pushing
the integer into the stack:
{{< codelines "C++" "compiler/06/ast.cpp" 36 38 >}}
The code for `ast_lid` needs to check if the variable is global or local,
just like we discussed:
{{< codelines "C++" "compiler/06/ast.cpp" 53 58 >}}
We do not have to do this for `ast_uid`:
{{< codelines "C++" "compiler/06/ast.cpp" 73 75 >}}
On to `ast_binop`! This is the first time we have to change our environment.
As we said earlier, once we build the right operand on the stack, every offset that we counted
from the top of the stack will have been shifted by 1 (we see this
in our compilation scheme for function application). So,
we create a new environment with `env_offset`, and use that
when we compile the left child:
{{< codelines "C++" "compiler/06/ast.cpp" 103 110 >}}
`ast_binop` performs two applications: `(+) lhs rhs`.
We push `rhs`, then `lhs`, then `(+)`, and then use MkApp
twice. In `ast_app`, we only need to perform one application,
`lhs rhs`:
{{< codelines "C++" "compiler/06/ast.cpp" 134 138 >}}
Note that we also extend our environment in this one,
for the exact same reason as before.
Case expressions are the only thing left on the agenda. This
is the time during which we have to perform desugaring. Here,
though, we run into an issue: we don't have tags assigned to constructors!
We need to adjust our code to keep track of the tags of the various
constructors of a type. To do this, we add a subclass for the `type_base`
struct, called `type_data`:
{{< codelines "C++" "compiler/06/type.hpp" 33 42 >}}
When we create types from `definition_data`, we tag the corresponding constructors:
{{< codelines "C++" "compiler/06/definition.cpp" 54 71 >}}
Ah, but adding constructor info to the type doesn't solve the problem.
Once we performed type checking, we don't keep
the types that we computed for an AST node, in the node. And obviously, we don't want
to go looking for them again. Furthermore, we can't just look up a constructor
in the environment, since we can well have patterns that don't have __any__ constructors:
```
match l {
l -> { 0 }
}
```
So, we want each `ast` node to store its type (well, in practice we only need this for
`ast_case`, but we might as well store it for all nodes). We can add it, no problem.
To add to that, we can add another, non-virtual `typecheck` method (let's call it `typecheck_common`,
since naming is hard). This method will call `typecheck`, and store the output into
the `node_type` field.
The signature is identical to `typecheck`, except it's neither virtual nor const:
```
type_ptr typecheck_common(type_mgr& mgr, const type_env& env);
```
And the implementation is as simple as you think:
{{< codelines "C++" "compiler/06/ast.cpp" 9 12 >}}
In client code (`definition_defn::typecheck_first` for instance), we should now
use `typecheck_common` instead of `typecheck`. With that done, we're almost there.
However, we're still missing something: most likely, the initial type assigned to any
node is a `type_var`, or a type variable. In this case, `type_var` __needs__ the information
from `type_mgr`, which we will not be keeping around. Besides, it's cleaner to keep the actual type
as a member of the node, not a variable type that references it. In order
to address this, we write two conversion functions that call `resolve` on all
types in an AST, given a type manager. After this is done, the type manager can be thrown away.
The signatures of the functions are as follows:
```
void resolve_common(const type_mgr& mgr);
virtual void resolve(const type_mgr& mgr) const = 0;
```
We also add the `resolve` method to `definition`, so that we can call it
without having to run `dynamic_cast`. The implementation for `ast::resolve_common`
just resolves the type:
{{< codelines "C++" "compiler/06/ast.cpp" 14 21 >}}
The virtual `ast::resolve` just calls `ast::resolve_common` on an all `ast` children
of a node. Here's a sample implementation from `ast_binop`:
{{< codelines "C++" "compiler/06/ast.cpp" 98 101 >}}
And here's the implementation of `definition::resolve` on `definition_defn`:
{{< codelines "C++" "compiler/06/definition.cpp" 32 42 >}}
Finally, we call `resolve` at the end `typecheck_program` in `main.cpp`:
{{< codelines "C++" "compiler/06/main.cpp" 40 42 >}}
At last, we're ready to implement the code for compiling `ast_case`.
Here it is, in all its glory:
{{< codelines "C++" "compiler/06/ast.cpp" 178 230 >}}
There's a lot to unpack here. First of all, just like we said in the compilation
scheme, we want to build and evaluate the expression that's being analyzed.
Once that's done, however, things get more tricky. We know that each
branch of a case expression will correspond to a vector of instructions -
in fact, our jump instruction contains a mapping from tags to instructions.
As we also discussed above, each list of instructions can be mapped to
by multiple tags. We don't want to recompile the same sequence of instructions
multiple times (or indeed, generate machine code for it). So, we keep
a mapping of tags to their corresponding sequences of instructions. We implement
this by having a vector of vectors of instructions (in which each inner vector
represents the code for a branch), and a map of tag number to index
in the vector containing all the branches. This way, multiple tags
can point to the same instruction set without duplicating information.
We also don't allow a tag to be mapped to more than one sequence of instructions.
This is handled differently depending on whether a variable pattern or a
constructor pattern are encountered. Variable patterns map all
tags that haven't been mapped yet, so no error can occur. Constructor patterns,
though, can explicitly try to map the same tag twice, and we don't want that.
I implied in the previous paragraph the implementation of our case expression
compilation algorithm, but let's go through it. Once we've compiled
the expression to be analyzed, and evaluated it (just like in our definitions
above), we proceed to look at all the branches specified in the case expression.
If a branch has a variable pattern, we must map to the result of the compilation
all the remaining, unmapped tags. We also aren't going to be taking apart
our value, so we don't need to use Split, but we do need to add 1 to the
environment offset to account the the presence of that value. So,
we compile the branch body with that offset, and iterate through
all the constructors of our data type. We skip a constructor
if it's been mapped, and if it hasn't been, we map it to the index
that this branch body will have in our list. Finally,
we push the newly compiled instruction sequence into the list of branch
bodies.
If a branch is a constructor pattern, on the other hand, we lead our compilation
output with a Split. This takes off the value from the stack, but pushes on
all the parameters of the constructor. We account for this by incrementing the
environment with the offset given by the number of arguments (just like we did
in our definitions of our compilation scheme). Before we map the tag,
we ensure that it hasn't already been mapped (and throw an exception, currently
in the form of a type error due to the growing length of this post),
and finally map it and insert the new branch code into the list of branches.
After we're done with all the branches, we also check for non-exhaustive patterns,
since otherwise we could run into runtime errors. With this, the case expression,
and the last of the AST nodes, can be compiled.
We also add a `compile` method to definitions, since they contain
our AST nodes. The method is empty for `defn_data`, and
looks as follows for `definition_defn`:
{{< codelines "C++" "compiler/06/definition.cpp" 44 52 >}}
Notice that we terminate the function with Update and Pop. Update
will turn the `ast_app` node that served as the "root"
of the application into an indirection to the value that we have computed.
After this, Pop will remove all "scratch work" from the stack.
In essense, this is how we can lazily evaluate expressions.
Finally, we make a function in our `main.cpp` file to compile
all the definitions:
{{< codelines "C++" "compiler/06/main.cpp" 45 56 >}}
In this method, we also include some extra
output to help us see the result of our compilation. Since
at the moment, only the `definition_defn` program has to
be compiled, we try cast all definitions to it, and if
we succeed, we print them out.
Let's try it all out! For the below sample program:
{{< rawblock "compiler/06/examples/works1.txt" >}}
Our compiler produces the following new output:
```
PushInt(6)
PushInt(320)
PushGlobal(plus)
MkApp()
MkApp()
Update(0)
Pop(0)
Push(1)
Push(1)
PushGlobal(plus)
MkApp()
MkApp()
Update(2)
Pop(2)
```
The first sequence of instructions is clearly `main`. It creates
an application of `plus` to `320`, and then applies that to
`6`, which results in `plus 320 6`, which is correct. The
second sequence of instruction pushes the parameter that
sits on offset 1 from the top of the stack (`y`). It then
pushes a parameter from the same offset again, but this time,
since `y` was previously pushed on the stack, `x` is now
in that position, so `x` is pushed onto the stack.
Finally, `+` is pushed, and the application
`(+) x y` is created, which is equivalent to `x+y`.
Let's also take a look at a case expression program:
{{< rawblock "compiler/06/examples/works3.txt" >}}
The result of the compilation is as follows:
```
Push(0)
Eval()
Jump(
Split()
PushInt(0)
Slide(0)
Split()
Push(1)
PushGlobal(length)
MkApp()
PushInt(1)
PushGlobal(plus)
MkApp()
MkApp()
Slide(2)
)
Update(1)
Pop(1)
```
We push the first (and only) parameter onto the stack. We then make
sure it's evaluated, and perform case analysis: if the list
is `Nil`, we simply push the number 0 onto the stack. If it's
a concatenation of some `x` and another lists `xs`, we
push `xs` and `length` onto the stack, make the application
(`length xs`), push the 1, and finally apply `+` to the result.
This all makes sense!
With this, we've been able to compile our expressions and functions
into G-machine code. We're not done, however - our computers
aren't G-machines. We'll need to compile our G-machine code to
__machine code__ (we will use LLVM for this), implement the
__runtime__, and develop a __garbage collector__. We'll
tackle the first of these in the next post - [Part 7 - Runtime]({{< relref "07_compiler_runtime.md" >}}).