13 KiB
title | date | draft | tags | |||
---|---|---|---|---|---|---|
Compiling a Functional Language Using C++, Part 6 - Compilation | 2019-08-06T14:26:38-07:00 | true |
|
In the previous post, we defined a machine for graph reduction, called a G-machine. However, this machine is still not particularly connected to our language. In this post, we will give meanings to programs in our language in the context of this G-machine. We will define a compilation scheme, which will be a set of rules that tell us how to translate programs in our language into G-machine instructions. To mirror Implementing Functional Languages: a tutorial, we'll call this compilation scheme \(\mathcal{C}\), and write it as \(\mathcal{C} ⟦e⟧ = i\), meaning "the expression \(e\) compiles to the instructions \(i\)".
To follow our route from the typechecking, let's start with compiling expressions that are numbers. It's pretty easy:
\\mathcal{C} ⟦n⟧ = [\\text{PushInt} \\; n]
Here, we compiled a number expression to a list of instructions with only one element - PushInt.
Just like when we did typechecking, let's move on to compiling function applications. As we informally stated in the previous chapter, since the thing we're applying has to be on top, we want to compile it last:
\\mathcal{C} ⟦e\_1 \\; e\_2⟧ = \\mathcal{C} ⟦e\_2⟧ ⧺ \\mathcal{C} ⟦e\_1⟧ ⧺ [\\text{MkApp}]
Here, we used the \(⧺\) operator to represent the concatenation of two lists. Otherwise, this should be pretty intutive - we first run the instructions to create the parameter, then we run the instructions to create the function, and finally, we combine them using MkApp.
It's variables that once again force us to adjust our strategy. If our program is well-typed, we know our variable will be on the stack: our definition of Unwind makes it so for functions, and we will define our case expression compilation scheme to match. However, we still need to know where on the stack each variable is, and this changes as the stack is modified.
To accommodate for this, we define an environment, \(\rho\), to be a partial function mapping variable names to thier offsets on the stack. We write \(\rho = [x \rightarrow n, y \rightarrow m]\) to say "the environment \(\rho\) maps variable \(x\) to stack offset \(n\), and variable \(y\) to stack offset \(m\)". We also write \(\rho \; x\) to say "look up \(x\) in \(\rho\)", since \(\rho\) is a function. Finally, to help with the ever-changing stack, we define an augmented environment \(\rho^{+n}\), such that \(\rho^{+n} \; x = \rho \; x + n\). In words, this basically means "\(\rho^{+n}\) has all the variables from \(\rho\), but their addresses are incremented by \(n\)". We now pass \(\rho\) in to \(\mathcal{C}\) together with the expression \(e\). Let's rewrite our first two rules. For numbers:
\\mathcal{C} ⟦n⟧ \\; \\rho = [\\text{PushInt} \\; n]
For function application:
\\mathcal{C} ⟦e\_1 \\; e\_2⟧ \\; \\rho = \\mathcal{C} ⟦e\_2⟧ \\; \\rho ⧺ \\mathcal{C} ⟦e\_1⟧ \\; \\rho^{+1} ⧺ [\\text{MkApp}]
Notice how in that last rule, we passed in \(\rho^{+1}\) when compiling the function's expression. This is because the result of running the instructions for \(e_2\) will have left on the stack the function's parameter. Whatever was at the top of the stack (and thus, had index 0), is now the second element from the top (address 1). The same is true for all other things that were on the stack. So, we increment the environment accordingly.
With the environment, the variable rule is simple:
\\mathcal{C} ⟦x⟧ \\; \\rho = [\\text{Push} \\; (\\rho \\; x)]
One more thing. If we run across a function name, we want to use PushGlobal rather than Push. Defining \(f\) to be a name of a global function, we capture this using the following rule:
\\mathcal{C} ⟦f⟧ \\; \\rho = [\\text{PushGlobal} \\; f]
Now it's time for us to compile case expressions, but there's a bit of an issue - our case expressions branches don't map one-to-one with the \(t \rightarrow i_t\) format of the Jump instruction. This is because we allow for name patterns in the form \(x\), which can possibly match more than one tag. Consider this rather useless example:
data Bool = { True, False }
defn weird b = { case b of { b -> { False } } }
We only have one branch, but we have two tags that should lead to it! Not only that, but variable patterns are location-dependent: if a variable pattern comes before a constructor pattern, then the constructor pattern will never be reached. On the other hand, if a constructor pattern comes before a variable pattern, it will be tried before the varible pattern, and thus is reachable.
We will ignore this problem for now - we will define our semantics as though each case expression branch can match exactly one tag. In our C++ code, we will write a conversion function that will figure out which tag goes to which sequence of instructions. Effectively, we'll be performing desugaring.
Now, on to defining the compilation rules for case expressions. It's helpful to define compiling a single branch of a case expression separately. For a branch in the form \(t \; x_1 \; x_2 \; ... \; x_n \rightarrow \text{body}\), we define a compilation scheme \(\mathcal{A}\) as follows:
\\begin{align}
\\mathcal{A} ⟦t \\; x\_1 \\; ... \\; x\_n \\rightarrow \text{body}⟧ \\; \\rho & =
t \\rightarrow [\\text{Split} \\; n] \\; ⧺ \\; \\mathcal{C}⟦\\text{body}⟧ \\; \\rho' \\; ⧺ \\; [\\text{Slide} \\; n] \\\\\\
\text{where} \\; \\rho' &= \\rho^{+n}[x\_1 \\rightarrow 0, ..., x\_n \\rightarrow n - 1]
\\end{align}
First, we run Split - the node on the top of the stack is a packed constructor, and we want access to its member variables, since they can be referenced by the branch's body via \(x_i\). For the same reason, we must make sure to include \(x_1\) through \(x_n\) in our environment. Furthermore, since the split values now occupy the stack, we have to offset our environment by \(n\) before adding bindings to our new variables. Doing all these things gives us \(\rho'\), which we use to compile the body, placing the resulting instructions after Split. This leaves us with the desired graph on top of the stack - the only thing left to do is to clean up the stack of the unpacked values, which we do using Slide.
Notice that we didn't just create instructions - we created a mapping from the tag \(t\) to the instructions that correspond to it.
Now, it's time for compiling the whole case expression. We first want to construct the graph for the expression we want to perform case analysis on. Next, we want to evaluate it (since we need a packed value, not a graph, to read the tag). Finally, we perform a jump depending on the tag. This is captured by the following rule:
\\mathcal{C} ⟦\\text{case} \\; e \\; \\text{of} \\; \\text{alt}_1 ... \\text{alt}_n⟧ \\; \\rho =
\\mathcal{C} ⟦e⟧ \\; \\rho \\; ⧺ [\\text{Eval}, \\text{Jump} \\; [\\mathcal{A} ⟦\\text{alt}_1⟧ \; \\rho, ..., \\mathcal{A} ⟦\\text{alt}_n⟧ \; \\rho]]
This works because \(\mathcal{A}\) creates not only instructions, but also a tag mapping. We simply populate our Jump instruction such mappings resulting from compiling each branch.
You may have noticed that we didn't add rules for binary operators. Just like with type checking, we treat them as function calls. However, rather that constructing graphs when we have to instantiate those functions, we simply evaluate the arguments and perform the relevant arithmetic operation using BinOp. We will do a similar thing for constructors.
Implementation
With that out of the way, we can get around to writing some code. Let's first define C++ structs for the instructions of the G-machine:
{{< codeblock "C++" "compiler/06/instruction.hpp" >}}
We can now envision a method on the ast
struct that takes an environment
(just like our compilation scheme takes the environment \(\rho\)),
and compiles the ast
. Rather than returning a vector
of instructions (which involves copying, unless we get some optimization kicking in),
we'll pass a reference to a vector to our method. The method will then place the generated
instructions into the vector.
There's one more thing to be considered. How do we tell apart a "global"
from a variable? A naive solution would be to take a list or map of
global functions as a third parameter to our compile
method.
But there's an easier way! We know that the program passed type checking.
This means that every referenced variable exists. From then, the situation is easy -
if actual variable names are kept in the environment, \(\rho\), then whenever
we see a variable that isn't in the current environment, it must be a function name.
Having finished contemplating out method, it's time to define a signature:
virtual void compile(const env_ptr& env, std::vector<instruction_ptr>& into) const;
Ah, but now we have to define "environment". Let's do that. Here's our header:
{{< codeblock "C++" "compiler/06/env.hpp" >}}
And here's the source file:
{{< codeblock "C++" "compiler/06/env.cpp" >}}
There's not that much to see here, but let's go through it anyway.
We define an environment as a linked list, kind of like
we did with the type environment. This time, though,
we use shared pointers instead of raw pointers to reference the parent.
I decided on this because we will need to be using virtual methods
(since we have two subclasses of env
), and thus will need to
be passing the env
by pointer. At that point, we might as well
use the "proper" way!
I implemented the environment as a linked list because it is, in essence,
a stack. However, not every "offset" in a stack is introduced by
binding variables - for instance, when we create an application node,
we first build the argument value on the stack, and then,
with that value still on the stack, build the left hand side of the application.
Thus, all the variable positions are offset by the presence of the argument
on the stack, and we must account for that. Similarly, in cases when we will
allocate space on the stack (we will run into these cases later), we will
need to account for that change. Thus, since we can increment
the offset by two ways (binding a variable and building something on the stack),
we allow for two types of nodes in our env
stack.
During recursion we will be tweaking the return value of get_offset
to
calculate the final location of a variable on the stack (if the
parent of a node returned offset 1
, but the node itself is a variable
node and thus introduces another offset, we need to return 2
). Because
of this, we cannot reasonably return a constant like -1
(it will quickly
be made positive on a long list), and thus we throw an exception. To
allow for a safe way to check for an offset, without try-catch,
we also add a has_variable
method which checks if the lookup will succeed.
A better approach would be to use std::optional
, but it's C++17, so
we'll shy away from it.
It will also help to move some of the functions on the binop
enum
into a separate file. The new neader is pretty small:
{{< codeblock "C++" "compiler/06/binop.hpp" >}}
The new source file is not much longer:
{{< codeblock "C++" "compiler/06/binop.cpp" >}}
And now, we begin our implementation. Let's start with the easy ones:
ast_int
, ast_lid
and ast_uid
. The code for ast_int
involves just pushing
the integer into the stack:
{{< codelines "C++" "compiler/06/ast.cpp" 18 20 >}}
The code for ast_lid
needs to check if the variable is global or local,
just like we discussed:
{{< codelines "C++" "compiler/06/ast.cpp" 31 36 >}}
We do not have to do this for ast_uid
:
{{< codelines "C++" "compiler/06/ast.cpp" 47 49 >}}
On to ast_binop
! This is the first time we have to change our environment.
Once we build the right operand on the stack, every offset that we counted
from the top of the stack will have been shifted by 1 (we see this
in our compilation scheme for function application). So,
we create a new environment with env_offset
, and use that
when we compile the left child:
{{< codelines "C++" "compiler/06/ast.cpp" 72 79 >}}
ast_binop
performs two applications: (+) lhs rhs
.
We push rhs
, then lhs
, then (+)
, and then use MkApp
twice. In ast_app
, we only need to perform one application,
lhs rhs
:
{{< codelines "C++" "compiler/06/ast.cpp" 98 102 >}}
Note that we also extend our environment in this one, for the exact same reason as before.
Case expressions are the only thing left on the agenda. This is the time during which we have to perform desugaring. Here, though, we run into an issue: we don't have tags assigned to constructors!