Rename page and add pop instruction to part 5 of compiler series

2019-11-14 11:05:17 -08:00
parent 58c9d5f982
commit c309ac4c14
3 changed files with 23 additions and 3 deletions
--- a/content/blog/06_compiler_compilation.md
+++ b/content/blog/06_compiler_compilation.md
@@ -0,0 +1,504 @@
+---
+title: Compiling a Functional Language Using C++, Part 6 - Compilation
+date: 2019-08-06T14:26:38-07:00
+tags: ["C and C++", "Functional Languages", "Compilers"]
+---
+In the previous post, we defined a machine for graph reduction,
+called a G-machine. However, this machine is still not particularly
+connected to __our__ language. In this post, we will give
+meanings to programs in our language in the context of
+this G-machine. We will define a __compilation scheme__,
+which will be a set of rules that tell us how to
+translate programs in our language into G-machine instructions.
+To mirror _Implementing Functional Languages: a tutorial_, we'll
+call this compilation scheme \\(\\mathcal{C}\\), and write it
+as \\(\\mathcal{C} ⟦e⟧ = i\\), meaning "the expression \\(e\\)
+compiles to the instructions \\(i\\)".
+
+To follow our route from the typechecking, let's start
+with compiling expressions that are numbers. It's pretty easy:
+$$
+\\mathcal{C} ⟦n⟧ = [\\text{PushInt} \\; n]
+$$
+
+Here, we compiled a number expression to a list of
+instructions with only one element - PushInt.
+
+Just like when we did typechecking, let's
+move on to compiling function applications. As
+we informally stated in the previous chapter, since
+the thing we're applying has to be on top,
+we want to compile it last:
+
+$$
+\\mathcal{C} ⟦e\_1 \\; e\_2⟧ = \\mathcal{C} ⟦e\_2⟧ ⧺ \\mathcal{C} ⟦e\_1⟧ ⧺ [\\text{MkApp}]
+$$
+
+Here, we used the \\(⧺\\) operator to represent the concatenation of two
+lists. Otherwise, this should be pretty intutive - we first run the instructions
+to create the parameter, then we run the instructions to create the function,
+and finally, we combine them using MkApp.
+
+It's variables that once again force us to adjust our strategy. If our
+program is well-typed, we know our variable will be on the stack:
+our definition of Unwind makes it so for functions, and we will
+define our case expression compilation scheme to match. However,
+we still need to know __where__ on the stack each variable is,
+and this changes as the stack is modified.
+
+To accommodate for this, we define an environment, \\(\\rho\\),
+to be a partial function mapping variable names to thier
+offsets on the stack. We write \\(\\rho = [x \\rightarrow n, y \\rightarrow m]\\)
+to say "the environment \\(\\rho\\) maps variable \\(x\\) to stack offset \\(n\\),
+and variable \\(y\\) to stack offset \\(m\\)". We also write \\(\\rho \\; x\\) to
+say "look up \\(x\\) in \\(\\rho\\)", since \\(\\rho\\) is a function. Finally,
+to help with the ever-changing stack, we define an augmented environment
+\\(\\rho^{+n}\\), such that \\(\\rho^{+n} \\; x = \\rho \\; x + n\\). In words,
+this basically means "\\(\\rho^{+n}\\) has all the variables from \\(\\rho\\),
+but their addresses are incremented by \\(n\\)". We now pass \\(\\rho\\)
+in to \\(\\mathcal{C}\\) together with the expression \\(e\\). Let's
+rewrite our first two rules. For numbers:
+
+$$
+\\mathcal{C} ⟦n⟧ \\; \\rho = [\\text{PushInt} \\; n]
+$$
+
+For function application:
+$$
+\\mathcal{C} ⟦e\_1 \\; e\_2⟧ \\; \\rho = \\mathcal{C} ⟦e\_2⟧ \\; \\rho ⧺ \\mathcal{C} ⟦e\_1⟧ \\; \\rho^{+1} ⧺ [\\text{MkApp}]
+$$
+
+Notice how in that last rule, we passed in \\(\\rho^{+1}\\) when compiling the function's expression. This is because
+the result of running the instructions for \\(e\_2\\) will have left on the stack the function's parameter. Whatever
+was at the top of the stack (and thus, had index 0), is now the second element from the top (address 1). The
+same is true for all other things that were on the stack. So, we increment the environment accordingly.
+
+With the environment, the variable rule is simple:
+$$
+\\mathcal{C} ⟦x⟧ \\; \\rho = [\\text{Push} \\; (\\rho \\; x)]
+$$
+
+One more thing. If we run across a function name, we want to
+use PushGlobal rather than Push. Defining \\(f\\) to be a name
+of a global function, we capture this using the following rule:
+
+$$
+\\mathcal{C} ⟦f⟧ \\; \\rho = [\\text{PushGlobal} \\; f]
+$$
+
+Now it's time for us to compile case expressions, but there's a bit of
+an issue - our case expressions branches don't map one-to-one with
+the \\(t \\rightarrow i\_t\\) format of the Jump instruction.
+This is because we allow for name patterns in the form \\(x\\),
+which can possibly match more than one tag. Consider this
+rather useless example:
+
+```
+data Bool = { True, False }
+defn weird b = { case b of { b -> { False } } }
+```
+
+We only have one branch, but we have two tags that should
+lead to it! Not only that, but variable patterns are
+location-dependent: if a variable pattern comes
+before a constructor pattern, then the constructor
+pattern will never be reached. On the other hand,
+if a constructor pattern comes before a variable
+pattern, it will be tried before the varible pattern,
+and thus is reachable.
+
+We will ignore this problem for now - we will define our semantics
+as though each case expression branch can match exactly one tag.
+In our C++ code, we will write a conversion function that will
+figure out which tag goes to which sequence of instructions.
+Effectively, we'll be performing [desugaring](https://en.wikipedia.org/wiki/Syntactic_sugar).
+
+Now, on to defining the compilation rules for case expressions.
+It's helpful to define compiling a single branch of a case expression
+separately. For a branch in the form \\(t \\; x\_1 \\; x\_2 \\; ... \\; x\_n \\rightarrow \text{body}\\),
+we define a compilation scheme \\(\\mathcal{A}\\) as follows:
+
+$$
+\\begin{align}
+\\mathcal{A} ⟦t \\; x\_1 \\; ... \\; x\_n \\rightarrow \text{body}⟧ \\; \\rho & =
+t \\rightarrow [\\text{Split} \\; n] \\; ⧺ \\; \\mathcal{C}⟦\\text{body}⟧ \\; \\rho' \\; ⧺ \\; [\\text{Slide} \\; n] \\\\\\
+\text{where} \\; \\rho' &= \\rho^{+n}[x\_1 \\rightarrow 0, ..., x\_n \\rightarrow n - 1]
+\\end{align}
+$$
+
+First, we run Split - the node on the top of the stack is a packed constructor,
+and we want access to its member variables, since they can be referenced by
+the branch's body via \\(x\_i\\). For the same reason, we must make sure to include
+\\(x\_1\\) through \\(x\_n\\) in our environment. Furthermore, since the split values now occupy the stack,
+we have to offset our environment by \\(n\\) before adding bindings to our new variables.
+Doing all these things gives us \\(\\rho'\\), which we use to compile the body, placing
+the resulting instructions after Split. This leaves us with the desired graph on top of
+the stack - the only thing left to do is to clean up the stack of the unpacked values,
+which we do using Slide.
+
+Notice that we didn't just create instructions - we created a mapping from the tag \\(t\\)
+to the instructions that correspond to it.
+
+Now, it's time for compiling the whole case expression. We first want
+to construct the graph for the expression we want to perform case analysis on.
+Next, we want to evaluate it (since we need a packed value, not a graph,
+to read the tag). Finally, we perform a jump depending on the tag. This
+is captured by the following rule:
+
+$$
+\\mathcal{C} ⟦\\text{case} \\; e \\; \\text{of} \\; \\text{alt}_1 ... \\text{alt}_n⟧ \\; \\rho =
+\\mathcal{C} ⟦e⟧ \\; \\rho \\; ⧺ [\\text{Eval}, \\text{Jump} \\; [\\mathcal{A} ⟦\\text{alt}_1⟧ \; \\rho, ..., \\mathcal{A} ⟦\\text{alt}_n⟧ \; \\rho]]
+$$
+
+This works because \\(\\mathcal{A}\\) creates not only instructions,
+but also a tag mapping. We simply populate our Jump instruction such mappings
+resulting from compiling each branch.
+
+You may have noticed that we didn't add rules for binary operators. Just like
+with type checking, we treat them as function calls. However, rather that constructing
+graphs when we have to instantiate those functions, we simply
+evaluate the arguments and perform the relevant arithmetic operation using BinOp.
+We will do a similar thing for constructors.
+
+### Implementation
+
+With that out of the way, we can get around to writing some code. Let's
+first define C++ structs for the instructions of the G-machine:
+
+{{< codeblock "C++" "compiler/06/instruction.hpp" >}}
+
+I omit the implementation of the various (trivial) `print` methods in this post;
+as always, you can look at the full project source code, which is 
+freely available for each post in the series.
+
+We can now envision a method on the `ast` struct that takes an environment 
+(just like our compilation scheme takes the environment \\(\\rho\\\)),
+and compiles the `ast`. Rather than returning a vector
+of instructions (which involves copying, unless we get some optimization kicking in),
+we'll pass a reference to a vector to our method. The method will then place the generated
+instructions into the vector. 
+
+There's one more thing to be considered. How do we tell apart a "global"
+from a variable? A naive solution would be to take a list or map of
+global functions as a third parameter to our `compile` method.
+But there's an easier way! We know that the program passed type checking.
+This means that every referenced variable exists. From then, the situation is easy -
+if actual variable names are kept in the environment, \\(\\rho\\), then whenever
+we see a variable that __isn't__ in the current environment, it must be a function name.
+
+Having finished contemplating out method, it's time to define a signature:
+```C++
+virtual void compile(const env_ptr& env, std::vector<instruction_ptr>& into) const;
+```
+
+Ah, but now we have to define "environment". Let's do that. Here's our header:
+
+{{< codeblock "C++" "compiler/06/env.hpp" >}}
+
+And here's the source file:
+
+{{< codeblock "C++" "compiler/06/env.cpp" >}}
+
+There's not that much to see here, but let's go through it anyway.
+We define an environment as a linked list, kind of like
+we did with the type environment. This time, though,
+we use shared pointers instead of raw pointers to reference the parent.
+I decided on this because we will need to be using virtual methods
+(since we have two subclasses of `env`), and thus will need to
+be passing the `env` by pointer. At that point, we might as well
+use the "proper" way!
+
+I implemented the environment as a linked list because it is, in essence,
+a stack. However, not every "offset" in a stack is introduced by
+binding variables - for instance, when we create an application node,
+we first build the argument value on the stack, and then,
+with that value still on the stack, build the left hand side of the application.
+Thus, all the variable positions are offset by the presence of the argument
+on the stack, and we must account for that. Similarly, in cases when we will
+allocate space on the stack (we will run into these cases later), we will
+need to account for that change. Thus, since we can increment
+the offset by two ways (binding a variable and building something on the stack),
+we allow for two types of nodes in our `env` stack.
+
+During recursion we will be tweaking the return value of `get_offset` to
+calculate the final location of a variable on the stack (if the
+parent of a node returned offset `1`, but the node itself is a variable
+node and thus introduces another offset, we need to return `2`). Because
+of this, we cannot reasonably return a constant like `-1` (it will quickly
+be made positive on a long list), and thus we throw an exception. To
+allow for a safe way to check for an offset, without try-catch,
+we also add a `has_variable` method which checks if the lookup will succeed.
+A better approach would be to use `std::optional`, but it's C++17, so
+we'll shy away from it.
+
+It will also help to move some of the functions on the `binop` enum
+into a separate file. The new neader is pretty small:
+
+{{< codeblock "C++" "compiler/06/binop.hpp" >}}
+
+The new source file is not much longer:
+
+{{< codeblock "C++" "compiler/06/binop.cpp" >}}
+
+And now, we begin our implementation. Let's start with the easy ones:
+`ast_int`, `ast_lid` and `ast_uid`. The code for `ast_int` involves just pushing
+the integer into the stack:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 36 38 >}}
+
+The code for `ast_lid` needs to check if the variable is global or local,
+just like we discussed:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 53 58 >}}
+
+We do not have to do this for `ast_uid`:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 73 75 >}}
+
+On to `ast_binop`! This is the first time we have to change our environment.
+As we said earlier, once we build the right operand on the stack, every offset that we counted
+from the top of the stack will have been shifted by 1 (we see this
+in our compilation scheme for function application). So,
+we create a new environment with `env_offset`, and use that
+when we compile the left child:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 103 110 >}}
+
+`ast_binop` performs two applications: `(+) lhs rhs`. 
+We push `rhs`, then `lhs`, then `(+)`, and then use MkApp
+twice. In `ast_app`, we only need to perform one application,
+`lhs rhs`:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 134 138 >}}
+
+Note that we also extend our environment in this one,
+for the exact same reason as before.
+
+Case expressions are the only thing left on the agenda. This
+is the time during which we have to perform desugaring. Here,
+though, we run into an issue: we don't have tags assigned to constructors!
+We need to adjust our code to keep track of the tags of the various
+constructors of a type. To do this, we add a subclass for the `type_base`
+struct, called `type_data`:
+
+{{< codelines "C++" "compiler/06/type.hpp" 33 42 >}}
+
+When we create types from `definition_data`, we tag the corresponding constructors:
+
+{{< codelines "C++" "compiler/06/definition.cpp" 54 71 >}}
+
+Ah, but adding constructor info to the type doesn't solve the problem.
+Once we performed type checking, we don't keep
+the types that we computed for an AST node, in the node. And obviously, we don't want
+to go looking for them again. Furthermore, we can't just look up a constructor
+in the environment, since we can well have patterns that don't have __any__ constructors:
+
+```
+match l {
+    l -> { 0 }
+}
+```
+
+So, we want each `ast` node to store its type (well, in practice we only need this for
+`ast_case`, but we might as well store it for all nodes). We can add it, no problem.
+To add to that, we can add another, non-virtual `typecheck` method (let's call it `typecheck_common`,
+since naming is hard). This method will call `typecheck`, and store the output into
+the `node_type` field.
+
+The signature is identical to `typecheck`, except it's neither virtual nor const:
+```
+type_ptr typecheck_common(type_mgr& mgr, const type_env& env);
+```
+
+And the implementation is as simple as you think:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 9 12 >}}
+
+In client code (`definition_defn::typecheck_first` for instance), we should now
+use `typecheck_common` instead of `typecheck`. With that done, we're almost there.
+However, we're still missing something: most likely, the initial type assigned to any
+node is a `type_var`, or a type variable. In this case, `type_var` __needs__ the information 
+from `type_mgr`, which we will not be keeping around. Besides, it's cleaner to keep the actual type
+as a member of the node, not a variable type that references it. In order
+to address this, we write two conversion functions that call `resolve` on all
+types in an AST, given a type manager. After this is done, the type manager can be thrown away.
+The signatures of the functions are as follows:
+
+```
+void resolve_common(const type_mgr& mgr);
+virtual void resolve(const type_mgr& mgr) const = 0;
+```
+
+We also add the `resolve` method to `definition`, so that we can call it
+without having to run `dynamic_cast`. The implementation for `ast::resolve_common`
+just resolves the type:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 14 21 >}}
+
+The virtual `ast::resolve` just calls `ast::resolve_common` on an all `ast` children
+of a node. Here's a sample implementation from `ast_binop`:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 98 101 >}}
+
+And here's the implementation of `definition::resolve` on `definition_defn`:
+
+{{< codelines "C++" "compiler/06/definition.cpp" 32 42 >}}
+
+Finally, we call `resolve` at the end `typecheck_program` in `main.cpp`:
+
+{{< codelines "C++" "compiler/06/main.cpp" 40 42 >}}
+
+At last, we're ready to implement the code for compiling `ast_case`.
+Here it is, in all its glory:
+
+{{< codelines "C++" "compiler/06/ast.cpp" 178 230 >}}
+
+There's a lot to unpack here. First of all, just like we said in the compilation
+scheme, we want to build and evaluate the expression that's being analyzed.
+Once that's done, however, things get more tricky. We know that each
+branch of a case expression will correspond to a vector of instructions -
+in fact, our jump instruction contains a mapping from tags to instructions.
+As we also discussed above, each list of instructions can be mapped to
+by multiple tags. We don't want to recompile the same sequence of instructions
+multiple times (or indeed, generate machine code for it). So, we keep
+a mapping of tags to their corresponding sequences of instructions. We implement
+this by having a vector of vectors of instructions (in which each inner vector
+represents the code for a branch), and a map of tag number to index
+in the vector containing all the branches. This way, multiple tags
+can point to the same instruction set without duplicating information.
+
+We also don't allow a tag to be mapped to more than one sequence of instructions.
+This is handled differently depending on whether a variable pattern or a
+constructor pattern are encountered. Variable patterns map all
+tags that haven't been mapped yet, so no error can occur. Constructor patterns,
+though, can explicitly try to map the same tag twice, and we don't want that.
+
+I implied in the previous paragraph the implementation of our case expression
+compilation algorithm, but let's go through it. Once we've compiled
+the expression to be analyzed, and evaluated it (just like in our definitions
+above), we proceed to look at all the branches specified in the case expression.
+
+If a branch has a variable pattern, we must map to the result of the compilation
+all the remaining, unmapped tags. We also aren't going to be taking apart
+our value, so we don't need to use Split, but we do need to add 1 to the
+environment offset to account the the presence of that value. So,
+we compile the branch body with that offset, and iterate through
+all the constructors of our data type. We skip a constructor
+if it's been mapped, and if it hasn't been, we map it to the index
+that this branch body will have in our list. Finally,
+we push the newly compiled instruction sequence into the list of branch
+bodies.
+
+If a branch is a constructor pattern, on the other hand, we lead our compilation
+output with a Split. This takes off the value from the stack, but pushes on
+all the parameters of the constructor. We account for this by incrementing the
+environment with the offset given by the number of arguments (just like we did
+in our definitions of our compilation scheme). Before we map the tag,
+we ensure that it hasn't already been mapped (and throw an exception, currently
+in the form of a type error due to the growing length of this post),
+and finally map it and insert the new branch code into the list of branches.
+
+After we're done with all the branches, we also check for non-exhaustive patterns,
+since otherwise we could run into runtime errors. With this, the case expression,
+and the last of the AST nodes, can be compiled.
+
+We also add a `compile` method to definitions, since they contain
+our AST nodes. The method is empty for `defn_data`, and
+looks as follows for `definition_defn`:
+
+{{< codelines "C++" "compiler/06/definition.cpp" 44 52 >}}
+
+Notice that we terminate the function with Update and Pop. This
+will turn the `ast_app` node that served as the "root"
+of the application into an indirection to the value that we have computed.
+Doing so will also remove all "scratch work" from the stack.
+In essense, this is how we can lazily evaluate expressions.
+
+Finally, we make a function in our `main.cpp` file to compile
+all the definitions:
+
+{{< codelines "C++" "compiler/06/main.cpp" 45 56 >}}
+
+In this method, we also include some extra
+output to help us see the result of our compilation. Since
+at the moment, only the `definition_defn` program has to
+be compiled, we try cast all definitions to it, and if
+we succeed, we print them out.
+
+Let's try it all out! For the below sample program:
+
+{{< rawblock "compiler/06/examples/works1.txt" >}}
+
+Our compiler produces the following new output:
+```
+PushInt(6)
+PushInt(320)
+PushGlobal(plus)
+MkApp()
+MkApp()
+Update(0)
+Pop(0)
+
+Push(1)
+Push(1)
+PushGlobal(plus)
+MkApp()
+MkApp()
+Update(2)
+Pop(2)
+```
+
+The first sequence of instructions is clearly `main`. It creates
+an application of `plus` to `320`, and then applies that to
+`6`, which results in `plus 320 6`, which is correct. The
+second sequence of instruction pushes the parameter that
+sits on offset 1 from the top of the stack (`y`). It then
+pushes a parameter from the same offset again, but this time,
+since `y` was previously pushed on the stack, `x` is now
+in that position, so `x` is pushed onto the stack.
+Finally, `+` is pushed, and the application
+`(+) x y` is created, which is equivalent to `x+y`.
+
+Let's also take a look at a case expression program:
+
+{{< rawblock "compiler/06/examples/works3.txt" >}}
+
+The result of the compilation is as follows:
+
+```
+Push(0)
+Eval()
+Jump(
+    Split()
+    PushInt(0)
+    Slide(0)
+
+    Split()
+    Push(1)
+    PushGlobal(length)
+    MkApp()
+    PushInt(1)
+    PushGlobal(plus)
+    MkApp()
+    MkApp()
+    Slide(2)
+
+)
+Update(1)
+Pop(1)
+```
+
+We push the first (and only) parameter onto the stack. We then make
+sure it's evaluated, and perform case analysis: if the list
+is `Nil`, we simply push the number 0 onto the stack. If it's
+a concatenation of some `x` and another lists `xs`, we
+push `xs` and `length` onto the stack, make the application
+(`length xs`), push the 1, and finally apply `+` to the result.
+This all makes sense!
+
+With this, we've been able to compile our expressions and functions
+into G-machine code. We're not done, however - our computers
+aren't G-machines. We'll need to compile our G-machine code to
+__machine code__ (we will use LLVM for this), implement the
+__runtime__, and develop a __garbage collector__. We'll
+tackle the first of these in the next post - [Part 7 - Runtime]({{< relref "07_compiler_runtime.md" >}}).