From f6ca13d6dc9e73d45e442810f3f70bb62ab64ce9 Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Thu, 18 Jun 2020 22:29:38 -0700 Subject: [PATCH] Add more implementation content to part 12. --- .../blog/12_compiler_let_in_lambda/index.md | 184 +++++++++++++++++- 1 file changed, 180 insertions(+), 4 deletions(-) diff --git a/content/blog/12_compiler_let_in_lambda/index.md b/content/blog/12_compiler_let_in_lambda/index.md index 2dbcc5a..1a50fc9 100644 --- a/content/blog/12_compiler_let_in_lambda/index.md +++ b/content/blog/12_compiler_let_in_lambda/index.md @@ -106,6 +106,43 @@ Wait a moment, didn't we just talk about nested polymorphic definitions, and how This is true, but why should we perform transformations on a malformed program? Typechecking before pulling functions to the global scope will help us save the work, and breaking down one dependency-searching problem (which is \(O(n^3)\) thanks to Warshall's) into smaller, independent problems may even lead to better performance. Furthermore, typechecking before program transformations will help us come up with more helpful error messages. {{< /sidenote >}} and can be transformed into a sequence of instructions just like any other global function. It has been pulled from its `where` (which, by the way, is pretty much equivalent to a `let/in`) to the top level. +Now, see how `addSingle` became `(addSingle n)`? If we chose to rewrite the +program this way, we'd have to find-and-replace every instance of `addSingle` +in the function body, which would be tedious and require us to keep +track of shadowed variables and the like. Also, what if we used a local +definition twice in the original piece of code? How about something like this: + +```Haskell {linenos=table} +fourthPower x = square * square + where + square = x * x +``` + +Applying the strategy we saw above, we get: + +```Haskell {linenos=table} +fourthPower x = (square x) * (square x) +square x = x * x +``` + +This is valid, except that in our evaluation model, the two instances +of `(square x)` will be built independently of one another, and thus, +will not be shared. This, in turn, will mean that `square` will be called +twice, which is not what we would expect from looking at the original program. +This isn't good. Instead, why don't we keep the `where`, but modify it +as follows: + +```Haskell {linenos=table} +fourthPower x = square * square + where square = square' x +square' x = x * x +``` + +This time, assuming we can properly implement `where`, the call to +`square' x` should only occur once. Though I've been using `where`, +which leads to less clutter in Haskell code, the exact same approach applies +to `let/in`, and that's what we'll be using in our language. + This technique of replacing captured variables with arguments, and pulling closures into the global scope to aid compilation, is called [Lambda Lifting](https://en.wikipedia.org/wiki/Lambda_lifting). Its name is no coincidence - lambda functions need to undergo the same kind of transformation as our nested definitions (unlike nested definitions, though, lambda functions need to be named). This is why they are included in this post together with `let/in`! ### Implementation @@ -285,10 +322,149 @@ compiling them. The main thing that we still have to address is the addition of new definitions to the global scope. Let's take a look at that next. #### Global Definitions -We want every function, regardless of whether or not it was declared in the global -scope, to be processed and converted to LLVM code. The LLVM code conversion -takes several steps. First, the function's AST is translated into G-machine +We want every function (and even non-function definitions that capture surrounding +variables), regardless of whether or not it was declared in the global scope, +to be processed and converted to LLVM code. The LLVM code conversion takes +several steps. First, the function's AST is translated into G-machine instructions, which we covered in [part 5]({{< relref "05_compiler_execution.md" >}}), by a process we covered in [part 6]({{< relref "06_compiler_compilation.md" >}}). Then, an LLVM function is created for every function, and registered globally. -Finally, the G-machine instructions are converted +Finally, the G-machine instructions are converted into LLVM IR, which is +inserted into the previously created functions. These things +can't be done in a single pass: at the very least, we can't start translating +G-machine instructions into LLVM IR until functions are globally declared, +because we would otherwise have no means of referencing other functions. It +makes sense to me, then, to pull out all the 'global' definitions into +a single top-level list (perhaps somewhere in `main.cpp`). + +Let's start implementing this with a new `global_scope` struct. This +struct will contain all of the global function and constructor definitions: + +{{< codelines "C++" "compiler/12/global_scope.hpp" 42 55 >}} + +This struct will allow us to keep track of all the global definitions, +emitting them as we go, and then coming back to them as necessary. +There are also signs of another piece of functionality: `occurence_count` +and `mangle_name`. These two will be used to handle duplicate names. + +We cannot have two global functions named the same thing, but we can +easily imagine a situation in which two separate `let/in` expressions define +a variable like `x`, which then needs to be lifted to the global scope. We +resolve such conflicts by slightly changing - "mangling" - the name of +one of the resulting global definitions. We allow the first global definition +to be named the same as it was originally (in our example, this would be `x`). +However, if we detect that a global definition `x` already exists (we +track this using `occurence_count`), we rename it to `x_1`. Subsequent +global definitions will end up being named `x_2`, `x_3`, and so on. + +Alright, let's take a look at `global_function` and `global_constructor`. +Here's the former: + +{{< codelines "C++" "compiler/12/global_scope.hpp" 11 27 >}} + +There's nothing really surprising here: all of the fields +are reminiscent of `definition_defn`, though some type-related variables +are missing. We also include the three compilation-related methods, +`compile`, `declare_llvm`, and `generate_llvm`, which were previously in `definition_defn`. Let's look at `global_constructor` now: + +{{< codelines "C++" "compiler/12/global_scope.hpp" 29 40 >}} + +This maps pretty closely to a single `definition_data::constructor`. +There's a difference here that is not clear at a glance, though. Whereas +the `name` in a `definition_defn` or `definition_data` refers to the +name as given by the user in the code, the `name` of a `global_function` +or `global_constructor` has gone through mangling, and thus, should be +unique. + +Let's now look at the implementation of these structs' methods. The methods +`add_function` and `add_constructor` are pretty straightforward. Here's +the former: + +{{< codelines "C++" "compiler/12/global_scope.cpp" 39 43 >}} + +And here's the latter: + +{{< codelines "C++" "compiler/12/global_scope.cpp" 45 49 >}} + +In both of these functions, we return a reference to the new global +definition we created. This helps us access the mangled `name` field, +and, in the case of `global_function`, inspect the `ast_ptr` that represents +its body. + +Next, we have `global_scope::compile` and `global_scope::generate_llvm`, +which encapsulate these operations on all global definitions. Their +implementations are very straightforward, and are similar to the +`gen_llvm` function we used to have in our `main.cpp`: + +{{< codelines "C++" "compiler/12/global_scope.cpp" 51 67 >}} + +Finally, we have `mangle`, which takes care of potentially duplicate +variable names: + +{{< codelines "C++" "compiler/12/global_scope.cpp" 69 83 >}} + +Let's move on to the global definition structs. +The `compile`, `declare_llvm`, and `generate_llvm` methods for +`global_function` are pretty much the same as those that we used to have +in `definition_defn`: + +{{< codelines "C++" "compiler/12/global_scope.cpp" 4 24 >}} + +The same is true for `global_constructor` and its method `generate_llvm`: + +{{< codelines "C++" "compiler/12/global_scope.cpp" 26 37 >}} + +Recall that in this case, we need not have two methods for declaring +and generating LLVM, since constructors don't reference other constructors, +and are always generated before any function definitions. + +#### Translation +While collecting all of the definitions into a global list, we can +also do some program transformations. Let's return to our earlier example: + +```Haskell {linenos=table} +fourthPower x = square * square + where + square = x * x +``` + +We said it should be translated into something like: + +```Haskell {linenos=table} +fourthPower x = square * square + where square = square' x +square' x = x * x +``` + +In our language, the original program above would be: + +```text {linenos=table} +defn fourthPower x = { + let { + defn square = { x * x } + } in { + square * square + } +} +``` + +And the translated version would be: + +```text {linenos=table} +defn fourthPower x = { + let { + defn square = { square' x } + } in { + square * square + } +} +defn square' x = { x * x } +``` + +Setting aside for the moment the naming of `square'` and `square`, we observe +that we want to perform the following operations: + +1. Move the body of the original definition of `square` into its own +global definition, adding all the captured variables as arguments. +2. Replace the right hand side of the `let/in` expression with an application +of the global definition to the variables it requires.