From 21851e3a9c552383ee8c4bc878ea06e7d28c333e Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Fri, 19 Jun 2020 02:22:08 -0700 Subject: [PATCH] Add more content to part 12. --- .../blog/12_compiler_let_in_lambda/index.md | 172 ++++++++++++++++++ 1 file changed, 172 insertions(+) diff --git a/content/blog/12_compiler_let_in_lambda/index.md b/content/blog/12_compiler_let_in_lambda/index.md index 1a50fc9..72f3218 100644 --- a/content/blog/12_compiler_let_in_lambda/index.md +++ b/content/blog/12_compiler_let_in_lambda/index.md @@ -418,6 +418,178 @@ Recall that in this case, we need not have two methods for declaring and generating LLVM, since constructors don't reference other constructors, and are always generated before any function definitions. +#### Visibility +Should we really be turning _all_ free variables in a function definition +into arguments? Consider the following piece of Haskell code: + +```Haskell {linenos=table} +add x y = x + y +mul x y = x * y +something = mul (add 1 3) 3 +``` + +In the definition of `something`, `mul` and `add` occur free. +A very naive lifting algorithm might be tempted to rewrite such a program +as follows: + +```Haskell {linenos=table} +add x y = x + y +mul x y = x * y +something' add mul = mul (add 1 3) 3 +something = something' add mul +``` + +But that's absurd! Not only are `add` and `mul` available globally, +but such a rewrite generates another definition with free variables, +which means we didn't really improve our program in any way. From this +example, we can see that we don't want to be turning reference to global +variables into function parameters. But how can we tell if a variable +we're trying to operate on is global or not? I propose a flag in our +`type_env`, which we'll augment to be used as a symbol table. To do +this, we update the implementation of `type_env` to map variables to +values of a struct `variable_data`: + +{{< codelines "C++" "compiler/12/type_env.hpp" 13 22 >}} + +The `visibility` enum is defined as follows: + +{{< codelines "C++" "compiler/12/type_env.hpp" 10 10 >}} + +As you can see from the above snippet, we also added a `mangled_name` field +to the new `variable_data` struct. We will be using this field shortly. We +also add a few methods to our `type_env`, and end up with the following: + +{{< codelines "C++" "compiler/12/type_env.hpp" 31 44 >}} + +We will come back to `find_free` and `find_free_except`, as well as +`set_mangled_name` and `get_mangled_name`. For now, we just adjust `bind` to +take a visibility parameter that defaults to `local`, and implement +`is_global`: + +{{< codelines "C++" "compiler/12/type_env.cpp" 27 32 >}} + +Remember the `visibility::global` in `parser.y`? This is where that comes in. +Specifically, we recall that `definition_defn::insert_types` is responsible +for placing function types into the environment, making them accessible +during typechecking later. At this time, we already need to know whether +or not the definitions are global or local (so that we can create the binding). +Thus, we add `visibility` as a parameter to `insert_types`: + +{{< codelines "C++" "compiler/12/definition.hpp" 44 44 >}} + +Since we are now moving from manually wrangling definitions towards using +`definition_group`, we make it so that the group itself provides this +argument. To do this, we add the `visibility` field from before to it, +and set it in the parser. One more thing: since constructors never +capture variables, we can always move them straight to the global +scope, and thus, we'll always mark them with `visibility::global`. + +#### Managing Mangled Names +Just mangling names is not enough. Consider the following program: + +```text {linenos=table} +defn packOne x = { + let { + data Packed a = { Pack a } + } in { + Pack x + } +} +defn packTwo x = { + let { + data Packed a = { Pack a } + } in { + Pack x + } +} +``` + +{{< sidenote "right" "lifting-types-note" "Lifting the data type declarations" >}} +We are actually not quite doing something like the following snippet. +The reason for this is that we don't mangle the names for types. I pointed +out this potential issue in a sidenote in the previous post. Since the size +of this post is already balooning, I will not deal with this issue here. +Even at the end of this post, our compiler will not be able to distinguish +between the two Packed types. We will hopefully get to it later. +{{< /sidenote >}} and their constructors into the global +scope gives us something like: + +``` {linenos=table} +data Packed a = { Pack a } +data Packed_1 a = { Pack_1 a } +defn packOne x = { Pack x } +defn packTwo x = { Pack_1 x } +``` + +Notice that we had to rename one of the calls to `Pack` to be a call to +be `Pack_1`. To actually change our AST to reference `Pack_1`, we'd have +to traverse the whole tree, and make sure to keep track of definitions +that could shadow `Pack` further down. This is cumbersome. Instead, we +can mark a variable as referring to a mangled version of itself, and +access this information when needed. To do this, we add the `mangled_name` +field to the `variable_data` struct as we've seen above, and implement +the `set_mangled_name` and `get_mangled_name` methods. The former: + +{{< codelines "C++" "compiler/12/type_env.cpp" 34 37 >}} + +And the latter: + +{{< codelines "C++" "compiler/12/type_env.cpp" 39 45 >}} + +We don't allow the `set_mangled_name` to affect variables that are declared +above the receiving `type_env`, and use the empty string as a 'none' value. +Now, when lifting data type constructors, we'll be able to use +`set_mangled_name` to make sure constructor calls are made correctly. We +will also be able to use this in other cases, like the translation +of local function definitions. + +#### New AST Nodes +Finally, it's time for us to add new AST nodes to our language. +Specifically, these nodes are `ast_let` (for `let/in` expressions) +and `ast_lambda` for lambda functions. We declare them as follows: + +{{< codelines "C++" "compiler/12/ast.hpp" 131 166 >}} + +In `ast_let`, the `definitions` field corresponds to the original definitions +given by the user in the program, and the `in` field corresponds to the +expression which uses these definitions. In the process of lifting, though, +we eventually transfer each of the definitions to the global scope, replacing +their right hand sides with partial applications. After this transformation, +all the data type definitions are effectively gone, and all the function +definitions are converted into the simple form `x = f a1 ... an`. We hold +these post-transformation equations in the `translated_definitions` field, +and it's them that we compile in this node's `compile` method. + +In `ast_lambda`, we allow multiple parameters (like Haskell's `\x y -> x + y`). +We store these parameters in the `params` field, and we store the lambda's +expression in the `body` field. Just like `definition_defn`, +the `ast_lambda` node maintains a separate environment in which its children +have been bound, and a list of variables that occur freely in its body. The +former is used for typechecking, while the latter is used for lifting. +Finally, the `translated` field holds the lambda function's form +after its body has been transformed into a global function. Similarly to +`ast_let`, this node will be in the form `f a1 ... an`. + +The +observant reader will have noticed that we have a new method: `translate`. +This is a new method for all `ast` descendants, and will implement the +steps of moving definitions to the global scope and transforming the +program. Before we get to it, though, let's quickly see the parsing +rules for `ast_let` and `ast_lambda`: + +{{< codelines "text" "compiler/12/parser.y" 107 115 >}} + +This is pretty similar to the rest of the grammar, so I will give this no +further explanation. + +{{< todo >}} +Explain typechecking for lambda functions and let/in expressions. +{{< /todo >}} + +{{< todo >}} +Explain free variable detection for lambda functions and let/in expressions. +{{< /todo >}} + #### Translation While collecting all of the definitions into a global list, we can also do some program transformations. Let's return to our earlier example: