blog-static/content/blog/12_compiler_let_in_lambda/index.md

19 KiB

title date tags description draft
Compiling a Functional Language Using C++, Part 12 - Let/In and Lambdas 2020-04-20T20:15:16-07:00
C and C++
Functional Languages
Compilers
In this post, we extend our language with let/in expressions and lambda functions. true

Now that our language's type system is more fleshed out and pleasant to use, it's time to shift our focus to the ergonomics of the language itself. I've been mentioning let/in expressions and lambda expressions for a while now. The former will let us create names for expressions that are limited to a certain scope (without having to create global variable bindings), while the latter will allow us to create functions without giving them any name at all.

Let's take a look at let/in expressions first, to make sure we're all on the same page about what it is we're trying to implement. Let's start with some rather basic examples, and then move on to more complex examples. The most basic use of a let/in expression is, in Haskell:

let x = 5 in x + x

In the above example, we bind the variable x to the value 5, and then refer to x twice in the expression after the in. The whole snippet is one expression, evaluating to what the in part evaluates to. Additionally, the variable x does not escape the expression - {{< sidenote "right" "used-note" "it cannot be used anywhere else." >}} Unless, of course, you bind it elsewhere; naturally, using x here does not forbid you from re-using the variable. {{< /sidenote >}}

Now, consider a slightly more complicated example:

let sum xs = foldl (+) 0 xs in sum [1,2,3]

Here, we're defining a function sum, {{< sidenote "right" "eta-note" "which takes a single argument:" >}} Those who favor the point-free programming style may be slightly twitching right now, the words eta reduction swirling in their mind. What do you know, fold-based sum is even one of the examples on the Wikipedia page! I assure you, I left the code as you see it deliberately, to demonstrate a principle. {{< /sidenote >}} the list to be summed. We will want this to be valid in our language, as well. We will soon see how this particular feature is related to lambda functions, and why I'm covering these two features in the same post.

Let's step up the difficulty a bit more, with an example that, {{< sidenote "left" "translate-note" "though it does not immediately translate to our language," >}} The part that doesn't translate well is the whole deal with patterns in function arguments, as well as the notion of having more than one equation for a single function, as is the case with safeTail.

It's not that these things are impossible to translate; it's just that translating them may be worthy of a post in and of itself, and would only serve to bloat and complicate this part. What can be implemented with pattern arguments can just as well be implemented using regular case expressions; I dare say most "big" functional languages actually just convert from the former to the latter as part of the compillation process. {{< /sidenote >}} illustrates another important principle:

1 2 3 4 5 6 7 let safeTail [] = Nothing safeTail [x] = Just x safeTail (_:xs) = safeTail xs myTail = safeTail [1,2,3,4] in myTail

The principle here is that definitions in let/in can be recursive and polymorphic. Remember the note in [part 10]({{< relref "10_compiler_polymorphism.md" >}}) about let-polymorphism? This is it: we're allowing polymorphic variable bindings, but only when they're bound in a let/in expression (or at the top level).

The principles demonstrated by the last two snippets mean that compiling let/in expressions, at least with the power we want to give them, will require the same kind of dependency analysis we had to go through when we implemented polymorphically typed functions. That is, we will need to analyze which functions calls which other functions, and typecheck the callees before the callers. We will continue to represent callee-caller relationships using a dependency graph, in which nodes represent functions, and an edge from one function node to another means that the former function calls the latter. Below is an image of one such graph:

{{< figure src="fig_graph.png" caption="Example dependency graph without let/in expressions." >}}

Since we want to typecheck callees first, we effectively want to traverse the graph in reverse topological order. However, there's a slight issue: a topological order is only defined for acyclic graphs, and it is very possible for functions in our language to mutually call each other. To deal with this, we have to find groups of mutually recursive functions, and and treat them as a single unit, thereby eliminating cycles. In the above graph, there are two groups, as follows:

{{< figure src="fig_colored_ordered.png" caption="Previous depndency graph with mutually recursive groups highlighted." >}}

As seen in the second image, according to the reverse topological order of the given graph, we will typecheck the blue group containing three functions first, since the sole function in the orange group calls one of the blue functions.

Things are more complicated now that let/in expressions are able to introduce their own, polymorphic and recursive declarations. However, there is a single invariant we can establish: function definitions can only depend on functions defined at the same time as them. That is, for our purposes, functions declared in the global scope can only depend on other functions declared in the global scope, and functions declared in a let/in expression can only depend on other functions declared in that same expression. That's not to say that a function declared in a let/in block inside some function f can't call another globally declared function g - rather, we allow this, but treat the situation as though f depends on g. In contrast, it's not at all possible for a global function to depend on a local function, because bindings created in a let/in expression do not escape the expression itself. This invariant tells us that in the presence of nested function definitions, the situation looks like this:

{{< figure src="fig_subgraphs.png" caption="Previous depndency graph augmented with let/in subgraphs." >}}

In the above image, some of the original nodes in our graph now contain other, smaller graphs. Those subgraphs are the graphs created by function declarations in let/in expressions. Just like our top-level nodes, the nodes of these smaller graphs can depend on other nodes, and even form cycles. Within each subgraph, we will have to perform the same kind of cycle detection, resulting in something like this:

{{< figure src="fig_subgraphs_colored_all.png" caption="Augmented dependency graph with mutually recursive groups highlighted." >}}

When typechecking a function, we must be ready to perform dependency analysis at any point. What's more is that the free variable analysis we used to perform must now be extended to differentiate between free variables that refer to "nearby" definitions (i.e. within the same let/in expression), and "far away" definitions (i.e. outside of the let/in expression). And speaking of free variables...

What do we do about variables that are captured by a local definition? Consider the following snippet:

1 2 3 addToAll n xs = map addSingle xs where addSingle x = n + x

In the code above, the variable n, bound on line 1, is used by addSingle on line 3. When a function refers to variables bound outside of itself (as addSingle does), it is said to be capturing these variables, and the function is called a closure. Why does this matter? On the machine level, functions are represented as sequences of instructions, and there's a finite number of them (as there is finite space on the machine). But there is an infinite number of addSingle functions! When we write addToAll 5 [1,2,3], addSingle becomes 5+x. When, on the other hand, we write addToAll 6 [1,2,3], addSingle becomes 6+x. There are certain ways to work around this - we could, for instance, dynamically create machine code in memory, and then execute it (this is called just-in-time compilation). This would end up with a collections of runtime-defined functions that can be represented as follows:

1 2 3 4 5 6 7 -- Version of addSingle when n = 5 addSingle5 x = 5 + x -- Version of addSingle when n = 6 addSingle6 x = 6 + x -- ... and so on ...

But now, we end up creating several functions with almost identical bodies, with the exception of the free variables themselves. Wouldn't it be better to perform the well-known strategy of reducing code duplication by factoring out parameters, and leaving only instance of the repeated code? We would end up with:

1 2 addToAll n xs = map (addSingle n) xs addSingle n x = n + x

Observe that we no longer have the "infinite" number of functions - the infinitude of possible behaviors is created via currying. Also note that addSingle {{< sidenote "right" "global-note" "is now declared at the global scope," >}} Wait a moment, didn't we just talk about nested polymorphic definitions, and how they change our typechecking model? If we transform our program into a bunch of global definitions, we don't need to make adjustments to our typechecking.

This is true, but why should we perform transformations on a malformed program? Typechecking before pulling functions to the global scope will help us save the work, and breaking down one dependency-searching problem (which is O(n^3) thanks to Warshall's) into smaller, independent problems may even lead to better performance. Furthermore, typechecking before program transformations will help us come up with more helpful error messages. {{< /sidenote >}} and can be transformed into a sequence of instructions just like any other global function. It has been pulled from its where (which, by the way, is pretty much equivalent to a let/in) to the top level.

This technique of replacing captured variables with arguments, and pulling closures into the global scope to aid compilation, is called Lambda Lifting. Its name is no coincidence - lambda functions need to undergo the same kind of transformation as our nested definitions (unlike nested definitions, though, lambda functions need to be named). This is why they are included in this post together with let/in!

Implementation

Now that we understand what we have to do, it's time to jump straight into doing it. First, we need to refactor our current code so allow for the changes we're going to make; then, we can implement let/in expressions; finally, we'll tackle lambda functions.

Infrastructure Changes

When finding captured variables, the notion of free variables once again becomes important. Recall that a free variable in an expression is a variable that is defined outside of that expression. Consider, for example, the expression:

let x = 5 in x + y

In this expression, x is not a free variable, since it's defined in the let/in expression. On the other hand, y is a free variable, since it's not defined locally.

The algorithm that we used for computing free variables was rather biased. Previously, we only cared about the difference between a local variable (defined somewhere in a function's body, or referring to one of the function's parameters) and a global variable (referring to a function name). This shows in our code for find_free. Consider, for example, this segment:

{{< codelines "C++" "compiler/11/ast.cpp" 33 36 >}}

We created bindings in our type environment whenever we saw a new variable being introduced, which led us to only count variables that we did not bind anywhere as 'free'. This approach is no longer sufficient. Consider, for example, the following Haskell code:

1 2 3 4 5 someFunction x = let y = x + 5 in x*y

We can see that the variable x is introduced on line 1. Thus, our current algorithm will happily store x in an environment, and not count it as free. But clearly, the definition of y on line 3 captures x! If we were to lift y into global scope, we would need to pass x to it as an argument. To fix this, we have to separate the creation and assignment of type environments from free variable detection. Why don't we start with ast and its descendants? Our signatures become:

void ast::find_free(std::set<std::string>& into);
type_ptr ast::typecheck(type_mgr& mgr, type_env_ptr& env);

For the most part, the code remains unchanged. We avoid using env (and this->env), and default to marking any variable as a free variable:

{{< codelines "C++" "compiler/12/ast.cpp" 39 41 >}}

Since we no longer use the environment, we resort to an alternative method of removing bound variables. Here's ast_case::find_free:

{{< codelines "C++" "compiler/12/ast.cpp" 169 181 >}}

For each branch, we find the free variables. However, we want to avoid marking variables that were introduced through pattern matching as free (they are not). Thus, we use pattern::find_variables to see which of the variables were bound by that pattern, and remove them from the list of free variables. We can then safely add the list of free variables in the pattern to the overall list of free variables. Other ast descendants experience largely cosmetic changes (such as the removal of the env parameter).

Of course, we must implement find_variables for each of our pattern subclasses. Here's what I got for pattern_var:

{{< codelines "C++" "compiler/12/ast.cpp" 402 404 >}}

And here's an equally terse implementation for pattern_constr:

{{< codelines "C++" "compiler/12/ast.cpp" 417 419 >}}

We also want to update definition_defn with this change. Our signatures become:

void definition_defn::find_free();
void definition_defn::insert_types(type_mgr& mgr, type_env_ptr& env, visibility v);

We'll get to the visiblity parameter later. The implementations are fairly simple. Just like ast_case, we want to erase each function's parameters from its list of free variables:

{{< codelines "C++" "compiler/12/definition.cpp" 13 18 >}}

Since find_free no longer creates any type bindings or environments, this functionality is shouldered by insert_types:

{{< codelines "C++" "compiler/12/definition.cpp" 20 32 >}}

Now that free variables are properly computed, we are able to move on to bigger and better things.

Nested Definitions

At present, our code for typechecking the whole program is located in main.cpp:

{{< codelines "C++" "compiler/11/main.cpp" 43 61 >}}

This piece of code goes on. We now want this to be more general. Soon, let/in expressions with bring with them definitions that are inside other definitions, which will not be reachable at the top level. The fundamental topological sorting algorithm, though, will remain the same. We can abstract a series of definitions that need to be ordered and then typechecked into a new struct, definition_group:

{{< codelines "C++" "compiler/12/definition.hpp" 73 83 >}}

This will be exactly like a list of defn/data definitions we have at the top level, except now, it can also occur in other places, like let/in expressions. Once again, ignore for the moment the visibility field.

The way we defined function ordering requires some extra work from definition_group. Recall that conceptually, functions can only depend on other functions defined in the same let/in expression, or, more generally, in the same definition_group. This means that we now classify free variables in definitions into two categories: free variables that refer to "nearby" definitions (i.e. definitions in the same group) and free variables that refer to "far away" definitions. The "nearby" variables will be used to do topological ordering, while the "far away" variables can be passed along further up, perhaps into an enclosing let/in expression (for which "nearby" variables aren't actually free, since they are bound in the let). We implement this partitioning of variables in definition_group::find_free:

{{< codelines "C++" "compiler/12/definition.cpp" 94 105 >}}

Notice that we have added a new nearby_variables field to definition_defn. This is used on line 101, and will be once again used in definition_group::typecheck. Speaking of typecheck, let's look at its definition:

{{< codelines "C++" "compiler/12/definition.cpp" 107 145 >}}

This function is a little long, but conceptually, each for loop contains a step of the process:

  • The first loop declares all data types, so that constructors can be verified to properly reference them.
  • The second loop creates all the data type constructors.
  • The third loop adds edges to our dependency graph.
  • The fourth loop performs typechecking on the now-ordered groups of mutually recursive functions.
    • The first inner loop inserts the types of all the functions into the environment.
    • The second inner loop actually performs typechecking.
    • The third inner loop makes as many things polymorphic as possible.

We can now adjust our parser.y to use a definition_group instead of two global vectors. First, we declare a global definition_group:

{{< codelines "C++" "compiler/12/parser.y" 10 10 >}}

Then, we adjust definitions to create definition_groups:

{{< codelines "text" "compiler/12/parser.y" 59 68 >}}

We can now adjust main.cpp to use the global definition_group. Among other changes (such as removing extern references to vectors, and updating function signatures) we also update the typecheck_program function:

{{< codelines "C++" "compiler/12/main.cpp" 41 49 >}}

Now, our code is ready for typechecking nested definitions, but not for compiling them. The main thing that we still have to address is the addition of new definitions to the global scope. Let's take a look at that next.

Global Definitions

We want every function, regardless of whether or not it was declared in the global scope, to be processed and converted to LLVM code. The LLVM code conversion takes several steps. First, the function's AST is translated into G-machine instructions, which we covered in [part 5]({{< relref "05_compiler_execution.md" >}}), by a process we covered in [part 6]({{< relref "06_compiler_compilation.md" >}}). Then, an LLVM function is created for every function, and registered globally. Finally, the G-machine instructions are converted