--- title: Compiling a Functional Language Using C++, Part 9 - Garbage Collection date: 2020-01-27T20:35:15-08:00 tags: ["C and C++", "Functional Languages", "Compilers"] draft: true --- > "When will you learn? When will you learn that __your actions have consequences?__" So far, we've entirely ignored the problem of memory management. Every time that we need a new node for our growing graph, we simply ask for more memory from the runtime with `malloc`. But selfishly, even when we no longer require the memory allocated for a particular node, when that node is no longer in use, we do not `free` it. In fact, our runtime currently has no idea about which nodes are needed and which ones are ready to be discarded. To convince ourselves that this is a problem, let's first assess the extent of the damage. Consider the program from `works3.txt`: {{< rawblock "compiler/09/examples/works3.txt" >}} Compiling and running this program through `valgrind`, we get the following output: ``` ==XXXX== LEAK SUMMARY: ==XXXX== definitely lost: 288 bytes in 12 blocks ==XXXX== indirectly lost: 768 bytes in 34 blocks ==XXXX== possibly lost: 0 bytes in 0 blocks ==XXXX== still reachable: 0 bytes in 0 blocks ==XXXX== suppressed: 0 bytes in 0 blocks ``` We lost 1056 bytes of memory, just to return the length of a list with 3 elements. The problem of leaking memory is very real. How do we solve this issue? We can't embed memory management into our language; We want to keep it pure, and managing memory is typically pretty far from that goal. Instead, we will make our runtime do the work of freeing memory. Even then, this is a nontrivial goal: our runtime manipulates graphs, each of which can be combined with others in arbitrary ways. In general, there will not always be a _single_ node that, when freed, will guarantee that another node can be freed as well. Instead, it's very possible in our graphs that two parent nodes both refer to a third, and only when both parents are freed can we free that third node itself. Consider, for instance, the function `square` as follows: ``` defn square x = { x * x } ``` This function will receive, on top of the stack, a single graph representing `x`. It will then create two applications of a global `(+)` function, each time to the graph of `x`. Thus, it will construct a tree with two `App` nodes, both of which {{< sidenote "right" "lazy-note" "must keep track of a reference to x.">}} We later take advantage of this, by replacing the graph of x with the result of evaluating it. Since both App nodes point to the same graph, when we evaluate it once, each node observes this update, and is not required to evaluate x again. With this, we achieve lazy evaluation. {{< /sidenote >}} The runtime will have to wait until both `App` nodes are freed before it can free the graph of `x`. This seems simple enough! If there are multiple things that may reference a node in the graph, why don't we just keep track of how many there are? Once we know that no more things are still referencing a node, we can free it. This is called [reference counting](https://en.wikipedia.org/wiki/Reference_counting). Reference counting is a valid technique, but unfortunately, it will not suit us. The reason for this is that our language may produce [cyclic graphs](https://en.wikipedia.org/wiki/Cycle_(graph_theory)). Consider, for example, this definition of an infinite list of the number 1: ``` defn ones = { Cons 1 ones } ``` Envisioning the graph of the tree, we can see `ones` as an application of the constructor `Cons` to two arguments, one of which is `ones` again. {{< sidenote "right" "recursive-note" "It refers to itself!" >}} Things are actually more complicated than this. In our current language, recursive definitions are only possible in function definitions (like ones). In our runtime, each time there is a reference to a function, this is done through a new node, which means that functions with recursive definitions are not represented cyclically. Therefore, reference counting would work. However, in the future, our language will have more ways of creating circular definitions, some of which will indeed create cycles in our graphs. So, to prepare for this, we will avoid the use of reference counting. {{< /sidenote >}} In this case, when we compute the number of nodes that require `ones`, we will always find the number to be at least 1: `ones` needs `ones`, which needs `ones`, and so on. It will not be possible for us to free `ones`, then, by simply counting the number of references to it. There's a more powerful technique than reference counting for freeing unused memory: __mark-and-sweep garbage collection__. This technique is conceptually pretty simple to grasp, yet will allow us to handle cycles in our graphs. Unsurprisingly, we implement this type of garbage collection in two stages: 1. __Mark__: We go through every node that is still needed by the runtime, and recursively mark is, its children, and so on as "to keep". 2. __Sweep__: We go through every node we haven't yet freed, and, if it hasn't been marked as "to keep", we free it. This also seems simple enough. There are two main things for us to figure out: 1. For __Mark__, what are the "nodes still needed by the runtime"? These are just the nodes on the various G-machine stacks. If a node is not on the stack, nor is it a child of a node that is on the stack, why should we keep it around? 2. For __Sweep__, how do we keep track of all the nodes we haven't yet freed? In our case, the solution is a global list of allocated nodes, which is updated every time that a node is allocated. Wait a minute, though. Inside of `unwind` in C, we only have a reference to the most recent stack. Our execution model allows for an arbitrary number of stacks: we can keep using `Eval`, placing the current stack on the dump, and starting a new stack from scratch to evaluate a node. How can we traverse these stacks from inside unwind? One solution could be to have each stack point to the "parent" stack. To find all the nodes on the stack, then, we'd start with the current stack, mark all the nodes on it as "required", then move on to the parent stack, rinse and repeat. This is plausible and pretty simple, but there's another way. We clean up after ourselves. ### Towards a Cleaner Stack Simon Peyton Jones wrote his G-machine semantics in a particular way. Every time that a function is called, all it leaves behind on the stack is the graph node that represents the function's output. Our own internal functions, however, have been less careful. Consider, for instance, the "binary operator" function I showed you. Its body is given by the following G-machine instructions: ```C++ instructions.push_back(instruction_ptr(new instruction_push(1))); instructions.push_back(instruction_ptr(new instruction_eval())); instructions.push_back(instruction_ptr(new instruction_push(1))); instructions.push_back(instruction_ptr(new instruction_eval())); instructions.push_back(instruction_ptr(new instruction_binop(op))); ``` When the function is called, there are at least 3 things on the stack: 1. The "outermost" application node, to be replaced with an indirection (to enable laziness). 2. The second argument to the binary operator. 3. The first argument to the binary operator. Then, __Push__ adds another node to the stack, an __Eval__ forces its evaluation (and leaves it on the stack). This happens again with the second argument. Finally, we call __BinOp__, popping two values off the stack and combining them according to the binary operator. This leaves the stack with 4 things: the 3 I described above, and thew newly computed value. This is fine as far as `eval` is concerned: its implementation only asks for the top value on the stack after `unwind` finishes. But for anything more complicated, this is a very bad side effect. We want to leave the stack as clean as we found it - with one node and no garbage. Fortunately, the way we compile functions is a good guide for how we should compile internal operators and constructors. The idea is captured by the two instructions we insert at the end of a user-defined function: {{< codelines "C++" "compiler/09/definition.cpp" 56 57 >}} Once a result is computed, we turn the node that represented the application into an indirection, and point it to the computed result (as I said before, this enables lazy evaluation). We also pop the arguments given to the function off the stack. Let's add these two things to the `gen_llvm_internal_op` function: {{< codelines "C++" "compiler/09/main.cpp" 70 85 >}} Notice, in particular, the `instruction_update(2)` and `instruction_pop(2)` instructions that were recently added. A similar thing has to be done for data type constructors. The difference, though, is that __Pack__ removes the data it packs from the stack, and thus, __Pop__ is not needed: {{< codelines "C++" "compiler/09/definition.cpp" 102 117 >}} With this done, let's run a quick test: let's print the number of things on the stack at the end of an `eval` call (before the stack is freed, of course). We can compare the output of runtime without the fix (`old`) and with the fix (`current`): ``` current old Current stack size is 0 | Current stack size: 1 Current stack size is 0 | Current stack size: 1 Current stack size is 0 | Current stack size: 1 Current stack size is 0 | Current stack size: 1 Current stack size is 0 | Current stack size: 0 Current stack size is 0 | Current stack size: 0 Current stack size is 0 | Current stack size: 3 Current stack size is 0 | Current stack size: 0 Current stack size is 0 | Current stack size: 3 Current stack size is 0 | Current stack size: 0 Current stack size is 0 | Current stack size: 3 Result: 3 | Result: 3 ``` The stack is now much cleaner! Every time `eval` is called, it starts with one node, and ends with one node (which is then popped). ### One Stack to Rule Them All Wait a minute. If the stack is really always empty at the end, do we really need to construct a new stack every time? {{< sidenote "right" "arity-note" "I think not" >}} There's some nuance to this. While it is true that for the most part, we can get rid of the new stacks in favor of a single one, our runtime will experience a change. The change lies in the Unwind-Global rule, which requires that the stack has as many children as the function needs arguments. Until now, there was no way for this condition to be accidentally satisfied: the function we were unwinding was the only thing on the stack. Now, though, things are different: the function being unwound may share a stack with something else, and just checking the stack size will not be sufficient. I believe that this is not a problem for us, since the compiler will only emit Eval instructions for things it knows are data types or numbers, meaning their type is not a partially applied function that is missing arguments. However, this is a nontrivial observation. {{< /sidenote >}}, and Simon Peyton Jones seems to agree. In _Implementing Functional Languages: a tutorial_, he mentions that the dump does not need to be implemented as a real stack of stacks. So let's try this out: instead of starting a new stack using `eval`, let's use an existing one, by just calling `unwind` again. To do so, all we have to do is change our `instruction_eval` instruction. When the G-machine wants something evaluated now, it should just call `unwind` directly! To make this change, we have to make `unwind` available to the compiler. We thus declare it in the `llvm_context.cpp` file: {{< codelines "C++" "compiler/09/llvm_context.cpp" 158 163 >}} And even create a function to construct a call to `unwind` with the following signature: {{< codelines "C++" "compiler/09/llvm_context.hpp" 58 58 >}} We implement it like so: {{< codelines "C++" "compiler/09/llvm_context.cpp" 217 220 >}} Finally, the `instruction_eval::gen_llvm` method simply calls `unwind`: {{< codelines "C++" "compiler/09/instruction.cpp" 157 159 >}} After this change, we only call `eval` from `main`. Furthermore, since `eval` releases all the resources it allocates before returning, we won't be able to {{< sidenote "right" "retrieve-note" "easily retrieve" >}} We were able to do this before, but that's because our runtime didn't free the nodes, ever. Now that it does, returning a node violates that node's lifetime. {{< /sidenote >}}the result of the evaluation from it. Thus, we simply merge `eval` with `main` - combining the printing and the initialization / freeing code. With this, only one stack will be allocated for the entirety of program execution. This doesn't just help us save on memory allocations, but also __solves the problem of marking valid nodes during garbage collection__! Instead of traversing a dump of stacks, we can now simply traverse a single stack; all that we need is in one place. So this takes care, more or less, of the "mark" portion of mark-and-sweep. Using the stack, we can recursively mark the nodes that we need. But what about "sweeping"? How can we possibly know of every node that we've allocated? There's some more bookkeping for us to do. ### It's All Connected There exists a simple technique I've previously seen (and used) for keeping track of all the allocated memory. The technique is to __turn all the allocated nodes into elements of a linked list__. The general process of implementing this proceeds as follows: 1. To each node, add a "next" pointer. 2. Keep a handle to the whole node chain somewhere. 3. Add each newly allocated node to the front of the whole chain. This "somewhere" could be a global variable. However, since we already pass a stack to almost all of our functions, it makes more sense to make the list handle a part of some data structure that will also contain the stack, and pass that around, instead. This keeps all of the G-machine data in one place, and in principle could allow for concurrent execution of more than one G-machine in a single program. Let's call our new data structure `gmachine`: {{< codelines "C++" "compiler/09/runtime.h" 69 74 >}} Here, the `stack` field holds the G-machine stack, and the `gc_nodes` is the handle to the list of all the nodes we've allocated and not yet freed. Don't worry about the `gc_node_count` and `gc_threshold` fields - we'll get to them a little later. This is going to be a significant change. First of all, since the handle won't be global, it can't be accessed from inside the `alloc_*` functions. Instead, we have to make sure to add nodes allocated through `alloc_*` to a G-machine somewhere wherever we call the allocators. To make it easier to add nodes to a G-machine GC handle, let's make a new function, `track`: ```C struct node_base* gmachine_track(struct gmachine*, struct node_base*); ``` This function will add the given node to the G-machine's handle, and return that same node. This way, we can wrap nodes in a call to `gmachine_track`. We will talk about this function's implementation later in the post. This doesn't get us all the way to a working runtime, though: right now, we still pass around `struct stack*` instead of `struct gmachine*` everywhere. However, the whole point of adding the `gmachine` struct was to store more data in it! Surely we need that new data somewhere, and thus, we need to use the `gmachine` struct for _some_ functions. What functions _do_ need a whole `gmachine*`, and which ones only need a `stack*`? 1. {{< sidenote "right" "ownership-note" "Clearly," >}} This might not be clear. Maybe pushing onto a stack will add a node to our GC handle, and so, we need to have access to the handle in stack_push. The underlying question is that of ownership: when we allocate a node, which part of the program does it "belong" to? The "owner" of the node should do the work of managing when to free it or keep it. Since we already agreed to create a gmachine struct to house the GC handle, it makes sense that nodes are owned by the G-machine. Thus, the assumption in functions like stack_push is that the "owner" of the node already took care of allocating and tracking it, and stack_push itself shouldn't bother. {{< /sidenote >}} `stack_push`, `stack_pop`, and similar functions do not require a G-machine. 2. `stack_alloc` and `stack_pack` __do__ need a G-machine, because they must allocate new nodes. Where the nodes are allocated, we should add them to the GC handle. 3. Since they use `stack_alloc` and `stack_pack`, generated functions also need a G-machine. 4. Since `unwind` calls the generated functions, it must also receive a G-machine. As far as stack functions go, we only _need_ to update `stack_alloc` and `stack_pack`. Everything else doesn't require new node allocations, and thus, does not require the GC handle. However, this makes our code rather ugly: we have a set of mostly `stack_*` functions, followed suddenly by two `gmachine_*` functions. In the interest of cleanliness, let's instead do the following: 1. Make all functions associated with G-machine rules (like __Alloc__, __Update__, and so on) require a `gmachine*`. This way, theres a correspondence between our code and the theory. 2. Leave the rest of the functions (`stack_push`, `stack_pop`, etc.) as is. They are not G-machine specific, and don't require a GC handle, so there's no need to touch them. Let's make this change. We end up with the following functions: {{< codelines "C" "compiler/09/runtime.h" 56 84 >}} For the majority of the changed functions, the updates are {{< sidenote "right" "cosmetic-note" "cosmetic." >}} We must also update the LLVM/C++ declarations of the affected functions: many of them now take a gmachine_ptr_type instead of stack_ptr_type. This change is not shown explicitly here (it is hard to do with our growing code base), but it is nonetheless significant. {{< /sidenote >}} The functions that require more significant modifications are `gmachine_alloc` and `gmachine_pack`. In both, we must now make a call to `gmachine_track` to ensure that a newly allocated node will be garbage collected in the future. The updated code for `gmachine_alloc` is: {{< codelines "C" "compiler/09/runtime.c" 140 145 >}} Correspondingly, the updated code for `gmachine_pack` is: {{< codelines "C" "compiler/09/runtime.c" 147 162 >}} Note that we've secretly made one more change. Instead of allocating `sizeof(*data) * n` bytes of memory for the array of packed nodes, we allocate `sizeof(*data) * (n + 1)`, and set the last element to `NULL`. This will allow other functions (which we will soon write) to know how many elements are packed inside a `node_data` (effectively, we've added a `NULL` terminator). We must change our compiler to keep it up to date with this change. Importantly, it must know that a G-machine struct exists. To give it this information, we add a new `llvm::StructType*` called `gmachine_type` to the `llvm_context` class, initialize it in the constructor, and set its body as follows: {{< codelines "C++" "compiler/09/llvm_context.cpp" 21 26 >}} The compiler must also know that generated functions now use the G-machine struct rather than a stack struct: {{< codelines "C++" "compiler/09/llvm_context.cpp" 19 19 >}} Since we still use some functions that require a stack and not a G-machine, we must have a way to get the stack from a G-machine. To do this, we create a new `unwrap` function, which uses LLVM's GEP instruction to get a pointer to the G-machine's stack field: {{< codelines "C++" "compiler/09/llvm_context.cpp" 222 225 >}} We use this function elsewhere, such `llvm_context::create_pop`: {{< codelines "C++" "compiler/09/llvm_context.cpp" 176 179 >}} Finally, we want to make sure our generated functions don't allocate nodes without tracking them with the G-machine. To do so, we modify all the `create_*` methods to require the G-machine function argument, and update the functions themselves to call `gmachine_track`. For example, here's `llvm_context::create_num`: {{< codelines "C++" "compiler/09/llvm_context.cpp" 235 239 >}} Of course, this requires us to add a new `create_track` method to the `llvm_context`: {{< codelines "C++" "compiler/09/llvm_context.cpp" 212 215 >}} This is good. Let's now implement the actual mark-and-sweep algorithm in `gmachine_gc`: {{< codelines "C" "compiler/09/runtime.c" 186 204 >}} In the code above, we first iterate through the stack, calling `gc_visit_node` on every node that we encounter. The assumption is that once `gc_visit_node` is done, every node that _can_ be reached has its `gc_reachable` field set to 1, and all the others have it set to 0. Once we reach the end of the stack, we continue to the "sweep" phase, iterating through the linked list of nodes (held in the G-machine GC handle `gc_nodes`). For each node, if its `gc_reachable` flag is not set, we remove it from the linked list, and call `free_node_direct` on it. Otherwise (that is, if the flag __is__ set), we clear it, so that the node can potentially be garbage collected in a future invocation of `gmachine_gc`. `gc_visit_node` recursively marks a node and its children as reachable: {{< codelines "C" "compiler/09/runtime.c" 51 70 >}} This is possible with the `node_data` nodes because of the change we made to the `gmachine_pack` instruction earlier: now, the last element of the "packed" array is `NULL`, telling `gc_visit_node` that it has reached the end of the list of children. `free_node_direct` performs a non-recursive deallocation of all the resources held by a particular node. So far, this is only needed for `node_data` nodes, since the arrays holding their children are dynamically allocated. Thus, the code for the function is pretty simple: {{< codelines "C" "compiler/09/runtime.c" 45 49 >}} ### When to Collect When should we run garbage collection? Initially, I tried running it after every call to `unwind`. However, this quickly proved impractical: the performance of all the programs in the language decreased by a spectacular amount. Programs like `works1.txt` and `works2.txt` would take tens of seconds to complete. Instead of this madness, let's settle for an approach common to many garbage collectors. Let's __perform garbage collection every time the amount of memory we've allocated doubles__. Tracking when the amount of allocated memory doubles is the purpose of the `gc_node_count` and `gc_threshold` fields in the `gmachine` struct. The former field tracks how many nodes have been tracked by the garbage collector, and the latter holds the number of nodes the G-machine must reach before triggering garbage collection. Since the G-machine is made aware of allocations by a call to the `gmachine_track` function, this is where we will attempt to perform garbage collection. We end up with the following code: {{< codelines "C++" "compiler/09/runtime.c" 171 184 >}} When a node is added to the GC handle, we increment the `gc_node_count` field. If the new value of this field exceeds the threshold, we perform garbage collection. There are cases in which this is fairly dangerous: for instance, `gmachine_pack` first moves all packed nodes into an array, then allocates a `node_data` node. This means that for a brief moment, the nodes stored into the new data node are inaccessible from the stack, and thus susceptible to garbage collection. To prevent situations like this, we run `gc_visit_node` on the node being tracked, marking it and its children as "reachable". Finally, we set the next "free" threshold to double the number of currently allocated nodes. This is about as much as we need to do. The change in this post was a major one, and required updating multiple files. As always, you're welcome to check out [the compiler source code for this post](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/09). To wrap up, let's evaluate our change. To especially stress the compiler, I came up with a prime number generator. Since booleans are not in the standard library, and since it isn't possible to pattern match on numbers, my only option was the use Peano encoding. This effectively means that numbers are represented as linked lists, which makes garbage collection all the more important. The program is quite long, but you can [find the entire code here](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/09/examples/primes.txt). When I ran the `primes` program compiled using the previous version of the compiler using `time`, I got the following output: ``` Maximum resident set size (kbytes): 935764 Minor (reclaiming a frame) page faults: 233642 ``` In contrast, here is the output of `time` when running the same program compiled with the new version of the compiler: ``` Maximum resident set size (kbytes): 7448 Minor (reclaiming a frame) page faults: 1577 ``` We have reduced maximum memory usage by a factor of 125, and the number of page faults by a factor of 148. That seems pretty good! With this success, we end today's post. As I mentioned before, we're not done. The language is still clunky to use, and can benefit from `let/in` expressions and __lambda functions__. Furthermore, our language is currently monomorphic, and would be much better with __polymorphism__. Finally, to make our language capable of more-than-trivial work, we may want to implement __Input/Output__ and __strings__. I hope to see you in future posts, where we will implement these features!