diff --git a/content/blog/08_compiler_llvm.md b/content/blog/08_compiler_llvm.md index 11f6b64..4d38a54 100644 --- a/content/blog/08_compiler_llvm.md +++ b/content/blog/08_compiler_llvm.md @@ -54,7 +54,6 @@ a `Module` object, which represents some collection of code and declarations {{< codeblock "C++" "compiler/08/llvm_context.hpp" >}} -{{< todo >}} Consistently name context / state.{{< /todo >}} {{< todo >}} Explain creation functions. {{< /todo >}} We include the LLVM context, builder, and module as members @@ -82,35 +81,58 @@ an `llvm::LinkageType`, the name of the function, and the module in which the function is declared. Since we only have one module (the one we initialized in the constructor) that's the module we pass in. The name of the function is the same -as its name in the runtime, and the linkage type is always -external. The only remaining parameter is -the `llvm::FunctionType`, which is created using code like: +as its name in the runtime. The linkage type is a little +more complicated - it tells LLVM the "visibility" of a function. +"Private" or "Internal" would hide this function from the linker +(like `static` functions in C). However, we want to do the opposite: our +generated functions should be accessible from other code. +Thus, our linkage type is "External". -{{< todo >}} Why external? {{< /todo >}} +The only remaining parameter is the `llvm::FunctionType`, which +is created using code like: ```C++ llvm::FunctionType::get(return_type, {param_type_1, param_type_2, ...}, is_variadic) ``` Declaring all the functions and types in our runtime is mostly -just tedious. Here are a few lines from `create_types()`, from +just tedious. Here are a few lines from `create_functions()`, which +give a very good idea of the rest of that method: + +{{< codelines "C++" "compiler/08/llvm_context.cpp" 47 60 >}} + +Similarly, here are a few lines from `create_types()`, from which you can extrapolate the rest: {{< codelines "C++" "compiler/08/llvm_context.cpp" 7 11 >}} -{{< todo >}} Also show struct body setters. {{< /todo >}} +We also tell LLVM the contents of our structs, so that +we may later reference specific fields. This is just like +forward declaration - we can forward declare a struct +in C/C++, but unless we also declare its contents, +we can't access what's inside. Below is the code +for specifying the body of `node_base` and `node_app`. -Similarly, here are a few lines from `create_functions()`, which -give a very good idea of the rest of that method: +{{< codelines "C++" "compiler/08/llvm_context.cpp" 19 26 >}} -{{< codelines "C++" "compiler/08/llvm_context.cpp" 20 27 >}} +There's still more functionality packed into `llvm_context`. +Let's next take a look into `custom_function`, and +the `create_custom_function` method. Why do we need +these? -This completes our implementation of the context. +This isn't the end of our `llvm_context` class: it also +has a variety of `create_*` methods! Let's take a look +at their signatures. Most return either `void`, +`llvm::ConstantInt*`, or `llvm::Value*`. Since +`llvm::ConstantInt*` is a subclass of `llvm::Value*`, let's +just treat it as simply an `llvm::Value*` while trying +to understand these methods. + +So, what is `llvm::Value`? To answer this question, let's +first understand how the LLVM IR works. ### LLVM IR -It's now time to look at generating actual code for each G-machine instruction. -Before we do this, we need to get a little bit of an understanding of what LLVM -IR is like. An important property of LLVM IR is that it is in __Single Static Assignment__ +An important property of LLVM IR is that it is in __Single Static Assignment__ (SSA) form. This means that each variable can only be assigned to once. For instance, if we use `<-` to represent assignment, the following program is valid: @@ -140,13 +162,26 @@ x2 <- x1 + 1 In practice, LLVM's C++ API can take care of versioning variables on its own, by auto-incrementing numbers associated with each variable we use. -We need not get too deep into the specifics of LLVM IR's textual -representation, since we will largely be working with the C++ -API to interact with it. We do, however, need to understand one more -concept from the world of compiler design: __basic blocks__. A basic -block is a sequence of instructions that are guaranteed to be executed -one after another. This means that a basic block cannot have -an if/else, jump, or any other type of control flow anywhere +Assigned to each variable is `llvm::Value`. The LLVM documentation states: + +> It is the base class of all values computed by a program that may be used as operands to other values. + +It's important to understand that `llvm::Value` __does not store the result of the computation__. +It rather represents how something may be computed. 1 is a value because it computed by +just returning. `x + 1` is a value because it is computed by adding the value inside of +`x` to 1. Since we cannot modify a variable once we've declared it, we will +keep assigning intermediate results to new variables, constructing new values +out of values that we've already specified. + +This somewhat elucidates what the `create_*` functions do: `create_i8` creates an 8-bit integer +value, and `create_pop` creates a value that is computed by calling +our runtime `stack_pop` function. + +Before we move on to look at the implementations of these functions, +we need to understand another concept from the world of compiler design: +__basic blocks__. A basic block is a sequence of instructions that +are guaranteed to be executed one after another. This means that a +basic block cannot have an if/else, jump, or any other type of control flow anywhere except at the end. If control flow could appear inside the basic block, there would be opporunity for execution of some, but not all, instructions in the block, violating the definition. Every time @@ -155,7 +190,74 @@ Writing control flow involves creating several blocks, with each block serving as the destination of a potential jump. We will see this used to compile the Jump instruction. -### Generating LLVM +### Generating LLVM IR +Now that we understand what `llvm::Value` is, and have a vague +understanding of how LLVM is structured, let's take a look at +the implementations of the `create_*` functions. The simplest +is `create_i8`: + +{{< codelines "C++" "compiler/08/llvm_context.cpp" 150 152 >}} + +Not much to see here. We create an instance of the `llvm::ConstantInt` class, +from the actual integer given to the method. As we said before, +`llvm::ConstantInt` is a subclass of `llvm::Value`. Next up, let's look +at `create_pop`: + +{{< codelines "C++" "compiler/08/llvm_context.cpp" 160 163 >}} + +We first retrieve an `llvm::Function` associated with `stack_pop` +from our map, and then use `llvm::IRBuilder::CreateCall` to insert +a value that represents a function call into the currently +selected basic block (the builder's state is what +dictates what the "selected basic block" is). `CreateCall` +takes as parameters the function we want to call (`stack_pop`, +which we store into the `pop_f` variable), as well as the arguments +to the function (for which we pass `f->arg_begin()`). + +Hold on. What the heck is `arg_begin()`? Why do we take a function +as a paramter to this method? The answer is fairly simple: this +method is used when we are +generating a function with signature `void f_(struct stack* s)` +(we discussed the signature in the previous post). The +parameter that we give to `create_pop` is this function we're +generating, and `arg_begin()` gets the value that represents +the first parameter to our function - `s`! Since `stack_pop` +takes a stack, we need to give it the stack we're working on, +and so we use `f->arg_begin()` to access it. + +Most of the other functions follow this exact pattern, with small +deviations. However, another function uses a more complicated LLVM +instruction: + +{{< codelines "C++" "compiler/08/llvm_context.cpp" 202 209 >}} + +`unwrap_num` is used to cast a given node pointer to a pointer +to a number node, and then return the integer value from +that number node. It starts fairly innocently: we ask +LLVM for the type of a pointer to a `node_num` struct, +and then use `CreatePointerCast` to create a value +that is the same node pointer we're given, but now interpreted +as a number node pointer. We now have to access +the `value` field of our node. `CreateGEP` helps us with +this: given a pointer to a node, and two offsets +`n` and `k`, it effectively performs the following: + +```C++ +&(num_pointer[n]->kth_field) +``` + +The first offset, then, gives an index into the "array" +represented by the pointer, while the second offset +gives the index of the field we want to access. We +want to dereference the pointer (`num_pointer[0]`), +and we want the second field (`1`, when counting from 0). +Thus, we call CreateGEP with these offsets and our pointers. + +This still leaves us with a pointer to a number, rather +than the number itself. To dereference the pointer, we use +`CreateLoad`. This gives us the value of the number node, +which we promptly return. + Let's envision a `gen_llvm` method on the `instruction` struct. We need access to all the other functions from our runtime, such as `stack_init`, and functions from our program such @@ -187,5 +289,3 @@ virtual void gen_llvm(llvm_context&, llvm::Function*) const; ``` {{< todo >}} Fix pointer type inconsistencies. {{< /todo >}} -{{< todo >}} Create + backport Pop instruction {{< /todo >}} -{{< todo >}} Explain forcing normal evaluation in binary operator {{< /todo >}}