blog-static/content/blog/08_compiler_llvm.md

7.5 KiB

title date draft tags
Compiling a Functional Language Using C++, Part 8 - LLVM 2019-10-30T22:16:22-07:00 true
C and C++
Functional Languages
Compilers

We don't want a compiler that can only generate code for a single platform. Our language should work on macOS, Windows, and Linux, on x86_64, ARM, and maybe some other architectures. We also don't want to manually implement the compiler for each platform, dealing with the specifics of each architecture and operating system.

This is where LLVM comes in. LLVM (which stands for Low Level Virtual Machine), is a project which presents us with a kind of generic assembly language, an Intermediate Representation (IR). It also provides tooling to compile the IR into platform-specific instructions, as well as to apply a host of various optimizations. We can thus translate our G-machine instructions to LLVM, and then use LLVM to generate machine code, which gets us to our ultimate goal of compiling our language.

We start with adding LLVM to our CMake project. {{< codelines "CMake" "compiler/08/CMakeLists.txt" 7 7 >}}

LLVM is a huge project, and has many components. We don't need most of them. We do need the core libraries, the x86 assembly generator, and x86 assembly parser. I'm not sure why we need the last one, but I ran into linking errors without them. We find the required link targets for these components using this CMake command:

{{< codelines "CMake" "compiler/08/CMakeLists.txt" 19 20 >}}

Finally, we add the new include directories, link targets, and definitions to our compiler executable:

{{< codelines "CMake" "compiler/08/CMakeLists.txt" 39 41 >}}

Great, we have the infrastructure updated to work with LLVM. It's now time to start using the LLVM API to compile our G-machine instructions into assembly. We start with LLVMContext. The LLVM documentation states:

This is an important class for using LLVM in a threaded context. It (opaquely) owns and manages the core "global" data of LLVM's core infrastructure, including the type and constant uniquing tables.

We will have exactly one instance of such a class in our program.

Additionally, we want an IRBuilder, which will help us generate IR instructions, placing them into basic blocks (more on that in a bit). Also, we want a Module object, which represents some collection of code and declarations (perhaps like a C++ source file). Let's keep these things in our own llvm_context class. Here's what that looks like:

{{< codeblock "C++" "compiler/08/llvm_context.hpp" >}}

{{< todo >}} Consistently name context / state.{{< /todo >}}

We include the LLVM context, builder, and module as members of the context struct. Since the builder and the module need the context, we initialize them in the constructor, where they can safely reference it.

Besides these fields, we added a few others, namely the functions and struct_types maps, and the various llvm::Type subclasses such as stack_type. We did this because we want to be able to call our runtime functions (and use our runtime structs) from LLVM. To generate a function call from LLVM, we need to have access to an llvm::Function object. We thus want to have an llvm::Function object for each runtime function we want to call. We could declare a member variable in our llvm_context for each runtime function, but it's easier to leave this to be an implementation detail, and only have a dynamically created map between runtime function names and their corresponding llvm::Function objects.

We populate the maps and other type-related variables in the two methods, create_functions() and create_types(). To create an llvm::Function, we must provide an llvm::FunctionType, an llvm::LinkageType, the name of the function, and the module in which the function is declared. Since we only have one module (the one we initialized in the constructor) that's the module we pass in. The name of the function is the same as its name in the runtime, and the linkage type is always external. The only remaining parameter is the llvm::FunctionType, which is created using code like:

{{< todo >}} Why external? {{< /todo >}}

llvm::FunctionType::get(return_type, {param_type_1, param_type_2, ...}, is_variadic)

Declaring all the functions and types in our runtime is mostly just tedious. Here are a few lines from create_types(), from which you can extrapolate the rest:

{{< codelines "C++" "compiler/08/llvm_context.cpp" 7 11 >}}

Similarly, here are a few lines from create_functions(), which give a very good idea of the rest of that method:

{{< codelines "C++" "compiler/08/llvm_context.cpp" 20 27 >}}

This completes our implementation of the context.

LLVM IR

It's now time to look at generating actual code for each G-machine instruction. Before we do this, we need to get a little bit of an understanding of what LLVM IR is like. An important property of LLVM IR is that it is in Single Static Assignment (SSA) form. This means that each variable can only be assigned to once. For instance, if we use <- to represent assignment, the following program is valid:

x <- 1
y <- 2
z <- x + y

However, the following program is not valid:

x <- 1
x <- x + 1

But what if we do want to modify a variable x? We can declare another "version" of x every time we modify it. For instance, if we wanted to increment x twice, we'd do this:

x <- 1
x1 <- x + 1
x2 <- x1 + 1

In practice, LLVM's C++ API can take care of versioning variables on its own, by auto-incrementing numbers associated with each variable we use.

We need not get too deep into the specifics of LLVM IR's textual representation, since we will largely be working with the C++ API to interact with it. We do, however, need to understand one more concept from the world of compiler design: basic blocks. A basic block is a sequence of instructions that are guaranteed to be executed one after another. This means that a basic block cannot have an if/else, jump, or any other type of control flow anywhere except at the end. If control flow could appear inside the basic block, there would be opporunity for execution of some, but not all, instructions in the block, violating the definition. Every time we add an IR instruction in LLVM, we add it to a basic block. Writing control flow involves creating several blocks, with each block serving as the destination of a potential jump. We will see this used to compile the Jump instruction.

Generating LLVM

Let's envision a gen_llvm method on the instruction struct. We need access to all the other functions from our runtime, such as stack_init, and functions from our program such as f_custom_function. Thus, we need access to our llvm_context. The current basic block is part of the builder, which is part of the context, so that's also taken care of. There's only one more thing that we will need, and that's access to the llvm::Function that's currently being compiled. To understand why, consider the signature of f_main from the previous post:

void f_main(struct stack*);

The function takes a stack as a parameter. What if we want to try use this stack in a method call, like stack_push(s, node)? We need to have access to the LLVM representation of the stack parameter. The easiest way to do this is to use llvm::Function::arg_begin(), which gives the first argument of the function. We thus carry the function pointer throughout our code generation methods.

With these things in mind, here's the signature for gen_llvm:

virtual void gen_llvm(const llvm_context&, llvm::Function*) const;