12 KiB
title | date | draft | tags | |||
---|---|---|---|---|---|---|
Compiling a Functional Language Using C++, Part 8 - LLVM | 2019-10-30T22:16:22-07:00 | true |
|
We don't want a compiler that can only generate code for a single platform. Our language should work on macOS, Windows, and Linux, on x86_64, ARM, and maybe some other architectures. We also don't want to manually implement the compiler for each platform, dealing with the specifics of each architecture and operating system.
This is where LLVM comes in. LLVM (which stands for Low Level Virtual Machine), is a project which presents us with a kind of generic assembly language, an Intermediate Representation (IR). It also provides tooling to compile the IR into platform-specific instructions, as well as to apply a host of various optimizations. We can thus translate our G-machine instructions to LLVM, and then use LLVM to generate machine code, which gets us to our ultimate goal of compiling our language.
We start with adding LLVM to our CMake project. {{< codelines "CMake" "compiler/08/CMakeLists.txt" 7 7 >}}
LLVM is a huge project, and has many components. We don't need most of them. We do need the core libraries, the x86 assembly generator, and x86 assembly parser. I'm not sure why we need the last one, but I ran into linking errors without them. We find the required link targets for these components using this CMake command:
{{< codelines "CMake" "compiler/08/CMakeLists.txt" 19 20 >}}
Finally, we add the new include directories, link targets, and definitions to our compiler executable:
{{< codelines "CMake" "compiler/08/CMakeLists.txt" 39 41 >}}
Great, we have the infrastructure updated to work with LLVM. It's
now time to start using the LLVM API to compile our G-machine instructions
into assembly. We start with LLVMContext
. The LLVM documentation states:
This is an important class for using LLVM in a threaded context. It (opaquely) owns and manages the core "global" data of LLVM's core infrastructure, including the type and constant uniquing tables.
We will have exactly one instance of such a class in our program.
Additionally, we want an IRBuilder
, which will help us generate IR instructions,
placing them into basic blocks (more on that in a bit). Also, we want
a Module
object, which represents some collection of code and declarations
(perhaps like a C++ source file). Let's keep these things in our own
llvm_context
class. Here's what that looks like:
{{< codeblock "C++" "compiler/08/llvm_context.hpp" >}}
{{< todo >}} Explain creation functions. {{< /todo >}}
We include the LLVM context, builder, and module as members of the context struct. Since the builder and the module need the context, we initialize them in the constructor, where they can safely reference it.
Besides these fields, we added
a few others, namely the functions
and struct_types
maps,
and the various llvm::Type
subclasses such as stack_type
.
We did this because we want to be able to call our runtime
functions (and use our runtime structs) from LLVM. To generate
a function call from LLVM, we need to have access to an
llvm::Function
object. We thus want to have an llvm::Function
object for each runtime function we want to call. We could declare
a member variable in our llvm_context
for each runtime function,
but it's easier to leave this to be an implementation
detail, and only have a dynamically created map between runtime
function names and their corresponding llvm::Function
objects.
We populate the maps and other type-related variables in the
two methods, create_functions()
and create_types()
. To
create an llvm::Function
, we must provide an llvm::FunctionType
,
an llvm::LinkageType
, the name of the function, and the module
in which the function is declared. Since we only have one
module (the one we initialized in the constructor) that's
the module we pass in. The name of the function is the same
as its name in the runtime. The linkage type is a little
more complicated - it tells LLVM the "visibility" of a function.
"Private" or "Internal" would hide this function from the linker
(like static
functions in C). However, we want to do the opposite: our
generated functions should be accessible from other code.
Thus, our linkage type is "External".
The only remaining parameter is the llvm::FunctionType
, which
is created using code like:
llvm::FunctionType::get(return_type, {param_type_1, param_type_2, ...}, is_variadic)
Declaring all the functions and types in our runtime is mostly
just tedious. Here are a few lines from create_functions()
, which
give a very good idea of the rest of that method:
{{< codelines "C++" "compiler/08/llvm_context.cpp" 47 60 >}}
Similarly, here are a few lines from create_types()
, from
which you can extrapolate the rest:
{{< codelines "C++" "compiler/08/llvm_context.cpp" 7 11 >}}
We also tell LLVM the contents of our structs, so that
we may later reference specific fields. This is just like
forward declaration - we can forward declare a struct
in C/C++, but unless we also declare its contents,
we can't access what's inside. Below is the code
for specifying the body of node_base
and node_app
.
{{< codelines "C++" "compiler/08/llvm_context.cpp" 19 26 >}}
There's still more functionality packed into llvm_context
.
Let's next take a look into custom_function
, and
the create_custom_function
method. Why do we need
these?
This isn't the end of our llvm_context
class: it also
has a variety of create_*
methods! Let's take a look
at their signatures. Most return either void
,
llvm::ConstantInt*
, or llvm::Value*
. Since
llvm::ConstantInt*
is a subclass of llvm::Value*
, let's
just treat it as simply an llvm::Value*
while trying
to understand these methods.
So, what is llvm::Value
? To answer this question, let's
first understand how the LLVM IR works.
LLVM IR
An important property of LLVM IR is that it is in Single Static Assignment
(SSA) form. This means that each variable can only be assigned to once. For instance,
if we use <-
to represent assignment, the following program is valid:
x <- 1
y <- 2
z <- x + y
However, the following program is not valid:
x <- 1
x <- x + 1
But what if we do want to modify a variable x
?
We can declare another "version" of x
every time we modify it.
For instance, if we wanted to increment x
twice, we'd do this:
x <- 1
x1 <- x + 1
x2 <- x1 + 1
In practice, LLVM's C++ API can take care of versioning variables on its own, by auto-incrementing numbers associated with each variable we use.
Assigned to each variable is llvm::Value
. The LLVM documentation states:
It is the base class of all values computed by a program that may be used as operands to other values.
It's important to understand that llvm::Value
does not store the result of the computation.
It rather represents how something may be computed. 1 is a value because it computed by
just returning. x + 1
is a value because it is computed by adding the value inside of
x
to 1. Since we cannot modify a variable once we've declared it, we will
keep assigning intermediate results to new variables, constructing new values
out of values that we've already specified.
This somewhat elucidates what the create_*
functions do: create_i8
creates an 8-bit integer
value, and create_pop
creates a value that is computed by calling
our runtime stack_pop
function.
Before we move on to look at the implementations of these functions, we need to understand another concept from the world of compiler design: basic blocks. A basic block is a sequence of instructions that are guaranteed to be executed one after another. This means that a basic block cannot have an if/else, jump, or any other type of control flow anywhere except at the end. If control flow could appear inside the basic block, there would be opporunity for execution of some, but not all, instructions in the block, violating the definition. Every time we add an IR instruction in LLVM, we add it to a basic block. Writing control flow involves creating several blocks, with each block serving as the destination of a potential jump. We will see this used to compile the Jump instruction.
Generating LLVM IR
Now that we understand what llvm::Value
is, and have a vague
understanding of how LLVM is structured, let's take a look at
the implementations of the create_*
functions. The simplest
is create_i8
:
{{< codelines "C++" "compiler/08/llvm_context.cpp" 150 152 >}}
Not much to see here. We create an instance of the llvm::ConstantInt
class,
from the actual integer given to the method. As we said before,
llvm::ConstantInt
is a subclass of llvm::Value
. Next up, let's look
at create_pop
:
{{< codelines "C++" "compiler/08/llvm_context.cpp" 160 163 >}}
We first retrieve an llvm::Function
associated with stack_pop
from our map, and then use llvm::IRBuilder::CreateCall
to insert
a value that represents a function call into the currently
selected basic block (the builder's state is what
dictates what the "selected basic block" is). CreateCall
takes as parameters the function we want to call (stack_pop
,
which we store into the pop_f
variable), as well as the arguments
to the function (for which we pass f->arg_begin()
).
Hold on. What the heck is arg_begin()
? Why do we take a function
as a paramter to this method? The answer is fairly simple: this
method is used when we are
generating a function with signature void f_(struct stack* s)
(we discussed the signature in the previous post). The
parameter that we give to create_pop
is this function we're
generating, and arg_begin()
gets the value that represents
the first parameter to our function - s
! Since stack_pop
takes a stack, we need to give it the stack we're working on,
and so we use f->arg_begin()
to access it.
Most of the other functions follow this exact pattern, with small deviations. However, another function uses a more complicated LLVM instruction:
{{< codelines "C++" "compiler/08/llvm_context.cpp" 202 209 >}}
unwrap_num
is used to cast a given node pointer to a pointer
to a number node, and then return the integer value from
that number node. It starts fairly innocently: we ask
LLVM for the type of a pointer to a node_num
struct,
and then use CreatePointerCast
to create a value
that is the same node pointer we're given, but now interpreted
as a number node pointer. We now have to access
the value
field of our node. CreateGEP
helps us with
this: given a pointer to a node, and two offsets
n
and k
, it effectively performs the following:
&(num_pointer[n]->kth_field)
The first offset, then, gives an index into the "array"
represented by the pointer, while the second offset
gives the index of the field we want to access. We
want to dereference the pointer (num_pointer[0]
),
and we want the second field (1
, when counting from 0).
Thus, we call CreateGEP with these offsets and our pointers.
This still leaves us with a pointer to a number, rather
than the number itself. To dereference the pointer, we use
CreateLoad
. This gives us the value of the number node,
which we promptly return.
Let's envision a gen_llvm
method on the instruction
struct.
We need access to all the other functions from our runtime,
such as stack_init
, and functions from our program such
as f_custom_function
. Thus, we need access to our
llvm_context
. The current basic block is part
of the builder, which is part of the context, so that's
also taken care of. There's only one more thing that we will
need, and that's access to the llvm::Function
that's
currently being compiled. To understand why, consider
the signature of f_main
from the previous post:
void f_main(struct stack*);
The function takes a stack as a parameter. What if
we want to try use this stack in a method call, like
stack_push(s, node)
? We need to have access to the
LLVM representation of the stack parameter. The easiest
way to do this is to use llvm::Function::arg_begin()
,
which gives the first argument of the function. We thus
carry the function pointer throughout our code generation
methods.
With these things in mind, here's the signature for gen_llvm
:
virtual void gen_llvm(llvm_context&, llvm::Function*) const;
{{< todo >}} Fix pointer type inconsistencies. {{< /todo >}}