2019-11-02 17:53:15 -07:00
|
|
|
---
|
|
|
|
title: Compiling a Functional Language Using C++, Part 8 - LLVM
|
|
|
|
date: 2019-10-30T22:16:22-07:00
|
|
|
|
draft: true
|
|
|
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
|
|
|
---
|
|
|
|
|
|
|
|
We don't want a compiler that can only generate code for a single
|
|
|
|
platform. Our language should work on macOS, Windows, and Linux,
|
|
|
|
on x86\_64, ARM, and maybe some other architectures. We also
|
|
|
|
don't want to manually implement the compiler for each platform,
|
|
|
|
dealing with the specifics of each architecture and operating
|
|
|
|
system.
|
|
|
|
|
|
|
|
This is where LLVM comes in. LLVM (which stands for __Low Level Virtual Machine__),
|
|
|
|
is a project which presents us with a kind of generic assembly language,
|
|
|
|
an __Intermediate Representation__ (IR). It also provides tooling to compile the
|
|
|
|
IR into platform-specific instructions, as well as to apply a host of various
|
|
|
|
optimizations. We can thus translate our G-machine instructions to LLVM,
|
|
|
|
and then use LLVM to generate machine code, which gets us to our ultimate
|
|
|
|
goal of compiling our language.
|
|
|
|
|
|
|
|
We start with adding LLVM to our CMake project.
|
|
|
|
{{< codelines "CMake" "compiler/08/CMakeLists.txt" 7 7 >}}
|
|
|
|
|
|
|
|
LLVM is a huge project, and has many components. We don't need
|
|
|
|
most of them. We do need the core libraries, the x86 assembly
|
|
|
|
generator, and x86 assembly parser. I'm
|
|
|
|
not sure why we need the last one, but I ran into linking
|
|
|
|
errors without them. We find the required link targets
|
|
|
|
for these components using this CMake command:
|
|
|
|
|
|
|
|
{{< codelines "CMake" "compiler/08/CMakeLists.txt" 19 20 >}}
|
|
|
|
|
|
|
|
Finally, we add the new include directories, link targets,
|
|
|
|
and definitions to our compiler executable:
|
|
|
|
|
|
|
|
{{< codelines "CMake" "compiler/08/CMakeLists.txt" 39 41 >}}
|
|
|
|
|
|
|
|
Great, we have the infrastructure updated to work with LLVM. It's
|
|
|
|
now time to start using the LLVM API to compile our G-machine instructions
|
|
|
|
into assembly. We start with `LLVMContext`. The LLVM documentation states:
|
|
|
|
|
|
|
|
> This is an important class for using LLVM in a threaded context.
|
|
|
|
> It (opaquely) owns and manages the core "global" data of LLVM's core infrastructure, including the type and constant uniquing tables.
|
|
|
|
|
|
|
|
We will have exactly one instance of such a class in our program.
|
|
|
|
|
|
|
|
Additionally, we want an `IRBuilder`, which will help us generate IR instructions,
|
|
|
|
placing them into basic blocks (more on that in a bit). Also, we want
|
|
|
|
a `Module` object, which represents some collection of code and declarations
|
|
|
|
(perhaps like a C++ source file). Let's keep these things in our own
|
2019-11-04 18:25:54 -08:00
|
|
|
`llvm_context` class. Here's what that looks like:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/08/llvm_context.hpp" >}}
|
|
|
|
|
|
|
|
{{< todo >}} Consistently name context / state.{{< /todo >}}
|
2019-11-05 00:42:33 -08:00
|
|
|
{{< todo >}} Explain creation functions. {{< /todo >}}
|
2019-11-04 18:25:54 -08:00
|
|
|
|
|
|
|
We include the LLVM context, builder, and module as members
|
|
|
|
of the context struct. Since the builder and the module need
|
|
|
|
the context, we initialize them in the constructor, where they
|
|
|
|
can safely reference it.
|
|
|
|
|
|
|
|
Besides these fields, we added
|
|
|
|
a few others, namely the `functions` and `struct_types` maps,
|
|
|
|
and the various `llvm::Type` subclasses such as `stack_type`.
|
|
|
|
We did this because we want to be able to call our runtime
|
|
|
|
functions (and use our runtime structs) from LLVM. To generate
|
|
|
|
a function call from LLVM, we need to have access to an
|
|
|
|
`llvm::Function` object. We thus want to have an `llvm::Function`
|
|
|
|
object for each runtime function we want to call. We could declare
|
|
|
|
a member variable in our `llvm_context` for each runtime function,
|
|
|
|
but it's easier to leave this to be an implementation
|
|
|
|
detail, and only have a dynamically created map between runtime
|
|
|
|
function names and their corresponding `llvm::Function` objects.
|
|
|
|
|
|
|
|
We populate the maps and other type-related variables in the
|
|
|
|
two methods, `create_functions()` and `create_types()`. To
|
|
|
|
create an `llvm::Function`, we must provide an `llvm::FunctionType`,
|
|
|
|
an `llvm::LinkageType`, the name of the function, and the module
|
|
|
|
in which the function is declared. Since we only have one
|
|
|
|
module (the one we initialized in the constructor) that's
|
|
|
|
the module we pass in. The name of the function is the same
|
|
|
|
as its name in the runtime, and the linkage type is always
|
|
|
|
external. The only remaining parameter is
|
|
|
|
the `llvm::FunctionType`, which is created using code like:
|
|
|
|
|
|
|
|
{{< todo >}} Why external? {{< /todo >}}
|
|
|
|
|
|
|
|
```C++
|
|
|
|
llvm::FunctionType::get(return_type, {param_type_1, param_type_2, ...}, is_variadic)
|
|
|
|
```
|
|
|
|
|
|
|
|
Declaring all the functions and types in our runtime is mostly
|
|
|
|
just tedious. Here are a few lines from `create_types()`, from
|
|
|
|
which you can extrapolate the rest:
|
|
|
|
|
|
|
|
{{< codelines "C++" "compiler/08/llvm_context.cpp" 7 11 >}}
|
|
|
|
|
2019-11-05 00:42:33 -08:00
|
|
|
{{< todo >}} Also show struct body setters. {{< /todo >}}
|
|
|
|
|
2019-11-04 18:25:54 -08:00
|
|
|
Similarly, here are a few lines from `create_functions()`, which
|
|
|
|
give a very good idea of the rest of that method:
|
|
|
|
|
|
|
|
{{< codelines "C++" "compiler/08/llvm_context.cpp" 20 27 >}}
|
|
|
|
|
|
|
|
This completes our implementation of the context.
|
|
|
|
|
|
|
|
### LLVM IR
|
|
|
|
It's now time to look at generating actual code for each G-machine instruction.
|
|
|
|
Before we do this, we need to get a little bit of an understanding of what LLVM
|
|
|
|
IR is like. An important property of LLVM IR is that it is in __Single Static Assignment__
|
|
|
|
(SSA) form. This means that each variable can only be assigned to once. For instance,
|
|
|
|
if we use `<-` to represent assignment, the following program is valid:
|
|
|
|
|
|
|
|
```
|
|
|
|
x <- 1
|
|
|
|
y <- 2
|
|
|
|
z <- x + y
|
|
|
|
```
|
|
|
|
|
|
|
|
However, the following program is __not__ valid:
|
|
|
|
|
|
|
|
```
|
|
|
|
x <- 1
|
|
|
|
x <- x + 1
|
|
|
|
```
|
|
|
|
|
|
|
|
But what if we __do__ want to modify a variable `x`?
|
|
|
|
We can declare another "version" of `x` every time we modify it.
|
|
|
|
For instance, if we wanted to increment `x` twice, we'd do this:
|
|
|
|
|
|
|
|
```
|
|
|
|
x <- 1
|
|
|
|
x1 <- x + 1
|
|
|
|
x2 <- x1 + 1
|
|
|
|
```
|
|
|
|
|
|
|
|
In practice, LLVM's C++ API can take care of versioning variables on its own, by
|
|
|
|
auto-incrementing numbers associated with each variable we use.
|
|
|
|
|
|
|
|
We need not get too deep into the specifics of LLVM IR's textual
|
|
|
|
representation, since we will largely be working with the C++
|
|
|
|
API to interact with it. We do, however, need to understand one more
|
|
|
|
concept from the world of compiler design: __basic blocks__. A basic
|
|
|
|
block is a sequence of instructions that are guaranteed to be executed
|
|
|
|
one after another. This means that a basic block cannot have
|
|
|
|
an if/else, jump, or any other type of control flow anywhere
|
|
|
|
except at the end. If control flow could appear inside the basic block,
|
|
|
|
there would be opporunity for execution of some, but not all,
|
|
|
|
instructions in the block, violating the definition. Every time
|
|
|
|
we add an IR instruction in LLVM, we add it to a basic block.
|
|
|
|
Writing control flow involves creating several blocks, with each
|
|
|
|
block serving as the destination of a potential jump. We will
|
|
|
|
see this used to compile the Jump instruction.
|
|
|
|
|
|
|
|
### Generating LLVM
|
|
|
|
Let's envision a `gen_llvm` method on the `instruction` struct.
|
|
|
|
We need access to all the other functions from our runtime,
|
|
|
|
such as `stack_init`, and functions from our program such
|
|
|
|
as `f_custom_function`. Thus, we need access to our
|
|
|
|
`llvm_context`. The current basic block is part
|
|
|
|
of the builder, which is part of the context, so that's
|
|
|
|
also taken care of. There's only one more thing that we will
|
|
|
|
need, and that's access to the `llvm::Function` that's
|
|
|
|
currently being compiled. To understand why, consider
|
|
|
|
the signature of `f_main` from the previous post:
|
|
|
|
|
|
|
|
```C
|
|
|
|
void f_main(struct stack*);
|
|
|
|
```
|
|
|
|
|
|
|
|
The function takes a stack as a parameter. What if
|
|
|
|
we want to try use this stack in a method call, like
|
|
|
|
`stack_push(s, node)`? We need to have access to the
|
|
|
|
LLVM representation of the stack parameter. The easiest
|
|
|
|
way to do this is to use `llvm::Function::arg_begin()`,
|
|
|
|
which gives the first argument of the function. We thus
|
|
|
|
carry the function pointer throughout our code generation
|
|
|
|
methods.
|
|
|
|
|
|
|
|
With these things in mind, here's the signature for `gen_llvm`:
|
|
|
|
|
|
|
|
```C++
|
2019-11-05 00:42:33 -08:00
|
|
|
virtual void gen_llvm(llvm_context&, llvm::Function*) const;
|
2019-11-04 18:25:54 -08:00
|
|
|
```
|
2019-11-05 00:42:33 -08:00
|
|
|
|
|
|
|
{{< todo >}} Fix pointer type inconsistencies. {{< /todo >}}
|