blog-static/content/blog/00_compiler_intro.md
Danila Fedorin 035b98a602
Some checks failed
continuous-integration/drone/push Build is failing
Start using description meta.
2020-05-09 17:29:57 -07:00

147 lines
6.8 KiB
Markdown

---
title: Compiling a Functional Language Using C++, Part 0 - Intro
date: 2019-08-03T01:02:30-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
description: "In this first post of a larger series, we embark on a journey of developing a compiler for a lazily evaluated functional language."
---
During my last academic term, I was enrolled in a compilers course.
We had a final project - develop a compiler for a basic Python subset,
using LLVM. It was a little boring - virtually nothing about the compiler
was __not__ covered in class, and it felt more like putting two puzzle
pieces together than building a real project.
Instead, I chose to implement a compiler for a functional programming language,
based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
a tutorial_. Since the class was requiring the use of tools based on C++,
that's what I used for my compiler. It was neat little project, and I
wanted to share with everyone else how one might go about writing their
own functional language.
### Motivation
There are two main motivating factors for this series.
First, whenever I stumble on a compiler implementation tutorial,
the language created is always imperative, inspired by C, C++, JavaScript,
or Python. There are many interesting things about compiling such a language.
However, I also think that the compilation of a functional language (including
features like lazy evaluation) is interesting enough, and rarely covered.
Second, I'm inspired by books such as _Software Foundations_ that use
source code as their text. The entire content of _Software Foundations_,
for instance, is written as comments in Coq source file. This means
that you can not only read the book, but also run the code and interact with it.
This makes it very engaging to read. Because of this, I want to provide for
each post a "snapshot" of the project code. All the code in the posts
will directly mirror that snapshot. The code you'll be reading will be
runnable and open.
### Overview
Let's go over some preliminary information before we embark on this journey.
#### The "classic" stages of a compiler
Let's take a look at the high level overview of what a compiler does.
Conceptually, the components of a compiler are pretty cleanly separated.
They are as follows:
1. Tokenizing / lexical analysis
2. Parsing
3. Analysis / optimization
5. Code Generation
There are many variations on this structure. Some compilers don't optimize
at all, some translate the program text into an intermediate representation,
an alternative way of representing the program that isn't machine code.
In some compilers, the stages of parsing and analysis can overlap.
In short, just like the pirate's code, it's more of a guideline than a rule.
#### What we will cover
We'll go through the stages of a compiler, starting from scratch
and building up our project. We'll cover:
* Tokenizing using regular expressions and Flex.
* Parsing using context free grammars and Bison.
* Monomorphic type checking (including typing rules).
* Evaluation using graph reduction and the G-Machine.
* Compiling G-Machine instructions to machine code using LLVM.
We'll be creating a __lazily evaluated__, __functional__ language.
#### What we won't cover
Surely a guide written by one person can't be comprehensive. Not only
do I have a finite amount of time to share this informatiom with you,
but I also don't want to dilute the content of the posts in this series. Furthermore,
many things that we'll be using for this tutorial are taught
by numerous other sources, and those sources do a better job than I would.
So, here are some things that you might want to know for this series,
which won't be covered by the series itself:
* [Theory of computation](https://en.wikipedia.org/wiki/Theory_of_computation),
or, more specifically, [automata theory](https://en.wikipedia.org/wiki/Automata_theory).
Deterministic and nondeterministic finite automata are briefly mentioned
during tokenizing, and context free grammars are used in our parser. However,
I don't provide a real explanation for either of those things.
* [Functional programming](https://en.wikipedia.org/wiki/Functional_programming),
with a touch of [lambda calculus](https://en.wikipedia.org/wiki/Lambda_calculus).
We jump straight into implementing the concepts from these.
* C++. I do my best to use correct and modern C++, but I'm not an expert. I will
not explain the syntax / semantics of C++ code included in these posts, but I will
explain what my code does in the context of compilers.
#### The syntax of our language
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
that most functional languages are very similar, and vary largely in syntax. That's
our main degree of freedom. We want to represent the following things, for sure:
* Defining functions
* Applying functions
* Arithmetic
* Algebraic data types (to represent lists, pairs, and the like)
* Pattern matching (to operate on data types)
We can additionally support anonymous (lambda) functions, but compiling those
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
Function application isn't much more difficult - `f x` means "apply f to x", and
`f x + g x` means sum the result of applying f to x and g to x. That is, function
application has higher precedence, or __binds tighter__ than binary operators like plus.
Next, let's define the syntax for declaring a function. Why not:
```
defn f x = { x + x }
```
As for declaring data types:
```
data List = { Nil, Cons Int List }
```
Notice that we are avoiding polymorphism here.
Let's also define a syntax for pattern matching:
```
case l of {
Nil -> { 0 }
Cons x xs -> { x }
}
```
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
constructed from an integer and another list (as defined in our `data` example),
return the integer".
That's it for the introduction! In the next post, we'll cover tokenizng, which is
the first step in converting source code into an executable program.
### Navigation
Here are the posts that I've written so far for this series:
* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
* [Parsing]({{< relref "02_compiler_parsing.md" >}})
* [Typechecking]({{< relref "03_compiler_typechecking.md" >}})
* [Small Improvements]({{< relref "04_compiler_improvements.md" >}})
* [Execution]({{< relref "05_compiler_execution.md" >}})
* [Compilation]({{< relref "06_compiler_compilation.md" >}})
* [Runtime]({{< relref "07_compiler_runtime.md" >}})
* [LLVM]({{< relref "08_compiler_llvm.md" >}})
* [Garbage Collection]({{< relref "09_compiler_garbage_collection.md" >}})
* [Polymorphism]({{< relref "10_compiler_polymorphism.md" >}})
* [Polymorphic Data Types]({{< relref "11_compiler_polymorphic_data_types.md" >}})