blog-static/content/blog/00_compiler_intro.md

118 lines
4.9 KiB
Markdown
Raw Normal View History

---
title: Compiling a Functional Language Using C++, Part 0 - Intro
date: 2019-08-03T01:02:30-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
draft: true
---
During my last academic term, I was enrolled in a compilers course.
We had a final project - develop a compiler for a basic Python subset,
using LLVM. It was a little boring - virtually nothing about the compiler
was __not__ covered in class, and it felt more like putting two puzzles
pieces together than building a real project.
Instead, I chose to implement a compiler for a functional programming language,
based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
a tutorial_. Since the class was requiring the use of tools based on C++,
that's what I used for my compiler. It was neat little project, and I
wanted to share with everyone else how one might go about writing their
own functional language.
### Motivation
There are two main motivating factors for this series.
First, whenever I stumble on a compiler implementation tutorial,
the language created is always imperative, inspired by C, C++, JavaScript,
or Python. There are many interesting things about compiling such a language.
However, I also think that the compilation of a functional language (including
features like lazy evaluation) is interesting enough, and rarely covered.
Second, I'm inspired by books such as _Software Foundations_ that use
source code as their text. The entire content of _Software Foundations_,
for instance, is written as comments in Coq source file. This means
that you can not only read the book, but also run the code and interact with it.
This makes it very engaging to read. Because of this, I want to provide for
each post a "snapshot" of the project code. All the code in the posts
will directly mirror that snapshot. The code you'll be reading will be
runnable and open.
### Overview
Let's go over some preliminary information before we embark on this journey.
#### The "classic" stages of a compiler
Let's take a look at the high level overview of what a compiler does.
Conceptually, the components of a compiler are pretty cleanly separated.
They are as gollows:
1. Tokenizing / lexical analysis
2. Parsing
3. Analysis / optimization
5. Code Generation
There are many variations on this structure. Some compilers don't optimize
at all, some translate the program text into an intermediate representation,
an alternative way of representing the program that isn't machine code.
In some compilers, the stages of parsing and analysis can overlap.
In short, just like the pirate's code, it's more of a guideline than a rule.
#### What we'll cover
We'll go through the stages of a compiler, starting from scratch
and building up our project. We'll cover:
* Tokenizing using regular expressions and Flex.
* Parsing using context free grammars and Bison.
* Monomorphic type checking (including typing rules).
* Evaluation using graph reduction and the G-Machine.
* Compiling G-Machine instructions to machine code using LLVM.
We'll be creating a __lazily evaluated__, __functional__ language.
#### The syntax of our language
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
that most functional languages are very similar, and vary largely in syntax. That's
our main degree of freedom. We want to represent the following things, for sure:
* Defining functions
* Applying functions
* Arithmetic
* Algebraic data types (to represent lists, pairs, and the like)
* Pattern matching (to operate on data types)
We can additionally support anonymous (lambda) functions, but compiling those
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
Function application isn't much more difficult - `f x` means "apply f to x", and
`f x + g x` means sum the result of applying f to x and g to x. That is, function
application has higher precedence, or __binds tighter__ than binary operators like plus.
Next, let's define the syntax for declaring a function. Why not:
```
defn f x = { x + x }
```
As for declaring data types:
```
data List = { Nil, Cons Int List }
```
Notice that we are avoiding polymorphism here.
Let's also define a syntax for pattern matching:
```
case l of {
Nil -> { 0 }
Cons x xs -> { x }
}
```
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
constructed from an integer and another list (as defined in our `data` example),
return the integer".
That's it for the introduction! In the next post, we'll cover tokenizng, which is
the first step in coverting source code into an executable program.
### Navigation
Here are the posts that I've written so far for this series:
* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
* [Parsing]({{< relref "02_compiler_parsing.md" >}})
* [Typechecking]({{< relref "03_compiler_typechecking.md" >}})