149 lines
6.9 KiB
Markdown
149 lines
6.9 KiB
Markdown
---
|
|
title: Compiling a Functional Language Using C++, Part 0 - Intro
|
|
date: 2019-08-03T01:02:30-07:00
|
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
|
description: "In this first post of a larger series, we embark on a journey of developing a compiler for a lazily evaluated functional language."
|
|
---
|
|
During my last academic term, I was enrolled in a compilers course.
|
|
We had a final project - develop a compiler for a basic Python subset,
|
|
using LLVM. It was a little boring - virtually nothing about the compiler
|
|
was __not__ covered in class, and it felt more like putting two puzzle
|
|
pieces together than building a real project.
|
|
|
|
Instead, I chose to implement a compiler for a functional programming language,
|
|
based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
|
|
a tutorial_. Since the class was requiring the use of tools based on C++,
|
|
that's what I used for my compiler. It was neat little project, and I
|
|
wanted to share with everyone else how one might go about writing their
|
|
own functional language.
|
|
|
|
### Motivation
|
|
There are two main motivating factors for this series.
|
|
|
|
First, whenever I stumble on a compiler implementation tutorial,
|
|
the language created is always imperative, inspired by C, C++, JavaScript,
|
|
or Python. There are many interesting things about compiling such a language.
|
|
However, I also think that the compilation of a functional language (including
|
|
features like lazy evaluation) is interesting enough, and rarely covered.
|
|
|
|
Second, I'm inspired by books such as _Software Foundations_ that use
|
|
source code as their text. The entire content of _Software Foundations_,
|
|
for instance, is written as comments in Coq source file. This means
|
|
that you can not only read the book, but also run the code and interact with it.
|
|
This makes it very engaging to read. Because of this, I want to provide for
|
|
each post a "snapshot" of the project code. All the code in the posts
|
|
will directly mirror that snapshot. The code you'll be reading will be
|
|
runnable and open.
|
|
|
|
### Overview
|
|
Let's go over some preliminary information before we embark on this journey.
|
|
|
|
#### The "classic" stages of a compiler
|
|
Let's take a look at the high level overview of what a compiler does.
|
|
Conceptually, the components of a compiler are pretty cleanly separated.
|
|
They are as follows:
|
|
|
|
1. Tokenizing / lexical analysis
|
|
2. Parsing
|
|
3. Analysis / optimization
|
|
5. Code Generation
|
|
|
|
There are many variations on this structure. Some compilers don't optimize
|
|
at all, some translate the program text into an intermediate representation,
|
|
an alternative way of representing the program that isn't machine code.
|
|
In some compilers, the stages of parsing and analysis can overlap.
|
|
In short, just like the pirate's code, it's more of a guideline than a rule.
|
|
|
|
#### What we will cover
|
|
We'll go through the stages of a compiler, starting from scratch
|
|
and building up our project. We'll cover:
|
|
|
|
* Tokenizing using regular expressions and Flex.
|
|
* Parsing using context free grammars and Bison.
|
|
* Monomorphic type checking (including typing rules).
|
|
* Evaluation using graph reduction and the G-Machine.
|
|
* Compiling G-Machine instructions to machine code using LLVM.
|
|
|
|
We'll be creating a __lazily evaluated__, __functional__ language.
|
|
|
|
#### What we won't cover
|
|
Surely a guide written by one person can't be comprehensive. Not only
|
|
do I have a finite amount of time to share this informatiom with you,
|
|
but I also don't want to dilute the content of the posts in this series. Furthermore,
|
|
many things that we'll be using for this tutorial are taught
|
|
by numerous other sources, and those sources do a better job than I would.
|
|
So, here are some things that you might want to know for this series,
|
|
which won't be covered by the series itself:
|
|
|
|
* [Theory of computation](https://en.wikipedia.org/wiki/Theory_of_computation),
|
|
or, more specifically, [automata theory](https://en.wikipedia.org/wiki/Automata_theory).
|
|
Deterministic and nondeterministic finite automata are briefly mentioned
|
|
during tokenizing, and context free grammars are used in our parser. However,
|
|
I don't provide a real explanation for either of those things.
|
|
* [Functional programming](https://en.wikipedia.org/wiki/Functional_programming),
|
|
with a touch of [lambda calculus](https://en.wikipedia.org/wiki/Lambda_calculus).
|
|
We jump straight into implementing the concepts from these.
|
|
* C++. I do my best to use correct and modern C++, but I'm not an expert. I will
|
|
not explain the syntax / semantics of C++ code included in these posts, but I will
|
|
explain what my code does in the context of compilers.
|
|
|
|
#### The syntax of our language
|
|
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
|
|
that most functional languages are very similar, and vary largely in syntax. That's
|
|
our main degree of freedom. We want to represent the following things, for sure:
|
|
|
|
* Defining functions
|
|
* Applying functions
|
|
* Arithmetic
|
|
* Algebraic data types (to represent lists, pairs, and the like)
|
|
* Pattern matching (to operate on data types)
|
|
|
|
We can additionally support anonymous (lambda) functions, but compiling those
|
|
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
|
|
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
|
|
Function application isn't much more difficult - `f x` means "apply f to x", and
|
|
`f x + g x` means sum the result of applying f to x and g to x. That is, function
|
|
application has higher precedence, or __binds tighter__ than binary operators like plus.
|
|
|
|
Next, let's define the syntax for declaring a function. Why not:
|
|
```
|
|
defn f x = { x + x }
|
|
```
|
|
|
|
As for declaring data types:
|
|
```
|
|
data List = { Nil, Cons Int List }
|
|
```
|
|
Notice that we are avoiding polymorphism here.
|
|
|
|
Let's also define a syntax for pattern matching:
|
|
```
|
|
case l of {
|
|
Nil -> { 0 }
|
|
Cons x xs -> { x }
|
|
}
|
|
```
|
|
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
|
|
constructed from an integer and another list (as defined in our `data` example),
|
|
return the integer".
|
|
|
|
That's it for the introduction! In the next post, we'll cover tokenizng, which is
|
|
the first step in converting source code into an executable program.
|
|
|
|
### Navigation
|
|
Here are the posts that I've written so far for this series:
|
|
|
|
* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
|
|
* [Parsing]({{< relref "02_compiler_parsing.md" >}})
|
|
* [Typechecking]({{< relref "03_compiler_typechecking.md" >}})
|
|
* [Small Improvements]({{< relref "04_compiler_improvements.md" >}})
|
|
* [Execution]({{< relref "05_compiler_execution.md" >}})
|
|
* [Compilation]({{< relref "06_compiler_compilation.md" >}})
|
|
* [Runtime]({{< relref "07_compiler_runtime.md" >}})
|
|
* [LLVM]({{< relref "08_compiler_llvm.md" >}})
|
|
* [Garbage Collection]({{< relref "09_compiler_garbage_collection.md" >}})
|
|
* [Polymorphism]({{< relref "10_compiler_polymorphism.md" >}})
|
|
* [Polymorphic Data Types]({{< relref "11_compiler_polymorphic_data_types.md" >}})
|
|
* [Let/In and Lambdas]({{< relref "12_compiler_let_in_lambda/index.md" >}})
|
|
|