Add an introduction post, and update other posts to match
This commit is contained in:
parent
619c346897
commit
94d242414f
117
content/blog/00_compiler_intro.md
Normal file
117
content/blog/00_compiler_intro.md
Normal file
|
@ -0,0 +1,117 @@
|
||||||
|
---
|
||||||
|
title: Compiling a Functional Language Using C++, Part 0 - Intro
|
||||||
|
date: 2019-08-03T01:02:30-07:00
|
||||||
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
||||||
|
draft: true
|
||||||
|
---
|
||||||
|
During my last academic term, I was enrolled in a compilers course.
|
||||||
|
We had a final project - develop a compiler for a basic Python subset,
|
||||||
|
using LLVM. It was a little boring - virtually nothing about the compiler
|
||||||
|
was __not__ covered in class, and it felt more like putting two puzzles
|
||||||
|
pieces together than building a real project.
|
||||||
|
|
||||||
|
Instead, I chose to implement a compiler for a functional programming language,
|
||||||
|
based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
|
||||||
|
a tutorial_. Since the class was requiring the use of tools based on C++,
|
||||||
|
that's what I used for my compiler. It was neat little project, and I
|
||||||
|
wanted to share with everyone else how one might go about writing their
|
||||||
|
own functional language.
|
||||||
|
|
||||||
|
### Motivation
|
||||||
|
There are two main motivating factors for this series.
|
||||||
|
|
||||||
|
First, whenever I stumble on a compiler implementation tutorial,
|
||||||
|
the language created is always imperative, inspired by C, C++, JavaScript,
|
||||||
|
or Python. There are many interesting things about compiling such a language.
|
||||||
|
However, I also think that the compilation of a functional language (including
|
||||||
|
features like lazy evaluation) is interesting enough, and rarely covered.
|
||||||
|
|
||||||
|
Second, I'm inspired by books such as _Software Foundations_ that use
|
||||||
|
source code as their text. The entire content of _Software Foundations_,
|
||||||
|
for instance, is written as comments in Coq source file. This means
|
||||||
|
that you can not only read the book, but also run the code and interact with it.
|
||||||
|
This makes it very engaging to read. Because of this, I want to provide for
|
||||||
|
each post a "snapshot" of the project code. All the code in the posts
|
||||||
|
will directly mirror that snapshot. The code you'll be reading will be
|
||||||
|
runnable and open.
|
||||||
|
|
||||||
|
### Overview
|
||||||
|
Let's go over some preliminary information before we embark on this journey.
|
||||||
|
|
||||||
|
#### The "classic" stages of a compiler
|
||||||
|
Let's take a look at the high level overview of what a compiler does.
|
||||||
|
Conceptually, the components of a compiler are pretty cleanly separated.
|
||||||
|
They are as gollows:
|
||||||
|
|
||||||
|
1. Tokenizing / lexical analysis
|
||||||
|
2. Parsing
|
||||||
|
3. Analysis / optimization
|
||||||
|
5. Code Generation
|
||||||
|
|
||||||
|
There are many variations on this structure. Some compilers don't optimize
|
||||||
|
at all, some translate the program text into an intermediate representation,
|
||||||
|
an alternative way of representing the program that isn't machine code.
|
||||||
|
In some compilers, the stages of parsing and analysis can overlap.
|
||||||
|
In short, just like the pirate's code, it's more of a guideline than a rule.
|
||||||
|
|
||||||
|
#### What we'll cover
|
||||||
|
We'll go through the stages of a compiler, starting from scratch
|
||||||
|
and building up our project. We'll cover:
|
||||||
|
|
||||||
|
* Tokenizing using regular expressions and Flex.
|
||||||
|
* Parsing using context free grammars and Bison.
|
||||||
|
* Monomorphic type checking (including typing rules).
|
||||||
|
* Evaluation using graph reduction and the G-Machine.
|
||||||
|
* Compiling G-Machine instructions to machine code using LLVM.
|
||||||
|
|
||||||
|
We'll be creating a __lazily evaluated__, __functional__ language.
|
||||||
|
|
||||||
|
#### The syntax of our language
|
||||||
|
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
|
||||||
|
that most functional languages are very similar, and vary largely in syntax. That's
|
||||||
|
our main degree of freedom. We want to represent the following things, for sure:
|
||||||
|
|
||||||
|
* Defining functions
|
||||||
|
* Applying functions
|
||||||
|
* Arithmetic
|
||||||
|
* Algebraic data types (to represent lists, pairs, and the like)
|
||||||
|
* Pattern matching (to operate on data types)
|
||||||
|
|
||||||
|
We can additionally support anonymous (lambda) functions, but compiling those
|
||||||
|
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
|
||||||
|
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
|
||||||
|
Function application isn't much more difficult - `f x` means "apply f to x", and
|
||||||
|
`f x + g x` means sum the result of applying f to x and g to x. That is, function
|
||||||
|
application has higher precedence, or __binds tighter__ than binary operators like plus.
|
||||||
|
|
||||||
|
Next, let's define the syntax for declaring a function. Why not:
|
||||||
|
```
|
||||||
|
defn f x = { x + x }
|
||||||
|
```
|
||||||
|
|
||||||
|
As for declaring data types:
|
||||||
|
```
|
||||||
|
data List = { Nil, Cons Int List }
|
||||||
|
```
|
||||||
|
Notice that we are avoiding polymorphism here.
|
||||||
|
|
||||||
|
Let's also define a syntax for pattern matching:
|
||||||
|
```
|
||||||
|
case l of {
|
||||||
|
Nil -> { 0 }
|
||||||
|
Cons x xs -> { x }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
|
||||||
|
constructed from an integer and another list (as defined in our `data` example),
|
||||||
|
return the integer".
|
||||||
|
|
||||||
|
That's it for the introduction! In the next post, we'll cover tokenizng, which is
|
||||||
|
the first step in coverting source code into an executable program.
|
||||||
|
|
||||||
|
### Navigation
|
||||||
|
Here are the posts that I've written so far for this series:
|
||||||
|
|
||||||
|
* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
|
||||||
|
* [Parsing]({{< relref "02_compiler_parsing.md" >}})
|
||||||
|
* [Typechecking]({{< relref "03_compiler_typechecking.md" >}})
|
|
@ -4,97 +4,26 @@ date: 2019-08-03T01:02:30-07:00
|
||||||
tags: ["C and C++", "Functional Languages", "Compilers"]
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
||||||
draft: true
|
draft: true
|
||||||
---
|
---
|
||||||
During my last academic term, I was enrolled in a compilers course.
|
It makes sense to build a compiler bit by bit, following the stages we outlined in
|
||||||
We had a final project - develop a compiler for a basic Python subset,
|
the first post of the series. This is because these stages are essentially a pipeline,
|
||||||
using LLVM. It was a little boring - virtually nothing about the compiler
|
with program text coming in one end, and the final program coming out of the other.
|
||||||
was __not__ covered in class, and it felt more like putting two puzzles
|
So as we build up our pipeline, we'll be able to push program text further and further,
|
||||||
pieces together than building a real project.
|
until eventually we get something that we can run on our machine.
|
||||||
|
|
||||||
Being involved of the Programming Language Theory (PLT) research group at my
|
|
||||||
university, I decided to do something different for the final project -
|
|
||||||
a compiler for a functional language. In a series of posts, starting with
|
|
||||||
thise one, I will explain what I did so that those interested in the subject
|
|
||||||
are able to replicate my steps, and maybe learn something for themselves.
|
|
||||||
|
|
||||||
### The "classic" stages of a compiler
|
|
||||||
Let's take a look at the high level overview of what a compiler does.
|
|
||||||
Conceptually, the components of a compiler are pretty cleanly separated.
|
|
||||||
They are as gollows:
|
|
||||||
|
|
||||||
1. Tokenizing / lexical analysis
|
|
||||||
2. Parsing
|
|
||||||
3. Analysis / optimization
|
|
||||||
5. Code Generation
|
|
||||||
|
|
||||||
There are many variations on this structure. Some compilers don't optimize
|
|
||||||
at all, some translate the program text into an intermediate representation,
|
|
||||||
an alternative way of representing the program that isn't machine code.
|
|
||||||
In some compilers, the stages of parsing and analysis can overlap.
|
|
||||||
In short, just like the pirate's code, it's more of a guideline than a rule.
|
|
||||||
|
|
||||||
### Tokenizing and Parsing (the "boring stuff")
|
|
||||||
It makes sense to build a compiler bit by bit, following the stages we outlined above.
|
|
||||||
This is because these stages are essentially a pipeline, with program text
|
|
||||||
coming in one end, and the final program coming out of the other. So as we build
|
|
||||||
up our pipeline, we'll be able to push program text further and further, until
|
|
||||||
eventually we get something that we can run on our machine.
|
|
||||||
|
|
||||||
This is how most tutorials go about building a compiler, too. The result is that
|
This is how most tutorials go about building a compiler, too. The result is that
|
||||||
there are a __lot__ of tutorials covering tokenizing and parsing. This is why
|
there are a __lot__ of tutorials covering tokenizing and parsing.
|
||||||
I refer to this part of the process as "boring". Nonetheless, I will cover the steps
|
Nonetheless, I will cover the steps required to tokenize and parse our little functional
|
||||||
required to tokenize and parse our little functional language. But before we do that,
|
language. Before we start, it might help to refresh your memory about
|
||||||
we first need to have an idea of what our language looks like.
|
the syntax of the language, which we outlined in the
|
||||||
|
[previous post]({{< relref "00_compiler_intro.md" >}}).
|
||||||
|
|
||||||
### The Grammar
|
|
||||||
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
|
|
||||||
that most functional languages are very similar, and vary largely in syntax. That's
|
|
||||||
our main degree of freedom. We want to represent the following things, for sure:
|
|
||||||
|
|
||||||
* Defining functions
|
|
||||||
* Applying functions
|
|
||||||
* Arithmetic
|
|
||||||
* Algebraic data types (to represent lists, pairs, and the like)
|
|
||||||
* Pattern matching (to operate on data types)
|
|
||||||
|
|
||||||
We can additionally support anonymous (lambda) functions, but compiling those
|
|
||||||
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
|
|
||||||
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
|
|
||||||
Function application isn't much more difficult - `f x` means "apply f to x", and
|
|
||||||
`f x + g x` means sum the result of applying f to x and g to x. That is, function
|
|
||||||
application has higher precedence, or __binds tighter__ than binary operators like plus.
|
|
||||||
|
|
||||||
Next, let's define the syntax for declaring a function. Why not:
|
|
||||||
```
|
|
||||||
defn f x = { x + x }
|
|
||||||
```
|
|
||||||
|
|
||||||
As for declaring data types:
|
|
||||||
```
|
|
||||||
data List = { Nil, Cons Int List }
|
|
||||||
```
|
|
||||||
Notice that we are avoiding polymorphism here.
|
|
||||||
|
|
||||||
Let's also define a syntax for pattern matching:
|
|
||||||
```
|
|
||||||
case l of {
|
|
||||||
Nil -> { 0 }
|
|
||||||
Cons x xs -> { x }
|
|
||||||
}
|
|
||||||
```
|
|
||||||
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
|
|
||||||
constructed from an integer and another list (as defined in our `data` example),
|
|
||||||
return the integer".
|
|
||||||
|
|
||||||
That's it for now! Let's take a look at tokenizing.
|
|
||||||
|
|
||||||
### Tokenizing
|
|
||||||
When we first get our program text, it's in a representation difficult for us to make
|
When we first get our program text, it's in a representation difficult for us to make
|
||||||
sense of. If we look at how it's represented in C++, we see that it's just an array
|
sense of. If we look at how it's represented in C++, we see that it's just an array
|
||||||
of characters (potentially hundreds, thousands, or millions in length). We __could__
|
of characters (potentially hundreds, thousands, or millions in length). We __could__
|
||||||
jump straight to parsing the text (which involves creating a tree structure, known
|
jump straight to parsing the text (which involves creating a tree structure, known
|
||||||
as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach -
|
as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach -
|
||||||
in fact, in functional languages, tokenizing is frequently skipped. However,
|
in fact, in functional languages, tokenizing is frequently skipped. However,
|
||||||
in our closer-to-metal language, it happens to be more convenient to first break the
|
in our closer-to-metal language (C++), it happens to be more convenient to first break the
|
||||||
input text into a bunch of distinct segments (tokens).
|
input text into a bunch of distinct segments (tokens).
|
||||||
|
|
||||||
For example, consider the string "320+6". If we skip tokenizing and go straight
|
For example, consider the string "320+6". If we skip tokenizing and go straight
|
||||||
|
@ -105,6 +34,7 @@ To us, this is a bit more clear - we've partitioned the string into logical segm
|
||||||
Our parser, then, won't have to care about recognizing a number - it will just know
|
Our parser, then, won't have to care about recognizing a number - it will just know
|
||||||
that a number is next in the string, and do with that information what it needs.
|
that a number is next in the string, and do with that information what it needs.
|
||||||
|
|
||||||
|
### The Theory
|
||||||
How do we go about breaking up a string into tokens? We need to come up with a
|
How do we go about breaking up a string into tokens? We need to come up with a
|
||||||
way to compare some characters in a string against a set of rules. But "rules"
|
way to compare some characters in a string against a set of rules. But "rules"
|
||||||
is a very general term - we could, for instance, define a particular
|
is a very general term - we could, for instance, define a particular
|
||||||
|
@ -150,7 +80,6 @@ starting with a lowercase letter and containing lowercase or uppercase letters a
|
||||||
can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide
|
can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide
|
||||||
a special operator for \\((r_1+)?\\), written as \\(r_1*\\).
|
a special operator for \\((r_1+)?\\), written as \\(r_1*\\).
|
||||||
|
|
||||||
#### The Theory
|
|
||||||
So how does one go about checking if a regular expression matches a string? An efficient way is to
|
So how does one go about checking if a regular expression matches a string? An efficient way is to
|
||||||
first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
|
first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
|
||||||
by literally translating each part of it to a series of states, one-to-one. This machine is called
|
by literally translating each part of it to a series of states, one-to-one. This machine is called
|
||||||
|
|
|
@ -4,8 +4,7 @@ date: 2019-08-06T14:26:38-07:00
|
||||||
draft: true
|
draft: true
|
||||||
tags: ["C and C++", "Functional Languages", "Compilers"]
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
||||||
---
|
---
|
||||||
I called tokenizing and parsing boring, but I think I failed to articulate
|
I think tokenizing and parsing are boring. The thing is, looking at syntax
|
||||||
the real reason that I feel this way. The thing is, looking at syntax
|
|
||||||
is a pretty shallow measure of how interesting a language is. It's like
|
is a pretty shallow measure of how interesting a language is. It's like
|
||||||
the cover of a book. Every language has one, and it so happens that to make
|
the cover of a book. Every language has one, and it so happens that to make
|
||||||
our "book", we need to start with making the cover. But the content of the book
|
our "book", we need to start with making the cover. But the content of the book
|
Loading…
Reference in New Issue
Block a user