Add an introduction post, and update other posts to match
This commit is contained in:
parent
619c346897
commit
94d242414f
117
content/blog/00_compiler_intro.md
Normal file
117
content/blog/00_compiler_intro.md
Normal file
@ -0,0 +1,117 @@
|
||||
---
|
||||
title: Compiling a Functional Language Using C++, Part 0 - Intro
|
||||
date: 2019-08-03T01:02:30-07:00
|
||||
tags: ["C and C++", "Functional Languages", "Compilers"]
|
||||
draft: true
|
||||
---
|
||||
During my last academic term, I was enrolled in a compilers course.
|
||||
We had a final project - develop a compiler for a basic Python subset,
|
||||
using LLVM. It was a little boring - virtually nothing about the compiler
|
||||
was __not__ covered in class, and it felt more like putting two puzzles
|
||||
pieces together than building a real project.
|
||||
|
||||
Instead, I chose to implement a compiler for a functional programming language,
|
||||
based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
|
||||
a tutorial_. Since the class was requiring the use of tools based on C++,
|
||||
that's what I used for my compiler. It was neat little project, and I
|
||||
wanted to share with everyone else how one might go about writing their
|
||||
own functional language.
|
||||
|
||||
### Motivation
|
||||
There are two main motivating factors for this series.
|
||||
|
||||
First, whenever I stumble on a compiler implementation tutorial,
|
||||
the language created is always imperative, inspired by C, C++, JavaScript,
|
||||
or Python. There are many interesting things about compiling such a language.
|
||||
However, I also think that the compilation of a functional language (including
|
||||
features like lazy evaluation) is interesting enough, and rarely covered.
|
||||
|
||||
Second, I'm inspired by books such as _Software Foundations_ that use
|
||||
source code as their text. The entire content of _Software Foundations_,
|
||||
for instance, is written as comments in Coq source file. This means
|
||||
that you can not only read the book, but also run the code and interact with it.
|
||||
This makes it very engaging to read. Because of this, I want to provide for
|
||||
each post a "snapshot" of the project code. All the code in the posts
|
||||
will directly mirror that snapshot. The code you'll be reading will be
|
||||
runnable and open.
|
||||
|
||||
### Overview
|
||||
Let's go over some preliminary information before we embark on this journey.
|
||||
|
||||
#### The "classic" stages of a compiler
|
||||
Let's take a look at the high level overview of what a compiler does.
|
||||
Conceptually, the components of a compiler are pretty cleanly separated.
|
||||
They are as gollows:
|
||||
|
||||
1. Tokenizing / lexical analysis
|
||||
2. Parsing
|
||||
3. Analysis / optimization
|
||||
5. Code Generation
|
||||
|
||||
There are many variations on this structure. Some compilers don't optimize
|
||||
at all, some translate the program text into an intermediate representation,
|
||||
an alternative way of representing the program that isn't machine code.
|
||||
In some compilers, the stages of parsing and analysis can overlap.
|
||||
In short, just like the pirate's code, it's more of a guideline than a rule.
|
||||
|
||||
#### What we'll cover
|
||||
We'll go through the stages of a compiler, starting from scratch
|
||||
and building up our project. We'll cover:
|
||||
|
||||
* Tokenizing using regular expressions and Flex.
|
||||
* Parsing using context free grammars and Bison.
|
||||
* Monomorphic type checking (including typing rules).
|
||||
* Evaluation using graph reduction and the G-Machine.
|
||||
* Compiling G-Machine instructions to machine code using LLVM.
|
||||
|
||||
We'll be creating a __lazily evaluated__, __functional__ language.
|
||||
|
||||
#### The syntax of our language
|
||||
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
|
||||
that most functional languages are very similar, and vary largely in syntax. That's
|
||||
our main degree of freedom. We want to represent the following things, for sure:
|
||||
|
||||
* Defining functions
|
||||
* Applying functions
|
||||
* Arithmetic
|
||||
* Algebraic data types (to represent lists, pairs, and the like)
|
||||
* Pattern matching (to operate on data types)
|
||||
|
||||
We can additionally support anonymous (lambda) functions, but compiling those
|
||||
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
|
||||
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
|
||||
Function application isn't much more difficult - `f x` means "apply f to x", and
|
||||
`f x + g x` means sum the result of applying f to x and g to x. That is, function
|
||||
application has higher precedence, or __binds tighter__ than binary operators like plus.
|
||||
|
||||
Next, let's define the syntax for declaring a function. Why not:
|
||||
```
|
||||
defn f x = { x + x }
|
||||
```
|
||||
|
||||
As for declaring data types:
|
||||
```
|
||||
data List = { Nil, Cons Int List }
|
||||
```
|
||||
Notice that we are avoiding polymorphism here.
|
||||
|
||||
Let's also define a syntax for pattern matching:
|
||||
```
|
||||
case l of {
|
||||
Nil -> { 0 }
|
||||
Cons x xs -> { x }
|
||||
}
|
||||
```
|
||||
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
|
||||
constructed from an integer and another list (as defined in our `data` example),
|
||||
return the integer".
|
||||
|
||||
That's it for the introduction! In the next post, we'll cover tokenizng, which is
|
||||
the first step in coverting source code into an executable program.
|
||||
|
||||
### Navigation
|
||||
Here are the posts that I've written so far for this series:
|
||||
|
||||
* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
|
||||
* [Parsing]({{< relref "02_compiler_parsing.md" >}})
|
||||
* [Typechecking]({{< relref "03_compiler_typechecking.md" >}})
|
@ -4,97 +4,26 @@ date: 2019-08-03T01:02:30-07:00
|
||||
tags: ["C and C++", "Functional Languages", "Compilers"]
|
||||
draft: true
|
||||
---
|
||||
During my last academic term, I was enrolled in a compilers course.
|
||||
We had a final project - develop a compiler for a basic Python subset,
|
||||
using LLVM. It was a little boring - virtually nothing about the compiler
|
||||
was __not__ covered in class, and it felt more like putting two puzzles
|
||||
pieces together than building a real project.
|
||||
|
||||
Being involved of the Programming Language Theory (PLT) research group at my
|
||||
university, I decided to do something different for the final project -
|
||||
a compiler for a functional language. In a series of posts, starting with
|
||||
thise one, I will explain what I did so that those interested in the subject
|
||||
are able to replicate my steps, and maybe learn something for themselves.
|
||||
|
||||
### The "classic" stages of a compiler
|
||||
Let's take a look at the high level overview of what a compiler does.
|
||||
Conceptually, the components of a compiler are pretty cleanly separated.
|
||||
They are as gollows:
|
||||
|
||||
1. Tokenizing / lexical analysis
|
||||
2. Parsing
|
||||
3. Analysis / optimization
|
||||
5. Code Generation
|
||||
|
||||
There are many variations on this structure. Some compilers don't optimize
|
||||
at all, some translate the program text into an intermediate representation,
|
||||
an alternative way of representing the program that isn't machine code.
|
||||
In some compilers, the stages of parsing and analysis can overlap.
|
||||
In short, just like the pirate's code, it's more of a guideline than a rule.
|
||||
|
||||
### Tokenizing and Parsing (the "boring stuff")
|
||||
It makes sense to build a compiler bit by bit, following the stages we outlined above.
|
||||
This is because these stages are essentially a pipeline, with program text
|
||||
coming in one end, and the final program coming out of the other. So as we build
|
||||
up our pipeline, we'll be able to push program text further and further, until
|
||||
eventually we get something that we can run on our machine.
|
||||
It makes sense to build a compiler bit by bit, following the stages we outlined in
|
||||
the first post of the series. This is because these stages are essentially a pipeline,
|
||||
with program text coming in one end, and the final program coming out of the other.
|
||||
So as we build up our pipeline, we'll be able to push program text further and further,
|
||||
until eventually we get something that we can run on our machine.
|
||||
|
||||
This is how most tutorials go about building a compiler, too. The result is that
|
||||
there are a __lot__ of tutorials covering tokenizing and parsing. This is why
|
||||
I refer to this part of the process as "boring". Nonetheless, I will cover the steps
|
||||
required to tokenize and parse our little functional language. But before we do that,
|
||||
we first need to have an idea of what our language looks like.
|
||||
there are a __lot__ of tutorials covering tokenizing and parsing.
|
||||
Nonetheless, I will cover the steps required to tokenize and parse our little functional
|
||||
language. Before we start, it might help to refresh your memory about
|
||||
the syntax of the language, which we outlined in the
|
||||
[previous post]({{< relref "00_compiler_intro.md" >}}).
|
||||
|
||||
### The Grammar
|
||||
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
|
||||
that most functional languages are very similar, and vary largely in syntax. That's
|
||||
our main degree of freedom. We want to represent the following things, for sure:
|
||||
|
||||
* Defining functions
|
||||
* Applying functions
|
||||
* Arithmetic
|
||||
* Algebraic data types (to represent lists, pairs, and the like)
|
||||
* Pattern matching (to operate on data types)
|
||||
|
||||
We can additionally support anonymous (lambda) functions, but compiling those
|
||||
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
|
||||
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
|
||||
Function application isn't much more difficult - `f x` means "apply f to x", and
|
||||
`f x + g x` means sum the result of applying f to x and g to x. That is, function
|
||||
application has higher precedence, or __binds tighter__ than binary operators like plus.
|
||||
|
||||
Next, let's define the syntax for declaring a function. Why not:
|
||||
```
|
||||
defn f x = { x + x }
|
||||
```
|
||||
|
||||
As for declaring data types:
|
||||
```
|
||||
data List = { Nil, Cons Int List }
|
||||
```
|
||||
Notice that we are avoiding polymorphism here.
|
||||
|
||||
Let's also define a syntax for pattern matching:
|
||||
```
|
||||
case l of {
|
||||
Nil -> { 0 }
|
||||
Cons x xs -> { x }
|
||||
}
|
||||
```
|
||||
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
|
||||
constructed from an integer and another list (as defined in our `data` example),
|
||||
return the integer".
|
||||
|
||||
That's it for now! Let's take a look at tokenizing.
|
||||
|
||||
### Tokenizing
|
||||
When we first get our program text, it's in a representation difficult for us to make
|
||||
sense of. If we look at how it's represented in C++, we see that it's just an array
|
||||
of characters (potentially hundreds, thousands, or millions in length). We __could__
|
||||
jump straight to parsing the text (which involves creating a tree structure, known
|
||||
as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach -
|
||||
in fact, in functional languages, tokenizing is frequently skipped. However,
|
||||
in our closer-to-metal language, it happens to be more convenient to first break the
|
||||
in our closer-to-metal language (C++), it happens to be more convenient to first break the
|
||||
input text into a bunch of distinct segments (tokens).
|
||||
|
||||
For example, consider the string "320+6". If we skip tokenizing and go straight
|
||||
@ -105,6 +34,7 @@ To us, this is a bit more clear - we've partitioned the string into logical segm
|
||||
Our parser, then, won't have to care about recognizing a number - it will just know
|
||||
that a number is next in the string, and do with that information what it needs.
|
||||
|
||||
### The Theory
|
||||
How do we go about breaking up a string into tokens? We need to come up with a
|
||||
way to compare some characters in a string against a set of rules. But "rules"
|
||||
is a very general term - we could, for instance, define a particular
|
||||
@ -150,7 +80,6 @@ starting with a lowercase letter and containing lowercase or uppercase letters a
|
||||
can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide
|
||||
a special operator for \\((r_1+)?\\), written as \\(r_1*\\).
|
||||
|
||||
#### The Theory
|
||||
So how does one go about checking if a regular expression matches a string? An efficient way is to
|
||||
first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
|
||||
by literally translating each part of it to a series of states, one-to-one. This machine is called
|
||||
|
@ -4,8 +4,7 @@ date: 2019-08-06T14:26:38-07:00
|
||||
draft: true
|
||||
tags: ["C and C++", "Functional Languages", "Compilers"]
|
||||
---
|
||||
I called tokenizing and parsing boring, but I think I failed to articulate
|
||||
the real reason that I feel this way. The thing is, looking at syntax
|
||||
I think tokenizing and parsing are boring. The thing is, looking at syntax
|
||||
is a pretty shallow measure of how interesting a language is. It's like
|
||||
the cover of a book. Every language has one, and it so happens that to make
|
||||
our "book", we need to start with making the cover. But the content of the book
|
Loading…
Reference in New Issue
Block a user