Add an introduction post, and update other posts to match

This commit is contained in:
Danila Fedorin 2019-08-26 17:43:45 -07:00
parent 619c346897
commit 94d242414f
3 changed files with 130 additions and 85 deletions

View File

@ -0,0 +1,117 @@
---
title: Compiling a Functional Language Using C++, Part 0 - Intro
date: 2019-08-03T01:02:30-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
draft: true
---
During my last academic term, I was enrolled in a compilers course.
We had a final project - develop a compiler for a basic Python subset,
using LLVM. It was a little boring - virtually nothing about the compiler
was __not__ covered in class, and it felt more like putting two puzzles
pieces together than building a real project.
Instead, I chose to implement a compiler for a functional programming language,
based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
a tutorial_. Since the class was requiring the use of tools based on C++,
that's what I used for my compiler. It was neat little project, and I
wanted to share with everyone else how one might go about writing their
own functional language.
### Motivation
There are two main motivating factors for this series.
First, whenever I stumble on a compiler implementation tutorial,
the language created is always imperative, inspired by C, C++, JavaScript,
or Python. There are many interesting things about compiling such a language.
However, I also think that the compilation of a functional language (including
features like lazy evaluation) is interesting enough, and rarely covered.
Second, I'm inspired by books such as _Software Foundations_ that use
source code as their text. The entire content of _Software Foundations_,
for instance, is written as comments in Coq source file. This means
that you can not only read the book, but also run the code and interact with it.
This makes it very engaging to read. Because of this, I want to provide for
each post a "snapshot" of the project code. All the code in the posts
will directly mirror that snapshot. The code you'll be reading will be
runnable and open.
### Overview
Let's go over some preliminary information before we embark on this journey.
#### The "classic" stages of a compiler
Let's take a look at the high level overview of what a compiler does.
Conceptually, the components of a compiler are pretty cleanly separated.
They are as gollows:
1. Tokenizing / lexical analysis
2. Parsing
3. Analysis / optimization
5. Code Generation
There are many variations on this structure. Some compilers don't optimize
at all, some translate the program text into an intermediate representation,
an alternative way of representing the program that isn't machine code.
In some compilers, the stages of parsing and analysis can overlap.
In short, just like the pirate's code, it's more of a guideline than a rule.
#### What we'll cover
We'll go through the stages of a compiler, starting from scratch
and building up our project. We'll cover:
* Tokenizing using regular expressions and Flex.
* Parsing using context free grammars and Bison.
* Monomorphic type checking (including typing rules).
* Evaluation using graph reduction and the G-Machine.
* Compiling G-Machine instructions to machine code using LLVM.
We'll be creating a __lazily evaluated__, __functional__ language.
#### The syntax of our language
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
that most functional languages are very similar, and vary largely in syntax. That's
our main degree of freedom. We want to represent the following things, for sure:
* Defining functions
* Applying functions
* Arithmetic
* Algebraic data types (to represent lists, pairs, and the like)
* Pattern matching (to operate on data types)
We can additionally support anonymous (lambda) functions, but compiling those
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
Function application isn't much more difficult - `f x` means "apply f to x", and
`f x + g x` means sum the result of applying f to x and g to x. That is, function
application has higher precedence, or __binds tighter__ than binary operators like plus.
Next, let's define the syntax for declaring a function. Why not:
```
defn f x = { x + x }
```
As for declaring data types:
```
data List = { Nil, Cons Int List }
```
Notice that we are avoiding polymorphism here.
Let's also define a syntax for pattern matching:
```
case l of {
Nil -> { 0 }
Cons x xs -> { x }
}
```
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
constructed from an integer and another list (as defined in our `data` example),
return the integer".
That's it for the introduction! In the next post, we'll cover tokenizng, which is
the first step in coverting source code into an executable program.
### Navigation
Here are the posts that I've written so far for this series:
* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
* [Parsing]({{< relref "02_compiler_parsing.md" >}})
* [Typechecking]({{< relref "03_compiler_typechecking.md" >}})

View File

@ -4,97 +4,26 @@ date: 2019-08-03T01:02:30-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
draft: true
---
During my last academic term, I was enrolled in a compilers course.
We had a final project - develop a compiler for a basic Python subset,
using LLVM. It was a little boring - virtually nothing about the compiler
was __not__ covered in class, and it felt more like putting two puzzles
pieces together than building a real project.
Being involved of the Programming Language Theory (PLT) research group at my
university, I decided to do something different for the final project -
a compiler for a functional language. In a series of posts, starting with
thise one, I will explain what I did so that those interested in the subject
are able to replicate my steps, and maybe learn something for themselves.
### The "classic" stages of a compiler
Let's take a look at the high level overview of what a compiler does.
Conceptually, the components of a compiler are pretty cleanly separated.
They are as gollows:
1. Tokenizing / lexical analysis
2. Parsing
3. Analysis / optimization
5. Code Generation
There are many variations on this structure. Some compilers don't optimize
at all, some translate the program text into an intermediate representation,
an alternative way of representing the program that isn't machine code.
In some compilers, the stages of parsing and analysis can overlap.
In short, just like the pirate's code, it's more of a guideline than a rule.
### Tokenizing and Parsing (the "boring stuff")
It makes sense to build a compiler bit by bit, following the stages we outlined above.
This is because these stages are essentially a pipeline, with program text
coming in one end, and the final program coming out of the other. So as we build
up our pipeline, we'll be able to push program text further and further, until
eventually we get something that we can run on our machine.
It makes sense to build a compiler bit by bit, following the stages we outlined in
the first post of the series. This is because these stages are essentially a pipeline,
with program text coming in one end, and the final program coming out of the other.
So as we build up our pipeline, we'll be able to push program text further and further,
until eventually we get something that we can run on our machine.
This is how most tutorials go about building a compiler, too. The result is that
there are a __lot__ of tutorials covering tokenizing and parsing. This is why
I refer to this part of the process as "boring". Nonetheless, I will cover the steps
required to tokenize and parse our little functional language. But before we do that,
we first need to have an idea of what our language looks like.
there are a __lot__ of tutorials covering tokenizing and parsing.
Nonetheless, I will cover the steps required to tokenize and parse our little functional
language. Before we start, it might help to refresh your memory about
the syntax of the language, which we outlined in the
[previous post]({{< relref "00_compiler_intro.md" >}}).
### The Grammar
Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
that most functional languages are very similar, and vary largely in syntax. That's
our main degree of freedom. We want to represent the following things, for sure:
* Defining functions
* Applying functions
* Arithmetic
* Algebraic data types (to represent lists, pairs, and the like)
* Pattern matching (to operate on data types)
We can additionally support anonymous (lambda) functions, but compiling those
is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
Function application isn't much more difficult - `f x` means "apply f to x", and
`f x + g x` means sum the result of applying f to x and g to x. That is, function
application has higher precedence, or __binds tighter__ than binary operators like plus.
Next, let's define the syntax for declaring a function. Why not:
```
defn f x = { x + x }
```
As for declaring data types:
```
data List = { Nil, Cons Int List }
```
Notice that we are avoiding polymorphism here.
Let's also define a syntax for pattern matching:
```
case l of {
Nil -> { 0 }
Cons x xs -> { x }
}
```
The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
constructed from an integer and another list (as defined in our `data` example),
return the integer".
That's it for now! Let's take a look at tokenizing.
### Tokenizing
When we first get our program text, it's in a representation difficult for us to make
sense of. If we look at how it's represented in C++, we see that it's just an array
of characters (potentially hundreds, thousands, or millions in length). We __could__
jump straight to parsing the text (which involves creating a tree structure, known
as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach -
in fact, in functional languages, tokenizing is frequently skipped. However,
in our closer-to-metal language, it happens to be more convenient to first break the
in our closer-to-metal language (C++), it happens to be more convenient to first break the
input text into a bunch of distinct segments (tokens).
For example, consider the string "320+6". If we skip tokenizing and go straight
@ -105,6 +34,7 @@ To us, this is a bit more clear - we've partitioned the string into logical segm
Our parser, then, won't have to care about recognizing a number - it will just know
that a number is next in the string, and do with that information what it needs.
### The Theory
How do we go about breaking up a string into tokens? We need to come up with a
way to compare some characters in a string against a set of rules. But "rules"
is a very general term - we could, for instance, define a particular
@ -150,7 +80,6 @@ starting with a lowercase letter and containing lowercase or uppercase letters a
can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide
a special operator for \\((r_1+)?\\), written as \\(r_1*\\).
#### The Theory
So how does one go about checking if a regular expression matches a string? An efficient way is to
first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
by literally translating each part of it to a series of states, one-to-one. This machine is called

View File

@ -4,8 +4,7 @@ date: 2019-08-06T14:26:38-07:00
draft: true
tags: ["C and C++", "Functional Languages", "Compilers"]
---
I called tokenizing and parsing boring, but I think I failed to articulate
the real reason that I feel this way. The thing is, looking at syntax
I think tokenizing and parsing are boring. The thing is, looking at syntax
is a pretty shallow measure of how interesting a language is. It's like
the cover of a book. Every language has one, and it so happens that to make
our "book", we need to start with making the cover. But the content of the book