Add an introduction post, and update other posts to match

2019-08-26 17:43:45 -07:00 · 2019-08-26 17:43:45 -07:00 · 94d242414f
commit 94d242414f
parent 619c346897
3 changed files with 130 additions and 85 deletions
--- a/content/blog/00_compiler_intro.md
+++ b/content/blog/00_compiler_intro.md
@ -0,0 +1,117 @@
+---
+title: Compiling a Functional Language Using C++, Part 0 - Intro
+date: 2019-08-03T01:02:30-07:00
+tags: ["C and C++", "Functional Languages", "Compilers"]
+draft: true
+---
+During my last academic term, I was enrolled in a compilers course.
+We had a final project - develop a compiler for a basic Python subset,
+using LLVM. It was a little boring - virtually nothing about the compiler
+was __not__ covered in class, and it felt more like putting two puzzles
+pieces together than building a real project.
+
+Instead, I chose to implement a compiler for a functional programming language,
+based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
+a tutorial_. Since the class was requiring the use of tools based on C++,
+that's what I used for my compiler. It was neat little project, and I
+wanted to share with everyone else how one might go about writing their
+own functional language.
+
+### Motivation
+There are two main motivating factors for this series.
+
+First, whenever I stumble on a compiler implementation tutorial,
+the language created is always imperative, inspired by C, C++, JavaScript,
+or Python. There are many interesting things about compiling such a language.
+However, I also think that the compilation of a functional language (including
+features like lazy evaluation) is interesting enough, and rarely covered.
+
+Second, I'm inspired by books such as _Software Foundations_ that use
+source code as their text. The entire content of _Software Foundations_,
+for instance, is written as comments in Coq source file. This means
+that you can not only read the book, but also run the code and interact with it.
+This makes it very engaging to read. Because of this, I want to provide for
+each post a "snapshot" of the project code. All the code in the posts
+will directly mirror that snapshot. The code you'll be reading will be
+runnable and open.
+
+### Overview
+Let's go over some preliminary information before we embark on this journey.
+
+#### The "classic" stages of a compiler
+Let's take a look at the high level overview of what a compiler does.
+Conceptually, the components of a compiler are pretty cleanly separated.
+They are as gollows:
+
+1. Tokenizing / lexical analysis
+2. Parsing
+3. Analysis / optimization
+5. Code Generation
+
+There are many variations on this structure. Some compilers don't optimize
+at all, some translate the program text into an intermediate representation,
+an alternative way of representing the program that isn't machine code.
+In some compilers, the stages of parsing and analysis can overlap.
+In short, just like the pirate's code, it's more of a guideline than a rule.
+
+#### What we'll cover
+We'll go through the stages of a compiler, starting from scratch
+and building up our project. We'll cover:
+
+* Tokenizing using regular expressions and Flex.
+* Parsing using context free grammars and Bison.
+* Monomorphic type checking (including typing rules).
+* Evaluation using graph reduction and the G-Machine.
+* Compiling G-Machine instructions to machine code using LLVM.
+
+We'll be creating a __lazily evaluated__, __functional__ language.
+
+#### The syntax of our language
+Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
+that most functional languages are very similar, and vary largely in syntax. That's
+our main degree of freedom. We want to represent the following things, for sure:
+
+* Defining functions
+* Applying functions
+* Arithmetic
+* Algebraic data types (to represent lists, pairs, and the like)
+* Pattern matching (to operate on data types)
+
+We can additionally support anonymous (lambda) functions, but compiling those
+is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
+define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
+Function application isn't much more difficult - `f x` means "apply f to x", and
+`f x + g x` means sum the result of applying f to x and g to x. That is, function
+application has higher precedence, or __binds tighter__ than binary operators like plus.
+
+Next, let's define the syntax for declaring a function. Why not:
+```
+defn f x = { x + x }
+```
+
+As for declaring data types:
+```
+data List = { Nil, Cons Int List }
+```
+Notice that we are avoiding polymorphism here.
+
+Let's also define a syntax for pattern matching:
+```
+case l of {
+    Nil -> { 0 }
+    Cons x xs -> { x }
+}
+```
+The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
+constructed from an integer and another list (as defined in our `data` example),
+return the integer".
+
+That's it for the introduction! In the next post, we'll cover tokenizng, which is
+the first step in coverting source code into an executable program.
+
+### Navigation
+Here are the posts that I've written so far for this series:
+
+* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
+* [Parsing]({{< relref "02_compiler_parsing.md" >}})
+* [Typechecking]({{< relref "03_compiler_typechecking.md" >}})
--- a/content/blog/01_compiler_tokenizing.md
+++ b/content/blog/01_compiler_tokenizing.md
@ -4,97 +4,26 @@ date: 2019-08-03T01:02:30-07:00
 tags: ["C and C++", "Functional Languages", "Compilers"]
 draft: true
 ---
-During my last academic term, I was enrolled in a compilers course.
-We had a final project - develop a compiler for a basic Python subset,
-using LLVM. It was a little boring - virtually nothing about the compiler
-was __not__ covered in class, and it felt more like putting two puzzles
-pieces together than building a real project.
-
-Being involved of the Programming Language Theory (PLT) research group at my
-university, I decided to do something different for the final project -
-a compiler for a functional language. In a series of posts, starting with
-thise one, I will explain what I did so that those interested in the subject
-are able to replicate my steps, and maybe learn something for themselves.
-
-### The "classic" stages of a compiler
-Let's take a look at the high level overview of what a compiler does.
-Conceptually, the components of a compiler are pretty cleanly separated.
-They are as gollows:
-
-1. Tokenizing / lexical analysis
-2. Parsing
-3. Analysis / optimization
-5. Code Generation
-
-There are many variations on this structure. Some compilers don't optimize
-at all, some translate the program text into an intermediate representation,
-an alternative way of representing the program that isn't machine code.
-In some compilers, the stages of parsing and analysis can overlap.
-In short, just like the pirate's code, it's more of a guideline than a rule.
-
-### Tokenizing and Parsing (the "boring stuff")
-It makes sense to build a compiler bit by bit, following the stages we outlined above.
-This is because these stages are essentially a pipeline, with program text
-coming in one end, and the final program coming out of the other. So as we build
-up our pipeline, we'll be able to push program text further and further, until
-eventually we get something that we can run on our machine.
+It makes sense to build a compiler bit by bit, following the stages we outlined in
+the first post of the series. This is because these stages are essentially a pipeline,
+with program text coming in one end, and the final program coming out of the other.
+So as we build up our pipeline, we'll be able to push program text further and further,
+until eventually we get something that we can run on our machine.

 This is how most tutorials go about building a compiler, too. The result is that
-there are a __lot__ of tutorials covering tokenizing and parsing. This is why
-I refer to this part of the process as "boring". Nonetheless, I will cover the steps
-required to tokenize and parse our little functional language. But before we do that,
-we first need to have an idea of what our language looks like.
+there are a __lot__ of tutorials covering tokenizing and parsing.
+Nonetheless, I will cover the steps required to tokenize and parse our little functional
+language. Before we start, it might help to refresh your memory about
+the syntax of the language, which we outlined in the
+[previous post]({{< relref "00_compiler_intro.md" >}}).

-### The Grammar
-Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
-that most functional languages are very similar, and vary largely in syntax. That's
-our main degree of freedom. We want to represent the following things, for sure:
-
-* Defining functions
-* Applying functions
-* Arithmetic
-* Algebraic data types (to represent lists, pairs, and the like)
-* Pattern matching (to operate on data types)
-
-We can additionally support anonymous (lambda) functions, but compiling those
-is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
-define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
-Function application isn't much more difficult - `f x` means "apply f to x", and
-`f x + g x` means sum the result of applying f to x and g to x. That is, function
-application has higher precedence, or __binds tighter__ than binary operators like plus.
-
-Next, let's define the syntax for declaring a function. Why not:
-```
-defn f x = { x + x }
-```
-
-As for declaring data types:
-```
-data List = { Nil, Cons Int List }
-```
-Notice that we are avoiding polymorphism here.
-
-Let's also define a syntax for pattern matching:
-```
-case l of {
-    Nil -> { 0 }
-    Cons x xs -> { x }
-}
-```
-The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
-constructed from an integer and another list (as defined in our `data` example),
-return the integer".
-
-That's it for now! Let's take a look at tokenizing.
-
-### Tokenizing
 When we first get our program text, it's in a representation difficult for us to make
 sense of. If we look at how it's represented in C++, we see that it's just an array
 of characters (potentially hundreds, thousands, or millions in length). We __could__
 jump straight to parsing the text (which involves creating a tree structure, known
 as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach -
 in fact, in functional languages, tokenizing is frequently skipped. However,
-in our closer-to-metal language, it happens to be more convenient to first break the
+in our closer-to-metal language (C++), it happens to be more convenient to first break the
 input text into a bunch of distinct segments (tokens).

 For example, consider the string "320+6". If we skip tokenizing and go straight
@ -105,6 +34,7 @@ To us, this is a bit more clear - we've partitioned the string into logical segm
 Our parser, then, won't have to care about recognizing a number - it will just know
 that a number is next in the string, and do with that information what it needs.

+### The Theory
 How do we go about breaking up a string into tokens? We need to come up with a
 way to compare some characters in a string against a set of rules. But "rules"
 is a very general term - we could, for instance, define a particular
@ -150,7 +80,6 @@ starting with a lowercase letter and containing lowercase or uppercase letters a
 can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide
 a special operator for \\((r_1+)?\\), written as \\(r_1*\\).

-#### The Theory
 So how does one go about checking if a regular expression matches a string? An efficient way is to
 first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
 by literally translating each part of it to a series of states, one-to-one. This machine is called
--- a/content/blog/03_compiler_typechecking.md
+++ b/content/blog/03_compiler_typechecking.md
@ -4,8 +4,7 @@ date: 2019-08-06T14:26:38-07:00
 draft: true
 tags: ["C and C++", "Functional Languages", "Compilers"]
 ---
-I called tokenizing and parsing boring, but I think I failed to articulate
-the real reason that I feel this way. The thing is, looking at syntax
+I think tokenizing and parsing are boring. The thing is, looking at syntax
 is a pretty shallow measure of how interesting a language is. It's like
 the cover of a book. Every language has one, and it so happens that to make
 our "book", we need to start with making the cover. But the content of the book