Add an introduction post, and update other posts to match

2019-08-26 17:43:45 -07:00 · 2019-08-26 17:43:45 -07:00 · 94d242414f
commit 94d242414f
parent 619c346897
3 changed files with 130 additions and 85 deletions
--- a/content/blog/00_compiler_intro.md
+++ b/content/blog/00_compiler_intro.md
@ -0,0 +1,117 @@
 ---
 title: Compiling a Functional Language Using C++, Part 0 - Intro
 date: 2019-08-03T01:02:30-07:00
 tags: ["C and C++", "Functional Languages", "Compilers"]
 draft: true
 ---
 During my last academic term, I was enrolled in a compilers course.
 We had a final project - develop a compiler for a basic Python subset,
 using LLVM. It was a little boring - virtually nothing about the compiler
 was __not__ covered in class, and it felt more like putting two puzzles
 pieces together than building a real project.
 Instead, I chose to implement a compiler for a functional programming language,
 based on a wonderful book by Simon Peyton Jones, _Implementing functional languages:
 a tutorial_. Since the class was requiring the use of tools based on C++,
 that's what I used for my compiler. It was neat little project, and I
 wanted to share with everyone else how one might go about writing their
 own functional language.
 ### Motivation
 There are two main motivating factors for this series.
 First, whenever I stumble on a compiler implementation tutorial,
 the language created is always imperative, inspired by C, C++, JavaScript,
 or Python. There are many interesting things about compiling such a language.
 However, I also think that the compilation of a functional language (including
 features like lazy evaluation) is interesting enough, and rarely covered.
 Second, I'm inspired by books such as _Software Foundations_ that use
 source code as their text. The entire content of _Software Foundations_,
 for instance, is written as comments in Coq source file. This means
 that you can not only read the book, but also run the code and interact with it.
 This makes it very engaging to read. Because of this, I want to provide for
 each post a "snapshot" of the project code. All the code in the posts
 will directly mirror that snapshot. The code you'll be reading will be
 runnable and open.
 ### Overview
 Let's go over some preliminary information before we embark on this journey.
 #### The "classic" stages of a compiler
 Let's take a look at the high level overview of what a compiler does.
 Conceptually, the components of a compiler are pretty cleanly separated.
 They are as gollows:
 1. Tokenizing / lexical analysis
 2. Parsing
 3. Analysis / optimization
 5. Code Generation
 There are many variations on this structure. Some compilers don't optimize
 at all, some translate the program text into an intermediate representation,
 an alternative way of representing the program that isn't machine code.
 In some compilers, the stages of parsing and analysis can overlap.
 In short, just like the pirate's code, it's more of a guideline than a rule.
 #### What we'll cover
 We'll go through the stages of a compiler, starting from scratch
 and building up our project. We'll cover:
 * Tokenizing using regular expressions and Flex.
 * Parsing using context free grammars and Bison.
 * Monomorphic type checking (including typing rules).
 * Evaluation using graph reduction and the G-Machine.
 * Compiling G-Machine instructions to machine code using LLVM.
 We'll be creating a __lazily evaluated__, __functional__ language.
 #### The syntax of our language
 Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
 that most functional languages are very similar, and vary largely in syntax. That's
 our main degree of freedom. We want to represent the following things, for sure:
 * Defining functions
 * Applying functions
 * Arithmetic
 * Algebraic data types (to represent lists, pairs, and the like)
 * Pattern matching (to operate on data types)
 We can additionally support anonymous (lambda) functions, but compiling those
 is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
 define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
 Function application isn't much more difficult - `f x` means "apply f to x", and
 `f x + g x` means sum the result of applying f to x and g to x. That is, function
 application has higher precedence, or __binds tighter__ than binary operators like plus.
 Next, let's define the syntax for declaring a function. Why not:
 ```
 defn f x = { x + x }
 ```
 As for declaring data types:
 ```
 data List = { Nil, Cons Int List }
 ```
 Notice that we are avoiding polymorphism here.
 Let's also define a syntax for pattern matching:
 ```
 case l of {
    Nil -> { 0 }
    Cons x xs -> { x }
 }
 ```
 The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
 constructed from an integer and another list (as defined in our `data` example),
 return the integer".
 That's it for the introduction! In the next post, we'll cover tokenizng, which is
 the first step in coverting source code into an executable program.
 ### Navigation
 Here are the posts that I've written so far for this series:
 * [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}})
 * [Parsing]({{< relref "02_compiler_parsing.md" >}})
 * [Typechecking]({{< relref "03_compiler_typechecking.md" >}})
--- a/content/blog/01_compiler_tokenizing.md
+++ b/content/blog/01_compiler_tokenizing.md
@ -4,97 +4,26 @@ date: 2019-08-03T01:02:30-07:00
 tags: ["C and C++", "Functional Languages", "Compilers"]
 draft: true
 ---
-During my last academic term, I was enrolled in a compilers course.
+It makes sense to build a compiler bit by bit, following the stages we outlined in
-We had a final project - develop a compiler for a basic Python subset,
+the first post of the series. This is because these stages are essentially a pipeline,
-using LLVM. It was a little boring - virtually nothing about the compiler
+with program text coming in one end, and the final program coming out of the other.
-was __not__ covered in class, and it felt more like putting two puzzles
+So as we build up our pipeline, we'll be able to push program text further and further,
-pieces together than building a real project.
+until eventually we get something that we can run on our machine.
 Being involved of the Programming Language Theory (PLT) research group at my
 university, I decided to do something different for the final project -
 a compiler for a functional language. In a series of posts, starting with
 thise one, I will explain what I did so that those interested in the subject
 are able to replicate my steps, and maybe learn something for themselves.
 ### The "classic" stages of a compiler
 Let's take a look at the high level overview of what a compiler does.
 Conceptually, the components of a compiler are pretty cleanly separated.
 They are as gollows:
 1. Tokenizing / lexical analysis
 2. Parsing
 3. Analysis / optimization
 5. Code Generation
 There are many variations on this structure. Some compilers don't optimize
 at all, some translate the program text into an intermediate representation,
 an alternative way of representing the program that isn't machine code.
 In some compilers, the stages of parsing and analysis can overlap.
 In short, just like the pirate's code, it's more of a guideline than a rule.
 ### Tokenizing and Parsing (the "boring stuff")
 It makes sense to build a compiler bit by bit, following the stages we outlined above.
 This is because these stages are essentially a pipeline, with program text
 coming in one end, and the final program coming out of the other. So as we build
 up our pipeline, we'll be able to push program text further and further, until
 eventually we get something that we can run on our machine.
 This is how most tutorials go about building a compiler, too. The result is that
-there are a __lot__ of tutorials covering tokenizing and parsing. This is why
+there are a __lot__ of tutorials covering tokenizing and parsing.
-I refer to this part of the process as "boring". Nonetheless, I will cover the steps
+Nonetheless, I will cover the steps required to tokenize and parse our little functional
-required to tokenize and parse our little functional language. But before we do that,
+language. Before we start, it might help to refresh your memory about
-we first need to have an idea of what our language looks like.
+the syntax of the language, which we outlined in the
 [previous post]({{< relref "00_compiler_intro.md" >}}).
 ### The Grammar
 Simon Peyton Jones, in his two works regarding compiling functional languages, remarks
 that most functional languages are very similar, and vary largely in syntax. That's
 our main degree of freedom. We want to represent the following things, for sure:
 * Defining functions
 * Applying functions
 * Arithmetic
 * Algebraic data types (to represent lists, pairs, and the like)
 * Pattern matching (to operate on data types)
 We can additionally support anonymous (lambda) functions, but compiling those
 is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to
 define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15.
 Function application isn't much more difficult - `f x` means "apply f to x", and
 `f x + g x` means sum the result of applying f to x and g to x. That is, function
 application has higher precedence, or __binds tighter__ than binary operators like plus.
 Next, let's define the syntax for declaring a function. Why not:
 ```
 defn f x = { x + x }
 ```
 As for declaring data types:
 ```
 data List = { Nil, Cons Int List }
 ```
 Notice that we are avoiding polymorphism here.
 Let's also define a syntax for pattern matching:
 ```
 case l of {
    Nil -> { 0 }
    Cons x xs -> { x }
 }
 ```
 The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's
 constructed from an integer and another list (as defined in our `data` example),
 return the integer".
 That's it for now! Let's take a look at tokenizing.
 ### Tokenizing
 When we first get our program text, it's in a representation difficult for us to make
 sense of. If we look at how it's represented in C++, we see that it's just an array
 of characters (potentially hundreds, thousands, or millions in length). We __could__
 jump straight to parsing the text (which involves creating a tree structure, known
 as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach -
 in fact, in functional languages, tokenizing is frequently skipped. However,
-in our closer-to-metal language, it happens to be more convenient to first break the
+in our closer-to-metal language (C++), it happens to be more convenient to first break the
 input text into a bunch of distinct segments (tokens).
 For example, consider the string "320+6". If we skip tokenizing and go straight
@ -105,6 +34,7 @@ To us, this is a bit more clear - we've partitioned the string into logical segm
 Our parser, then, won't have to care about recognizing a number - it will just know
 that a number is next in the string, and do with that information what it needs.
 ### The Theory
 How do we go about breaking up a string into tokens? We need to come up with a
 way to compare some characters in a string against a set of rules. But "rules"
 is a very general term - we could, for instance, define a particular
@ -150,7 +80,6 @@ starting with a lowercase letter and containing lowercase or uppercase letters a
 can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide
 a special operator for \\((r_1+)?\\), written as \\(r_1*\\).
 #### The Theory
 So how does one go about checking if a regular expression matches a string? An efficient way is to
 first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression
 by literally translating each part of it to a series of states, one-to-one. This machine is called
--- a/content/blog/03_compiler_typechecking.md
+++ b/content/blog/03_compiler_typechecking.md
@ -4,8 +4,7 @@ date: 2019-08-06T14:26:38-07:00
 draft: true
 tags: ["C and C++", "Functional Languages", "Compilers"]
 ---
-I called tokenizing and parsing boring, but I think I failed to articulate
+I think tokenizing and parsing are boring. The thing is, looking at syntax
 the real reason that I feel this way. The thing is, looking at syntax
 is a pretty shallow measure of how interesting a language is. It's like
 the cover of a book. Every language has one, and it so happens that to make
 our "book", we need to start with making the cover. But the content of the book