From 94d242414f15f4b42a6b528054c0cc31a7a705d0 Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Mon, 26 Aug 2019 17:43:45 -0700 Subject: [PATCH] Add an introduction post, and update other posts to match --- content/blog/00_compiler_intro.md | 117 ++++++++++++++++++ content/blog/01_compiler_tokenizing.md | 95 ++------------ ...r_trees.md => 03_compiler_typechecking.md} | 3 +- 3 files changed, 130 insertions(+), 85 deletions(-) create mode 100644 content/blog/00_compiler_intro.md rename content/blog/{03_compiler_trees.md => 03_compiler_typechecking.md} (99%) diff --git a/content/blog/00_compiler_intro.md b/content/blog/00_compiler_intro.md new file mode 100644 index 0000000..a4fd8a3 --- /dev/null +++ b/content/blog/00_compiler_intro.md @@ -0,0 +1,117 @@ +--- +title: Compiling a Functional Language Using C++, Part 0 - Intro +date: 2019-08-03T01:02:30-07:00 +tags: ["C and C++", "Functional Languages", "Compilers"] +draft: true +--- +During my last academic term, I was enrolled in a compilers course. +We had a final project - develop a compiler for a basic Python subset, +using LLVM. It was a little boring - virtually nothing about the compiler +was __not__ covered in class, and it felt more like putting two puzzles +pieces together than building a real project. + +Instead, I chose to implement a compiler for a functional programming language, +based on a wonderful book by Simon Peyton Jones, _Implementing functional languages: +a tutorial_. Since the class was requiring the use of tools based on C++, +that's what I used for my compiler. It was neat little project, and I +wanted to share with everyone else how one might go about writing their +own functional language. + +### Motivation +There are two main motivating factors for this series. + +First, whenever I stumble on a compiler implementation tutorial, +the language created is always imperative, inspired by C, C++, JavaScript, +or Python. There are many interesting things about compiling such a language. +However, I also think that the compilation of a functional language (including +features like lazy evaluation) is interesting enough, and rarely covered. + +Second, I'm inspired by books such as _Software Foundations_ that use +source code as their text. The entire content of _Software Foundations_, +for instance, is written as comments in Coq source file. This means +that you can not only read the book, but also run the code and interact with it. +This makes it very engaging to read. Because of this, I want to provide for +each post a "snapshot" of the project code. All the code in the posts +will directly mirror that snapshot. The code you'll be reading will be +runnable and open. + +### Overview +Let's go over some preliminary information before we embark on this journey. + +#### The "classic" stages of a compiler +Let's take a look at the high level overview of what a compiler does. +Conceptually, the components of a compiler are pretty cleanly separated. +They are as gollows: + +1. Tokenizing / lexical analysis +2. Parsing +3. Analysis / optimization +5. Code Generation + +There are many variations on this structure. Some compilers don't optimize +at all, some translate the program text into an intermediate representation, +an alternative way of representing the program that isn't machine code. +In some compilers, the stages of parsing and analysis can overlap. +In short, just like the pirate's code, it's more of a guideline than a rule. + +#### What we'll cover +We'll go through the stages of a compiler, starting from scratch +and building up our project. We'll cover: + +* Tokenizing using regular expressions and Flex. +* Parsing using context free grammars and Bison. +* Monomorphic type checking (including typing rules). +* Evaluation using graph reduction and the G-Machine. +* Compiling G-Machine instructions to machine code using LLVM. + +We'll be creating a __lazily evaluated__, __functional__ language. + +#### The syntax of our language +Simon Peyton Jones, in his two works regarding compiling functional languages, remarks +that most functional languages are very similar, and vary largely in syntax. That's +our main degree of freedom. We want to represent the following things, for sure: + +* Defining functions +* Applying functions +* Arithmetic +* Algebraic data types (to represent lists, pairs, and the like) +* Pattern matching (to operate on data types) + +We can additionally support anonymous (lambda) functions, but compiling those +is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to +define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15. +Function application isn't much more difficult - `f x` means "apply f to x", and +`f x + g x` means sum the result of applying f to x and g to x. That is, function +application has higher precedence, or __binds tighter__ than binary operators like plus. + +Next, let's define the syntax for declaring a function. Why not: +``` +defn f x = { x + x } +``` + +As for declaring data types: +``` +data List = { Nil, Cons Int List } +``` +Notice that we are avoiding polymorphism here. + +Let's also define a syntax for pattern matching: +``` +case l of { + Nil -> { 0 } + Cons x xs -> { x } +} +``` +The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's +constructed from an integer and another list (as defined in our `data` example), +return the integer". + +That's it for the introduction! In the next post, we'll cover tokenizng, which is +the first step in coverting source code into an executable program. + +### Navigation +Here are the posts that I've written so far for this series: + +* [Tokenizing]({{< relref "01_compiler_tokenizing.md" >}}) +* [Parsing]({{< relref "02_compiler_parsing.md" >}}) +* [Typechecking]({{< relref "03_compiler_typechecking.md" >}}) diff --git a/content/blog/01_compiler_tokenizing.md b/content/blog/01_compiler_tokenizing.md index 5b1f123..462aa34 100644 --- a/content/blog/01_compiler_tokenizing.md +++ b/content/blog/01_compiler_tokenizing.md @@ -4,97 +4,26 @@ date: 2019-08-03T01:02:30-07:00 tags: ["C and C++", "Functional Languages", "Compilers"] draft: true --- -During my last academic term, I was enrolled in a compilers course. -We had a final project - develop a compiler for a basic Python subset, -using LLVM. It was a little boring - virtually nothing about the compiler -was __not__ covered in class, and it felt more like putting two puzzles -pieces together than building a real project. - -Being involved of the Programming Language Theory (PLT) research group at my -university, I decided to do something different for the final project - -a compiler for a functional language. In a series of posts, starting with -thise one, I will explain what I did so that those interested in the subject -are able to replicate my steps, and maybe learn something for themselves. - -### The "classic" stages of a compiler -Let's take a look at the high level overview of what a compiler does. -Conceptually, the components of a compiler are pretty cleanly separated. -They are as gollows: - -1. Tokenizing / lexical analysis -2. Parsing -3. Analysis / optimization -5. Code Generation - -There are many variations on this structure. Some compilers don't optimize -at all, some translate the program text into an intermediate representation, -an alternative way of representing the program that isn't machine code. -In some compilers, the stages of parsing and analysis can overlap. -In short, just like the pirate's code, it's more of a guideline than a rule. - -### Tokenizing and Parsing (the "boring stuff") -It makes sense to build a compiler bit by bit, following the stages we outlined above. -This is because these stages are essentially a pipeline, with program text -coming in one end, and the final program coming out of the other. So as we build -up our pipeline, we'll be able to push program text further and further, until -eventually we get something that we can run on our machine. +It makes sense to build a compiler bit by bit, following the stages we outlined in +the first post of the series. This is because these stages are essentially a pipeline, +with program text coming in one end, and the final program coming out of the other. +So as we build up our pipeline, we'll be able to push program text further and further, +until eventually we get something that we can run on our machine. This is how most tutorials go about building a compiler, too. The result is that -there are a __lot__ of tutorials covering tokenizing and parsing. This is why -I refer to this part of the process as "boring". Nonetheless, I will cover the steps -required to tokenize and parse our little functional language. But before we do that, -we first need to have an idea of what our language looks like. +there are a __lot__ of tutorials covering tokenizing and parsing. +Nonetheless, I will cover the steps required to tokenize and parse our little functional +language. Before we start, it might help to refresh your memory about +the syntax of the language, which we outlined in the +[previous post]({{< relref "00_compiler_intro.md" >}}). -### The Grammar -Simon Peyton Jones, in his two works regarding compiling functional languages, remarks -that most functional languages are very similar, and vary largely in syntax. That's -our main degree of freedom. We want to represent the following things, for sure: - -* Defining functions -* Applying functions -* Arithmetic -* Algebraic data types (to represent lists, pairs, and the like) -* Pattern matching (to operate on data types) - -We can additionally support anonymous (lambda) functions, but compiling those -is actually a bit trickier, so we will skip those for now. Arithmetic is the simplest to -define - let's define it as we would expect: `3` is a number, `3+2*6` evaluates to 15. -Function application isn't much more difficult - `f x` means "apply f to x", and -`f x + g x` means sum the result of applying f to x and g to x. That is, function -application has higher precedence, or __binds tighter__ than binary operators like plus. - -Next, let's define the syntax for declaring a function. Why not: -``` -defn f x = { x + x } -``` - -As for declaring data types: -``` -data List = { Nil, Cons Int List } -``` -Notice that we are avoiding polymorphism here. - -Let's also define a syntax for pattern matching: -``` -case l of { - Nil -> { 0 } - Cons x xs -> { x } -} -``` -The above means "if the list `l` is `Nil`, then return 0, otherwise, if it's -constructed from an integer and another list (as defined in our `data` example), -return the integer". - -That's it for now! Let's take a look at tokenizing. - -### Tokenizing When we first get our program text, it's in a representation difficult for us to make sense of. If we look at how it's represented in C++, we see that it's just an array of characters (potentially hundreds, thousands, or millions in length). We __could__ jump straight to parsing the text (which involves creating a tree structure, known as an __abstract syntax tree__; more on that later). There's nothing wrong with this approach - in fact, in functional languages, tokenizing is frequently skipped. However, -in our closer-to-metal language, it happens to be more convenient to first break the +in our closer-to-metal language (C++), it happens to be more convenient to first break the input text into a bunch of distinct segments (tokens). For example, consider the string "320+6". If we skip tokenizing and go straight @@ -105,6 +34,7 @@ To us, this is a bit more clear - we've partitioned the string into logical segm Our parser, then, won't have to care about recognizing a number - it will just know that a number is next in the string, and do with that information what it needs. +### The Theory How do we go about breaking up a string into tokens? We need to come up with a way to compare some characters in a string against a set of rules. But "rules" is a very general term - we could, for instance, define a particular @@ -150,7 +80,6 @@ starting with a lowercase letter and containing lowercase or uppercase letters a can be written as \\(\[a-z\]([a-z]+)?\\). Again, most regex implementations provide a special operator for \\((r_1+)?\\), written as \\(r_1*\\). -#### The Theory So how does one go about checking if a regular expression matches a string? An efficient way is to first construct a [state machine](https://en.wikipedia.org/wiki/Finite-state_machine). A type of state machine can be constructed from a regular expression by literally translating each part of it to a series of states, one-to-one. This machine is called diff --git a/content/blog/03_compiler_trees.md b/content/blog/03_compiler_typechecking.md similarity index 99% rename from content/blog/03_compiler_trees.md rename to content/blog/03_compiler_typechecking.md index 5435588..e43f2af 100644 --- a/content/blog/03_compiler_trees.md +++ b/content/blog/03_compiler_typechecking.md @@ -4,8 +4,7 @@ date: 2019-08-06T14:26:38-07:00 draft: true tags: ["C and C++", "Functional Languages", "Compilers"] --- -I called tokenizing and parsing boring, but I think I failed to articulate -the real reason that I feel this way. The thing is, looking at syntax +I think tokenizing and parsing are boring. The thing is, looking at syntax is a pretty shallow measure of how interesting a language is. It's like the cover of a book. Every language has one, and it so happens that to make our "book", we need to start with making the cover. But the content of the book