blog-static/content/blog/03_compiler_trees.md

246 lines
11 KiB
Markdown

---
title: Compiling a Functional Language Using C++, Part 3 - Operations On Trees
date: 2019-08-06T14:26:38-07:00
draft: true
tags: ["C and C++", "Functional Languages", "Compilers"]
---
I called tokenizing and parsing boring, but I think I failed to articulate
the real reason that I feel this way. The thing is, looking at syntax
is a pretty shallow measure of how interesting a language is. It's like
the cover of a book. Every language has one, and it so happens that to make
our "book", we need to start with making the cover. But the content of the book
is what matters, and that's where we've arrived now. We must make decisions
about our language, and give meaning to programs written in it. But before
we can give our programs meaning, we need to make sense of the current domain
of programs that we receive from our parser. Let's consider a few wonderful examples.
```
defn main = { plus 320 6 }
defn plus x y = { x + y }
```
This is a valid program, as far as we're concerned. But are __all__
programs we get from the parser valid? See for yourself:
```
data Bool = { True, False }
defn main { 3 + True }
```
Obviously, that's not right. The parser accepts it - this matches our grammar.
But giving meaning to this program is not easy, since we have no clear
way of adding 3 and some data type. Similarly:
```
defn main { 1 2 3 4 5 }
```
What is this? It's a sequence of applications, starting with `1 2`. Numbers
are not functions. Their type is `Int`, not `Int -> a`. I can't even think of a type `1`
would need to have for this program to be valid.
Before we give meaning to programs in our language, we'll need to toss away the ones
that don't make sense. To do so, we will use type checking. During the process of type
checking, we will collect information about various parts of our abstract syntax trees,
classifying them by the types of values they create. Using this information, we'll be able
to throw away blatantly incorrect programs.
### Basic Type Checking
You may have noticed in the very first post that I have chosen to avoid polymorphism.
This will significantly simplify our type checking algorithm. If a more robust
algorithm is desired, take a look at the
[Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system).
Personally, I enjoyed [this](http://dev.stephendiehl.com/fun/006_hindley_milner.html) section
of _Write You a Haskell_.
Let's start with the types of constants - those are pretty obvious. The constant `3` is an integer,
and we shall mark it as such: `3 :: Int`. Variables seem like the natural
next step, but they're fairly different. Without outside knowledge, all we can do is say
that a variable has __some__ type. If we stick with Haskell's notation (used in polymorphic types),
we can say a variable has a type `a`, where `a` can be replaced with any other type. If we
know more information, like the fact that `x` was declared to be an integer, we can instead
say that. This tells us that throughout type checking we'll have to keep some kind
of record of names and their associated types.
Next, let's take a look at functions, which are admittedly more interesting
than the previous two examples. I'm not talking about the case of seeing something
like a function name `f`. This is the same as the variable case - we don't even know
it's a function unless there is context, and if there __is__ context, then that
context is probably the most useful information we have. I'm talking about
something like the application of a function to another value, in the form
`f x`. In this case, we know that `f :: a -> b`, a function from something
to something. However, we know even more. For this program to be correct,
the `a` in `f :: a -> b`, and the type of `x` (let's call it `c`), must
be compatible. In order to do that, we will use what can be considered
simplified [unification](https://en.wikipedia.org/wiki/Unification_(computer_science)).
Conceptually, what this means is that we will attempt to perform substitutions
in various equations in search of a solution.
#### Basic Examples
Let's try an example. We'll try to determine the type of the following
expression, and extract any other information from this expression
that we might use later.
```
foo 320 6
```
In out parse tree, this will be represented as `(foo 320) 6`, meaning
that the outermost application will be at the top. Let's assume
we know __nothing__ about `foo`.
To figure out the type of the application, we will need to know the types
of the thing being applied, and the thing that serves as the argument.
The latter is easy: the type of `6` is `Int`. Before we proceed
into the left child of the application, there's one more
piece of information we can deduce: since `foo 320` is applied to
an argument, it has to be of type `a -> b`.
Let's proceed to the left child. It's another application, this time
of `foo` to `320`. Again, the right child is simple: the type of
`320` is `Int`. Again, we know that `foo` has to have a type
`c -> d` (we're using different variable names to avoid ambiguity).
Now, we need to combine the pieces of information that we have. Since
`foo :: c -> d`, we know that its first parameter __must__ be of
type `c`. We also know that its first parameter is of type `Int`.
The only way for both of these to be tree is for `c = Int`.
This also tells us that `foo :: Int -> d`. Finally,
since `foo` has now been applied to its first argument,
we know that the `foo 320 :: d`.
We've done what we can from this innermost application; it's time to return
to the outermost one. We now know that the left child is of type `d`, and
that it also has to be of type `a -> b`. The only way for this to be true
is for `d = a -> b`. So, `foo 320` is a function from `a` to `b`.
Again, we can conclude the first parameter has to be of type
`a`. We also know that the first parameter is of type `Int`. Clearly,
this means that `a = Int`. After the application, we know
that the whole expression has some type `d`.
Let's revisit what we know about `foo`. Last time we checked in on it,
`foo` was of type `Int -> d`. But since we know that `d = a -> b`,
and that `a = Int`, we can now say that `foo :: Int -> Int -> b`.
We haven't found any issues with the expression, and we learned
some new information about the type of `foo`. Awesome!
Let's apply this to a simplified example from the beginning of this post.
Let's check the type of:
```
1 2
```
Once again, the application is what we see first. The right child
of the application, just like in the previous example, is `Int`.
We also kno that since `1` is being applied as a function,
its type must be `a -> b`. However, we also know that the left
child, being a number, is also of type `Int`! There's no
way to combine `a -> b` with `Int`, and thus, there is no solution
we can find for the type of `1 2`. This means our program is invalid.
We can toss it away, give an error, and exit.
#### Some Notation
If you go to the Wikipedia page on the Hindley-Milner type system,
you will see quite a lot of symbols and greek letters. This is a __good thing__,
because the way that I presented to you the rules for figuring out
types of expressions is very verbose. You have to read several
paragraphs of text, and that's only for the first three logical
rules! If you're anything like me, you want to be able to read just
the important parts, and with some notation, I'll be able to show you
these important parts concisely, while continuing to explain
the rules in detail in paragraphs of text.
Let's start with inference rules. An inference rule is an expression in
the form:
$$
\\frac{A\_1 \\ldots A\_n} {B\_1 \\ldots B\_m}
$$
This reads, "given that the premises \\(A\_1\\) through \\(A\_n\\) are true,
it holds that the conclusions \\(B\_1\\) through \\(B\_n\\) are true".
For example, we can have the following inference rule:
$$
\\frac
{\\text{if it's cold, I wear a jacket} \\quad \\text{it's cold}}
{\\text{I wear a jacket}}
$$
Since you wear a jacket when it's cold, and it's cold, we can conclude
that you will wear a jacket.
When talking about type systems, it's common to represent a type with \\(\\tau\\).
The letter, which is the greek character "tau", is used as a placeholder for
some __concrete type__. It's kind of like a template, to be filled in
with an actual value. When we plug in an actual value into a rule containing
\\(\\tau\\), we say we are __instantiating__ it. Similarly, we will use
\\(e\\) to serve as a placeholder for an expression (matched by our
\\(A\_{add}\\) grammar rule from part 2). Next, we have the typing relation,
written as \\(e:\\tau\\). This says that "expression \\(e\\) has the type
\\(\\tau\\)".
Alright, this is enough to get us started with some typing rules.
Let's start with one for numbers. If we define \\(n\\) to mean
"any expression that is a just a number, like 3, 2, 6, etc.",
we can write the typing rule as follows:
$$
\\frac{}{n : \\text{Int}}
$$
There's nothing above the line -
there are no premises that are needed for a number to
have the type `Int`.
Now, let's move on to the rule for function application:
$$
\\frac
{e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad e_2 : \\tau_1}
{e_1 \\; e_2 : \\tau\_2}
$$
It's the variable rule that forces us to adjust our notation.
Our rules don't take into account the context that we've
already discussed. Let's fix that! It's convention
to use the symbol \\(\\Gamma\\) for the context. We then
add notation to say, "using the context \\(\\Gamma\\),
we can deduce that \\(e\\) has type \\(\\tau\\)". We will
write this as \\(\\Gamma \\vdash e : \\tau\\).
But what __is__ our context? It's just a set of pairs
in the form \\(x : \\tau\\), where \\(x\\) represents
a variable name. Each pair tells us that the variable
\\(x\\) is known to have a type \\(\\tau\\).
Since \\(\\Gamma\\) is just a regular set, we can
write \\(x : \\tau \\in \\Gamma\\), meaning that
in the current context, it is known that \\(x\\)
has the type \\(\\tau\\).
Let's update our rules with this new addition.
The integer rule just needs a slight adjustment:
$$
\frac{}{\\Gamma \\vdash n : \\text{Int}}
$$
The same is true for the application rule:
$$
\frac
{\\Gamma \\vdash e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad \\Gamma \\vdash e_2 : \\tau_1}
{\\Gamma \\vdash e_1 \\; e_2 : \\tau\_2}
$$
And finally, we can represent the variable rule:
$$
\\frac{x : \\tau \\in \\Gamma}{\\Gamma \\vdash x : \\tau}
$$
In these three expressions, we've captured all of the rules
that we've seen so far. It's important to know
that These rules leave out the process
of unification altogether: we use unification to find
types that satisfy the rules.
#### Checking More Expressions
So far, we've only checked types of numbers, applications, and variables.
Our language has more than that, though!