blog-static/content/blog/03_compiler_trees.md

608 lines
26 KiB
Markdown

---
title: Compiling a Functional Language Using C++, Part 3 - Type Checking
date: 2019-08-06T14:26:38-07:00
draft: true
tags: ["C and C++", "Functional Languages", "Compilers"]
---
I called tokenizing and parsing boring, but I think I failed to articulate
the real reason that I feel this way. The thing is, looking at syntax
is a pretty shallow measure of how interesting a language is. It's like
the cover of a book. Every language has one, and it so happens that to make
our "book", we need to start with making the cover. But the content of the book
is what matters, and that's where we've arrived now. We must make decisions
about our language, and give meaning to programs written in it. But before
we can give our programs meaning, we need to make sense of the current domain
of programs that we receive from our parser. Let's consider a few wonderful examples.
```
defn main = { plus 320 6 }
defn plus x y = { x + y }
```
This is a valid program, as far as we're concerned. But are __all__
programs we get from the parser valid? See for yourself:
```
data Bool = { True, False }
defn main = { 3 + True }
```
Obviously, that's not right. The parser accepts it - this matches our grammar.
But giving meaning to this program is not easy, since we have no clear
way of adding 3 and some data type. Similarly:
```
defn main = { 1 2 3 4 5 }
```
What is this? It's a sequence of applications, starting with `1 2`. Numbers
are not functions. Their type is `Int`, not `Int -> a`. I can't even think of a type numbers
would need to have for this program to be valid (though perhaps one could come up with one).
Before we give meaning to programs in our language, we'll need to toss away the ones
that don't make sense. To do so, we will use type checking. During the process of type
checking, we will collect information about various parts of our abstract syntax trees,
classifying them by the types of values they create. Using this information, we'll be able
to throw away blatantly incorrect programs.
### Basic Type Checking
You may have noticed in the very first post that I have chosen to avoid polymorphism.
This will simplify our type checking algorithm. If a more robust
algorithm is desired, take a look at the
[Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system).
Personally, I enjoyed [this](http://dev.stephendiehl.com/fun/006_hindley_milner.html) section
of _Write You a Haskell_, which I used to sanity check this very post.
Let's start with the types of constants - those are pretty obvious. The constant `3` is an integer,
and we shall mark it as such: `3 :: Int`. Variables seem like the natural
next step, but they're fairly different. Without outside knowledge, we can't
tell what type a variable `x` is. If we know more information, like the fact that `x`
was declared to be an integer, we can instead say that. This tells us that throughout
type checking we'll have to keep some kind of record of names and their associated types.
Next, let's take a look at functions, which are admittedly more interesting
than the previous two examples. I'm not talking about the case of seeing something
like a function name `f`. This is the same as the variable case - we don't even know
it's a function unless there is context, and if there __is__ context, then that
context is probably the most useful information we have. I'm talking about
something like the application of a function to another value, in the form
`f x`. In this case, we know that for the program to be valid,
`f` must have the type `a -> b`, a function from something
to something. Furthermore, since `f` is being applied to `x`,
the type of `x` (let's call it `c`) must be the same as the type `a`.
Our rules are getting more complicated. In order to check that they hold, we will use
[unification](https://en.wikipedia.org/wiki/Unification_(computer_science)).
This means that we'll be creating various equations (such as "the type of `f` is equal to `a -> b`"),
and finding substitutions to solve these equations. If we're unable to
find a solution to our equations, the program is invalid and we throw it away.
#### Basic Examples
Let's try an example. We'll try to determine the type of the following
expression, and extract any other information from this expression
that we might use later.
```
foo 320 6
```
In out parse tree, this will be represented as `(foo 320) 6`, meaning
that the outermost application will be at the top. Let's assume
we know only that `foo` is defined.
To figure out the type of the application, we will need to know the types
of the thing being applied, and the thing that serves as the argument.
The latter is easy: the type of `6` is `Int`. Before we proceed
into the left child of the application, there's one more
piece of information we can deduce: since `foo 320` is applied to
an argument, it has to be of type `a -> b`.
Let's proceed to the left child. It's another application, this time
of `foo` to `320`. Again, the right child is simple: the type of
`320` is `Int`. Again, we know that `foo` has to have a type
`c -> d` (we're using different variable names since the types
of `foo` and `foo 320` are not the same).
Now, we need to combine the pieces of information that we have. Since
`foo :: c -> d`, we know that its first parameter __must__ be of
type `c`. We also know that its first parameter is of type `Int`.
The only way for both of these to be true is for `c = Int`.
This also tells us that `foo :: Int -> d`. Finally,
since `foo` has now been applied to its first argument,
we know that the `foo 320 :: d`.
We've done what we can from this innermost application; it's time to return
to the outermost one. We now know that the left child is of type `d`, and
that it also has to be of type `a -> b`. The only way for this to be true
is for `d = a -> b`. So, `foo 320` is a function from `a` to `b`.
Again, we can conclude the first parameter has to be of type
`a`. We also know that the first parameter is of type `Int`. Clearly,
this means that `a = Int`. After the application, we know
that the whole expression has some type `b`.
Let's revisit what we know about `foo`. Last time we checked in on it,
`foo` was of type `Int -> d`. But since we know that `d = a -> b`,
and that `a = Int`, we can now say that `foo :: Int -> Int -> b`.
We haven't found any issues with the expression, and we learned
some new information about the type of `foo`. Awesome!
Let's apply this to a simplified example from the beginning of this post.
Let's check the type of:
```
1 2
```
Once again, the application is what we see first. The right child
of the application, just like in the previous example, is `Int`.
We also kno that since `1` is being applied as a function,
its type must be `a -> b`. However, we also know that the left
child, being a number, is also of type `Int`! There's no
way to combine `a -> b` with `Int`, and thus, there is no solution
we can find for the type of `1 2`. This means our program is invalid.
We can toss it away, give an error, and exit.
#### Some Notation
If you go to the Wikipedia page on the Hindley-Milner type system,
you will see quite a lot of symbols and greek letters. This is a __good thing__,
because the way that I presented to you the rules for figuring out
types of expressions is very verbose. You have to read several
paragraphs of text, and that's only for the first three logical
rules! If you're anything like me, you want to be able to read just
the important parts, and with some notation, I'll be able to show you
these important parts concisely, while continuing to explain
the rules in detail in paragraphs of text.
Let's start with inference rules. An inference rule is an expression in
the form:
$$
\\frac{A\_1 \\ldots A\_n} {B\_1 \\ldots B\_m}
$$
This reads, "given that the premises \\(A\_1\\) through \\(A\_n\\) are true,
it holds that the conclusions \\(B\_1\\) through \\(B\_m\\) are true".
For example, we can have the following inference rule:
$$
\\frac
{\\text{if it's cold, I wear a jacket} \\quad \\text{it's cold}}
{\\text{I wear a jacket}}
$$
Since you wear a jacket when it's cold, and it's cold, we can conclude
that you will wear a jacket.
When talking about type systems, it's common to represent a type with \\(\\tau\\).
The letter, which is the greek character "tau", is used as a placeholder for
some __concrete type__. It's kind of like a template, to be filled in
with an actual value. When we plug in an actual value into a rule containing
\\(\\tau\\), we say we are __instantiating__ it. Similarly, we will use
\\(e\\) to serve as a placeholder for an expression (matched by our
\\(A\_{add}\\) grammar rule from part 2). Next, we have the typing relation,
written as \\(e:\\tau\\). This says that "expression \\(e\\) has the type
\\(\\tau\\)".
Alright, this is enough to get us started with some typing rules.
Let's start with one for numbers. If we define \\(n\\) to mean
"any expression that is a just a number, like 3, 2, 6, etc.",
we can write the typing rule as follows:
$$
\\frac{}{n : \\text{Int}}
$$
There's nothing above the line -
there are no premises that are needed for a number to
have the type `Int`.
Now, let's move on to the rule for function application:
$$
\\frac
{e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad e_2 : \\tau_1}
{e_1 \\; e_2 : \\tau\_2}
$$
This rule includes everything we've said before:
the thing being applied has to have a function type
(\\(\\tau\_1 \\rightarrow \\tau\_2\\)), and
the expression the function is applied to
has to have the same type \\(\\tau\_1\\) as the
left type of the function.
It's the variable rule that forces us to adjust our notation.
Our rules don't take into account the context that we've
already discussed, and thus, we can't bring
in any outside information. Let's fix that! It's convention
to use the symbol \\(\\Gamma\\) for the context. We then
add notation to say, "using the context \\(\\Gamma\\),
we can deduce that \\(e\\) has type \\(\\tau\\)". We will
write this as \\(\\Gamma \\vdash e : \\tau\\).
But what __is__ our context? We can think of it
as a mapping from variable names to their known types. We
can represent such a mapping using a set of pairs
in the form \\(x : \\tau\\), where \\(x\\) represents
a variable name.
Since \\(\\Gamma\\) is just a regular set, we can
write \\(x : \\tau \\in \\Gamma\\), meaning that
in the current context, it is known that \\(x\\)
has the type \\(\\tau\\).
Let's update our rules with this new addition.
The integer rule just needs a slight adjustment:
$$
\frac{}{\\Gamma \\vdash n : \\text{Int}}
$$
The same is true for the application rule:
$$
\frac
{\\Gamma \\vdash e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad \\Gamma \\vdash e_2 : \\tau_1}
{\\Gamma \\vdash e_1 \\; e_2 : \\tau\_2}
$$
And finally, we can represent the variable rule:
$$
\\frac{x : \\tau \\in \\Gamma}{\\Gamma \\vdash x : \\tau}
$$
In these three expressions, we've captured all of the rules
that we've seen so far. It's important to know
that these rules leave out the process
of unification altogether: we use unification to find
types that satisfy the rules.
#### Checking Case Expressions
So far, we've only checked types of numbers, applications, and variables.
Our language has more than that, though!
Binary operators are by far the simplest to extend our language with;
We can simply say that `(+)` is a function, `Int -> Int -> Int`, and
`x+y` is the same as `(+) x y`. This way, we simply translate
operators into function application, and the same typing rules apply.
Next up, we have case expressions. This is one of the two places
where we will introduce new variables into the context, and
also a place where we will need several rules.
Let's first take a look at the whole case expression rule:
$$
\\frac
{\\Gamma \\vdash e : \\tau \\quad \\text{matcht}(\\tau, p\_i) = b\_i \\quad \\Gamma,b\_i \\vdash e\_i : \\tau\_c}
{\\Gamma \\vdash \\text{case} \\; e \\; \\text{of} \; \\\{ (p\_1,e\_1) \\ldots (p\_n, e\_n) \\\} : \\tau\_c }
$$
This is a lot more complicated than the other rules we've seen, and we've used some notation
that we haven't seen before. Let's take this step by step:
1. \\(e : \\tau\\), in this case, means that the expression between `case` and `of`, is of type \\(\\tau\\).
2. \\(\\text{matcht}(\\tau, p\_i) = b\_i\\) means that the pattern \\(p\_i\\) can match a value of type
\\(\\tau\\), producing additional type pairs \\(b\_i\\). We need \\(b\_i\\) because a pattern
such as `Cons x xs` will introduce new type information, namely \\(\text{x} : \text{Int}\\) and \\(\text{xs} : \text{List}\\).
3. \\(\\Gamma,b\_i \\vdash e\_i : \\tau\_c\\) means that each individual branch can be deduced to have the type
\\(\\tau\_c\\), using the previously existing context \\(\\Gamma\\), with the addition of \\(b\_i\\), the new type information.
4. Finally, the conclusion is that the case expression, if all the premises are met, is of type \\(\\tau\_c\\).
For completeness, let's add rules for \\(\\text{matcht}(\\tau, p\_i) = b\_i\\). We'll need two: one for
the "basic" pattern, which always matches the value and binds a variable to it, and one
for a constructor pattern, that matches a constructor and its parameters.
Let's define \\(v\\) to be a variable name in the context of a pattern. For the basic pattern:
$$
\\frac
{}
{\\text{matcht}(\\tau, v) = \\\{v : \\tau \\\}}
$$
For the next rule, let's define \\(c\\) to be a constructor name. The rule for the constructor pattern, then, is:
$$
\\frac
{\\Gamma \\vdash c : \\tau\_1 \\rightarrow ... \\rightarrow \\tau\_n \\rightarrow \\tau}
{\\text{matcht}(\\tau, c \; v\_1 ... v\_n) = \\{ v\_1 : \\tau\_1, ..., v\_n : \\tau\_n \\}}
$$
This rule means that whenever we have a pattern in the form of a constructor applied to
\\(n\\) variable names, if the constructor takes \\(n\\) arguments of types \\(\\tau\_1\\)
through \\(\\tau\_n\\), then the each variable will have a corresponding type.
We didn't include lambda expressions in our syntax, and thus we won't need typing rules for them,
so it actually seems like we're done with the first draft of our type rules.
#### Implementation
Let's work towards some code. Before we write anything down though, let's
get a definition of what a "type" is, in the context of our type checker.
Let's say a type is one of 3 things:
1. A "base type", like `Int`, `Bool`, or `List`.
2. A type that's a function from one type to another.
3. A placeholder / type variable (like the kind we used for type inference).
We represent a plceholder type with a unique string, such as "a", or "b",
and this makes our placeholder type class very similar to the base type class.
{{< codeblock "C++" "compiler/03/type.hpp" >}}
As you can see, we also declared a `type_mgr`, or type manager class.
This class will keep the state used for generating more placeholder
type names, as well as the information about which
placeholder type is mapped to what. We gave it 3 methods:
* `unify`, to perform unification. It will take two types and
find values for placeholder variables such that they can
equal.
* `resolve`, to get to the "bottom" of a chain of equations.
For instance, we have placeholder `a` be mapped to a placeholder
`b`, an finally, the placeholder `b` to be mapped to `Int`.
`resolve` will return for us `Int`, and, if the "bottom"
of the chain is a placeholder, it will set `var` to be a pointer
to that placeholder.
* `bind`, inspired by [this post](http://dev.stephendiehl.com/fun/006_hindley_milner.html),
will map a type variable of some name to a type. This function will also check if
we're binding a type variable to itself, and do nothing in that case,
since `a = a` is not a very useful equation to have.
To fit its original purpose, we also give the manager class the methods
`new_type_name`, and two convenience methods to create placeholder types,
`new_type` (in the form `a`) and `new_arrow_type` (in the form `a->b`).
Let's take a look at the implementation now:
{{< codeblock "C++" "compiler/03/type.cpp" >}}
Here, `new_type_name` is actually pretty boring. My goal
was to generate type names like `a`, then `b`, eventually getting
to `z`, and then move on to `aa`. This provides is with an
endless stream of placeholder types.
Time for the interesting functions. `resolve` keeps
trying `dynamic_cast` to a type variable, and if that succeeds,
then either:
1. It's a type variable that's already been set
to something, in which case we try resolve the thing it was
set to (`t = it->second`)
2. It's a type variable that hasn't been set to something.
We set `var` to it (the caller will use this info),
and stop our resolution loop (`break`).
In `unify`, we start by calling `resolve` - we don't want
to accidentally compare `a` and `b` (and try to bind `a` to
`b`) when `a` is already bound to something else (like `Int`).
From resolve, we will have `lvar` and `rvar` set to
something not NULL if `l` or `r` were type variables
that haven't yet been bound (we defined `resolve` to behave this way).
So, if one of the variables is not NULL, we try to bind it.
Next, `unify` checks if both types are either base types or
arrow types. If they're base types, it compares their names,
and if they're arrow types, it recursively unifies their children.
We return in all cases when unification succeeds, and then throw
an exception (currently 0) if all the cases fell thorugh, and thus,
unification failed.
Finally, `bind` places the type we're binding to into
the `types` map, but not before it checks that the type
we're binding is the same as the string we're binding it to
(since, again, `a=a` is not a useful equation).
We now have a unification algorithm, but we still
need to implement our rules. Our rules
usually include three things: an environment
\\(\\Gamma\\), an expression \\(e\\),
and a type \\(\\tau\\). We will
represent this as a method on `ast`, our struct
for an expression tree. This
method will take an environment and return
a type.
##### Environment
How should we implement our environment?
Conceptually, an environment maps a name string
to a type. So naively, we can implement this simply
using an `std::map`. But observe
that we only extend the environment in one case so far:
a case expression. In a case expression, we have the base
envrionment \\(\\Gamma\\), and for each branch,
we extend it with the bindings produced by
the pattern match. Each branch receives a modified
copy of the original environment, one that
doesn't see the effects of the other branches.
Using our naive approach, we'd create a new `std::map` for
each branch that's a copy of the original environment,
and place into it the new pairs. But this means we'll
need to copy a map for each branch of the pattern!
There's a better way. We structure our environment like
a linked list. Each node in the linked list
contains an `std::map`. When we encounter a new
scope (such as in a case branch), we create a new such node, and add all
variables introduced in this scope to that node's map. We make
it point to our current environment. Then, we pass around the new node
as the environment.
When we look up a variable name, we first look in this node we created.
If we don't find the variable we're looking for, we move on to the next
node. The benefit of this is that we won't be re-creating a map
for each branch, and just creating a node with the changes.
Let's implement exactly that. the header:
{{< codeblock "C++" "compiler/03/env.hpp" >}}
And the source file:
{{< codeblock "C++" "compiler/03/env.cpp" >}}
Nothing should seem too surprising. Of note is the fact
that we're not using smart pointers for `scope`,
and that the child we create during the call
would be invalid if the parent goes out of scope
/ is released. We're gearing this towards
creating new environments on the stack, and we'll
take care not to let a parent go out of scope
before the child.
##### Typechecking Expressions
At last, it's time to declare a new type checking method.
We start with with a signature inside `ast`:
```
virtual type_ptr typecheck(type_mgr& mgr, const type_env& env) const;
```
We also implement the \\(\\text{matchp}\\) function
as a method `match` on `pattern` with the following signature:
```
virtual void match(type_ptr t, type_mgr& mgr, type_env& env) const;
```
We declare this in every subclass of `ast`. Let's take a look
at the implementation now:
{{< codeblock "C++" "compiler/03/ast.cpp" >}}
The `typecheck` implementation for `ast_int`, `ast_lid`, and `ast_uid`
is quite intuitive. The type of a number is always `Int`, and varible
names we simply look up in the environment.
For `ast_binop` and `ast_app`, we check the types of the children first,
and store their resulting types. In the case of `binop`, we assume
that we have the type of the binary operator (such as `+`) in the
environment, and throw if it isn't. Then, we know that
the operator has to be a function of at least two arguments,
with the types of the left and right children of the application.
We also know its actual type, as it's in the environment. We unify
these two types, and then return.
In the case of `app`, we know that the left side (the thing
being applied), has to be a function from the type of the right
child. We also know its computed type. Once again, we unify the two,
and then return.
The type checking for `case` offloads most of the work onto the
`match` method on patterns. We start by computing the type
of the expression we're matching. Then, for each branch,
we create a new scope, and populate it with the new bindings
created by the pattern match (`match` does this). Once
we have a new environment, we typecheck the body
of the branch. Finally, we unify the type of the branch's
body with the previous body types (since branches of the
case expression have to have the same type).
Let's take a look at `match` now. The `match` method
for a variable pattern is very simple: it always
introduces a new variable bindings, and stops there.
The `match` method for a constructor pattern is more
interesting. First, it looks up the constructor
that the pattern is trying to match. This needs to exist,
so if we don't find a type for it, we throw. Next,
for each variable name in the pattern, we know
that there has to be a corresponding parameter in the
constructor. Because of this, we cast the constructor type
to an function type (throwing if it isn't one),
and create a new mapping from the variable name to the
left side (parameter) of the function type. We then
move on to examine the right side of the function
type, and the remaining variables of the pattern.
Once we have gone through the parameters,
what remains __should__ be the type of the thing
the constructor creates. Not only that, but
the remaining type should match the type of the value
the pattern is being matched against. To ensure
this, we unify the two types.
There's only one thing that can still go wrong.
The value __and__ the pattern can __both__ be
a partial application. For ease of implementation,
we will not allow this case. Thus, we make sure
the final type is a base type, and throw otherwise.
##### Typechecking Definitions
This looks good, but we're not done yet. We can type
check expressions, but our program ins't
made up of expressions. Rather, it's made up of
definitions. Further, we can't look at the definitions
in isolation. Consider these two functions:
```
defn double x = { x + x }
defn quadruple x = { double (double x) }
```
Assuming we have an environment containing `x` when we typecheck the body
of `double`, our algorithm will work out fine. But what about
`quadruple`? It needs to know what `double` is, or at least that it exists.
We could also envision two mutually recursive functions. Let's
assume we have the functions `eq` and `if` in global scope. We can write
two functions, `even` and `odd`:
```
defn even x = { if (eq x 0) True (odd (x-1)) }
defn odd x = { if (eq x 0) False (even (n-1)) }
```
`odd` needs to know about `even`, and `even` needs
to know about `odd`. Thus, before we do any checking,
we need to populate a global environment with __some__
type for each function we declare. We will
use what we know about the function for our
initial declaration: if the function
takes two parameters, its type will be `a -> b -> c`.
If it takes one parameter, its type will be `a -> b`.
What's more, though, is that we need to make sure
that the function's parameters are passed in the environment
when checking its body, and that these parameters' types
are the same as the placeholder types in the function's
"declaration".
We'll typecheck the program in two passes. First,
we'll go through each definition, and add any
functions it declares to the global scope. Then,
we will go through each definition again, and,
if it's a function, typecheck its body using
the previously fleshed out global scope.
We'll add two functions, `typecheck_first`
and `typecheck_second` corresponding to
these two stages. Their signatures:
```
virtual void typecheck_first(type_mgr& mgr, type_env& env);
virtual void typecheck_second(type_mgr& mgr, const type_env& env) const;
```
Furthermore, in the `definition_defn`, we will keep an
`std::vector` of `type_ptr`, in which the first element is the
type of the __last__ parameter, and so on. We switch around
the order of arguments because we build up the `a -> b -> ...`
type signature from the right (`->` is right associative), and
thus we'll be creating the types right-to-left, too. We also
add a `type_ptr` field which holds the function's return type.
We keep these two things in the `definition_defn` so
that they persist between the two typechecking stages: we want to use
the types from the first stage to aid in checking the body in the second stage.
Here's the code for the implementation:
{{< codeblock "C++" "compiler/03/definition.cpp" >}}
And finally, our updated main:
{{< codeblock "C++" "compiler/03/main.cpp" >}}
Notice that we manually add the types for our binary operators to the environment.
Let's run our project on a few examples. On our two "bad" examples, we get
the very eloquent error:
```
terminate called after throwing an instance of 'int'
[2] 9776 abort (core dumped) ./a.out < bad2.txt
```
That's what we get for throwing 0.
So far, our program has thrown in 100% of cases. Let's verify it actually
accepts valid programs! We'll try our very first example from today,
as well as these two:
{{< rawblock "compiler/03/works2.txt" >}}
{{< rawblock "compiler/03/works3.txt" >}}
All of our examples print the number of declarations in the program,
which means they don't throw 0. And so, we have typechecking!