Add more content for the type checking post
This commit is contained in:
parent
d1a84b6f30
commit
fce8e75dde
245
content/blog/03_compiler_trees.md
Normal file
245
content/blog/03_compiler_trees.md
Normal file
@ -0,0 +1,245 @@
|
||||
---
|
||||
title: Compiling a Functional Language Using C++, Part 3 - Operations On Trees
|
||||
date: 2019-08-06T14:26:38-07:00
|
||||
draft: true
|
||||
tags: ["C and C++", "Functional Languages", "Compilers"]
|
||||
---
|
||||
I called tokenizing and parsing boring, but I think I failed to articulate
|
||||
the real reason that I feel this way. The thing is, looking at syntax
|
||||
is a pretty shallow measure of how interesting a language is. It's like
|
||||
the cover of a book. Every language has one, and it so happens that to make
|
||||
our "book", we need to start with making the cover. But the content of the book
|
||||
is what matters, and that's where we've arrived now. We must make decisions
|
||||
about our language, and give meaning to programs written in it. But before
|
||||
we can give our programs meaning, we need to make sense of the current domain
|
||||
of programs that we receive from our parser. Let's consider a few wonderful examples.
|
||||
|
||||
```
|
||||
defn main = { plus 320 6 }
|
||||
defn plus x y = { x + y }
|
||||
```
|
||||
|
||||
This is a valid program, as far as we're concerned. But are __all__
|
||||
programs we get from the parser valid? See for yourself:
|
||||
|
||||
```
|
||||
data Bool = { True, False }
|
||||
defn main { 3 + True }
|
||||
```
|
||||
|
||||
Obviously, that's not right. The parser accepts it - this matches our grammar.
|
||||
But giving meaning to this program is not easy, since we have no clear
|
||||
way of adding 3 and some data type. Similarly:
|
||||
|
||||
```
|
||||
defn main { 1 2 3 4 5 }
|
||||
```
|
||||
|
||||
What is this? It's a sequence of applications, starting with `1 2`. Numbers
|
||||
are not functions. Their type is `Int`, not `Int -> a`. I can't even think of a type `1`
|
||||
would need to have for this program to be valid.
|
||||
|
||||
Before we give meaning to programs in our language, we'll need to toss away the ones
|
||||
that don't make sense. To do so, we will use type checking. During the process of type
|
||||
checking, we will collect information about various parts of our abstract syntax trees,
|
||||
classifying them by the types of values they create. Using this information, we'll be able
|
||||
to throw away blatantly incorrect programs.
|
||||
|
||||
### Basic Type Checking
|
||||
You may have noticed in the very first post that I have chosen to avoid polymorphism.
|
||||
This will significantly simplify our type checking algorithm. If a more robust
|
||||
algorithm is desired, take a look at the
|
||||
[Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system).
|
||||
Personally, I enjoyed [this](http://dev.stephendiehl.com/fun/006_hindley_milner.html) section
|
||||
of _Write You a Haskell_.
|
||||
|
||||
Let's start with the types of constants - those are pretty obvious. The constant `3` is an integer,
|
||||
and we shall mark it as such: `3 :: Int`. Variables seem like the natural
|
||||
next step, but they're fairly different. Without outside knowledge, all we can do is say
|
||||
that a variable has __some__ type. If we stick with Haskell's notation (used in polymorphic types),
|
||||
we can say a variable has a type `a`, where `a` can be replaced with any other type. If we
|
||||
know more information, like the fact that `x` was declared to be an integer, we can instead
|
||||
say that. This tells us that throughout type checking we'll have to keep some kind
|
||||
of record of names and their associated types.
|
||||
|
||||
Next, let's take a look at functions, which are admittedly more interesting
|
||||
than the previous two examples. I'm not talking about the case of seeing something
|
||||
like a function name `f`. This is the same as the variable case - we don't even know
|
||||
it's a function unless there is context, and if there __is__ context, then that
|
||||
context is probably the most useful information we have. I'm talking about
|
||||
something like the application of a function to another value, in the form
|
||||
`f x`. In this case, we know that `f :: a -> b`, a function from something
|
||||
to something. However, we know even more. For this program to be correct,
|
||||
the `a` in `f :: a -> b`, and the type of `x` (let's call it `c`), must
|
||||
be compatible. In order to do that, we will use what can be considered
|
||||
simplified [unification](https://en.wikipedia.org/wiki/Unification_(computer_science)).
|
||||
Conceptually, what this means is that we will attempt to perform substitutions
|
||||
in various equations in search of a solution.
|
||||
|
||||
#### Basic Examples
|
||||
Let's try an example. We'll try to determine the type of the following
|
||||
expression, and extract any other information from this expression
|
||||
that we might use later.
|
||||
|
||||
```
|
||||
foo 320 6
|
||||
```
|
||||
|
||||
In out parse tree, this will be represented as `(foo 320) 6`, meaning
|
||||
that the outermost application will be at the top. Let's assume
|
||||
we know __nothing__ about `foo`.
|
||||
|
||||
To figure out the type of the application, we will need to know the types
|
||||
of the thing being applied, and the thing that serves as the argument.
|
||||
The latter is easy: the type of `6` is `Int`. Before we proceed
|
||||
into the left child of the application, there's one more
|
||||
piece of information we can deduce: since `foo 320` is applied to
|
||||
an argument, it has to be of type `a -> b`.
|
||||
|
||||
Let's proceed to the left child. It's another application, this time
|
||||
of `foo` to `320`. Again, the right child is simple: the type of
|
||||
`320` is `Int`. Again, we know that `foo` has to have a type
|
||||
`c -> d` (we're using different variable names to avoid ambiguity).
|
||||
|
||||
Now, we need to combine the pieces of information that we have. Since
|
||||
`foo :: c -> d`, we know that its first parameter __must__ be of
|
||||
type `c`. We also know that its first parameter is of type `Int`.
|
||||
The only way for both of these to be tree is for `c = Int`.
|
||||
This also tells us that `foo :: Int -> d`. Finally,
|
||||
since `foo` has now been applied to its first argument,
|
||||
we know that the `foo 320 :: d`.
|
||||
|
||||
We've done what we can from this innermost application; it's time to return
|
||||
to the outermost one. We now know that the left child is of type `d`, and
|
||||
that it also has to be of type `a -> b`. The only way for this to be true
|
||||
is for `d = a -> b`. So, `foo 320` is a function from `a` to `b`.
|
||||
Again, we can conclude the first parameter has to be of type
|
||||
`a`. We also know that the first parameter is of type `Int`. Clearly,
|
||||
this means that `a = Int`. After the application, we know
|
||||
that the whole expression has some type `d`.
|
||||
|
||||
Let's revisit what we know about `foo`. Last time we checked in on it,
|
||||
`foo` was of type `Int -> d`. But since we know that `d = a -> b`,
|
||||
and that `a = Int`, we can now say that `foo :: Int -> Int -> b`.
|
||||
|
||||
We haven't found any issues with the expression, and we learned
|
||||
some new information about the type of `foo`. Awesome!
|
||||
|
||||
Let's apply this to a simplified example from the beginning of this post.
|
||||
Let's check the type of:
|
||||
```
|
||||
1 2
|
||||
```
|
||||
Once again, the application is what we see first. The right child
|
||||
of the application, just like in the previous example, is `Int`.
|
||||
We also kno that since `1` is being applied as a function,
|
||||
its type must be `a -> b`. However, we also know that the left
|
||||
child, being a number, is also of type `Int`! There's no
|
||||
way to combine `a -> b` with `Int`, and thus, there is no solution
|
||||
we can find for the type of `1 2`. This means our program is invalid.
|
||||
We can toss it away, give an error, and exit.
|
||||
|
||||
#### Some Notation
|
||||
If you go to the Wikipedia page on the Hindley-Milner type system,
|
||||
you will see quite a lot of symbols and greek letters. This is a __good thing__,
|
||||
because the way that I presented to you the rules for figuring out
|
||||
types of expressions is very verbose. You have to read several
|
||||
paragraphs of text, and that's only for the first three logical
|
||||
rules! If you're anything like me, you want to be able to read just
|
||||
the important parts, and with some notation, I'll be able to show you
|
||||
these important parts concisely, while continuing to explain
|
||||
the rules in detail in paragraphs of text.
|
||||
|
||||
Let's start with inference rules. An inference rule is an expression in
|
||||
the form:
|
||||
$$
|
||||
\\frac{A\_1 \\ldots A\_n} {B\_1 \\ldots B\_m}
|
||||
$$
|
||||
This reads, "given that the premises \\(A\_1\\) through \\(A\_n\\) are true,
|
||||
it holds that the conclusions \\(B\_1\\) through \\(B\_n\\) are true".
|
||||
|
||||
For example, we can have the following inference rule:
|
||||
$$
|
||||
\\frac
|
||||
{\\text{if it's cold, I wear a jacket} \\quad \\text{it's cold}}
|
||||
{\\text{I wear a jacket}}
|
||||
$$
|
||||
|
||||
Since you wear a jacket when it's cold, and it's cold, we can conclude
|
||||
that you will wear a jacket.
|
||||
|
||||
When talking about type systems, it's common to represent a type with \\(\\tau\\).
|
||||
The letter, which is the greek character "tau", is used as a placeholder for
|
||||
some __concrete type__. It's kind of like a template, to be filled in
|
||||
with an actual value. When we plug in an actual value into a rule containing
|
||||
\\(\\tau\\), we say we are __instantiating__ it. Similarly, we will use
|
||||
\\(e\\) to serve as a placeholder for an expression (matched by our
|
||||
\\(A\_{add}\\) grammar rule from part 2). Next, we have the typing relation,
|
||||
written as \\(e:\\tau\\). This says that "expression \\(e\\) has the type
|
||||
\\(\\tau\\)".
|
||||
|
||||
Alright, this is enough to get us started with some typing rules.
|
||||
Let's start with one for numbers. If we define \\(n\\) to mean
|
||||
"any expression that is a just a number, like 3, 2, 6, etc.",
|
||||
we can write the typing rule as follows:
|
||||
$$
|
||||
\\frac{}{n : \\text{Int}}
|
||||
$$
|
||||
|
||||
There's nothing above the line -
|
||||
there are no premises that are needed for a number to
|
||||
have the type `Int`.
|
||||
|
||||
Now, let's move on to the rule for function application:
|
||||
$$
|
||||
\\frac
|
||||
{e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad e_2 : \\tau_1}
|
||||
{e_1 \\; e_2 : \\tau\_2}
|
||||
$$
|
||||
|
||||
It's the variable rule that forces us to adjust our notation.
|
||||
Our rules don't take into account the context that we've
|
||||
already discussed. Let's fix that! It's convention
|
||||
to use the symbol \\(\\Gamma\\) for the context. We then
|
||||
add notation to say, "using the context \\(\\Gamma\\),
|
||||
we can deduce that \\(e\\) has type \\(\\tau\\)". We will
|
||||
write this as \\(\\Gamma \\vdash e : \\tau\\).
|
||||
|
||||
But what __is__ our context? It's just a set of pairs
|
||||
in the form \\(x : \\tau\\), where \\(x\\) represents
|
||||
a variable name. Each pair tells us that the variable
|
||||
\\(x\\) is known to have a type \\(\\tau\\).
|
||||
|
||||
Since \\(\\Gamma\\) is just a regular set, we can
|
||||
write \\(x : \\tau \\in \\Gamma\\), meaning that
|
||||
in the current context, it is known that \\(x\\)
|
||||
has the type \\(\\tau\\).
|
||||
|
||||
Let's update our rules with this new addition.
|
||||
|
||||
The integer rule just needs a slight adjustment:
|
||||
$$
|
||||
\frac{}{\\Gamma \\vdash n : \\text{Int}}
|
||||
$$
|
||||
|
||||
The same is true for the application rule:
|
||||
$$
|
||||
\frac
|
||||
{\\Gamma \\vdash e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad \\Gamma \\vdash e_2 : \\tau_1}
|
||||
{\\Gamma \\vdash e_1 \\; e_2 : \\tau\_2}
|
||||
$$
|
||||
|
||||
And finally, we can represent the variable rule:
|
||||
$$
|
||||
\\frac{x : \\tau \\in \\Gamma}{\\Gamma \\vdash x : \\tau}
|
||||
$$
|
||||
|
||||
In these three expressions, we've captured all of the rules
|
||||
that we've seen so far. It's important to know
|
||||
that These rules leave out the process
|
||||
of unification altogether: we use unification to find
|
||||
types that satisfy the rules.
|
||||
|
||||
#### Checking More Expressions
|
||||
So far, we've only checked types of numbers, applications, and variables.
|
||||
Our language has more than that, though!
|
Loading…
Reference in New Issue
Block a user