2019-08-13 21:31:10 -07:00
|
|
|
---
|
2019-08-26 00:13:10 -07:00
|
|
|
title: Compiling a Functional Language Using C++, Part 3 - Type Checking
|
2019-08-13 21:31:10 -07:00
|
|
|
date: 2019-08-06T14:26:38-07:00
|
|
|
|
draft: true
|
|
|
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
|
|
|
---
|
|
|
|
I called tokenizing and parsing boring, but I think I failed to articulate
|
|
|
|
the real reason that I feel this way. The thing is, looking at syntax
|
|
|
|
is a pretty shallow measure of how interesting a language is. It's like
|
|
|
|
the cover of a book. Every language has one, and it so happens that to make
|
|
|
|
our "book", we need to start with making the cover. But the content of the book
|
|
|
|
is what matters, and that's where we've arrived now. We must make decisions
|
|
|
|
about our language, and give meaning to programs written in it. But before
|
|
|
|
we can give our programs meaning, we need to make sense of the current domain
|
|
|
|
of programs that we receive from our parser. Let's consider a few wonderful examples.
|
|
|
|
|
|
|
|
```
|
|
|
|
defn main = { plus 320 6 }
|
|
|
|
defn plus x y = { x + y }
|
|
|
|
```
|
|
|
|
|
|
|
|
This is a valid program, as far as we're concerned. But are __all__
|
|
|
|
programs we get from the parser valid? See for yourself:
|
|
|
|
|
|
|
|
```
|
|
|
|
data Bool = { True, False }
|
2019-08-26 00:13:10 -07:00
|
|
|
defn main = { 3 + True }
|
2019-08-13 21:31:10 -07:00
|
|
|
```
|
|
|
|
|
|
|
|
Obviously, that's not right. The parser accepts it - this matches our grammar.
|
|
|
|
But giving meaning to this program is not easy, since we have no clear
|
|
|
|
way of adding 3 and some data type. Similarly:
|
|
|
|
|
|
|
|
```
|
2019-08-26 00:13:10 -07:00
|
|
|
defn main = { 1 2 3 4 5 }
|
2019-08-13 21:31:10 -07:00
|
|
|
```
|
|
|
|
|
|
|
|
What is this? It's a sequence of applications, starting with `1 2`. Numbers
|
|
|
|
are not functions. Their type is `Int`, not `Int -> a`. I can't even think of a type `1`
|
|
|
|
would need to have for this program to be valid.
|
|
|
|
|
|
|
|
Before we give meaning to programs in our language, we'll need to toss away the ones
|
|
|
|
that don't make sense. To do so, we will use type checking. During the process of type
|
|
|
|
checking, we will collect information about various parts of our abstract syntax trees,
|
|
|
|
classifying them by the types of values they create. Using this information, we'll be able
|
|
|
|
to throw away blatantly incorrect programs.
|
|
|
|
|
|
|
|
### Basic Type Checking
|
|
|
|
You may have noticed in the very first post that I have chosen to avoid polymorphism.
|
|
|
|
This will significantly simplify our type checking algorithm. If a more robust
|
|
|
|
algorithm is desired, take a look at the
|
|
|
|
[Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system).
|
|
|
|
Personally, I enjoyed [this](http://dev.stephendiehl.com/fun/006_hindley_milner.html) section
|
|
|
|
of _Write You a Haskell_.
|
|
|
|
|
|
|
|
Let's start with the types of constants - those are pretty obvious. The constant `3` is an integer,
|
|
|
|
and we shall mark it as such: `3 :: Int`. Variables seem like the natural
|
|
|
|
next step, but they're fairly different. Without outside knowledge, all we can do is say
|
|
|
|
that a variable has __some__ type. If we stick with Haskell's notation (used in polymorphic types),
|
|
|
|
we can say a variable has a type `a`, where `a` can be replaced with any other type. If we
|
|
|
|
know more information, like the fact that `x` was declared to be an integer, we can instead
|
|
|
|
say that. This tells us that throughout type checking we'll have to keep some kind
|
|
|
|
of record of names and their associated types.
|
|
|
|
|
|
|
|
Next, let's take a look at functions, which are admittedly more interesting
|
|
|
|
than the previous two examples. I'm not talking about the case of seeing something
|
|
|
|
like a function name `f`. This is the same as the variable case - we don't even know
|
|
|
|
it's a function unless there is context, and if there __is__ context, then that
|
|
|
|
context is probably the most useful information we have. I'm talking about
|
|
|
|
something like the application of a function to another value, in the form
|
|
|
|
`f x`. In this case, we know that `f :: a -> b`, a function from something
|
|
|
|
to something. However, we know even more. For this program to be correct,
|
|
|
|
the `a` in `f :: a -> b`, and the type of `x` (let's call it `c`), must
|
|
|
|
be compatible. In order to do that, we will use what can be considered
|
|
|
|
simplified [unification](https://en.wikipedia.org/wiki/Unification_(computer_science)).
|
|
|
|
Conceptually, what this means is that we will attempt to perform substitutions
|
|
|
|
in various equations in search of a solution.
|
|
|
|
|
|
|
|
#### Basic Examples
|
|
|
|
Let's try an example. We'll try to determine the type of the following
|
|
|
|
expression, and extract any other information from this expression
|
|
|
|
that we might use later.
|
|
|
|
|
|
|
|
```
|
|
|
|
foo 320 6
|
|
|
|
```
|
|
|
|
|
|
|
|
In out parse tree, this will be represented as `(foo 320) 6`, meaning
|
|
|
|
that the outermost application will be at the top. Let's assume
|
|
|
|
we know __nothing__ about `foo`.
|
|
|
|
|
|
|
|
To figure out the type of the application, we will need to know the types
|
|
|
|
of the thing being applied, and the thing that serves as the argument.
|
|
|
|
The latter is easy: the type of `6` is `Int`. Before we proceed
|
|
|
|
into the left child of the application, there's one more
|
|
|
|
piece of information we can deduce: since `foo 320` is applied to
|
|
|
|
an argument, it has to be of type `a -> b`.
|
|
|
|
|
|
|
|
Let's proceed to the left child. It's another application, this time
|
|
|
|
of `foo` to `320`. Again, the right child is simple: the type of
|
|
|
|
`320` is `Int`. Again, we know that `foo` has to have a type
|
|
|
|
`c -> d` (we're using different variable names to avoid ambiguity).
|
|
|
|
|
|
|
|
Now, we need to combine the pieces of information that we have. Since
|
|
|
|
`foo :: c -> d`, we know that its first parameter __must__ be of
|
|
|
|
type `c`. We also know that its first parameter is of type `Int`.
|
|
|
|
The only way for both of these to be tree is for `c = Int`.
|
|
|
|
This also tells us that `foo :: Int -> d`. Finally,
|
|
|
|
since `foo` has now been applied to its first argument,
|
|
|
|
we know that the `foo 320 :: d`.
|
|
|
|
|
|
|
|
We've done what we can from this innermost application; it's time to return
|
|
|
|
to the outermost one. We now know that the left child is of type `d`, and
|
|
|
|
that it also has to be of type `a -> b`. The only way for this to be true
|
|
|
|
is for `d = a -> b`. So, `foo 320` is a function from `a` to `b`.
|
|
|
|
Again, we can conclude the first parameter has to be of type
|
|
|
|
`a`. We also know that the first parameter is of type `Int`. Clearly,
|
|
|
|
this means that `a = Int`. After the application, we know
|
|
|
|
that the whole expression has some type `d`.
|
|
|
|
|
|
|
|
Let's revisit what we know about `foo`. Last time we checked in on it,
|
|
|
|
`foo` was of type `Int -> d`. But since we know that `d = a -> b`,
|
|
|
|
and that `a = Int`, we can now say that `foo :: Int -> Int -> b`.
|
|
|
|
|
|
|
|
We haven't found any issues with the expression, and we learned
|
|
|
|
some new information about the type of `foo`. Awesome!
|
|
|
|
|
|
|
|
Let's apply this to a simplified example from the beginning of this post.
|
|
|
|
Let's check the type of:
|
|
|
|
```
|
|
|
|
1 2
|
|
|
|
```
|
|
|
|
Once again, the application is what we see first. The right child
|
|
|
|
of the application, just like in the previous example, is `Int`.
|
|
|
|
We also kno that since `1` is being applied as a function,
|
|
|
|
its type must be `a -> b`. However, we also know that the left
|
|
|
|
child, being a number, is also of type `Int`! There's no
|
|
|
|
way to combine `a -> b` with `Int`, and thus, there is no solution
|
|
|
|
we can find for the type of `1 2`. This means our program is invalid.
|
|
|
|
We can toss it away, give an error, and exit.
|
|
|
|
|
|
|
|
#### Some Notation
|
|
|
|
If you go to the Wikipedia page on the Hindley-Milner type system,
|
|
|
|
you will see quite a lot of symbols and greek letters. This is a __good thing__,
|
|
|
|
because the way that I presented to you the rules for figuring out
|
|
|
|
types of expressions is very verbose. You have to read several
|
|
|
|
paragraphs of text, and that's only for the first three logical
|
|
|
|
rules! If you're anything like me, you want to be able to read just
|
|
|
|
the important parts, and with some notation, I'll be able to show you
|
|
|
|
these important parts concisely, while continuing to explain
|
|
|
|
the rules in detail in paragraphs of text.
|
|
|
|
|
|
|
|
Let's start with inference rules. An inference rule is an expression in
|
|
|
|
the form:
|
|
|
|
$$
|
|
|
|
\\frac{A\_1 \\ldots A\_n} {B\_1 \\ldots B\_m}
|
|
|
|
$$
|
|
|
|
This reads, "given that the premises \\(A\_1\\) through \\(A\_n\\) are true,
|
|
|
|
it holds that the conclusions \\(B\_1\\) through \\(B\_n\\) are true".
|
|
|
|
|
|
|
|
For example, we can have the following inference rule:
|
|
|
|
$$
|
|
|
|
\\frac
|
|
|
|
{\\text{if it's cold, I wear a jacket} \\quad \\text{it's cold}}
|
|
|
|
{\\text{I wear a jacket}}
|
|
|
|
$$
|
|
|
|
|
|
|
|
Since you wear a jacket when it's cold, and it's cold, we can conclude
|
|
|
|
that you will wear a jacket.
|
|
|
|
|
|
|
|
When talking about type systems, it's common to represent a type with \\(\\tau\\).
|
|
|
|
The letter, which is the greek character "tau", is used as a placeholder for
|
|
|
|
some __concrete type__. It's kind of like a template, to be filled in
|
|
|
|
with an actual value. When we plug in an actual value into a rule containing
|
|
|
|
\\(\\tau\\), we say we are __instantiating__ it. Similarly, we will use
|
|
|
|
\\(e\\) to serve as a placeholder for an expression (matched by our
|
|
|
|
\\(A\_{add}\\) grammar rule from part 2). Next, we have the typing relation,
|
|
|
|
written as \\(e:\\tau\\). This says that "expression \\(e\\) has the type
|
|
|
|
\\(\\tau\\)".
|
|
|
|
|
|
|
|
Alright, this is enough to get us started with some typing rules.
|
|
|
|
Let's start with one for numbers. If we define \\(n\\) to mean
|
|
|
|
"any expression that is a just a number, like 3, 2, 6, etc.",
|
|
|
|
we can write the typing rule as follows:
|
|
|
|
$$
|
|
|
|
\\frac{}{n : \\text{Int}}
|
|
|
|
$$
|
|
|
|
|
|
|
|
There's nothing above the line -
|
|
|
|
there are no premises that are needed for a number to
|
|
|
|
have the type `Int`.
|
|
|
|
|
|
|
|
Now, let's move on to the rule for function application:
|
|
|
|
$$
|
|
|
|
\\frac
|
|
|
|
{e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad e_2 : \\tau_1}
|
|
|
|
{e_1 \\; e_2 : \\tau\_2}
|
|
|
|
$$
|
|
|
|
|
|
|
|
It's the variable rule that forces us to adjust our notation.
|
|
|
|
Our rules don't take into account the context that we've
|
|
|
|
already discussed. Let's fix that! It's convention
|
|
|
|
to use the symbol \\(\\Gamma\\) for the context. We then
|
|
|
|
add notation to say, "using the context \\(\\Gamma\\),
|
|
|
|
we can deduce that \\(e\\) has type \\(\\tau\\)". We will
|
|
|
|
write this as \\(\\Gamma \\vdash e : \\tau\\).
|
|
|
|
|
|
|
|
But what __is__ our context? It's just a set of pairs
|
|
|
|
in the form \\(x : \\tau\\), where \\(x\\) represents
|
|
|
|
a variable name. Each pair tells us that the variable
|
|
|
|
\\(x\\) is known to have a type \\(\\tau\\).
|
|
|
|
|
|
|
|
Since \\(\\Gamma\\) is just a regular set, we can
|
|
|
|
write \\(x : \\tau \\in \\Gamma\\), meaning that
|
|
|
|
in the current context, it is known that \\(x\\)
|
|
|
|
has the type \\(\\tau\\).
|
|
|
|
|
|
|
|
Let's update our rules with this new addition.
|
|
|
|
|
|
|
|
The integer rule just needs a slight adjustment:
|
|
|
|
$$
|
|
|
|
\frac{}{\\Gamma \\vdash n : \\text{Int}}
|
|
|
|
$$
|
|
|
|
|
|
|
|
The same is true for the application rule:
|
|
|
|
$$
|
|
|
|
\frac
|
|
|
|
{\\Gamma \\vdash e_1 : \\tau\_1 \\rightarrow \\tau\_2 \\quad \\Gamma \\vdash e_2 : \\tau_1}
|
|
|
|
{\\Gamma \\vdash e_1 \\; e_2 : \\tau\_2}
|
|
|
|
$$
|
|
|
|
|
|
|
|
And finally, we can represent the variable rule:
|
|
|
|
$$
|
|
|
|
\\frac{x : \\tau \\in \\Gamma}{\\Gamma \\vdash x : \\tau}
|
|
|
|
$$
|
|
|
|
|
|
|
|
In these three expressions, we've captured all of the rules
|
|
|
|
that we've seen so far. It's important to know
|
|
|
|
that These rules leave out the process
|
|
|
|
of unification altogether: we use unification to find
|
|
|
|
types that satisfy the rules.
|
|
|
|
|
2019-08-25 01:37:24 -07:00
|
|
|
#### Checking Case Expressions
|
2019-08-13 21:31:10 -07:00
|
|
|
So far, we've only checked types of numbers, applications, and variables.
|
|
|
|
Our language has more than that, though!
|
2019-08-13 22:54:16 -07:00
|
|
|
|
|
|
|
Binary operators are by far the simplest to extend our language with;
|
|
|
|
We can simply say that `(+)` is a function, `Int -> Int -> Int`, and
|
|
|
|
`x+y` is the same as `(+) x y`. This way, we simply translate
|
|
|
|
operators into function application, and the same typing rules apply.
|
|
|
|
|
|
|
|
Next up, we have case expressions. This is one of the two places
|
|
|
|
where we will introduce new variables into the context, and
|
|
|
|
also a place where we will need several rules.
|
|
|
|
|
|
|
|
Let's first take a look at the whole case expression rule:
|
|
|
|
$$
|
|
|
|
\\frac
|
2019-08-25 01:37:24 -07:00
|
|
|
{\\Gamma \\vdash e : \\tau \\quad \\text{matcht}(\\tau, p\_i) = b\_i \\quad \\Gamma,b\_i \\vdash e\_i : \\tau\_c}
|
2019-08-13 22:54:16 -07:00
|
|
|
{\\Gamma \\vdash \\text{case} \\; e \\; \\text{of} \; \\\{ (p\_1,e\_1) \\ldots (p\_n, e\_n) \\\} : \\tau\_c }
|
|
|
|
$$
|
|
|
|
This is a lot more complicated than the other rules we've seen, and we've used some notation
|
|
|
|
that we haven't seen before. Let's take this step by step:
|
|
|
|
|
|
|
|
1. \\(e : \\tau\\), in this case, means that the expression between `case` and `of`, is of type \\(\\tau\\).
|
2019-08-25 01:37:24 -07:00
|
|
|
2. \\(\\text{matcht}(\\tau, p\_i) = b\_i\\) means that the pattern \\(p\_i\\) can match a value of type
|
2019-08-13 22:54:16 -07:00
|
|
|
\\(\\tau\\), producing additional type pairs \\(b\_i\\). We need \\(b\_i\\) because a pattern
|
|
|
|
such as `Cons x xs` will introduce new type information, namely \\(\text{x} : \text{Int}\\) and \\(\text{xs} : \text{List}\\).
|
|
|
|
3. \\(\\Gamma,b\_i \\vdash e\_i : \\tau\_c\\) means that each individual branch can be deduced to have the type
|
|
|
|
\\(\\tau\_c\\), using the previously existing context \\(\\Gamma\\), with the addition of \\(b\_i\\), the new type information.
|
|
|
|
4. Finally, the conclusion is that the case expression, if all the premises are met, is of type \\(\\tau\_c\\).
|
|
|
|
|
2019-08-25 01:37:24 -07:00
|
|
|
For completeness, let's add rules for \\(\\text{matcht}(\\tau, p\_i) = b\_i\\). We'll need two: one for
|
2019-08-13 22:54:16 -07:00
|
|
|
the "basic" pattern, which always matches the value and binds it to the variable, and one
|
|
|
|
for a constructor pattern, that matches a constructor and its parameters.
|
|
|
|
|
|
|
|
Let's define \\(v\\) to be a variable name in the context of a pattern. For the basic pattern:
|
|
|
|
$$
|
|
|
|
\\frac
|
|
|
|
{}
|
2019-08-25 01:37:24 -07:00
|
|
|
{\\text{matcht}(\\tau, v) = \\\{v : \\tau \\\}}
|
2019-08-13 22:54:16 -07:00
|
|
|
$$
|
|
|
|
|
2019-08-25 01:37:24 -07:00
|
|
|
For the next rule, let's define \\(c\\) to be a constructor name. The rule for the constructor pattern, then, is:
|
2019-08-13 22:54:16 -07:00
|
|
|
$$
|
|
|
|
\\frac
|
2019-08-25 01:37:24 -07:00
|
|
|
{\\Gamma \\vdash c : \\tau\_1 \\rightarrow ... \\rightarrow \\tau\_n \\rightarrow \\tau}
|
|
|
|
{\\text{matcht}(\\tau, c \; v\_1 ... v\_n) = \\{ v\_1 : \\tau\_1, ..., v\_n : \\tau\_n \\}}
|
2019-08-13 22:54:16 -07:00
|
|
|
$$
|
2019-08-25 01:37:24 -07:00
|
|
|
|
|
|
|
This rule means that whenever we have a pattern in the form of a constructor applied to
|
|
|
|
\\(n\\) variable names, if the constructor takes \\(n\\) arguments of types \\(\\tau\_1\\)
|
|
|
|
through \\(\\tau\_n\\), then the each variable will have a corresponding type.
|
|
|
|
|
|
|
|
We didn't include lambda expressions in our syntax, and thus we won't need typing rules for them,
|
|
|
|
so it actually seems like we're done with the first draft of our type rules.
|
|
|
|
|
2019-08-25 16:42:23 -07:00
|
|
|
#### Implementation
|
|
|
|
Let's work towards some code. Before we write anything down though, let's
|
|
|
|
get a definition of what a "type" is, in the context of our type checker.
|
|
|
|
Let's say a type is one of 3 things:
|
|
|
|
|
|
|
|
1. A "base type", like `Int`, `Bool`, or `List`.
|
|
|
|
2. A type that's a function from one type to another.
|
|
|
|
3. A placeholder / type variable (like the kind we used for type inference).
|
|
|
|
|
|
|
|
We represent a plceholder type with a unique string, such as "a", or "b",
|
|
|
|
and this makes our placeholder type class very similar to the base type class.
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/03/type.hpp" >}}
|
|
|
|
|
|
|
|
As you can see, we also declared a `type_mgr`, or type manager class.
|
|
|
|
This class will keep the state used for generating more placeholder
|
|
|
|
type names, as well as the information about which
|
|
|
|
placeholder type is mapped to what. We gave it 3 methods:
|
|
|
|
|
|
|
|
* `unify`, to perform unification. It will take two types and
|
|
|
|
find values for placeholder variables such that they can
|
|
|
|
equal.
|
|
|
|
* `resolve`, to get to the "bottom" of a chain of equations.
|
|
|
|
For instance, we have placeholder `a` be mapped to a placeholder
|
|
|
|
`b`, an finally, the placeholder `b` to be mapped to `Int`.
|
|
|
|
`resolve` will return for us `Int`, and, if the "bottom"
|
|
|
|
of the chain is a placeholder, it will set `var` to be a pointer
|
|
|
|
to that placeholder.
|
|
|
|
* `bind`, inspired by [this post](http://dev.stephendiehl.com/fun/006_hindley_milner.html),
|
|
|
|
will map a type variable of some name to a type. This function will also check if
|
|
|
|
the thing we're binding to is the same type variable and not do anything in that case,
|
|
|
|
since `a = a` is not a very useful equation to have.
|
|
|
|
|
|
|
|
To fit its original purpose, we also give the manager class the methods
|
|
|
|
`new_type_name`, and two convenience methods to create placeholder types,
|
|
|
|
`new_type` (in the form `a`) and `new_arrow_type` (in the form `a->b`).
|
|
|
|
|
|
|
|
Let's take a look at the implementation now:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/03/type.cpp" >}}
|
|
|
|
|
|
|
|
Here, `new_type_name` is actually pretty boring. My goal
|
|
|
|
was to generate type names like `a`, then `b`, eventually getting
|
|
|
|
to `z`, and then move on to `aa`. This provides is with an
|
|
|
|
endless stream of placeholder types.
|
|
|
|
|
|
|
|
Time for the interesting functions. `resolve` keeps
|
|
|
|
trying `dynamic_cast` to a type variable, and if that succeeds,
|
|
|
|
then either:
|
|
|
|
|
|
|
|
1. It's a type variable that's already been set
|
|
|
|
to something, in which case we try resolve the thing it was
|
|
|
|
set to (`t = it->second`)
|
|
|
|
2. It's a type variable that hasn't been set to something.
|
|
|
|
We set `var` to it (the caller will use this info),
|
|
|
|
and stop our resolution loop (`break`).
|
|
|
|
|
|
|
|
In `unify`, we start by calling `resolve` - we don't want
|
|
|
|
to accidentally compare `a` and `b` (and try to bind `a` to
|
|
|
|
`b`) when `a` is already bound to something else (like `Int`).
|
|
|
|
|
|
|
|
From resolve, we will have `lvar` and `rvar` set to
|
|
|
|
something not NULL if `l` or `r` were type variables
|
|
|
|
that haven't been yet mapped (we defined `resolve` to behave this way).
|
|
|
|
So, if one of the variables is not NULL, we try to bind it.
|
|
|
|
|
|
|
|
Next, `unify` checks if both types are either base types or
|
|
|
|
arrow types. If they're base types, it compares their names,
|
|
|
|
and if they're arrow types, it recursively unifies their children.
|
|
|
|
We return in all cases when unification succeeds, and then throw
|
|
|
|
an exception (currently 0) if all the cases fell thorugh, and thus,
|
|
|
|
unification failed.
|
|
|
|
|
|
|
|
Finally, `bind` places the type we're binding to into
|
|
|
|
the `types` map, but not before it checks that the type
|
|
|
|
we're binding is the same as the string we're binding it to
|
|
|
|
(since, again, `a=a` is not a useful equation).
|
|
|
|
|
|
|
|
We now have a unification algorithm, but we still
|
|
|
|
need to implement our rules. Our rules
|
|
|
|
usually include three things: an environment
|
|
|
|
\\(\\Gamma\\), an expression \\(e\\),
|
|
|
|
and a type \\(\\tau\\). We will
|
|
|
|
represent this as a method on `ast`, which is
|
|
|
|
our representation of an expression \\(e\\). This
|
|
|
|
method will take an environment and return
|
|
|
|
a type.
|
|
|
|
|
|
|
|
But first, how should we implement our environment?
|
|
|
|
Conceptually, an environment maps a name string
|
|
|
|
to a type. So naively, we can implement this simply
|
|
|
|
using an `std::map`. But observe
|
|
|
|
that we only extend the environment in one case so far:
|
|
|
|
a case expression. In a case expression, we have the base
|
|
|
|
envrionment \\(\\Gamma\\), and for each branch,
|
|
|
|
we extend it with the bindings produced by
|
|
|
|
the pattern match. Each branch receives a modified
|
|
|
|
copy of the original environment, one that
|
|
|
|
doesn't see the effects of the other branches.
|
|
|
|
|
|
|
|
Using our naive approach, we'd create a new `std::map` for
|
|
|
|
each branch that's a copy of the original environment,
|
|
|
|
and place into it the new pairs. But this means we'll
|
|
|
|
need to copy a map for each branch of the pattern!
|
|
|
|
|
|
|
|
There's a better way. We structure our environment like
|
|
|
|
a linked list. Each node in the linked list
|
|
|
|
contains an `std::map`. When we encounter a new
|
|
|
|
scope (such as in a case branch), we create a new such node, and add all
|
|
|
|
variables introduced in this scope to that node's map. We make
|
|
|
|
it point to our current environment. Then, we pass around the new node
|
|
|
|
as the environment.
|
|
|
|
|
|
|
|
When we look up a variable name, we first look in this node we created.
|
|
|
|
If we don't find the variable we're looking for, we move on to the next
|
|
|
|
node. The benefit of this is that we won't be re-creating a map
|
|
|
|
for each branch, and just creating a node with the changes.
|
2019-08-26 00:13:10 -07:00
|
|
|
Let's implement exactly that. the header:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/03/env.hpp" >}}
|
|
|
|
|
|
|
|
And the source file:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/03/env.cpp" >}}
|
|
|
|
|
|
|
|
Nothing should seem too surprising. Of note is the fact
|
|
|
|
that we're not using smart pointers for `scope`,
|
|
|
|
and that the child we create during the call
|
|
|
|
would be invalid if the parent goes out of scope
|
|
|
|
/ is released. We're gearing this towards
|
|
|
|
creating new environments on the stack, and we'll
|
|
|
|
take care not to let a parent go out of scope
|
|
|
|
before the child.
|
|
|
|
|
|
|
|
At least, it's time to declare a new type checking method.
|
|
|
|
We start with with a signature inside `ast`:
|
|
|
|
```
|
|
|
|
virtual type_ptr typecheck(type_mgr& mgr, const type_env& env) const;
|
|
|
|
```
|
|
|
|
We also implement the \\(\\text{matchp}\\) function
|
|
|
|
as a method `match` on `pattern` with the following signature:
|
|
|
|
```
|
|
|
|
virtual void match(type_ptr t, type_mgr& mgr, type_env& env) const;
|
|
|
|
```
|
|
|
|
|
|
|
|
We declare this in every subclass of `ast`. Let's take a look
|
|
|
|
at the implementation now:
|
|
|
|
|
|
|
|
{{< codeblock "C++" "compiler/03/ast.cpp" >}}
|
|
|
|
|
|
|
|
This looks good, but we're not done yet. We can type
|
|
|
|
check expressions, but our program ins't
|
|
|
|
made up of expressions. Rather, it's made up of
|
|
|
|
declarations. Further, we can't look at the declarations
|
|
|
|
in isolation. Consider these two functions:
|
|
|
|
|
|
|
|
```
|
|
|
|
defn double x = { x + x }
|
|
|
|
defn quadruple x = { double (double x) }
|
|
|
|
```
|
|
|
|
|
|
|
|
Assuming we have an environment containing `x` when we typecheck the body
|
|
|
|
of `double`, our algorithm will work out fine. But what about
|
|
|
|
`quadruple`? It needs to know what `double` is, or at least that it exists.
|
|
|
|
|
|
|
|
We could also envision two mutually recursive functions. Let's
|
|
|
|
assume we have the functions `eq` and `if` in global scope. We can write
|
|
|
|
two functions, `even` and `odd`:
|
|
|
|
|
|
|
|
```
|
|
|
|
defn even x = { if (eq x 0) True (odd (x-1)) }
|
|
|
|
defn odd x = { if (eq x 0) False (even (n-1)) }
|
|
|
|
```
|
|
|
|
|
|
|
|
`odd` needs to know about `even`, and `even` needs
|
|
|
|
to know about `odd`. Thus, before we do any checking,
|
|
|
|
we need to populate a global environment with __some__
|
|
|
|
type for each function we declare. We will
|
|
|
|
use what we know about the function for our
|
|
|
|
initial declaration: if the function
|
|
|
|
takes two parameters, its type will be `a -> b -> c`.
|
|
|
|
If it takes one parameter, its type will be `a -> b`.
|
|
|
|
What's more, though, is that we need to make sure
|
|
|
|
that the function's parameters are passed in the environment
|
|
|
|
when checking its body, and that these parameters' types
|
|
|
|
are the same as the placeholder types in the function's
|
|
|
|
"declaration".
|
|
|
|
|
|
|
|
We'll typecheck the program in two passes. First,
|
|
|
|
we'll go through each definition, and add any
|
|
|
|
functions it declares to the global scope. Then,
|
|
|
|
we will go through each definition again, and,
|
|
|
|
if it's a function, typecheck its body using
|
|
|
|
the previously fleshed out global scope.
|
|
|
|
|
|
|
|
We'll add two functions, `typecheck_first`
|
|
|
|
and `typecheck_second` corresponding to
|
|
|
|
these two stages. Their signatures:
|
|
|
|
|
|
|
|
```
|
|
|
|
virtual void typecheck_first(type_mgr& mgr, type_env& env);
|
|
|
|
virtual void typecheck_second(type_mgr& mgr, const type_env& env) const;
|
|
|
|
```
|
|
|
|
|
|
|
|
Furthermore, in the `definition_defn`, we will keep an
|
|
|
|
`std::vector` of `type_ptr`, in which the first element is the
|
|
|
|
type of the __last__ parameter, and so on. We switch around
|
|
|
|
the order of arguments because we build up the `a -> b -> ... -> x`
|
|
|
|
type signature from the right (`->` is right associative), and
|
|
|
|
thus we'll be creating the types right-to-left, too. We also
|
|
|
|
add a `type_ptr` field which holds the type for the function's
|
|
|
|
return value. We keep these two things in the `definition_defn` so
|
|
|
|
that they persist between the two typechecking stages: we want to use
|
|
|
|
the types from the first stage to aid in checking the body in the second stage.
|
|
|
|
|
|
|
|
Here's the code for the implementation:
|
|
|
|
{{< codeblock "C++" "compiler/03/definition.cpp" >}}
|
|
|
|
|
|
|
|
And finally, our updated main:
|
|
|
|
{{< codeblock "C++" "compiler/03/main.cpp" >}}
|
|
|
|
|
|
|
|
Notice that we manually add the types for our binary operators to the environment.
|
|
|
|
|
|
|
|
Let's run our project on a few examples. On our two "bad" examples, we get
|
|
|
|
the very eloquent error:
|
|
|
|
```
|
|
|
|
terminate called after throwing an instance of 'int'
|
|
|
|
[2] 9776 abort (core dumped) ./a.out < bad2.txt
|
|
|
|
```
|
|
|
|
That's what we get for throwing 0.
|
|
|
|
|
|
|
|
So far, our program has thrown in 100% of cases. Let's verify it actually
|
|
|
|
accepts valid programs! We'll try our very first example from today,
|
|
|
|
as well as these two:
|
|
|
|
{{< rawblock "compiler/03/works2.txt" >}}
|
|
|
|
{{< rawblock "compiler/03/works3.txt" >}}
|
|
|
|
|
|
|
|
All of our examples print the number of declarations in the program,
|
|
|
|
which means they don't throw 0. And so, we have typechecking!
|