From 619c346897273e22c6b232f85665e33206868249 Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Mon, 26 Aug 2019 17:08:05 -0700 Subject: [PATCH] Do the first round of revisions on part 3 --- code/compiler/03/ast.cpp | 20 ++-- content/blog/03_compiler_trees.md | 148 ++++++++++++++++++++++-------- 2 files changed, 116 insertions(+), 52 deletions(-) diff --git a/code/compiler/03/ast.cpp b/code/compiler/03/ast.cpp index f244ffa..bdee157 100644 --- a/code/compiler/03/ast.cpp +++ b/code/compiler/03/ast.cpp @@ -28,28 +28,22 @@ type_ptr ast_binop::typecheck(type_mgr& mgr, const type_env& env) const { type_ptr ftype = env.lookup(op_name(op)); if(!ftype) throw 0; - type_ptr place_a = mgr.new_type(); - type_ptr place_b = mgr.new_type(); - type_ptr place_c = mgr.new_type(); - type_ptr arrow_one = type_ptr(new type_arr(place_b, place_c)); - type_ptr arrow_two = type_ptr(new type_arr(place_a, arrow_one)); + type_ptr return_type = mgr.new_type(); + type_ptr arrow_one = type_ptr(new type_arr(rtype, return_type)); + type_ptr arrow_two = type_ptr(new type_arr(ltype, arrow_one)); mgr.unify(arrow_two, ftype); - mgr.unify(place_a, ltype); - mgr.unify(place_b, rtype); - return place_c; + return return_type; } type_ptr ast_app::typecheck(type_mgr& mgr, const type_env& env) const { type_ptr ltype = left->typecheck(mgr, env); type_ptr rtype = right->typecheck(mgr, env); - type_ptr place_a = mgr.new_type(); - type_ptr place_b = mgr.new_type(); - type_ptr arrow = type_ptr(new type_arr(place_a, place_b)); + type_ptr return_type = mgr.new_type(); + type_ptr arrow = type_ptr(new type_arr(rtype, return_type)); mgr.unify(arrow, ltype); - mgr.unify(place_a, rtype); - return place_b; + return return_type; } type_ptr ast_case::typecheck(type_mgr& mgr, const type_env& env) const { diff --git a/content/blog/03_compiler_trees.md b/content/blog/03_compiler_trees.md index 1e5e4ed..5435588 100644 --- a/content/blog/03_compiler_trees.md +++ b/content/blog/03_compiler_trees.md @@ -36,8 +36,8 @@ defn main = { 1 2 3 4 5 } ``` What is this? It's a sequence of applications, starting with `1 2`. Numbers -are not functions. Their type is `Int`, not `Int -> a`. I can't even think of a type `1` -would need to have for this program to be valid. +are not functions. Their type is `Int`, not `Int -> a`. I can't even think of a type numbers +would need to have for this program to be valid (though perhaps one could come up with one). Before we give meaning to programs in our language, we'll need to toss away the ones that don't make sense. To do so, we will use type checking. During the process of type @@ -47,20 +47,18 @@ to throw away blatantly incorrect programs. ### Basic Type Checking You may have noticed in the very first post that I have chosen to avoid polymorphism. -This will significantly simplify our type checking algorithm. If a more robust +This will simplify our type checking algorithm. If a more robust algorithm is desired, take a look at the [Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system). Personally, I enjoyed [this](http://dev.stephendiehl.com/fun/006_hindley_milner.html) section -of _Write You a Haskell_. +of _Write You a Haskell_, which I used to sanity check this very post. Let's start with the types of constants - those are pretty obvious. The constant `3` is an integer, and we shall mark it as such: `3 :: Int`. Variables seem like the natural -next step, but they're fairly different. Without outside knowledge, all we can do is say -that a variable has __some__ type. If we stick with Haskell's notation (used in polymorphic types), -we can say a variable has a type `a`, where `a` can be replaced with any other type. If we -know more information, like the fact that `x` was declared to be an integer, we can instead -say that. This tells us that throughout type checking we'll have to keep some kind -of record of names and their associated types. +next step, but they're fairly different. Without outside knowledge, we can't +tell what type a variable `x` is. If we know more information, like the fact that `x` +was declared to be an integer, we can instead say that. This tells us that throughout +type checking we'll have to keep some kind of record of names and their associated types. Next, let's take a look at functions, which are admittedly more interesting than the previous two examples. I'm not talking about the case of seeing something @@ -68,13 +66,16 @@ like a function name `f`. This is the same as the variable case - we don't even it's a function unless there is context, and if there __is__ context, then that context is probably the most useful information we have. I'm talking about something like the application of a function to another value, in the form -`f x`. In this case, we know that `f :: a -> b`, a function from something -to something. However, we know even more. For this program to be correct, -the `a` in `f :: a -> b`, and the type of `x` (let's call it `c`), must -be compatible. In order to do that, we will use what can be considered -simplified [unification](https://en.wikipedia.org/wiki/Unification_(computer_science)). -Conceptually, what this means is that we will attempt to perform substitutions -in various equations in search of a solution. +`f x`. In this case, we know that for the program to be valid, +`f` must have the type `a -> b`, a function from something +to something. Furthermore, since `f` is being applied to `x`, +the type of `x` (let's call it `c`) must be the same as the type `a`. + +Our rules are getting more complicated. In order to check that they hold, we will use +[unification](https://en.wikipedia.org/wiki/Unification_(computer_science)). +This means that we'll be creating various equations (such as "the type of `f` is equal to `a -> b`"), +and finding substitutions to solve these equations. If we're unable to +find a solution to our equations, the program is invalid and we throw it away. #### Basic Examples Let's try an example. We'll try to determine the type of the following @@ -87,7 +88,7 @@ foo 320 6 In out parse tree, this will be represented as `(foo 320) 6`, meaning that the outermost application will be at the top. Let's assume -we know __nothing__ about `foo`. +we know only that `foo` is defined. To figure out the type of the application, we will need to know the types of the thing being applied, and the thing that serves as the argument. @@ -99,12 +100,13 @@ an argument, it has to be of type `a -> b`. Let's proceed to the left child. It's another application, this time of `foo` to `320`. Again, the right child is simple: the type of `320` is `Int`. Again, we know that `foo` has to have a type -`c -> d` (we're using different variable names to avoid ambiguity). +`c -> d` (we're using different variable names since the types +of `foo` and `foo 320` are not the same). Now, we need to combine the pieces of information that we have. Since `foo :: c -> d`, we know that its first parameter __must__ be of type `c`. We also know that its first parameter is of type `Int`. -The only way for both of these to be tree is for `c = Int`. +The only way for both of these to be true is for `c = Int`. This also tells us that `foo :: Int -> d`. Finally, since `foo` has now been applied to its first argument, we know that the `foo 320 :: d`. @@ -116,7 +118,7 @@ is for `d = a -> b`. So, `foo 320` is a function from `a` to `b`. Again, we can conclude the first parameter has to be of type `a`. We also know that the first parameter is of type `Int`. Clearly, this means that `a = Int`. After the application, we know -that the whole expression has some type `d`. +that the whole expression has some type `b`. Let's revisit what we know about `foo`. Last time we checked in on it, `foo` was of type `Int -> d`. But since we know that `d = a -> b`, @@ -156,7 +158,7 @@ $$ \\frac{A\_1 \\ldots A\_n} {B\_1 \\ldots B\_m} $$ This reads, "given that the premises \\(A\_1\\) through \\(A\_n\\) are true, -it holds that the conclusions \\(B\_1\\) through \\(B\_n\\) are true". +it holds that the conclusions \\(B\_1\\) through \\(B\_m\\) are true". For example, we can have the following inference rule: $$ @@ -197,18 +199,27 @@ $$ {e_1 \\; e_2 : \\tau\_2} $$ +This rule includes everything we've said before: +the thing being applied has to have a function type +(\\(\\tau\_1 \\rightarrow \\tau\_2\\)), and +the expression the function is applied to +has to have the same type \\(\\tau\_1\\) as the +left type of the function. + It's the variable rule that forces us to adjust our notation. Our rules don't take into account the context that we've -already discussed. Let's fix that! It's convention +already discussed, and thus, we can't bring +in any outside information. Let's fix that! It's convention to use the symbol \\(\\Gamma\\) for the context. We then add notation to say, "using the context \\(\\Gamma\\), we can deduce that \\(e\\) has type \\(\\tau\\)". We will write this as \\(\\Gamma \\vdash e : \\tau\\). -But what __is__ our context? It's just a set of pairs +But what __is__ our context? We can think of it +as a mapping from variable names to their known types. We +can represent such a mapping using a set of pairs in the form \\(x : \\tau\\), where \\(x\\) represents -a variable name. Each pair tells us that the variable -\\(x\\) is known to have a type \\(\\tau\\). +a variable name. Since \\(\\Gamma\\) is just a regular set, we can write \\(x : \\tau \\in \\Gamma\\), meaning that @@ -236,7 +247,7 @@ $$ In these three expressions, we've captured all of the rules that we've seen so far. It's important to know -that These rules leave out the process +that these rules leave out the process of unification altogether: we use unification to find types that satisfy the rules. @@ -271,7 +282,7 @@ such as `Cons x xs` will introduce new type information, namely \\(\text{x} : \t 4. Finally, the conclusion is that the case expression, if all the premises are met, is of type \\(\\tau\_c\\). For completeness, let's add rules for \\(\\text{matcht}(\\tau, p\_i) = b\_i\\). We'll need two: one for -the "basic" pattern, which always matches the value and binds it to the variable, and one +the "basic" pattern, which always matches the value and binds a variable to it, and one for a constructor pattern, that matches a constructor and its parameters. Let's define \\(v\\) to be a variable name in the context of a pattern. For the basic pattern: @@ -325,7 +336,7 @@ of the chain is a placeholder, it will set `var` to be a pointer to that placeholder. * `bind`, inspired by [this post](http://dev.stephendiehl.com/fun/006_hindley_milner.html), will map a type variable of some name to a type. This function will also check if -the thing we're binding to is the same type variable and not do anything in that case, +we're binding a type variable to itself, and do nothing in that case, since `a = a` is not a very useful equation to have. To fit its original purpose, we also give the manager class the methods @@ -358,7 +369,7 @@ to accidentally compare `a` and `b` (and try to bind `a` to From resolve, we will have `lvar` and `rvar` set to something not NULL if `l` or `r` were type variables -that haven't been yet mapped (we defined `resolve` to behave this way). +that haven't yet been bound (we defined `resolve` to behave this way). So, if one of the variables is not NULL, we try to bind it. Next, `unify` checks if both types are either base types or @@ -378,12 +389,13 @@ need to implement our rules. Our rules usually include three things: an environment \\(\\Gamma\\), an expression \\(e\\), and a type \\(\\tau\\). We will -represent this as a method on `ast`, which is -our representation of an expression \\(e\\). This +represent this as a method on `ast`, our struct +for an expression tree. This method will take an environment and return a type. -But first, how should we implement our environment? +##### Environment +How should we implement our environment? Conceptually, an environment maps a name string to a type. So naively, we can implement this simply using an `std::map`. But observe @@ -429,7 +441,8 @@ creating new environments on the stack, and we'll take care not to let a parent go out of scope before the child. -At least, it's time to declare a new type checking method. +##### Typechecking Expressions +At last, it's time to declare a new type checking method. We start with with a signature inside `ast`: ``` virtual type_ptr typecheck(type_mgr& mgr, const type_env& env) const; @@ -445,11 +458,68 @@ at the implementation now: {{< codeblock "C++" "compiler/03/ast.cpp" >}} -{{< todo >}}Explain implementation, at least a little.{{< /todo >}} +The `typecheck` implementation for `ast_int`, `ast_lid`, and `ast_uid` +is quite intuitive. The type of a number is always `Int`, and varible +names we simply look up in the environment. + +For `ast_binop` and `ast_app`, we check the types of the children first, +and store their resulting types. In the case of `binop`, we assume +that we have the type of the binary operator (such as `+`) in the +environment, and throw if it isn't. Then, we know that +the operator has to be a function of at least two arguments, +with the types of the left and right children of the application. +We also know its actual type, as it's in the environment. We unify +these two types, and then return. + +In the case of `app`, we know that the left side (the thing +being applied), has to be a function from the type of the right +child. We also know its computed type. Once again, we unify the two, +and then return. + +The type checking for `case` offloads most of the work onto the +`match` method on patterns. We start by computing the type +of the expression we're matching. Then, for each branch, +we create a new scope, and populate it with the new bindings +created by the pattern match (`match` does this). Once +we have a new environment, we typecheck the body +of the branch. Finally, we unify the type of the branch's +body with the previous body types (since branches of the +case expression have to have the same type). + +Let's take a look at `match` now. The `match` method +for a variable pattern is very simple: it always +introduces a new variable bindings, and stops there. +The `match` method for a constructor pattern is more +interesting. First, it looks up the constructor +that the pattern is trying to match. This needs to exist, +so if we don't find a type for it, we throw. Next, +for each variable name in the pattern, we know +that there has to be a corresponding parameter in the +constructor. Because of this, we cast the constructor type +to an function type (throwing if it isn't one), +and create a new mapping from the variable name to the +left side (parameter) of the function type. We then +move on to examine the right side of the function +type, and the remaining variables of the pattern. + +Once we have gone through the parameters, +what remains __should__ be the type of the thing +the constructor creates. Not only that, but +the remaining type should match the type of the value +the pattern is being matched against. To ensure +this, we unify the two types. + +There's only one thing that can still go wrong. +The value __and__ the pattern can __both__ be +a partial application. For ease of implementation, +we will not allow this case. Thus, we make sure +the final type is a base type, and throw otherwise. + +##### Typechecking Definitions This looks good, but we're not done yet. We can type check expressions, but our program ins't made up of expressions. Rather, it's made up of -declarations. Further, we can't look at the declarations +definitions. Further, we can't look at the definitions in isolation. Consider these two functions: ``` @@ -503,11 +573,11 @@ virtual void typecheck_second(type_mgr& mgr, const type_env& env) const; Furthermore, in the `definition_defn`, we will keep an `std::vector` of `type_ptr`, in which the first element is the type of the __last__ parameter, and so on. We switch around -the order of arguments because we build up the `a -> b -> ... -> x` +the order of arguments because we build up the `a -> b -> ...` type signature from the right (`->` is right associative), and thus we'll be creating the types right-to-left, too. We also -add a `type_ptr` field which holds the type for the function's -return value. We keep these two things in the `definition_defn` so +add a `type_ptr` field which holds the function's return type. +We keep these two things in the `definition_defn` so that they persist between the two typechecking stages: we want to use the types from the first stage to aid in checking the body in the second stage.