blog-static/content/blog/11_compiler_polymorphic_dat...

401 lines
19 KiB
Markdown
Raw Normal View History

---
title: Compiling a Functional Language Using C++, Part 11 - Polymorphic Data Types
2020-04-14 19:06:29 -07:00
date: 2020-04-14T19:05:42-07:00
tags: ["C and C++", "Functional Languages", "Compilers"]
2020-05-09 17:29:37 -07:00
description: "In this post, we enable our compiler to understand polymorphic data types."
---
[In part 10]({{< relref "10_compiler_polymorphism.md" >}}), we managed to get our
compiler to accept functions that were polymorphically typed. However, a piece
of the puzzle is still missing: while our _functions_ can handle values
of different types, the same cannot be said for our _data types_. This means
that we cannot construct data structures that can contain arbitrary types.
While we can define and use a list of integers, if we want to also have a
list of booleans, we must copy all of our constructors and define a new data type.
Worse, not only do we have to duplicate the constructors, but also all the functions
that operate on the list. As far as our compiler is concerned, a list of
integers and a list of booleans are entirely different beasts, and cannot
be operated on by the same code.
To make polymorphic data types possible, we must extend our language (and type
system) a little. We will now allow for something like this:
```
data List a = { Nil, Cons a List }
```
In the above snippet, we are no longer declaring a single type, but a collection
of related types, __parameterized__ by a type `a`. Any type can take the place
of `a` to get a list containing that type of element.
Then, `List Int` is a type,
as is `List Bool` and `List (List Int)`. The constructors in the snippet also
get polymorphic types:
{{< latex >}}
\text{Nil} : \forall a \; . \; \text{List} \; a \\
\text{Cons} : \forall a \; . \; a \rightarrow \text{List} \; a \rightarrow \text{List} \; a
{{< /latex >}}
When you call `Cons`, the type of the resulting list varies with the type of element
you pass in. The empty list `Nil` is a valid list of any type, since, well, it's
empty.
Let's talk about `List` itself, now. I suggest that we ponder the following table:
\\(\\text{List}\\)|\\(\\text{Cons}\\)
----|----
\\(\\text{List}\\) is not a type; it must be followed up with arguments, like \\(\\text{List} \\; \\text{Int}\\).|\\(\\text{Cons}\\) is not a list; it must be followed up with arguments, like \\(\\text{Cons} \\; 3 \\; \\text{Nil}\\).
\\(\\text{List} \\; \\text{Int}\\) is in its simplest form.|\\(\\text{Cons} \\; 3 \\; \\text{Nil}\\) is in its simplest form.
\\(\\text{List} \\; \\text{Int}\\) is a type.|\\(\\text{Cons} \\; 3 \\; \\text{Nil}\\) is a value of type \\(\\text{List} \\; \\text{Int}\\).
I hope that the similarities are quite striking. I claim that
`List` is quite similar to a constructor `Cons`, except that it occurs
in a different context: whereas `Cons` is a way to create values,
`List` is a way to create types. Indeed, while we call `Cons` a constructor,
it's typical to call `List` a __type constructor__.
We know that `Cons` is a function which
assigns to values (like `3` and `Nil`) other values (like `Cons 3 Nil`, or `[3]` for
short). In a similar manner, `List` can be thought of as a function
that assigns to types (like `Int`) other types (like `List Int`). We can
even claim that it has a type:
{{< latex >}}
\text{List} : \text{Type} \rightarrow \text{Type}
{{< /latex >}}
{{< sidenote "right" "dependent-types-note" "Unless we get really wacky," >}}
When your type constructors take as input not only other types but also values
such as <code>3</code>, you've ventured into the territory of
<a href="https://en.wikipedia.org/wiki/Dependent_type">dependent types</a>.
This is a significant step up in complexity from what we'll be doing in this
series. If you're interested, check out
<a href="https://www.idris-lang.org/">Idris</a> (if you want to know about dependent types
for functional programming) or <a href="https://coq.inria.fr/">Coq</a> (to see how
propositions and proofs can be encoded in a dependently typed language).
{{< /sidenote >}}
our type constructors will only take zero or more types as input, and produce
a type as output. In this case, writing \\(\\text{Type}\\) becomes quite repetitive,
and we will adopt the convention of writing \\(*\\) instead. The types of such
constructors are called [kinds](https://en.wikipedia.org/wiki/Kind_(type_theory)).
Let's look at a few examples, just to make sure we're on the same page:
* The kind of \\(\\text{Bool}\\) is \\(*\\): it does not accept any
type arguments, and is a type in its own right.
* The kind of \\(\\text{List}\\) is \\(*\\rightarrow *\\): it takes
one argument (the type of the things inside the list), and creates
a type from it.
* If we define a pair as `data Pair a b = { MkPair a b }`, then its
kind is \\(* \\rightarrow * \\rightarrow *\\), because it requires
two parameters.
As one final observation, we note that effectively, all we're doing is
tracking the [arity](https://en.wikipedia.org/wiki/Arity) of the constructor
type.
Let's now enumerate all the possible forms that (mono)types can take in our system:
1. A type can be a placeholder, like \\(a\\), \\(b\\), etc.
2. A type can be a type constructor, applied to
{{< sidenote "right" "zero-more-note" "zero ore more arguments," >}}
It is convenient to treat regular types (like \(\text{Bool}\)) as
type constructors of arity 0 (that is, type constructors with kind \(*\)).
In effect, they take zero arguments and produce types (themselves).
2020-04-14 19:06:29 -07:00
{{< /sidenote >}} such as \\(\\text{List} \\; \\text{Int}\\) or \\(\\text{Bool}\\).
3. A function from one type to another, like \\(\\text{List} \\; a \\rightarrow \\text{Int}\\).
Polytypes (type schemes) in our system can be all of the above, but may also include a "forall"
2020-04-09 22:55:02 -07:00
quantifier at the front, generalizing the type (like \\(\\forall a \\; . \\; \\text{List} \\; a \\rightarrow \\text{Int}\\)).
Let's start implementing all of this. Why don't we start with the change to the syntax of our language?
We have complicated the situation quite a bit. Let's take a look at the _old_ grammar
for data type declarations (this is going back as far as [part 2]({{< relref "02_compiler_parsing.md" >}})).
Here, \\(L\_D\\) is the nonterminal for the things that go between the curly braces of a data type
declaration, \\(D\\) is the nonterminal representing a single constructor definition,
and \\(L\_U\\) is a list of zero or more uppercase variable names:
{{< latex >}}
\begin{aligned}
L_D & \rightarrow D \; , \; L_D \\
L_D & \rightarrow D \\
D & \rightarrow \text{upperVar} \; L_U \\
L_U & \rightarrow \text{upperVar} \; L_U \\
L_U & \rightarrow \epsilon
\end{aligned}
{{< /latex >}}
This grammar was actually too simple even for our monomorphically typed language!
Since functions are not represented using a single uppercase variable, it wasn't possible for us
to define constructors that accept as arguments anything other than integers and user-defined
2020-04-14 19:06:29 -07:00
data types. Now, we also need to modify this grammar to allow for constructor applications (which can be nested).
To do all of these things, we will define a new nonterminal, \\(Y\\), for types:
{{< latex >}}
\begin{aligned}
Y & \rightarrow N \; ``\rightarrow" Y \\
Y & \rightarrow N
\end{aligned}
{{< /latex >}}
We make it right-recursive (because the \\(\\rightarrow\\) operator is right-associative). Next, we define
a nonterminal for all types _except_ those constructed with the arrow, \\(N\\).
{{< latex >}}
\begin{aligned}
N & \rightarrow \text{upperVar} \; L_Y \\
N & \rightarrow \text{typeVar} \\
N & \rightarrow ( Y )
\end{aligned}
{{< /latex >}}
The first of the above rules allows a type to be a constructor applied to zero or more arguments
(generated by \\(L\_Y\\)). The second rule allows a type to be a placeholder type variable. Finally,
the third rule allows for any type (including functions, again) to occur between parentheses.
This is so that higher-order functions, like \\((a \rightarrow b) \rightarrow a \rightarrow a \\),
can be represented.
Unfortunately, the definition of \\(L\_Y\\) is not as straightforward as we imagine. We could define
it as just a list of \\(Y\\) nonterminals, but this would make the grammar ambigous: something
like `List Maybe Int` could be interpreted as "`List`, applied to types `Maybe` and `Int`", and
"`List`, applied to type `Maybe Int`". To avoid this, we define a "type list element" \\(Y'\\),
which does not take arguments:
{{< latex >}}
\begin{aligned}
Y' & \rightarrow \text{upperVar} \\
Y' & \rightarrow \text{lowerVar} \\
Y' & \rightarrow ( Y )
\end{aligned}
{{< /latex >}}
We then make \\(L\_Y\\) a list of \\(Y'\\):
{{< latex >}}
\begin{aligned}
L_Y & \rightarrow Y' \; L_Y \\
L_Y & \rightarrow \epsilon
\end{aligned}
{{< /latex >}}
Finally, we update the rules for the data type declaration, as well as for a single
2020-04-14 19:06:29 -07:00
constructor. In these new rules, we use \\(L\_T\\) to mean a list of type variables.
The rules are as follows:
{{< latex >}}
\begin{aligned}
T & \rightarrow \text{data} \; \text{upperVar} \; L_T = \{ L_D \} \\
D & \rightarrow \text{upperVar} \; L_Y \\
\end{aligned}
{{< /latex >}}
Those are all the changes we have to make to our grammar. Let's now move on to implementing
the corresponding data structures. We define a new family of structs, which represent types as they are
received from the parser. These differ from regular types in that they
do not necessarily represent valid types; validating types requires two passes, whereas parsing is
done in a single pass. We can define our parsed types as follows:
{{< codeblock "C++" "compiler/11/parsed_type.hpp" >}}
We define the conversion method `to_type`, which requires
a set of type variables that are allowed to occur within a parsed
type (which are the variables specified on the left of the `=`
in the data type declaration syntax), and the environment in which to
look up the arities of any type constructors. The implementation is as follows:
{{< codeblock "C++" "compiler/11/parsed_type.cpp" >}}
Note that this definition requires a new `type` subclass, `type_app`, which
represents type application. Unlike `parsed_type_app`, it stores a pointer
to the type constructor being applied, rather than its name. This
helps validate the type (by making sure the parsed type's name refers to
an existing type constructor), and lets us gather information like
which constructors the resulting type has. We define this new type as follows:
{{< codelines "C++" "compiler/11/type.hpp" 70 78 >}}
With our new data structures in hand, we can now update the grammar in our Bison file.
First things first, we'll add the type parameters to the data type definition:
{{< codelines "plaintext" "compiler/11/parser.y" 127 130 >}}
Next, we add the new grammar rules we came up with:
{{< codelines "plaintext" "compiler/11/parser.y" 138 163 >}}
Note in the above rules that even for `typeListElement`, which
can never be applied to any arguments, we still attach a `parsed_type_app`
as the semantic value. This is for consistency; it's easier to view
all types in our system as applications to zero or more arguments,
than to write coercions from non-applied types to types applied to zero
arguments.
Finally, we define the types for these new rules at the top of the file:
{{< codelines "plaintext" "compiler/11/parser.y" 43 44 >}}
This concludes our work on the parser, but opens up a whole can of worms
elsewhere. First of all, now that we introduced a new `type` subclass, we must
ensure that type unification still works as intended. We therefore have
to adjust the `type_mgr::unify` method:
{{< codelines "C++" "compiler/11/type.cpp" 95 132 >}}
In the above snippet, we add a new if-statement that checks whether or
not both types being unified are type applications, and if so, unifies
their constructors and arguments. We also extend our type equality check
to ensure that both the names _and_ arities of types match
{{< sidenote "right" "type-equality-note" "when they are compared for equality." >}}
This is actually a pretty silly measure. Consider the following three
propositions:
1) types are only declared at the top-level scope.
2) if a type is introduced, and another type with that name already exists, we throw an error.
3) for name equality to be insufficient, we need to have two declared types
with the same name. Given these propositions, it will not be possible for us to
declare two types that would confuse the name equality check. However,
in the near future, these propositions may not all hold: if we allow
<code>let/in</code> expressions to contain data type definitions,
it will be possible to declare two types with the same name and arity
(in different scopes), which would <em>still</em> confuse the check.
In the future, if this becomes an issue, we will likely move to unique
type identifiers.
{{< /sidenote >}} Note also the more basic fact that we added arity
to our `type_base`,
2020-04-14 19:06:29 -07:00
{{< sidenote "left" "base-arity-note" "since it may now be a type constructor instead of a plain type." >}}
You may be wondering, why did we add arity to base types, rather than data types?
Although so far, our language can only create type constructors from data type definitions,
it's possible (or even likely) that we will have
polymorphic built-in types, such as
<a href="https://www.haskell.org/tutorial/io.html">the IO monad</a>.
To prepare for this, we will allow our base types to be type constructors too.
{{< /sidenote >}}
Jut as we change `type_mgr::unify`, we need to change `type_mgr::find_free`
to include the new case of `type_app`. The adjusted function looks as follows:
{{< codelines "C++" "compiler/11/type.cpp" 174 187 >}}
2020-04-14 19:06:29 -07:00
There is another adjustment that we have to make to our type code. Recall
that we had code that implemented substitutions: replacing free variables
with other types to properly implement our type schemes. There
was a bug in that code, which becomes much more apparent when the substitution
system is put under more pressure. Specifically, the bug was in how type
variables were handled.
The old substitution code, when it found that a type
variable had been bound to another type, always moved on to perform
a substitution in that other type. This wasn't really a problem then, since
any type variables that needed to be substituted were guaranteed to be
free (that's why they were put into the "forall" quantifier). However, with our
new system, we are using user-provided type variables (usually `a`, `b`, and so on),
which have likely already been used by our compiler internally, and thus have
been bound to something. That something is irrelevant to us: when we
perform a substitution on a user-defined data type, we _know_ that _our_ `a` is
free, and should be substitited. In short, precedence should be given to
substituting type variables, rather than resolving them to what they are bound to.
To make this adjustment possible, we need to make `substitute` a method of `type_manager`,
since it will now require an awareness of existing type bindings. Additionally,
this method will now perform its own type resolution, checking if a type variable
needs to be substitited between each step. The whole code is as follows:
{{< codelines "C++" "compiler/11/type.cpp" 134 165 >}}
That's all for types. Definitions, though, need some work. First of all,
we've changed our parser to feed our `constructor` type a vector of
`parsed_type_ptr`, rather than `std::string`. We therefore have to update
`constructor` to receive and store this new vector:
{{< codelines "C++" "compiler/11/definition.hpp" 13 20 >}}
Similarly, `definition_data` itself needs to accept the list of type
variables it has:
{{< codelines "C++" "compiler/11/definition.hpp" 54 70 >}}
We then look at `definition_data::insert_constructors`, which converts
`constructor` instances to actual constructor functions. The code,
which is getting pretty complciated, is as follows:
{{< codelines "C++" "compiler/11/definition.cpp" 64 92 >}}
In the above snippet, we do the following things:
1. We first create a set of type variables that can occur
in this type's constructors (the same set that's used
by the `to_type` method we saw earlier). While doing this, we ensure
a type variable is not used twice (this is not allowed), and add each
type variable to the final return type (which is something like `List a`),
in the order they occur.
2. When the variables have been gathered into a set, we iterate
over all constructors, and convert them into types by calling `to_type`
2020-04-14 19:06:29 -07:00
on their arguments, then assembling the resulting argument types into a function.
This is not enough, however,
{{< sidenote "right" "type-variables-note" "since constructors of types that accept type variables are polymorphic," >}}
This is also not enough because without generalization using "forall", we are risking using type variables
that have already been bound, or that will be bound. Even if <code>a</code> has not yet been used by the typechecker,
it will be once the type manager generates its first type variable, and things will go south. If we, for some reason,
wanted type constructors to be monomorphic (but generic, with type variables) we'd need to internally
instnatiate fresh type variables for every user-defined type variable, and substitute them appropriately.
{{< /sidenote >}}
as we have discussed above with \\(\\text{Nil}\\) and \\(\\text{Cons}\\).
2020-04-14 19:06:29 -07:00
To accomodate for this, we also add all type variables to the "forall" quantifier
of a new type scheme, whose monotype is our newly assembled function type. This
type scheme is what we store as the type of the constructor.
2020-04-14 19:06:29 -07:00
This was the last major change we have to perform. The rest is cleanup: we have switched
our system to dealing with type applications (sometimes with zero arguments), and we must
bring the rest of the compiler up to speed with this change. For instance, we update
`ast_int` to create a reference to an existing integer type during typechecking:
{{< codelines "C++" "compiler/11/ast.cpp" 20 22 >}}
Similarly, we update our code in `typecheck_program` to use
type applications in the type for binary operations:
{{< codelines "C++" "compiler/11/main.cpp" 31 37 >}}
Finally, we update `ast_case` to unwrap type applications to get the needed constructor
data from `type_data`. This has to be done in `ast_case::typecheck`, as follows:
{{< codelines "C++" "compiler/11/ast.cpp" 163 168 >}}
Additionally, a similar change needs to be made in `ast_case::compile`:
{{< codelines "C++" "compiler/11/ast.cpp" 174 175 >}}
That should be all! Let's try an example:
{{< rawblock "compiler/11/examples/works3.txt" >}}
The output:
```
Result: 6
```
Yay! Not only were we able to define a list of any type, but our `length` function correctly
determined the lengths of two lists of different types! Let's try an example with the
classic [`fold` functions](http://learnyouahaskell.com/higher-order-functions#folds):
{{< rawblock "compiler/11/examples/list.txt" >}}
We expect the sum of the list `[1,2,3,4]` to be `10`, and its length to be `4`, so the sum
of the two should be `14`. And indeed, our program agrees:
```
Result: 14
```
Let's do one more example, to test types that take more than one type parameter:
{{< rawblock "compiler/11/examples/pair.txt" >}}
Once again, the compiled program gives the expected result:
```
Result: 4
```
This looks good! We have added support for polymorphic data types to our compiler.
We are now free to move on to `let/in` expressions, __lambda functions__, and __Input/Output__,
as promised, starting with [part 12]({{< relref "12_compiler_let_in_lambda/index.md" >}}) - `let/in`
and lambdas!