401 lines
19 KiB
Markdown
401 lines
19 KiB
Markdown
---
|
|
title: Compiling a Functional Language Using C++, Part 11 - Polymorphic Data Types
|
|
date: 2020-04-14T19:05:42-07:00
|
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
|
description: "In this post, we enable our compiler to understand polymorphic data types."
|
|
---
|
|
[In part 10]({{< relref "10_compiler_polymorphism.md" >}}), we managed to get our
|
|
compiler to accept functions that were polymorphically typed. However, a piece
|
|
of the puzzle is still missing: while our _functions_ can handle values
|
|
of different types, the same cannot be said for our _data types_. This means
|
|
that we cannot construct data structures that can contain arbitrary types.
|
|
While we can define and use a list of integers, if we want to also have a
|
|
list of booleans, we must copy all of our constructors and define a new data type.
|
|
Worse, not only do we have to duplicate the constructors, but also all the functions
|
|
that operate on the list. As far as our compiler is concerned, a list of
|
|
integers and a list of booleans are entirely different beasts, and cannot
|
|
be operated on by the same code.
|
|
|
|
To make polymorphic data types possible, we must extend our language (and type
|
|
system) a little. We will now allow for something like this:
|
|
|
|
```
|
|
data List a = { Nil, Cons a List }
|
|
```
|
|
|
|
In the above snippet, we are no longer declaring a single type, but a collection
|
|
of related types, __parameterized__ by a type `a`. Any type can take the place
|
|
of `a` to get a list containing that type of element.
|
|
Then, `List Int` is a type,
|
|
as is `List Bool` and `List (List Int)`. The constructors in the snippet also
|
|
get polymorphic types:
|
|
|
|
{{< latex >}}
|
|
\text{Nil} : \forall a \; . \; \text{List} \; a \\
|
|
\text{Cons} : \forall a \; . \; a \rightarrow \text{List} \; a \rightarrow \text{List} \; a
|
|
{{< /latex >}}
|
|
|
|
When you call `Cons`, the type of the resulting list varies with the type of element
|
|
you pass in. The empty list `Nil` is a valid list of any type, since, well, it's
|
|
empty.
|
|
|
|
Let's talk about `List` itself, now. I suggest that we ponder the following table:
|
|
|
|
\\(\\text{List}\\)|\\(\\text{Cons}\\)
|
|
----|----
|
|
\\(\\text{List}\\) is not a type; it must be followed up with arguments, like \\(\\text{List} \\; \\text{Int}\\).|\\(\\text{Cons}\\) is not a list; it must be followed up with arguments, like \\(\\text{Cons} \\; 3 \\; \\text{Nil}\\).
|
|
\\(\\text{List} \\; \\text{Int}\\) is in its simplest form.|\\(\\text{Cons} \\; 3 \\; \\text{Nil}\\) is in its simplest form.
|
|
\\(\\text{List} \\; \\text{Int}\\) is a type.|\\(\\text{Cons} \\; 3 \\; \\text{Nil}\\) is a value of type \\(\\text{List} \\; \\text{Int}\\).
|
|
|
|
I hope that the similarities are quite striking. I claim that
|
|
`List` is quite similar to a constructor `Cons`, except that it occurs
|
|
in a different context: whereas `Cons` is a way to create values,
|
|
`List` is a way to create types. Indeed, while we call `Cons` a constructor,
|
|
it's typical to call `List` a __type constructor__.
|
|
We know that `Cons` is a function which
|
|
assigns to values (like `3` and `Nil`) other values (like `Cons 3 Nil`, or `[3]` for
|
|
short). In a similar manner, `List` can be thought of as a function
|
|
that assigns to types (like `Int`) other types (like `List Int`). We can
|
|
even claim that it has a type:
|
|
|
|
{{< latex >}}
|
|
\text{List} : \text{Type} \rightarrow \text{Type}
|
|
{{< /latex >}}
|
|
|
|
{{< sidenote "right" "dependent-types-note" "Unless we get really wacky," >}}
|
|
When your type constructors take as input not only other types but also values
|
|
such as <code>3</code>, you've ventured into the territory of
|
|
<a href="https://en.wikipedia.org/wiki/Dependent_type">dependent types</a>.
|
|
This is a significant step up in complexity from what we'll be doing in this
|
|
series. If you're interested, check out
|
|
<a href="https://www.idris-lang.org/">Idris</a> (if you want to know about dependent types
|
|
for functional programming) or <a href="https://coq.inria.fr/">Coq</a> (to see how
|
|
propositions and proofs can be encoded in a dependently typed language).
|
|
{{< /sidenote >}}
|
|
our type constructors will only take zero or more types as input, and produce
|
|
a type as output. In this case, writing \\(\\text{Type}\\) becomes quite repetitive,
|
|
and we will adopt the convention of writing \\(*\\) instead. The types of such
|
|
constructors are called [kinds](https://en.wikipedia.org/wiki/Kind_(type_theory)).
|
|
Let's look at a few examples, just to make sure we're on the same page:
|
|
|
|
* The kind of \\(\\text{Bool}\\) is \\(*\\): it does not accept any
|
|
type arguments, and is a type in its own right.
|
|
* The kind of \\(\\text{List}\\) is \\(*\\rightarrow *\\): it takes
|
|
one argument (the type of the things inside the list), and creates
|
|
a type from it.
|
|
* If we define a pair as `data Pair a b = { MkPair a b }`, then its
|
|
kind is \\(* \\rightarrow * \\rightarrow *\\), because it requires
|
|
two parameters.
|
|
|
|
As one final observation, we note that effectively, all we're doing is
|
|
tracking the [arity](https://en.wikipedia.org/wiki/Arity) of the constructor
|
|
type.
|
|
|
|
Let's now enumerate all the possible forms that (mono)types can take in our system:
|
|
|
|
1. A type can be a placeholder, like \\(a\\), \\(b\\), etc.
|
|
2. A type can be a type constructor, applied to
|
|
{{< sidenote "right" "zero-more-note" "zero ore more arguments," >}}
|
|
It is convenient to treat regular types (like \(\text{Bool}\)) as
|
|
type constructors of arity 0 (that is, type constructors with kind \(*\)).
|
|
In effect, they take zero arguments and produce types (themselves).
|
|
{{< /sidenote >}} such as \\(\\text{List} \\; \\text{Int}\\) or \\(\\text{Bool}\\).
|
|
3. A function from one type to another, like \\(\\text{List} \\; a \\rightarrow \\text{Int}\\).
|
|
|
|
Polytypes (type schemes) in our system can be all of the above, but may also include a "forall"
|
|
quantifier at the front, generalizing the type (like \\(\\forall a \\; . \\; \\text{List} \\; a \\rightarrow \\text{Int}\\)).
|
|
|
|
Let's start implementing all of this. Why don't we start with the change to the syntax of our language?
|
|
We have complicated the situation quite a bit. Let's take a look at the _old_ grammar
|
|
for data type declarations (this is going back as far as [part 2]({{< relref "02_compiler_parsing.md" >}})).
|
|
Here, \\(L\_D\\) is the nonterminal for the things that go between the curly braces of a data type
|
|
declaration, \\(D\\) is the nonterminal representing a single constructor definition,
|
|
and \\(L\_U\\) is a list of zero or more uppercase variable names:
|
|
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
L_D & \rightarrow D \; , \; L_D \\
|
|
L_D & \rightarrow D \\
|
|
D & \rightarrow \text{upperVar} \; L_U \\
|
|
L_U & \rightarrow \text{upperVar} \; L_U \\
|
|
L_U & \rightarrow \epsilon
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
This grammar was actually too simple even for our monomorphically typed language!
|
|
Since functions are not represented using a single uppercase variable, it wasn't possible for us
|
|
to define constructors that accept as arguments anything other than integers and user-defined
|
|
data types. Now, we also need to modify this grammar to allow for constructor applications (which can be nested).
|
|
To do all of these things, we will define a new nonterminal, \\(Y\\), for types:
|
|
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
Y & \rightarrow N \; ``\rightarrow" Y \\
|
|
Y & \rightarrow N
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
We make it right-recursive (because the \\(\\rightarrow\\) operator is right-associative). Next, we define
|
|
a nonterminal for all types _except_ those constructed with the arrow, \\(N\\).
|
|
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
N & \rightarrow \text{upperVar} \; L_Y \\
|
|
N & \rightarrow \text{typeVar} \\
|
|
N & \rightarrow ( Y )
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
The first of the above rules allows a type to be a constructor applied to zero or more arguments
|
|
(generated by \\(L\_Y\\)). The second rule allows a type to be a placeholder type variable. Finally,
|
|
the third rule allows for any type (including functions, again) to occur between parentheses.
|
|
This is so that higher-order functions, like \\((a \rightarrow b) \rightarrow a \rightarrow a \\),
|
|
can be represented.
|
|
|
|
Unfortunately, the definition of \\(L\_Y\\) is not as straightforward as we imagine. We could define
|
|
it as just a list of \\(Y\\) nonterminals, but this would make the grammar ambigous: something
|
|
like `List Maybe Int` could be interpreted as "`List`, applied to types `Maybe` and `Int`", and
|
|
"`List`, applied to type `Maybe Int`". To avoid this, we define a "type list element" \\(Y'\\),
|
|
which does not take arguments:
|
|
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
Y' & \rightarrow \text{upperVar} \\
|
|
Y' & \rightarrow \text{lowerVar} \\
|
|
Y' & \rightarrow ( Y )
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
We then make \\(L\_Y\\) a list of \\(Y'\\):
|
|
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
L_Y & \rightarrow Y' \; L_Y \\
|
|
L_Y & \rightarrow \epsilon
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
Finally, we update the rules for the data type declaration, as well as for a single
|
|
constructor. In these new rules, we use \\(L\_T\\) to mean a list of type variables.
|
|
The rules are as follows:
|
|
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
T & \rightarrow \text{data} \; \text{upperVar} \; L_T = \{ L_D \} \\
|
|
D & \rightarrow \text{upperVar} \; L_Y \\
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
Those are all the changes we have to make to our grammar. Let's now move on to implementing
|
|
the corresponding data structures. We define a new family of structs, which represent types as they are
|
|
received from the parser. These differ from regular types in that they
|
|
do not necessarily represent valid types; validating types requires two passes, whereas parsing is
|
|
done in a single pass. We can define our parsed types as follows:
|
|
|
|
{{< codeblock "C++" "compiler/11/parsed_type.hpp" >}}
|
|
|
|
We define the conversion method `to_type`, which requires
|
|
a set of type variables that are allowed to occur within a parsed
|
|
type (which are the variables specified on the left of the `=`
|
|
in the data type declaration syntax), and the environment in which to
|
|
look up the arities of any type constructors. The implementation is as follows:
|
|
|
|
{{< codeblock "C++" "compiler/11/parsed_type.cpp" >}}
|
|
|
|
Note that this definition requires a new `type` subclass, `type_app`, which
|
|
represents type application. Unlike `parsed_type_app`, it stores a pointer
|
|
to the type constructor being applied, rather than its name. This
|
|
helps validate the type (by making sure the parsed type's name refers to
|
|
an existing type constructor), and lets us gather information like
|
|
which constructors the resulting type has. We define this new type as follows:
|
|
|
|
{{< codelines "C++" "compiler/11/type.hpp" 70 78 >}}
|
|
|
|
With our new data structures in hand, we can now update the grammar in our Bison file.
|
|
First things first, we'll add the type parameters to the data type definition:
|
|
|
|
{{< codelines "plaintext" "compiler/11/parser.y" 127 130 >}}
|
|
|
|
Next, we add the new grammar rules we came up with:
|
|
|
|
{{< codelines "plaintext" "compiler/11/parser.y" 138 163 >}}
|
|
|
|
Note in the above rules that even for `typeListElement`, which
|
|
can never be applied to any arguments, we still attach a `parsed_type_app`
|
|
as the semantic value. This is for consistency; it's easier to view
|
|
all types in our system as applications to zero or more arguments,
|
|
than to write coercions from non-applied types to types applied to zero
|
|
arguments.
|
|
|
|
Finally, we define the types for these new rules at the top of the file:
|
|
|
|
{{< codelines "plaintext" "compiler/11/parser.y" 43 44 >}}
|
|
|
|
This concludes our work on the parser, but opens up a whole can of worms
|
|
elsewhere. First of all, now that we introduced a new `type` subclass, we must
|
|
ensure that type unification still works as intended. We therefore have
|
|
to adjust the `type_mgr::unify` method:
|
|
|
|
{{< codelines "C++" "compiler/11/type.cpp" 95 132 >}}
|
|
|
|
In the above snippet, we add a new if-statement that checks whether or
|
|
not both types being unified are type applications, and if so, unifies
|
|
their constructors and arguments. We also extend our type equality check
|
|
to ensure that both the names _and_ arities of types match
|
|
{{< sidenote "right" "type-equality-note" "when they are compared for equality." >}}
|
|
This is actually a pretty silly measure. Consider the following three
|
|
propositions:
|
|
1) types are only declared at the top-level scope.
|
|
2) if a type is introduced, and another type with that name already exists, we throw an error.
|
|
3) for name equality to be insufficient, we need to have two declared types
|
|
with the same name. Given these propositions, it will not be possible for us to
|
|
declare two types that would confuse the name equality check. However,
|
|
in the near future, these propositions may not all hold: if we allow
|
|
<code>let/in</code> expressions to contain data type definitions,
|
|
it will be possible to declare two types with the same name and arity
|
|
(in different scopes), which would <em>still</em> confuse the check.
|
|
In the future, if this becomes an issue, we will likely move to unique
|
|
type identifiers.
|
|
{{< /sidenote >}} Note also the more basic fact that we added arity
|
|
to our `type_base`,
|
|
{{< sidenote "left" "base-arity-note" "since it may now be a type constructor instead of a plain type." >}}
|
|
You may be wondering, why did we add arity to base types, rather than data types?
|
|
Although so far, our language can only create type constructors from data type definitions,
|
|
it's possible (or even likely) that we will have
|
|
polymorphic built-in types, such as
|
|
<a href="https://www.haskell.org/tutorial/io.html">the IO monad</a>.
|
|
To prepare for this, we will allow our base types to be type constructors too.
|
|
{{< /sidenote >}}
|
|
|
|
Jut as we change `type_mgr::unify`, we need to change `type_mgr::find_free`
|
|
to include the new case of `type_app`. The adjusted function looks as follows:
|
|
|
|
{{< codelines "C++" "compiler/11/type.cpp" 174 187 >}}
|
|
|
|
There is another adjustment that we have to make to our type code. Recall
|
|
that we had code that implemented substitutions: replacing free variables
|
|
with other types to properly implement our type schemes. There
|
|
was a bug in that code, which becomes much more apparent when the substitution
|
|
system is put under more pressure. Specifically, the bug was in how type
|
|
variables were handled.
|
|
|
|
The old substitution code, when it found that a type
|
|
variable had been bound to another type, always moved on to perform
|
|
a substitution in that other type. This wasn't really a problem then, since
|
|
any type variables that needed to be substituted were guaranteed to be
|
|
free (that's why they were put into the "forall" quantifier). However, with our
|
|
new system, we are using user-provided type variables (usually `a`, `b`, and so on),
|
|
which have likely already been used by our compiler internally, and thus have
|
|
been bound to something. That something is irrelevant to us: when we
|
|
perform a substitution on a user-defined data type, we _know_ that _our_ `a` is
|
|
free, and should be substitited. In short, precedence should be given to
|
|
substituting type variables, rather than resolving them to what they are bound to.
|
|
|
|
To make this adjustment possible, we need to make `substitute` a method of `type_manager`,
|
|
since it will now require an awareness of existing type bindings. Additionally,
|
|
this method will now perform its own type resolution, checking if a type variable
|
|
needs to be substitited between each step. The whole code is as follows:
|
|
|
|
{{< codelines "C++" "compiler/11/type.cpp" 134 165 >}}
|
|
|
|
That's all for types. Definitions, though, need some work. First of all,
|
|
we've changed our parser to feed our `constructor` type a vector of
|
|
`parsed_type_ptr`, rather than `std::string`. We therefore have to update
|
|
`constructor` to receive and store this new vector:
|
|
|
|
{{< codelines "C++" "compiler/11/definition.hpp" 13 20 >}}
|
|
|
|
Similarly, `definition_data` itself needs to accept the list of type
|
|
variables it has:
|
|
|
|
{{< codelines "C++" "compiler/11/definition.hpp" 54 70 >}}
|
|
|
|
We then look at `definition_data::insert_constructors`, which converts
|
|
`constructor` instances to actual constructor functions. The code,
|
|
which is getting pretty complciated, is as follows:
|
|
|
|
{{< codelines "C++" "compiler/11/definition.cpp" 64 92 >}}
|
|
|
|
In the above snippet, we do the following things:
|
|
|
|
1. We first create a set of type variables that can occur
|
|
in this type's constructors (the same set that's used
|
|
by the `to_type` method we saw earlier). While doing this, we ensure
|
|
a type variable is not used twice (this is not allowed), and add each
|
|
type variable to the final return type (which is something like `List a`),
|
|
in the order they occur.
|
|
2. When the variables have been gathered into a set, we iterate
|
|
over all constructors, and convert them into types by calling `to_type`
|
|
on their arguments, then assembling the resulting argument types into a function.
|
|
This is not enough, however,
|
|
{{< sidenote "right" "type-variables-note" "since constructors of types that accept type variables are polymorphic," >}}
|
|
This is also not enough because without generalization using "forall", we are risking using type variables
|
|
that have already been bound, or that will be bound. Even if <code>a</code> has not yet been used by the typechecker,
|
|
it will be once the type manager generates its first type variable, and things will go south. If we, for some reason,
|
|
wanted type constructors to be monomorphic (but generic, with type variables) we'd need to internally
|
|
instnatiate fresh type variables for every user-defined type variable, and substitute them appropriately.
|
|
{{< /sidenote >}}
|
|
as we have discussed above with \\(\\text{Nil}\\) and \\(\\text{Cons}\\).
|
|
To accomodate for this, we also add all type variables to the "forall" quantifier
|
|
of a new type scheme, whose monotype is our newly assembled function type. This
|
|
type scheme is what we store as the type of the constructor.
|
|
|
|
This was the last major change we have to perform. The rest is cleanup: we have switched
|
|
our system to dealing with type applications (sometimes with zero arguments), and we must
|
|
bring the rest of the compiler up to speed with this change. For instance, we update
|
|
`ast_int` to create a reference to an existing integer type during typechecking:
|
|
|
|
{{< codelines "C++" "compiler/11/ast.cpp" 20 22 >}}
|
|
|
|
Similarly, we update our code in `typecheck_program` to use
|
|
type applications in the type for binary operations:
|
|
|
|
{{< codelines "C++" "compiler/11/main.cpp" 31 37 >}}
|
|
|
|
Finally, we update `ast_case` to unwrap type applications to get the needed constructor
|
|
data from `type_data`. This has to be done in `ast_case::typecheck`, as follows:
|
|
|
|
{{< codelines "C++" "compiler/11/ast.cpp" 163 168 >}}
|
|
|
|
Additionally, a similar change needs to be made in `ast_case::compile`:
|
|
|
|
{{< codelines "C++" "compiler/11/ast.cpp" 174 175 >}}
|
|
|
|
That should be all! Let's try an example:
|
|
|
|
{{< rawblock "compiler/11/examples/works3.txt" >}}
|
|
|
|
The output:
|
|
|
|
```
|
|
Result: 6
|
|
```
|
|
|
|
Yay! Not only were we able to define a list of any type, but our `length` function correctly
|
|
determined the lengths of two lists of different types! Let's try an example with the
|
|
classic [`fold` functions](http://learnyouahaskell.com/higher-order-functions#folds):
|
|
|
|
{{< rawblock "compiler/11/examples/list.txt" >}}
|
|
|
|
We expect the sum of the list `[1,2,3,4]` to be `10`, and its length to be `4`, so the sum
|
|
of the two should be `14`. And indeed, our program agrees:
|
|
|
|
```
|
|
Result: 14
|
|
```
|
|
|
|
Let's do one more example, to test types that take more than one type parameter:
|
|
|
|
{{< rawblock "compiler/11/examples/pair.txt" >}}
|
|
|
|
Once again, the compiled program gives the expected result:
|
|
|
|
```
|
|
Result: 4
|
|
```
|
|
|
|
This looks good! We have added support for polymorphic data types to our compiler.
|
|
We are now free to move on to `let/in` expressions, __lambda functions__, and __Input/Output__,
|
|
as promised, starting with [part 12]({{< relref "12_compiler_let_in_lambda/index.md" >}}) - `let/in`
|
|
and lambdas!
|