---
title: Compiling a Functional Language Using C++, Part 11 - Polymorphic Data Types
date: 2020-03-28T20:10:35-07:00
draft: true
tags: ["C and C++", "Functional Languages", "Compilers"]
---
[In part 10]({{< relref "10_compiler_polymorphism.md" >}}), we managed to get our
compiler to accept functions that were polymorphically typed. However, a piece
of the puzzle is still missing: while our _functions_ can handle values
of different types, the same cannot be said for our _data types_. This means
that we cannot construct data structures that can contain arbitrary types.
While we can define and use a list of integers, if we want to also have a
list of booleans, we must copy all of our constructors and define a new data type.
Worse, not only do we have to duplicate the constructors, but also all the functions
that operate on the list. As far as our compiler is concerned, a list of
integers and a list of booleans are entirely different beasts, and cannot
be operated on by the same code.
To make polymorphic data types possible, we must extend our language (and type
system) a little. We will now allow for something like this:
```
data List a = { Nil, Cons a List }
```
In the above snippet, we are no longer declaring a single type, but a collection
of related types, __parameterized__ by a type `a`. Any type can take the place
of `a` to get a list containing that type of element.
Then, `List Int` is a type,
as is `List Bool` and `List (List Int)`. The constructors in the snippet also
get polymorphic types:
{{< latex >}}
\text{Nil} : \forall a \; . \; \text{List} \; a \\
\text{Cons} : \forall a \; . \; a \rightarrow \text{List} \; a \rightarrow \text{List} \; a
{{< /latex >}}
When you call `Cons`, the type of the resulting list varies with the type of element
you pass in. The empty list `Nil` is a valid list of any type, since, well, it's
empty.
Let's talk about `List` itself, now. I suggest that we ponder the following table:
\\(\\text{List}\\)|\\(\\text{Cons}\\)
----|----
\\(\\text{List}\\) is not a type; it must be followed up with arguments, like \\(\\text{List} \\; \\text{Int}\\).|\\(\\text{Cons}\\) is not a list; it must be followed up with arguments, like \\(\\text{Cons} \\; 3 \\; \\text{Nil}\\).
\\(\\text{List} \\; \\text{Int}\\) is in its simplest form.|\\(\\text{Cons} \\; 3 \\; \\text{Nil}\\) is in its simplest form.
\\(\\text{List} \\; \\text{Int}\\) is a type.|\\(\\text{Cons} \\; 3 \\; \\text{Nil}\\) is a value of type \\(\\text{List} \\; \\text{Int}\\).
I hope that the similarities are quite striking. I claim that
`List` is quite similar to a constructor `Cons`, except that it occurs
in a different context: whereas `Cons` is a way to create values,
`List` is a way to create types. Indeed, while we call `Cons` a constructor,
it's typical to call `List` a __type constructor__.
We know that `Cons` is a function which
assigns to values (like `3` and `Nil`) other values (like `Cons 3 Nil`, or `[3]` for
short). In a similar manner, `List` can be thought of as a function
that assigns to types (like `Int`) other types (like `List Int`). We can
even claim that it has a type:
{{< latex >}}
\text{List} : \text{Type} \rightarrow \text{Type}
{{< /latex >}}
{{< sidenote "right" "dependent-types-note" "Unless we get really wacky," >}}
When your type constructors take as input not only other types but also values
such as 3
, you've ventured into the territory of
dependent types.
This is a significant step up in complexity from what we'll be doing in this
series. If you're interested, check out
Idris (if you want to know about dependent types
for functional programming) or Coq (to see how
propositions and proofs can be encoded in a dependently typed language).
{{< /sidenote >}}
our type constructors will only take zero or more types as input, and produce
a type as output. In this case, writing \\(\\text{Type}\\) becomes quite repetitive,
and we will adopt the convention of writing \\(*\\) instead. The types of such
constructors are called [kinds](https://en.wikipedia.org/wiki/Kind_(type_theory)).
Let's look at a few examples, just to make sure we're on the same page:
* The kind of \\(\\text{Bool}\\) is \\(*\\): it does not accept any
type arguments, and is a type in its own right.
* The kind of \\(\\text{List}\\) is \\(*\\rightarrow *\\): it takes
one argument (the type of the things inside the list), and creates
a type from it.
* If we define a pair as `data Pair a b = { MkPair a b }`, then its
kind is \\(* \\rightarrow * \\rightarrow *\\), because it requires
two parameters.
As one final observation, we note that effectively, all we're doing is
tracking the [arity](https://en.wikipedia.org/wiki/Arity) of the constructor
type.
Let's now enumerate all the possible forms that (mono)types can take in our system:
1. A type can be a placeholder, like \\(a\\), \\(b\\), etc.
2. A type can be a type constructor, applied to
{{< sidenote "right" "zero-more-note" "zero ore more arguments," >}}
It is convenient to treat regular types (like \(\text{Bool}\)) as
type constructors of arity 0 (that is, type constructors with kind \(*\)).
In effect, they take zero arguments and produce types (themselves).
{{< /sidenote >}} such as \\(\\text{List} \; \\text{Int}\\) or \\(\\text{Bool}\\).
3. A function from one type to another, like \\(\\text{List} \\; a \\rightarrow \\text{Int}\\).
Polytypes (type schemes) in our system can be all of the above, but may also include a "forall"
quantifier at the front, generalizing the type (like \\(\\forall a \\; . \\; \\text{List} \\; a \\rightarrow \\text{Int}\\)).
Let's start implementing all of this. Why don't we start with the change to the syntax of our language?
We have complicated the situation quite a bit. Let's take a look at the _old_ grammar
for data type declarations (this is going back as far as [part 2]({{< relref "02_compiler_parsing.md" >}})).
Here, \\(L\_D\\) is the nonterminal for the things that go between the curly braces of a data type
declaration, \\(D\\) is the nonterminal representing a single constructor definition,
and \\(L\_U\\) is a list of zero or more uppercase variable names:
{{< latex >}}
\begin{aligned}
L_D & \rightarrow D \; , \; L_D \\
L_D & \rightarrow D \\
D & \rightarrow \text{upperVar} \; L_U \\
L_U & \rightarrow \text{upperVar} \; L_U \\
L_U & \rightarrow \epsilon
\end{aligned}
{{< /latex >}}
This grammar was actually too simple even for our monomorphically typed language!
Since functions are not represented using a single uppercase variable, it wasn't possible for us
to define constructors that accept as arguments anything other than integers and user-defined
data types. Now, we also need to modify this grammar to allow for constructor applications (which can be nested!)
To do so, we will define a new nonterminal, \\(Y\\), for types:
{{< latex >}}
\begin{aligned}
Y & \rightarrow N \; ``\rightarrow" Y \\
Y & \rightarrow N
\end{aligned}
{{< /latex >}}
We make it right-recursive (because the \\(\\rightarrow\\) operator is right-associative). Next, we define
a nonterminal for all types _except_ those constructed with the arrow, \\(N\\).
{{< latex >}}
\begin{aligned}
N & \rightarrow \text{upperVar} \; L_Y \\
N & \rightarrow \text{typeVar} \\
N & \rightarrow ( Y )
\end{aligned}
{{< /latex >}}
The first of the above rules allows a type to be a constructor applied to zero or more arguments
(generated by \\(L\_Y\\)). The second rule allows a type to be a placeholder type variable. Finally,
the third rule allows for any type (including functions, again) to occur between parentheses.
This is so that higher-order functions, like \\((a \rightarrow b) \rightarrow a \rightarrow a \\),
can be represented.
Unfortunately, the definition of \\(L\_Y\\) is not as straightforward as we imagine. We could define
it as just a list of \\(Y\\) nonterminals, but this would make the grammar ambigous: something
like `List Maybe Int` could be interpreted as "`List`, applied to types `Maybe` and `Int`", and
"`List`, applied to type `Maybe Int`". To avoid this, we define a "type list element" \\(Y'\\),
which does not take arguments:
{{< latex >}}
\begin{aligned}
Y' & \rightarrow \text{upperVar} \\
Y' & \rightarrow \text{lowerVar} \\
Y' & \rightarrow ( Y )
\end{aligned}
{{< /latex >}}
We then make \\(L\_Y\\) a list of \\(Y'\\):
{{< latex >}}
\begin{aligned}
L_Y & \rightarrow Y' \; L_Y \\
L_Y & \rightarrow \epsilon
\end{aligned}
{{< /latex >}}
Finally, we update the rules for the data type declaration, as well as for a single
constructor:
{{< latex >}}
\begin{aligned}
T & \rightarrow \text{data} \; \text{upperVar} \; L_T = \{ L_D \} \\
D & \rightarrow \text{upperVar} \; L_Y \\
\end{aligned}
{{< /latex >}}
Now that we have a grammar for all these things, we have to implement
the corresponding data structures. We define a new family of structs,
extending `parsed_type`, which represent types as they are
received from the parser. These differ from regular types in that they
do not require that the types they represent are valid; validating
types requires two passes, which is a luxury we do not have when
parsing. We can define them as follows:
{{< codeblock "C++" "compiler/11/parsed_type.hpp" >}}
We define the conversion function `to_type`, which requires
a set of type variables quantified in the given type, and
the environment in which to look up the arities of various
type constructors. The implementation is as follows:
{{< codeblock "C++" "compiler/11/parsed_type.cpp" >}}
With this definition in hand, we can now update the grammar in our Bison file.
First things first, we'll add the type parameters to the data type definition:
{{< codelines "plaintext" "compiler/11/parser.y" 127 130 >}}
Next, we add the new grammar rules we came up with:
{{< codelines "plaintext" "compiler/11/parser.y" 138 163 >}}
Finally, we define the types for these new rules at the top of the file:
{{< codelines "plaintext" "compiler/11/parser.y" 43 44 >}}
{{< todo >}}
Nullary is not the right word.
{{< /todo >}}