blog-static/content/blog/10_compiler_polymorphism.md
Danila Fedorin 2fde7e5cf8 Update "compiler: polymorphism" to new math delimiters
Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>
2024-05-13 18:48:34 -07:00

787 lines
41 KiB
Markdown

---
title: Compiling a Functional Language Using C++, Part 10 - Polymorphism
date: 2020-03-25T17:14:20-07:00
tags: ["C++", "Functional Languages", "Compilers"]
series: "Compiling a Functional Language using C++"
description: "In this post, we extend our compiler's typechecking algorithm to implement the Hindley-Milner type system, allowing for polymorphic functions."
favorite: true
---
[In part 8]({{< relref "08_compiler_llvm.md" >}}), we wrote some pretty interesting programs in our little language.
We successfully expressed arithmetic and recursion. But there's one thing
that we cannot express in our language without further changes: an `if` statement.
Suppose we didn't want to add a special `if/else` expression into our language.
Thanks to lazy evaluation, we can express it using a function:
```
defn if c t e = {
case c of {
True -> { t }
False -> { e }
}
}
```
But an issue still remains: so far, our compiler remains __monomorphic__. That
is, a particular function can only have one possible type for each one of its
arguments. With our current setup, something like this
{{< sidenote "right" "if-note" "would not work:" >}}
In a polymorphically typed language, the inner <code>if</code> would just evaluate to
<code>False</code>, and the whole expression to 3.
{{< /sidenote >}}
```
if (if True False True) 11 3
```
This is because, for this to work, both of the following would need to hold (borrowing
some of our notation from the [typechecking]({{< relref "03_compiler_typechecking.md" >}}) post):
{{< latex >}}
\text{if} : \text{Bool} \rightarrow \text{Int} \rightarrow \text{Int} \rightarrow \text{Int}
{{< /latex >}}
{{< latex >}}
\text{if} : \text{Bool} \rightarrow \text{Bool} \rightarrow \text{Bool} \rightarrow \text{Bool}
{{< /latex >}}
But using our rules so far, such a thing is impossible, since there is no way for
\(\text{Int}\) to be unified with \(\text{Bool}\). We need a more powerful
set of rules to describe our program's types.
### Hindley-Milner Type System
One such powerful set of rules is the [Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system),
which we have previously alluded to. In fact, the rules we came up
with were already very close to Hindley-Milner, with the exception of two:
__generalization__ and __instantiation__. It's been quite a while since the last time we worked on typechecking, so I'm going
to present a table with these new rules, as well as all of the ones that we
{{< sidenote "right" "rules-note" "previously used." >}}
The rules aren't quite the same as the ones we used earlier;
note that \(\sigma\) is used in place of \(\tau\) in the first rule,
for instance. These changes are slight, and we'll talk about how the
rules work together below.
{{< /sidenote >}} I will also give a quick
summary of each of these rules.
{{< foldtable >}}
Rule|Name and Description
-----|-------
{{< latex >}}
\frac
{x:\sigma \in \Gamma}
{\Gamma \vdash x:\sigma}
{{< /latex >}}| __Var__: If the variable \(x\) is known to have some polymorphic type \(\sigma\) then an expression consisting only of that variable is of that type.
{{< latex >}}
\frac
{\Gamma \vdash e_1 : \tau_1 \rightarrow \tau_2 \quad \Gamma \vdash e_2 : \tau_1}
{\Gamma \vdash e_1 \; e_2 : \tau_2}
{{< /latex >}}| __App__: If an expression \(e_1\), which is a function from monomorphic type \(\tau_1\) to another monomorphic type \(\tau_2\), is applied to an argument \(e_2\) of type \(\tau_1\), then the result is of type \(\tau_2\).
{{< latex >}}
\frac
{\Gamma \vdash e : \tau \quad \text{matcht}(\tau, p_i) = b_i
\quad \Gamma,b_i \vdash e_i : \tau_c}
{\Gamma \vdash \text{case} \; e \; \text{of} \;
\{ (p_1,e_1) \ldots (p_n, e_n) \} : \tau_c }
{{< /latex >}}| __Case__: This rule is not part of Hindley-Milner, and is specific to our language. If the expression being case-analyzed is of type \(\tau\) and each branch \((p_i, e_i)\) is of the same type \(\tau_c\) when the pattern \(p_i\) works with type \(\tau\) producing extra bindings \(b_i\), the whole case expression is of type \(\tau_c\).
{{< latex >}}
\frac{\Gamma \vdash e : \sigma' \quad \sigma' \sqsubseteq \sigma}
{\Gamma \vdash e : \sigma}
{{< /latex >}}| __Inst (New)__: If type \(\sigma\) is an instantiation (or specialization) of type \(\sigma'\) then an expression of type \(\sigma'\) is also an expression of type \(\sigma\).
{{< latex >}}
\frac
{\Gamma \vdash e : \sigma \quad \alpha \not \in \text{free}(\Gamma)}
{\Gamma \vdash e : \forall \alpha . \sigma}
{{< /latex >}}| __Gen (New)__: If an expression has a type with free variables, this rule allows us generalize it to allow all possible types to be used for these free variables.
Here, there is a distinction between different forms of types. First, there are
monomorphic types, or __monotypes__, \(\tau\), which are types such as \(\text{Int}\),
\(\text{Int} \rightarrow \text{Bool}\), \(a \rightarrow b\)
and so on. These types are what we've been working with so far. Each of them
represents one (hence, "mono-"), concrete type. This is obvious in the case
of \(\text{Int}\) and \(\text{Int} \rightarrow \text{Bool}\), but
for \(a \rightarrow b\) things are slightly less clear. Does it really
represent a single type, if we can put an arbtirary thing for \(a\)?
The answer is "yes"! Although \(a\) is not currently known, it stands
in place of another monotype, which is yet to be determined.
So, how do we represent polymorphic types, like that of \(\text{if}\)?
We borrow some more notation from mathematics, and use the "forall" quantifier:
{{< latex >}}
\text{if} : \forall a \; . \; \text{Bool} \rightarrow a \rightarrow a \rightarrow a
{{< /latex >}}
This basically says, "the type of \(\text{if}\) is \(\text{Bool} \rightarrow a \rightarrow a \rightarrow a\)
for all possible \(a\)". This new expression using "forall" is what we call a type scheme, or a polytype \(\sigma\).
For simplicity, we only allow "forall" to be at the front of a polytype. That is, expressions like
\(a \rightarrow \forall b \; . \; b \rightarrow b\) are not valid polytypes as far as we're concerned.
It's key to observe that only some of the typing rules in the above table use polytypes (\(\sigma\)). Whereas expressions
consisting of a single variable can be polymorphically typed, this is not true for function applications and case expressions.
In fact, according to our rules, there is no way to introduce a polytype anywhere into our system!
The reason for this is that we only allow polymorphism at certain locations. In the Hindley-Milner type system,
this is called __Let-Polymorphism__, which means that only in `let`/`in` expressions can variables or expressions
be given a polymorphic type. We, on the other hand, do not (yet) have `let`/`in` expressions, so our polymorphism
is further limited: only functions (and data type constructors) can be polymorphically typed.
Let's talk about what __Inst__ does, and what "\(\sqsubseteq\)" means.
First, why don't we go ahead and write the formal inference rule for \(\sqsubseteq\):
{{< latex >}}
\frac
{\tau'=\{\alpha_i \mapsto \tau_i \}\tau \quad \beta_i \not \in \text{free}(\forall \alpha_1...\forall \alpha_n \; . \; \tau)}
{\forall \alpha_1 ... \forall \alpha_n \; . \; \tau \sqsubseteq \forall \beta_1 ... \forall \beta_m \; . \; \tau'}
{{< /latex >}}
In my opinion, this is one of the more confusing inference rules we have to deal with in Hindley-Milner.
In principle, though, it's not too hard to understand. \(\sigma' \sqsubseteq \sigma\) says "\(\sigma'\)
is more general than \(\sigma\)". Alternatively, we can interpret it as "\(\sigma\) is an instance of \(\sigma'\)".
What does it mean for one polytype to be more general than another? Intuitively, we might say that \(\forall a \; . \; a \rightarrow a\) is
more general than \(\text{Int} \rightarrow \text{Int}\), because the former type can represent the latter, and more. Formally,
we define this in terms of __substitutions__. A substitution \(\{\alpha \mapsto \tau \}\) replaces a variable
\(\alpha\) with a monotype \(\tau\). If we can use a substitution to convert one type into another, then the first
type (the one on which the substitution was performed) is more general than the resulting type. In our previous example,
since we can apply the substitution \(\{a \mapsto \text{Int}\}\) to get \(\text{Int} \rightarrow \text{Int}\)
from \(\forall a \; . \; a \rightarrow a\), the latter type is more general; using our notation,
\(\forall a \; . \; a \rightarrow a \sqsubseteq \text{Int} \rightarrow \text{Int}\).
That's pretty much all that the rule says, really. But what about the whole thing with \(\beta\) and \(\text{free}\)? The reason
for that part of the rule is that, in principle, we can substitute polytypes into polytypes. However, we can't
do this using \(\{ \alpha \mapsto \sigma \}\). Consider, for example:
{{< latex >}}
\{ a \mapsto \forall b \; . \; b \rightarrow b \} \; \text{Bool} \rightarrow a \rightarrow a \stackrel{?}{=}
\text{Bool} \rightarrow (\forall b \; . \; b \rightarrow b) \rightarrow \forall b \; . \; b \rightarrow b
{{< /latex >}}
Hmm, this is not good. Didn't we agree that we only want quantifiers at the front of our types? Instead, to make that substitution,
we only substitute the monotype \(b \rightarrow b\), and then add the \(\forall b\) at the front. But
to do this, we must make sure that \(b\) doesn't occur anywhere in the original type
\(\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a\) (otherwise we can accidentally generalize
too much). So then, our concrete inference rule is as follows:
{{< latex >}}
\frac
{
\begin{gathered}
\text{Bool} \rightarrow (b \rightarrow b) \rightarrow b \rightarrow b =\{a \mapsto (b \rightarrow b) \} \; \text{Bool} \rightarrow a \rightarrow a \\
b \not \in \text{free}(\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a) = \varnothing
\end{gathered}
}
{\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a \sqsubseteq \forall b \; . \; \text{Bool} \rightarrow (b \rightarrow b) \rightarrow b \rightarrow b}
{{< /latex >}}
In the above rule we:
1. Replaced \(a\) with \(b \rightarrow b\), getting rid of \(a\) in the quantifier.
2. Observed that \(b\) is not a free variable in the original type, and thus can be generalized.
3. Added the generalization for \(b\) to the front of the resulting type.
Now, __Inst__ just allows us to perform specialization / substitution as many times
as we need to get to the type we want.
### A New Typechecking Algorithm
Alright, now we have all these rules. How does this change our typechecking algorithm?
How about the following:
1. To every declared function, assign the type \(a \rightarrow ... \rightarrow y \rightarrow z\),
where
{{< sidenote "right" "arguments-note" "\(a\) through \(y\) are the types of the arguments to the function," >}}
Of course, there can be more or less than 25 arguments to any function. This is just a generalization;
we use as many input types as are needed.
{{< /sidenote >}} and \(z\) is the function's
return type.
2. We typecheck each declared function, using the __Var__, __Case__, __App__, and __Inst__ rules.
3. Whatever type variables we don't fill in, we assume can be filled in with any type,
and use the __Gen__ rule to sprinkle polymorphism where it is needed.
Maybe this is enough. Let's go through an example. Suppose we have three functions:
```
defn if c t e = {
case c of {
True -> { t }
False -> { e }
}
}
defn testOne = { if True False True }
defn testTwo = { if True 0 1 }
```
If we go through and typecheck them top-to-bottom, the following happens:
1. We start by assuming \(\text{if} : a \rightarrow b \rightarrow c \rightarrow d\),
\(\text{testOne} : e\) and \(\text{testTwo} : f\).
2. We look at `if`. We infer the type of `c` to be \(\text{Bool}\), and thus, \(a = \text{Bool}\).
We also infer that \(b = c\), since they occur in two branches of the same case expression.
Finally, we infer that \(c = d\), since whatever the case expression returns becomes the return
value of the function. Thus, we come out knowing that \(\text{if} : \text{Bool} \rightarrow b
\rightarrow b \rightarrow b\).
3. Now, since we never figured out \(b\), we use __Gen__: \(\text{if} : \forall b \; . \;
\text{Bool} \rightarrow b \rightarrow b \rightarrow b\). Like we'd want, `if` works with
all types, as long as both its inputs are of the same type.
4. When we typecheck the body of `testOne`, we use __Inst__ to turn the above type for `if`
into a single, monomorphic instance. Then, type inference proceeds as before, and all is well.
5. When we typecheck the body of `testTwo`, we use __Inst__ again, instantiating a new monotype,
and all is well again.
So far, so good. But what if we started from the bottom, and went to the top?
1. We start by assuming \(\text{if} : a \rightarrow b \rightarrow c \rightarrow d\),
\(\text{testOne} : e\) and \(\text{testTwo} : f\).
2. We look at `testTwo`. We infer that \(a = \text{Bool}\) (since
we pass in `True` to `if`). We also infer that \(b = \text{Int}\), and that \(c = \text{Int}\).
Not yet sure of the return type of `if`, this is where we stop. We are left with
the knowledge that \(f = d\) (because the return type of `if` is the return type of `testTwo`),
but that's about it. Since \(f\) is no longer free, we don't generalize, and conclude that \(\text{testTwo} : f\).
3. We look at `testOne`. We infer that \(a = \text{Bool}\) (already known). We also infer
that \(b = \text{Bool}\), and that \(c = \text{Bool}\). But wait a minute! This is not right.
We are back to where we started, with a unification error!
What went wrong? I claim that we typechecked the functions that _used_ `if` before we typechecked `if` itself,
which led us to infer a less-than-general type for `if`. This less-than-general type was insufficient to
correctly check the whole program.
To address this, we enforce a particular order of type inference on our declaration, guided by dependencies
between functions. Haskell, which has to deal with a similar issue, has
[a section in the 2010 report on this](https://www.haskell.org/onlinereport/haskell2010/haskellch4.html#x10-880004.5).
In short:
1. We find the
{{< sidenote "right" "transitive-closure-note" "transitive closure" >}}
A transitive closure of a vertex in a graph is the list of all vertices reachable
from that original vertex. Check out the <a href="https://en.wikipedia.org/wiki/Transitive_closure#In_graph_theory">
Wikipedia page on this</a>.
{{< /sidenote >}}
of the function dependencies. We define a function \(f\) to be dependent on another function \(g\)
if \(f\) calls \(g\). The transitive closure will help us find functions that are related indirectly.
For instance, if \(g\) also depends on \(h\), then the transitive closure of \(f\) will
include \(h\), even if \(f\) directly doesn't use \(h\).
2. We isolate groups of mutually dependent functions. If \(f\) depends on \(g\) and \(g\) depends \(f\),
they are placed in one group. We then construct a dependency graph __of these groups__.
3. We compute a topological order of the group graph. This helps us typecheck the dependencies
of functions before checking the functions themselves. In our specific case, this would ensure
we check `if` first, and only then move on to `testOne` and `testTwo`. The order of typechecking
within a group does not matter, as long as we generalize only after typechecking all functions
in a group.
4. We typecheck the function groups, and functions within them, following the above topological order.
To find the transitive closure of a graph, we can use [Warshall's Algorithm](https://cs.winona.edu/lin/cs440/ch08-2.pdf).
This algorithm, with complexity \(O(|V|^3)\), goes as follows:
{{< latex >}}
\begin{aligned}
& A, R^{(i)} \in \mathbb{B}^{n \times n} \\
& \\
& R^{(0)} \leftarrow A \\
& \textbf{for} \; k \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\
& \quad \textbf{for} \; i \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\
& \quad \quad \textbf{for} \; j \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\
& \quad \quad \quad R^{(k)}[i,j] \leftarrow R^{(k-1)}[i,j] \; \textbf{or} \; R^{(k-1)}[i,k] \; \textbf{and} \; R^{(k-1)}[k,j] \\
& \textbf{return} \; R^{(n)}
\end{aligned}
{{< /latex >}}
In the above notation, \(R^{(i)}\) is the \(i\)th matrix \(R\), and \(A\) is the adjacency
matrix of the graph in question. All matrices in the algorithm are from \(\mathbb{B}^{n \times n}\),
the set of \(n\) by \(n\) boolean matrices. Once this algorithm is complete, we get as output a
transitive closure adjacency matrix \(R^{(n)}\). Mutually dependent functions will be pretty easy to
isolate from this matrix. If \(R^{(n)}[i,j]\) and \(R^{(n)}[j,i]\), then the functions represented by vertices
\(i\) and \(j\) depend on each other.
Once we've identified the groups, and
{{< sidenote "right" "group-graph-note" "constructed a group graph," >}}
This might seem like a "draw the rest of the owl" situation, but it really isn't.
We'll follow a naive algorithm for findings groups, and for translating function dependencies
into group dependencies. This algorithm, in C++, will be presented later on.
{{< /sidenote >}} it is time to compute the topological order. For this, we will use
[Kahn's Algorithm](https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm).
The algorithm goes as follows:
{{< latex >}}
\begin{aligned}
& L \leftarrow \text{empty list} \\
& S \leftarrow \text{set of all nodes with no incoming edges} \\
& \\
& \textbf{while} \; S \; \text{is non-empty} \; \textbf{do} \\
& \quad \text{remove a node} \; n \; \text{from} \; S \\
& \quad \text{add} \; n \; \text{to the end of} \; L \\
& \quad \textbf{for each} \; \text{node} \; m \; \text{with edge} \;
e \; \text{from} \; n \; \text{to} \; m \; \textbf{do} \\
& \quad \quad \text{remove edge} \; e \; \text{from the graph} \\
& \quad \quad \textbf{if} \; m \; \text{has no other incoming edges} \; \textbf{then} \\
& \quad \quad \quad \text{insert} \; m \; \text{into} \; S \\
& \\
& \textbf{if} \; \text{the graph has edges} \; \textbf{then} \\
& \quad \textbf{return} \; \text{error} \quad \textit{(graph has at least once cycle)} \\
& \textbf{else} \\
& \quad \textbf{return} \; L \quad \textit{(a topologically sorted order)}
\end{aligned}
{{< /latex >}}
Note that since we've already isolated all mutually dependent functions into
groups, our graph will never have cycles, and this algorithm will always succeed.
Also note that since we start with nodes with no incoming edges, our list will
__begin with the groups that should be checked last__. This is because a node
with no incoming edges might (and probably does) still have outgoing edges,
and thus depends on other functions / groups. Like in our successful example,
we want to __typecheck functions that are depended on first__.
### Implementation
Let's start working on a C++ implementation of all of this now. First,
I think that we should create a C++ class that will represent our function
dependency graph. Let's call it `function_graph`. I propose the following
definition:
{{< codelines "C++" "compiler/10/graph.hpp" 12 52 >}}
There's a lot to unpack here. First of all, we create a type alias `function` that
represents the label of a function in our graph. It is probably most convenient
to work with `std::string` instances, so we settle for that. Next, we define
a struct that will represent a single group of mutually dependent functions.
Passing this struct by value seems wrong, so we'll settle for a C++ `unique_pt`
to help carry instances around.
Finally, we arrive at the definition of `function_graph`. Inside this class,
we define a helper struct, `group_data`, which holds information
about an individual group as it is being constructed. This information
includes the group's adjacency list and
[indegree](https://en.wikipedia.org/wiki/Directed_graph#Indegree_and_outdegree)
(both used for Kahn's topological sorting algorithm), as well as the set
of functions in the group (which we will eventually return).
The `adjacency_lists` and `edges` fields are the meat of the graph representation.
Both of the variables provide a different view of the same graph: `adjacency_lists`
associates with every function a list of functions it depends on, while
`edges` holds a set of tuples describing edges in the graph. Having
more than one representation makes it more convenient for us to perform
different operations on our graphs.
Next up are some internal methods that perform the various steps we described
above:
* `compute_transitive_edges` applies Warshall's algorithm to find the graph's
transitive closure.
* `create_groups` creates two mappings, one from functions to their respective
groups' IDs, and one from group IDs to information about the corresponding groups.
This step is largely used to determine which functions belong to the same group,
and as such, uses the set of transitive edges generated by `compute_transitive_edges`.
* `create_edges` creates edges __between groups__. During this step, the indegrees
of each group are computed, as well as their adjacency lists.
* `generate_order` uses the indegrees and adjacency lists produced in the prior step
to establish a topological order.
Following these, we have three public function definitions:
* `add_function` adds a vertex to the graph. Sometimes, a function does not
reference any other functions, and would not appear in the list of edges.
We will call `add_function` to make sure that the function graph is aware
of such independent functions. For convenience, `add_function` returns the adjacency list
of the added function.
* `add_edge` adds a new dependency between two functions.
* `compute_order` method uses the internal methods described above to convert
the function dependency graph into a properly ordered list of groups.
Let's start by looking at how to implement the internal methods. `compute_transitive_edges`
is a very straightforward implementation of Warshall's:
{{< codelines "C++" "compiler/10/graph.cpp" 3 21 >}}
Next is `create_groups`. For each function, we iterate over all other functions.
If the other function is mutually dependent with the first function, we add
it to the same group. In the outer loop, we skip over functions that have
already been added to the group. This is because
{{< sidenote "right" "equivalence-note" "mutual dependence" >}}
There is actually a slight difference between "mutual dependence"
the way we defined it and "being in the same group", and
it lies in the symmetric property of an equivalence relation.
We defined a function to depend on another function if it calls
that other function. Then, a recursive function depends on itself,
but a non-recursive function does not, and therefore does not
satisfy the symmetric property. However, as far as we're concerned,
a function should be in a group with itself even if it's not recursive. Thus, the
real equivalence relation we use is "in the same group as", and
consists of "mutual dependence" extended with symmetry.
{{< /sidenote >}}
is an [equivalence relation](https://en.wikipedia.org/wiki/Equivalence_relation),
which means that if we already added a function to a group, all its
group members were also already visited and added.
{{< codelines "C++" "compiler/10/graph.cpp" 23 44 >}}
Once groups have been created, we use their functions' edges
to create edges for the groups themselves, using `create_edges`.
We avoid creating edges from a group to itself, to prevent
unnecessary cycles. While constructing the edges, we also
increment the relevant indegree counter.
{{< codelines "C++" "compiler/10/graph.cpp" 46 63 >}}
Finally, we apply Kahn's algorithm to create a topological order
in `generate_order`:
{{< codelines "C++" "compiler/10/graph.cpp" 65 90 >}}
These four steps are used in `compute_order`:
{{< codelines "C++" "compiler/10/graph.cpp" 106 114 >}}
Let's now look at the remaining two public definitions.
First comes `add_function`, which creates an adjacency list for the
function to be inserted (if one does not already exist),
and returns a reference to the resulting list:
{{< codelines "C++" "compiler/10/graph.cpp" 92 99 >}}
We use this in `add_edge`, which straightforwardly creates an edge
between two functions:
{{< codelines "C++" "compiler/10/graph.cpp" 101 104 >}}
With this, we can now properly order our typechecking.
However, we are just getting started: there are still
numerous changes we need to make to get our compiler
to behave as we desire.
The first change is the least relevant, but will help clean
up our code base in the presence of polymorphism: we will
get rid of `resolve`, in both definitions and AST nodes.
The reasons for this are twofold. First,
{{< sidenote "right" "case-type-note" "only the case expression node actually uses the type it stores." >}}
Recall that <code>ast_case</code> needs this information to properly
account for the changes to the stack from when data is unpacked.
{{< /sidenote >}} This means that
all the rest of the infrastructure we've written around
preserving types is somewhat pointless. Second, when
we call `resolve`, we'd now have to distinguish
between type variables captured by "forall" and actual,
undefined variables. That's a lot of wasted work!
To replace the now-removed `type` field,
we make `ast_case` include a new member, `input_type`,
which stores the type of the thing between `case` and `of`.
Since `ast_case` requires its type to be a data type
at the time of typechecking, we no longer need to resolve anything.
Next, we need to work in a step geared towards finding function calls
(to determine dependencies). As we have noted in [part 6]({{< relref "06_compiler_compilation.md" >}}),
it's pretty easy to tell apart calls to global functions from "local" ones. If
we see that a variable was previously bound (perhaps as a function argument,
or by a pattern in a case expression), we know for sure that it is not a global
function call. Otherwise, if the variable isn't bound anywhere in the function
definition (it's a __free variable__), it must refer to a global function. Then,
we can traverse the function body, storing variables that are bound (but only within
their scope), and noting references to variables we haven't yet seen. To
implement this, we can use a linked list, where each node refers to a particular
scope, points to the scope enclosing it, and contains a list of variables...
Wait a minute, this is identical to `type_env`! There's no reason to reimplement all
this. But then, another question arises: do we throw away the `type_env` generated
by the dependency-searching step? It seems wasteful, since we will eventually
repeat this same work. Rather, we'll re-use the same `type_env` instances
in both this new step and `typecheck`. To do this, we will now store a pointer
to a `type_env` in every AST node, and set this pointer during our first traversal
of the tree. Indeed, this makes our `type_env` more like a
[symbol table](https://en.wikipedia.org/wiki/Symbol_table). With this change,
our new dependency-finding step will be implemented by the `find_free` function
with the following signature:
```C++
void ast::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into);
```
Let's take a look at how this will be implemented. The simplest case (as usual)
is `ast_int`:
{{< codelines "C++" "compiler/10/ast.cpp" 16 18 >}}
In this case, we associate the `type_env` with the node, but don't do anything
else: a number is not a variable. A more interesting case is `ast_lid`:
{{< codelines "C++" "compiler/10/ast.cpp" 33 36 >}}
If a lowercase variable has not yet been bound to something, it's free,
and we store it. Somewhat counterintuitively, `ast_uid` behaves
differently:
{{< codelines "C++" "compiler/10/ast.cpp" 54 56 >}}
We don't allow uppercase variables to be bound to anything outside of data type
declarations, so we don't care about uppercase free variables. Next up is
`ast_binop`:
{{< codelines "C++" "compiler/10/ast.cpp" 73 77 >}}
A binary operator can have free variables in the subexpressions on the left and on the right, and
the above implementation reflects that. This is identical to the implementation of
`ast_app`:
{{< codelines "C++" "compiler/10/ast.cpp" 109 113 >}}
Finally, `ast_case` requires the most complicated function (as usual):
{{< codelines "C++" "compiler/10/ast.cpp" 142 150 >}}
The `type_scope` function replaces the `type_env::scope` method,
which cannot (without significant effort) operate on smart pointers.
Importantly, we are using a new `pattern` method here, `insert_bindings`. This
is because we split "introducing variables" and "typechecking variables"
into two steps for patterns, as well. The implementation of `insert_bindings`
for `pattern_var` is as follows:
{{< codelines "C++" "compiler/10/ast.cpp" 230 232 >}}
A variable pattern always introduces the variable it is made up of.
On the other hand, the implementation for `pattern_constr` is as follows:
{{< codelines "C++" "compiler/10/ast.cpp" 245 249 >}}
All the variables of the pattern are placed into the environment. For now, we don't worry
about arity; this is the job of typechecking.
These changes are reflected in all instances of our `typecheck` function. First of
all, `typecheck` no longer needs to receive a `type_env` parameter, since each
tree node has a `type_env_ptr`. Furthermore, `typecheck` should no longer call
`bind`, since this was already done by `find_free`. For example,
`ast_lid::typecheck` will now use `env::lookup`:
{{< codelines "C++" "compiler/10/ast.cpp" 38 40 >}}
Don't worry about `instantiate` for now; that's coming up. Similarly to
`ast_lid`, `ast_case::typecheck` will no longer introduce new bindings,
but unify existing types via the `pattern`:
{{< codelines "C++" "compiler/10/ast.cpp" 152 169 >}}
The above implementation uses another new `pattern` method, `typecheck`.
This method inherits the type checking functionality previously
contained in `pattern::match`. Here's the implementation for `pattern_var`:
{{< codelines "C++" "compiler/10/ast.cpp" 234 236 >}}
And here's the implementation for `pattern_constr`:
{{< codelines "C++" "compiler/10/ast.cpp" 251 266 >}}
So far, so good. However, for all of this to reach the main typechecking
code, not only `ast` subclasses need to be updated, but also
the `definition`s. Here things get more complicated, because
`definition_data` and `definition_defn` are growing more and more apart.
Previously, we had two typechecking steps: `typecheck_first` (which registered
function names into the environment) and `typecheck_second` (which performed
the actual typechecking). However, not only are these names not informative,
but the algorithms for typechecking the two types of definition will soon
have different numbers of "major" steps.
Let's take a look at how we would typecheck data types. I propose the following
steps:
1. Iterate all declared data types, storing them into some kind of "known" list.
2. Iterate again, and for each constructor of a type, verify that
it refers to "known" types. Add valid constructors to the global environment as functions.
We don't currently verify that types are "known"; A user could declare a list of `Floobs`,
and never say what a `Floob` is.
{{< sidenote "right" "known-type-note" "This isn't too big of an issue" >}}
Curiously, this flaw did lead to some valid programs being rejected. Since
we had no notion of a "known" type, whenever data type constructors
were created, every argument type was marked a "base" type;
see <a href="https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/09/definition.cpp#L82">
this line</a> if you're curious.
This would cause pattern matching to fail on the tail of a list, with
the "attempt to pattern match on non-data argument" error.
{{< /sidenote >}}(good luck constructing
a value of a non-existent type), but a mature compiler should prevent this from happening.
On the other hand, here are the steps for function definitions:
1. Find the free variables of each function to create the ordered list of groups as described above.
2. Within each group, insert a general function type (like \(a \rightarrow b \rightarrow c\))
into the environment for each function.
3. Within each group (in the same pass) run typechecking
(including polymorphism, using the rules as described above).
The two types of definitions further diverge when generating LLVM and compiling to G-machine instructions:
data types immediately construct and insert their functions, and do not emit G-machine instructions,
while functions generate G-machine instructions, declare prototypes, and emit LLVM in three distinct phases.
Overall, there are virtually no similarities between the two data type declarations, and any inheritance
of common functions starts to appear somewhat forced. To address this, we remove the `definition` class
altogether, and sever the relationship between `definition_data` and `definition_defn`. The
two now look as follows:
{{< codelines "C++" "compiler/10/definition.hpp" 23 67 >}}
In `definition_defn`, the functions are arranged as follows:
* `find_free` locates the free variables in the definition, populating
the `free_variables` field and thereby finding edges for the function graph.
* `insert_types` stores the type of the function into the global environment
(a pointer to which is now stored as a field).
* `typecheck` runs the standard typechecking steps.
* `compile` generates G-machine instructions.
* `declare_llvm` inserts LLVM function prototypes into the `llvm_context`.
* `generate_llvm` converts G-machine instructions into LLVM IR.
In `definition_data`, the steps are significantly simpler:
* `insert_types` registers the type being declared as a "known" type.
* `insert_constructors` inserts constructors (which are verified to
refer to "known" types) into the global environment.
* `generate_llvm` creates the LLVM functions (and their IR).
While the last three methods of `definition_defn` remain unchanged save
for the name, the implementations of the first three see some updates.
First is `find_free`:
{{< codelines "C++" "compiler/10/definition.cpp" 12 26 >}}
First, to make sure we don't pollute the global scope
with function parameters, `find_free` creates a new environment
`var_env`. Then, it stores into this new environment the function parameters,
ensuring that the parameters of a function aren't marked "free".
Concurrently, `find_free` constructs the "general" function
type (used by `insert_types`). Once all the arguments have been bound, `definition_defn::find_free`
makes a call to `ast::find_free`, which does the work of actually
finding free variables.
Since the function type is created by `find_free`, `insert_types` has very little to do:
{{< codelines "C++" "compiler/10/definition.cpp" 28 30 >}}
Finally, `typecheck`, which no longer has to bind the function
arguments to new types, is also fairly simple:
{{< codelines "C++" "compiler/10/definition.cpp" 32 35 >}}
Let's move on to data types. In order to implement `definition_data::insert_types`,
we need to store somewhere a list of all the valid type names. We do this
by adding a new `type_names` field to `type_env`, and implementing the
corresponding methods `lookup_type`:
{{< codelines "C++" "compiler/10/type_env.cpp" 11 16 >}}
And `bind_type`:
{{< codelines "C++" "compiler/10/type_env.cpp" 26 29 >}}
Note in the above snippets that we disallow redeclaring type names;
declaring two data types (or other types) with the same name in
our language will not be valid. In `insert_types`, we create a new
data type and store it in the environment:
{{< codelines "C++" "compiler/10/definition.cpp" 59 62 >}}
We then update `insert_constructors` to query the environment
when creating constructor types, rather than blindly using `new type_base(...)`
like before:
{{< codelines "C++" "compiler/10/definition.cpp" 64 82 >}}
The separation of data and function definitions must be reconciled with code
going back as far as the parser. While previously, we populated a single, global
vector of definitions called `program`, we can no longer do that. Instead, we'll
split our program into two maps, one for data types and one for functions. We
use maps for convenience: the groups generated by our function graph refer
to functions by name, and it would be nice to quickly look up the data
the names refer to. Rather than returning such maps, we change our semantic
actions to simply insert new data into one of two global maps. Below
is a snippet that includes all the changes:
{{< codelines "plaintext" "compiler/10/parser.y" 39 65 >}}
Note that `program` and `definitions` no longer have a type, and that `data` and `defn`
have been changed to return `definition_data_ptr` and `definition_defn_ptr`, respectively.
This necessitates changes to our main file. First of all, we declare the two new maps
we hope to receive from Bison:
{{< codelines "C++" "compiler/10/main.cpp" 24 25 >}}
We then change all affected functions, which in many cases amounts to splitting the `program` parameter
into `defs_data` and `defs_defn` parameters. We also make other, largely mechanical changes: code iterating
over definitions now requires the use of `second` to refer to the value stored in the map, and LLVM
generation now needs to separately process the two different types of definitions. The biggest change
occurs in `typecheck_program`, which not only undergoes all the aforementioned modifications, but
is also updated to use topological ordering:
{{< codelines "C++" "compiler/10/main.cpp" 27 84 >}}
The above code uses the yet-unexplained `generalize` method. What's going on?
Observe that the __Var__ rule of the Hindley-Milner type system says that a variable \(x\)
can have a __polytype__ in the environment \(\Gamma\). Our `type_ptr` can only represent monotypes,
so we must change what `type_env` associates with names to a new struct for representing polytypes,
which we will call `type_scheme`. The `type_scheme` struct, just like the formal definition of
a polytype, contains zero or more "forall"-quantified type variables, followed by a monotype which
may use these variables:
{{< codelines "C++" "compiler/10/type.hpp" 17 27 >}}
The `type_scheme::instantiate` method is effectively an implementation of the special
case of the __Inst__ rule, in which a polytype is specialized to a monotype. Since
the __App__ and __Case__ rules only use monotypes, we'll be using this special case a lot.
We implement this method as follows:
{{< codelines "C++" "compiler/10/type.cpp" 34 41 >}}
In the above code, if the type scheme represents a monotype (i.e., it has no quantified variables),
we simply return that monotype. Otherwise, we must perform a substitution, replacing "forall"-quantified
variables with fresh type parameters to be determined (we will never determine a single type for any of
the quantified variables, since they are specifically meant to represent any type).
We build a substitution map, which assigns to each quantified type variable a corresponding
"fresh" type, and then create a new type with with the substitution applied using `substitute`,
which is implemented as follows:
{{< codelines "C++" "compiler/10/type.cpp" 18 32 >}}
In principle, the function is fairly simple: if the current type is equivalent to a
quantified type, we return the corresponding "fresh" type. If, on the other hand,
the type represents a function, we perform a substitution in the function's input
and output types. This method avoids creating new types where possible; a new type
is only created if a function's input or output type is changed by a substitution
(in which case, the function itself is changed by the substitution). In all
other cases, substitution won't do anything, so we just return the original type.
Now it is a bit more clear why we saw `instantiate` in a code snippet some time ago;
to compute a monotype for a variable reference, we must take into account the
possibility that the variable has a polymorphic type, which needs to be specialized
(potentially differently in every occurrence of the variable).
When talking about our new typechecking algorithm, we mentioned using __Gen__ to sprinkle
polymorphism into our program. If it can, __Gen__ will add free variables
in a type to the "forall" quantifier at the front, making that type polymorphic.
We implement this using a new `generalize` method added to the `type_env`, which (as per
convention) generalizes the type of a given variable as much as possible:
{{< codelines "C++" "compiler/10/type_env.cpp" 31 41 >}}
For now, we disallow types to be generalized twice, and we naturally disallow generalizing
types of nonexistent variables. If neither of those things occurs, we find all the free
variables in the variable's current type using a new method called `type_mgr::find_free`,
and put them into the "forall" quantifier. `type_mgr::find_free` is implemented as follows:
{{< codelines "C++" "compiler/10/type.cpp" 138 148 >}}
The above code is fairly straightforward; if a type is a variable that is not yet bound to anything,
it is free; if the type is a function, we search for free variables in its input and output types;
otherwise, the type has no free variables.
Finally, we have made the necessary changes. Let's test it out with the example from the beginning:
{{< rawblock "compiler/10/examples/if.txt" >}}
Running it, we get the output:
```
3
```
Hooray!
While this is a major success, we are not yet done. Although our functions can now
have polymorphic types, the same cannot be said for our data types! We want to
have lists of integers __and__ lists of booleans, without having to duplicate any code!
While this also falls into the category of polymorphism, this post has already gotten very long,
and we will return to it in [part 11]({{< relref "11_compiler_polymorphic_data_types.md" >}}). Once we're done with that, I still intend
to go over `let/in` expressions, __lambda functions__, and __Input/Output__ together with
__strings__.