784 lines
42 KiB
Markdown
784 lines
42 KiB
Markdown
---
|
|
title: Compiling a Functional Language Using C++, Part 10 - Polymorphism
|
|
date: 2020-03-25T17:14:20-07:00
|
|
tags: ["C and C++", "Functional Languages", "Compilers"]
|
|
description: "In this post, we extend our compiler's typechecking algorithm to implement the Hindley-Milner type system, allowing for polymorphic functions."
|
|
---
|
|
|
|
[In part 8]({{< relref "08_compiler_llvm.md" >}}), we wrote some pretty interesting programs in our little language.
|
|
We successfully expressed arithmetic and recursion. But there's one thing
|
|
that we cannot express in our language without further changes: an `if` statement.
|
|
|
|
Suppose we didn't want to add a special `if/else` expression into our language.
|
|
Thanks to lazy evaluation, we can express it using a function:
|
|
|
|
```
|
|
defn if c t e = {
|
|
case c of {
|
|
True -> { t }
|
|
False -> { e }
|
|
}
|
|
}
|
|
```
|
|
|
|
But an issue still remains: so far, our compiler remains __monomorphic__. That
|
|
is, a particular function can only have one possible type for each one of its
|
|
arguments. With our current setup, something like this
|
|
{{< sidenote "right" "if-note" "would not work:" >}}
|
|
In a polymorphically typed language, the inner <code>if</code> would just evaluate to
|
|
<code>False</code>, and the whole expression to 3.
|
|
{{< /sidenote >}}
|
|
|
|
```
|
|
if (if True False True) 11 3
|
|
```
|
|
|
|
This is because, for this to work, both of the following would need to hold (borrowing
|
|
some of our notation from the [typechecking]({{< relref "03_compiler_typechecking.md" >}}) post):
|
|
|
|
{{< latex >}}
|
|
\text{if} : \text{Bool} \rightarrow \text{Int} \rightarrow \text{Int} \rightarrow \text{Int}
|
|
{{< /latex >}}
|
|
|
|
{{< latex >}}
|
|
\text{if} : \text{Bool} \rightarrow \text{Bool} \rightarrow \text{Bool} \rightarrow \text{Bool}
|
|
{{< /latex >}}
|
|
|
|
But using our rules so far, such a thing is impossible, since there is no way for
|
|
\\(\text{Int}\\) to be unified with \\(\text{Bool}\\). We need a more powerful
|
|
set of rules to describe our program's types.
|
|
|
|
|
|
### Hindley-Milner Type System
|
|
One such powerful set of rules is the [Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system),
|
|
which we have previously alluded to. In fact, the rules we came up
|
|
with were already very close to Hindley-Milner, with the exception of two:
|
|
__generalization__ and __instantiation__. It's been quite a while since the last time we worked on typechecking, so I'm going
|
|
to present a table with these new rules, as well as all of the ones that we
|
|
{{< sidenote "right" "rules-note" "previously used." >}}
|
|
The rules aren't quite the same as the ones we used earlier;
|
|
note that \(\sigma\) is used in place of \(\tau\) in the first rule,
|
|
for instance. These changes are slight, and we'll talk about how the
|
|
rules work together below.
|
|
{{< /sidenote >}} I will also give a quick
|
|
summary of each of these rules.
|
|
|
|
Rule|Name and Description
|
|
-----|-------
|
|
{{< latex >}}
|
|
\frac
|
|
{x:\sigma \in \Gamma}
|
|
{\Gamma \vdash x:\sigma}
|
|
{{< /latex >}}| __Var__: If the variable \\(x\\) is known to have some polymorphic type \\(\\sigma\\) then an expression consisting only of that variable is of that type.
|
|
{{< latex >}}
|
|
\frac
|
|
{\Gamma \vdash e_1 : \tau_1 \rightarrow \tau_2 \quad \Gamma \vdash e_2 : \tau_1}
|
|
{\Gamma \vdash e_1 \; e_2 : \tau_2}
|
|
{{< /latex >}}| __App__: If an expression \\(e\_1\\), which is a function from monomorphic type \\(\\tau\_1\\) to another monomorphic type \\(\\tau\_2\\), is applied to an argument \\(e\_2\\) of type \\(\\tau\_1\\), then the result is of type \\(\\tau\_2\\).
|
|
{{< latex >}}
|
|
\frac
|
|
{\Gamma \vdash e : \tau \quad \text{matcht}(\tau, p_i) = b_i
|
|
\quad \Gamma,b_i \vdash e_i : \tau_c}
|
|
{\Gamma \vdash \text{case} \; e \; \text{of} \;
|
|
\{ (p_1,e_1) \ldots (p_n, e_n) \} : \tau_c }
|
|
{{< /latex >}}| __Case__: This rule is not part of Hindley-Milner, and is specific to our language. If the expression being case-analyzed is of type \\(\\tau\\) and each branch \\((p\_i, e\_i)\\) is of the same type \\(\\tau\_c\\) when the pattern \\(p\_i\\) works with type \\(\\tau\\) producing extra bindings \\(b\_i\\), the whole case expression is of type \\(\\tau\_c\\).
|
|
{{< latex >}}
|
|
\frac{\Gamma \vdash e : \sigma' \quad \sigma' \sqsubseteq \sigma}
|
|
{\Gamma \vdash e : \sigma}
|
|
{{< /latex >}}| __Inst (New)__: If type \\(\\sigma\\) is an instantiation (or specialization) of type \\(\\sigma\'\\) then an expression of type \\(\\sigma\'\\) is also an expression of type \\(\\sigma\\).
|
|
{{< latex >}}
|
|
\frac
|
|
{\Gamma \vdash e : \sigma \quad \alpha \not \in \text{free}(\Gamma)}
|
|
{\Gamma \vdash e : \forall \alpha . \sigma}
|
|
{{< /latex >}}| __Gen (New)__: If an expression has a type with free variables, this rule allows us generalize it to allow all possible types to be used for these free variables.
|
|
|
|
Here, there is a distinction between different forms of types. First, there are
|
|
monomorphic types, or __monotypes__, \\(\\tau\\), which are types such as \\(\\text{Int}\\),
|
|
\\(\\text{Int} \\rightarrow \\text{Bool}\\), \\(a \\rightarrow b\\)
|
|
and so on. These types are what we've been working with so far. Each of them
|
|
represents one (hence, "mono-"), concrete type. This is obvious in the case
|
|
of \\(\\text{Int}\\) and \\(\\text{Int} \\rightarrow \\text{Bool}\\), but
|
|
for \\(a \\rightarrow b\\) things are slightly less clear. Does it really
|
|
represent a single type, if we can put an arbtirary thing for \\(a\\)?
|
|
The answer is "yes"! Although \\(a\\) is not currently known, it stands
|
|
in place of another monotype, which is yet to be determined.
|
|
|
|
So, how do we represent polymorphic types, like that of \\(\\text{if}\\)?
|
|
We borrow some more notation from mathematics, and use the "forall" quantifier:
|
|
|
|
{{< latex >}}
|
|
\text{if} : \forall a \; . \; \text{Bool} \rightarrow a \rightarrow a \rightarrow a
|
|
{{< /latex >}}
|
|
|
|
This basically says, "the type of \\(\\text{if}\\) is \\(\\text{Bool} \\rightarrow a \\rightarrow a \\rightarrow a\\)
|
|
for all possible \\(a\\)". This new expression using "forall" is what we call a type scheme, or a polytype \\(\\sigma\\).
|
|
For simplicity, we only allow "forall" to be at the front of a polytype. That is, expressions like
|
|
\\(a \\rightarrow \\forall b \\; . \\; b \\rightarrow b\\) are not valid polytypes as far as we're concerned.
|
|
|
|
It's key to observe that only some of the typing rules in the above table use polytypes (\\(\\sigma\\)). Whereas expressions
|
|
consisting of a single variable can be polymorphically typed, this is not true for function applications and case expressions.
|
|
In fact, according to our rules, there is no way to introduce a polytype anywhere into our system!
|
|
|
|
The reason for this is that we only allow polymorphism at certain locations. In the Hindley-Milner type system,
|
|
this is called __Let-Polymorphism__, which means that only in `let`/`in` expressions can variables or expressions
|
|
be given a polymorphic type. We, on the other hand, do not (yet) have `let`/`in` expressions, so our polymorphism
|
|
is further limited: only functions (and data type constructors) can be polymorphically typed.
|
|
|
|
Let's talk about what __Inst__ does, and what "\\(\\sqsubseteq\\)" means.
|
|
First, why don't we go ahead and write the formal inference rule for \\(\\sqsubseteq\\):
|
|
|
|
{{< latex >}}
|
|
\frac
|
|
{\tau'=\{\alpha_i \mapsto \tau_i \}\tau \quad \beta_i \not \in \text{free}(\forall \alpha_1...\forall \alpha_n \; . \; \tau)}
|
|
{\forall \alpha_1 ... \forall \alpha_n \; . \; \tau \sqsubseteq \forall \beta_1 ... \forall \beta_m \; . \; \tau'}
|
|
{{< /latex >}}
|
|
|
|
In my opinion, this is one of the more confusing inference rules we have to deal with in Hindley-Milner.
|
|
In principle, though, it's not too hard to understand. \\(\\sigma\' \\sqsubseteq \\sigma\\) says "\\(\\sigma\'\\)
|
|
is more general than \\(\\sigma\\)". Alternatively, we can interpret it as "\\(\\sigma\\) is an instance of \\(\\sigma\'\\)".
|
|
|
|
What does it mean for one polytype to be more general than another? Intuitively, we might say that \\(\forall a \\; . \\; a \\rightarrow a\\) is
|
|
more general than \\(\\text{Int} \\rightarrow \\text{Int}\\), because the former type can represent the latter, and more. Formally,
|
|
we define this in terms of __substitutions__. A substitution \\(\\{\\alpha \\mapsto \\tau \\}\\) replaces a variable
|
|
\\(\\alpha\\) with a monotype \\(\\tau\\). If we can use a substitution to convert one type into another, then the first
|
|
type (the one on which the substitution was performed) is more general than the resulting type. In our previous example,
|
|
since we can apply the substitution \\(\\{a \\mapsto \\text{Int}\\}\\) to get \\(\\text{Int} \\rightarrow \\text{Int}\\)
|
|
from \\(\\forall a \\; . \\; a \\rightarrow a\\), the latter type is more general; using our notation,
|
|
\\(\\forall a \\; . \\; a \\rightarrow a \\sqsubseteq \\text{Int} \\rightarrow \\text{Int}\\).
|
|
|
|
That's pretty much all that the rule says, really. But what about the whole thing with \\(\\beta\\) and \\(\\text{free}\\)? The reason
|
|
for that part of the rule is that, in principle, we can substitute polytypes into polytypes. However, we can't
|
|
do this using \\(\\{ \\alpha \\mapsto \\sigma \\}\\). Consider, for example:
|
|
|
|
{{< latex >}}
|
|
\{ a \mapsto \forall b \; . \; b \rightarrow b \} \; \text{Bool} \rightarrow a \rightarrow a \stackrel{?}{=}
|
|
\text{Bool} \rightarrow (\forall b \; . \; b \rightarrow b) \rightarrow \forall b \; . \; b \rightarrow b
|
|
{{< /latex >}}
|
|
|
|
Hmm, this is not good. Didn't we agree that we only want quantifiers at the front of our types? Instead, to make that substitution,
|
|
we only substitute the monotype \\(b \\rightarrow b\\), and then add the \\(\\forall b\\) at the front. But
|
|
to do this, we must make sure that \\(b\\) doesn't occur anywhere in the original type
|
|
\\(\forall a \\; . \\; \\text{Bool} \\rightarrow a \\rightarrow a\\) (otherwise we can accidentally generalize
|
|
too much). So then, our concrete inference rule is as follows:
|
|
|
|
{{< latex >}}
|
|
\frac
|
|
{
|
|
\begin{gathered}
|
|
\text{Bool} \rightarrow (b \rightarrow b) \rightarrow b \rightarrow b =\{a \mapsto (b \rightarrow b) \} \; \text{Bool} \rightarrow a \rightarrow a \\
|
|
b \not \in \text{free}(\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a) = \varnothing
|
|
\end{gathered}
|
|
}
|
|
{\forall a \; . \; \text{Bool} \rightarrow a \rightarrow a \sqsubseteq \forall b \; . \; \text{Bool} \rightarrow (b \rightarrow b) \rightarrow b \rightarrow b}
|
|
{{< /latex >}}
|
|
|
|
In the above rule we:
|
|
1. Replaced \\(a\\) with \\(b \\rightarrow b\\), getting rid of \\(a\\) in the quantifier.
|
|
2. Observed that \\(b\\) is not a free variable in the original type, and thus can be generalized.
|
|
3. Added the generalization for \\(b\\) to the front of the resulting type.
|
|
|
|
Now, __Inst__ just allows us to perform specialization / substitution as many times
|
|
as we need to get to the type we want.
|
|
|
|
### A New Typechecking Algorithm
|
|
|
|
Alright, now we have all these rules. How does this change our typechecking algorithm?
|
|
How about the following:
|
|
|
|
1. To every declared function, assign the type \\(a \\rightarrow ... \\rightarrow y \\rightarrow z\\),
|
|
where
|
|
{{< sidenote "right" "arguments-note" "\(a\) through \(y\) are the types of the arguments to the function," >}}
|
|
Of course, there can be more or less than 25 arguments to any function. This is just a generalization;
|
|
we use as many input types as are needed.
|
|
{{< /sidenote >}} and \\(z\\) is the function's
|
|
return type.
|
|
2. We typecheck each declared function, using the __Var__, __Case__, __App__, and __Inst__ rules.
|
|
3. Whatever type variables we don't fill in, we assume can be filled in with any type,
|
|
and use the __Gen__ rule to sprinkle polymorphism where it is needed.
|
|
|
|
Maybe this is enough. Let's go through an example. Suppose we have three functions:
|
|
|
|
```
|
|
defn if c t e = {
|
|
case c of {
|
|
True -> { t }
|
|
False -> { e }
|
|
}
|
|
}
|
|
defn testOne = { if True False True }
|
|
defn testTwo = { if True 0 1 }
|
|
```
|
|
|
|
If we go through and typecheck them top-to-bottom, the following happens:
|
|
|
|
1. We start by assuming \\(\\text{if} : a \\rightarrow b \\rightarrow c \\rightarrow d\\),
|
|
\\(\\text{testOne} : e\\) and \\(\\text{testTwo} : f\\).
|
|
2. We look at `if`. We infer the type of `c` to be \\(\\text{Bool}\\), and thus, \\(a = \\text{Bool}\\).
|
|
We also infer that \\(b = c\\), since they occur in two branches of the same case expression.
|
|
Finally, we infer that \\(c = d\\), since whatever the case expression returns becomes the return
|
|
value of the function. Thus, we come out knowing that \\(\\text{if} : \\text{Bool} \\rightarrow b
|
|
\\rightarrow b \\rightarrow b\\).
|
|
3. Now, since we never figured out \\(b\\), we use __Gen__: \\(\\text{if} : \\forall b \\; . \\;
|
|
\\text{Bool} \\rightarrow b \\rightarrow b \\rightarrow b\\). Like we'd want, `if` works with
|
|
all types, as long as both its inputs are of the same type.
|
|
4. When we typecheck the body of `testOne`, we use __Inst__ to turn the above type for `if`
|
|
into a single, monomorphic instance. Then, type inference proceeds as before, and all is well.
|
|
5. When we typecheck the body of `testTwo`, we use __Inst__ again, instantiating a new monotype,
|
|
and all is well again.
|
|
|
|
So far, so good. But what if we started from the bottom, and went to the top?
|
|
|
|
1. We start by assuming \\(\\text{if} : a \\rightarrow b \\rightarrow c \\rightarrow d\\),
|
|
\\(\\text{testOne} : e\\) and \\(\\text{testTwo} : f\\).
|
|
2. We look at `testTwo`. We infer that \\(a = \\text{Bool}\\) (since
|
|
we pass in `True` to `if`). We also infer that \\(b = \\text{Int}\\), and that \\(c = \\text{Int}\\).
|
|
Not yet sure of the return type of `if`, this is where we stop. We are left with
|
|
the knowledge that \\(f = d\\) (because the return type of `if` is the return type of `testTwo`),
|
|
but that's about it. Since \\(f\\) is no longer free, we don't generalize, and conclude that \\(\text{testTwo} : f\\).
|
|
3. We look at `testOne`. We infer that \\(a = \\text{Bool}\\) (already known). We also infer
|
|
that \\(b = \\text{Bool}\\), and that \\(c = \\text{Bool}\\). But wait a minute! This is not right.
|
|
We are back to where we started, with a unification error!
|
|
|
|
What went wrong? I claim that we typechecked the functions that _used_ `if` before we typechecked `if` itself,
|
|
which led us to infer a less-than-general type for `if`. This less-than-general type was insufficient to
|
|
correctly check the whole program.
|
|
|
|
To address this, we enforce a particular order of type inference on our declaration, guided by dependencies
|
|
between functions. Haskell, which has to deal with a similar issue, has
|
|
[a section in the 2010 report on this](https://www.haskell.org/onlinereport/haskell2010/haskellch4.html#x10-880004.5).
|
|
In short:
|
|
|
|
1. We find the
|
|
{{< sidenote "right" "transitive-closure-note" "transitive closure" >}}
|
|
A transitive closure of a vertex in a graph is the list of all vertices reachable
|
|
from that original vertex. Check out the <a href="https://en.wikipedia.org/wiki/Transitive_closure#In_graph_theory">
|
|
Wikipedia page on this</a>.
|
|
{{< /sidenote >}}
|
|
of the function dependencies. We define a function \\(f\\) to be dependent on another function \\(g\\)
|
|
if \\(f\\) calls \\(g\\). The transitive closure will help us find functions that are related indirectly.
|
|
For instance, if \\(g\\) also depends on \\(h\\), then the transitive closure of \\(f\\) will
|
|
include \\(h\\), even if \\(f\\) directly doesn't use \\(h\\).
|
|
2. We isolate groups of mutually dependent functions. If \\(f\\) depends on \\(g\\) and \\(g\\) depends \\(f\\),
|
|
they are placed in one group. We then construct a dependency graph __of these groups__.
|
|
3. We compute a topological order of the group graph. This helps us typecheck the dependencies
|
|
of functions before checking the functions themselves. In our specific case, this would ensure
|
|
we check `if` first, and only then move on to `testOne` and `testTwo`. The order of typechecking
|
|
within a group does not matter, as long as we generalize only after typechecking all functions
|
|
in a group.
|
|
4. We typecheck the function groups, and functions within them, following the above topological order.
|
|
|
|
To find the transitive closure of a graph, we can use [Warshall's Algorithm](https://cs.winona.edu/lin/cs440/ch08-2.pdf).
|
|
This algorithm, with complexity \\(O(|V|^3)\\), goes as follows:
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
& A, R^{(i)} \in \mathbb{B}^{n \times n} \\
|
|
& \\
|
|
& R^{(0)} \leftarrow A \\
|
|
& \textbf{for} \; k \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\
|
|
& \quad \textbf{for} \; i \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\
|
|
& \quad \quad \textbf{for} \; j \leftarrow 1 \; \textbf{to} \; n \; \textbf{do} \\
|
|
& \quad \quad \quad R^{(k)}[i,j] \leftarrow R^{(k-1)}[i,j] \; \textbf{or} \; R^{(k-1)}[i,k] \; \textbf{and} \; R^{(k-1)}[k,j] \\
|
|
& \textbf{return} \; R^{(n)}
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
In the above notation, \\(R^{(i)}\\) is the \\(i\\)th matrix \\(R\\), and \\(A\\) is the adjacency
|
|
matrix of the graph in question. All matrices in the algorithm are from \\(\\mathbb{B}^{n \times n}\\),
|
|
the set of \\(n\\) by \\(n\\) boolean matrices. Once this algorithm is complete, we get as output a
|
|
transitive closure adjacency matrix \\(R^{(n)}\\). Mutually dependent functions will be pretty easy to
|
|
isolate from this matrix. If \\(R^{(n)}[i,j]\\) and \\(R^{(n)}[j,i]\\), then the functions represented by vertices
|
|
\\(i\\) and \\(j\\) depend on each other.
|
|
|
|
Once we've identified the groups, and
|
|
{{< sidenote "right" "group-graph-note" "constructed a group graph," >}}
|
|
This might seem like a "draw the rest of the owl" situation, but it really isn't.
|
|
We'll follow a naive algorithm for findings groups, and for translating function dependencies
|
|
into group dependencies. This algorithm, in C++, will be presented later on.
|
|
{{< /sidenote >}} it is time to compute the topological order. For this, we will use
|
|
[Kahn's Algorithm](https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm).
|
|
The algorithm goes as follows:
|
|
|
|
{{< latex >}}
|
|
\begin{aligned}
|
|
& L \leftarrow \text{empty list} \\
|
|
& S \leftarrow \text{set of all nodes with no incoming edges} \\
|
|
& \\
|
|
& \textbf{while} \; S \; \text{is non-empty} \; \textbf{do} \\
|
|
& \quad \text{remove a node} \; n \; \text{from} \; S \\
|
|
& \quad \text{add} \; n \; \text{to the end of} \; L \\
|
|
& \quad \textbf{for each} \; \text{node} \; m \; \text{with edge} \;
|
|
e \; \text{from} \; n \; \text{to} \; m \; \textbf{do} \\
|
|
& \quad \quad \text{remove edge} \; e \; \text{from the graph} \\
|
|
& \quad \quad \textbf{if} \; m \; \text{has no other incoming edges} \; \textbf{then} \\
|
|
& \quad \quad \quad \text{insert} \; m \; \text{into} \; S \\
|
|
& \\
|
|
& \textbf{if} \; \text{the graph has edges} \; \textbf{then} \\
|
|
& \quad \textbf{return} \; \text{error} \quad \textit{(graph has at least once cycle)} \\
|
|
& \textbf{else} \\
|
|
& \quad \textbf{return} \; L \quad \textit{(a topologically sorted order)}
|
|
\end{aligned}
|
|
{{< /latex >}}
|
|
|
|
Note that since we've already isolated all mutually dependent functions into
|
|
groups, our graph will never have cycles, and this algorithm will always succeed.
|
|
Also note that since we start with nodes with no incoming edges, our list will
|
|
__begin with the groups that should be checked last__. This is because a node
|
|
with no incoming edges might (and probably does) still have outgoing edges,
|
|
and thus depends on other functions / groups. Like in our successful example,
|
|
we want to __typecheck functions that are depended on first__.
|
|
|
|
### Implementation
|
|
Let's start working on a C++ implementation of all of this now. First,
|
|
I think that we should create a C++ class that will represent our function
|
|
dependency graph. Let's call it `function_graph`. I propose the following
|
|
definition:
|
|
|
|
{{< codelines "C++" "compiler/10/graph.hpp" 12 52 >}}
|
|
|
|
There's a lot to unpack here. First of all, we create a type alias `function` that
|
|
represents the label of a function in our graph. It is probably most convenient
|
|
to work with `std::string` instances, so we settle for that. Next, we define
|
|
a struct that will represent a single group of mutually dependent functions.
|
|
Passing this struct by value seems wrong, so we'll settle for a C++ `unique_pt`
|
|
to help carry instances around.
|
|
|
|
Finally, we arrive at the definition of `function_graph`. Inside this class,
|
|
we define a helper struct, `group_data`, which holds information
|
|
about an individual group as it is being constructed. This information
|
|
includes the group's adjacency list and
|
|
[indegree](https://en.wikipedia.org/wiki/Directed_graph#Indegree_and_outdegree)
|
|
(both used for Kahn's topological sorting algorithm), as well as the set
|
|
of functions in the group (which we will eventually return).
|
|
|
|
The `adjacency_lists` and `edges` fields are the meat of the graph representation.
|
|
Both of the variables provide a different view of the same graph: `adjacency_lists`
|
|
associates with every function a list of functions it depends on, while
|
|
`edges` holds a set of tuples describing edges in the graph. Having
|
|
more than one representation makes it more convenient for us to perform
|
|
different operations on our graphs.
|
|
|
|
Next up are some internal methods that perform the various steps we described
|
|
above:
|
|
* `compute_transitive_edges` applies Warshall's algorithm to find the graph's
|
|
transitive closure.
|
|
* `create_groups` creates two mappings, one from functions to their respective
|
|
groups' IDs, and one from group IDs to information about the corresponding groups.
|
|
This step is largely used to determine which functions belong to the same group,
|
|
and as such, uses the set of transitive edges generated by `compute_transitive_edges`.
|
|
* `create_edges` creates edges __between groups__. During this step, the indegrees
|
|
of each group are computed, as well as their adjacency lists.
|
|
* `generate_order` uses the indegrees and adjacency lists produced in the prior step
|
|
to establish a topological order.
|
|
|
|
Following these, we have three public function definitions:
|
|
* `add_function` adds a vertex to the graph. Sometimes, a function does not
|
|
reference any other functions, and would not appear in the list of edges.
|
|
We will call `add_function` to make sure that the function graph is aware
|
|
of such independent functions. For convenience, `add_function` returns the adjacency list
|
|
of the added function.
|
|
* `add_edge` adds a new dependency between two functions.
|
|
* `compute_order` method uses the internal methods described above to convert
|
|
the function dependency graph into a properly ordered list of groups.
|
|
|
|
Let's start by looking at how to implement the internal methods. `compute_transitive_edges`
|
|
is a very straightforward implementation of Warshall's:
|
|
|
|
{{< codelines "C++" "compiler/10/graph.cpp" 3 21 >}}
|
|
|
|
Next is `create_groups`. For each function, we iterate over all other functions.
|
|
If the other function is mutually dependent with the first function, we add
|
|
it to the same group. In the outer loop, we skip over functions that have
|
|
already been added to the group. This is because
|
|
{{< sidenote "right" "equivalence-note" "mutual dependence" >}}
|
|
There is actually a slight difference between "mutual dependence"
|
|
the way we defined it and "being in the same group", and
|
|
it lies in the symmetric property of an equivalence relation.
|
|
We defined a function to depend on another function if it calls
|
|
that other function. Then, a recursive function depends on itself,
|
|
but a non-recursive function does not, and therefore does not
|
|
satisfy the symmetric property. However, as far as we're concerned,
|
|
a function should be in a group with itself even if it's not recursive. Thus, the
|
|
real equivalence relation we use is "in the same group as", and
|
|
consists of "mutual dependence" extended with symmetry.
|
|
{{< /sidenote >}}
|
|
is an [equivalence relation](https://en.wikipedia.org/wiki/Equivalence_relation),
|
|
which means that if we already added a function to a group, all its
|
|
group members were also already visited and added.
|
|
|
|
{{< codelines "C++" "compiler/10/graph.cpp" 23 44 >}}
|
|
|
|
Once groups have been created, we use their functions' edges
|
|
to create edges for the groups themselves, using `create_edges`.
|
|
We avoid creating edges from a group to itself, to prevent
|
|
unnecessary cycles. While constructing the edges, we also
|
|
increment the relevant indegree counter.
|
|
|
|
{{< codelines "C++" "compiler/10/graph.cpp" 46 63 >}}
|
|
|
|
Finally, we apply Kahn's algorithm to create a topological order
|
|
in `generate_order`:
|
|
|
|
{{< codelines "C++" "compiler/10/graph.cpp" 65 90 >}}
|
|
|
|
These four steps are used in `compute_order`:
|
|
|
|
{{< codelines "C++" "compiler/10/graph.cpp" 106 114 >}}
|
|
|
|
Let's now look at the remaining two public definitions.
|
|
First comes `add_function`, which creates an adjacency list for the
|
|
function to be inserted (if one does not already exist),
|
|
and returns a reference to the resulting list:
|
|
|
|
{{< codelines "C++" "compiler/10/graph.cpp" 92 99 >}}
|
|
|
|
We use this in `add_edge`, which straightforwardly creates an edge
|
|
between two functions:
|
|
|
|
{{< codelines "C++" "compiler/10/graph.cpp" 101 104 >}}
|
|
|
|
With this, we can now properly order our typechecking.
|
|
However, we are just getting started: there are still
|
|
numerous changes we need to make to get our compiler
|
|
to behave as we desire.
|
|
|
|
The first change is the least relevant, but will help clean
|
|
up our code base in the presence of polymorphism: we will
|
|
get rid of `resolve`, in both definitions and AST nodes.
|
|
The reasons for this are twofold. First,
|
|
{{< sidenote "right" "case-type-note" "only the case expression node actually uses the type it stores." >}}
|
|
Recall that <code>ast_case</code> needs this information to properly
|
|
account for the changes to the stack from when data is unpacked.
|
|
{{< /sidenote >}} This means that
|
|
all the rest of the infrastructure we've written around
|
|
preserving types is somewhat pointless. Second, when
|
|
we call `resolve`, we'd now have to distinguish
|
|
between type variables captured by "forall" and actual,
|
|
undefined variables. That's a lot of wasted work!
|
|
To replace the now-removed `type` field,
|
|
we make `ast_case` include a new member, `input_type`,
|
|
which stores the type of the thing between `case` and `of`.
|
|
Since `ast_case` requires its type to be a data type
|
|
at the time of typechecking, we no longer need to resolve anything.
|
|
|
|
Next, we need to work in a step geared towards finding function calls
|
|
(to determine dependencies). As we have noted in [part 6]({{< relref "06_compiler_compilation.md" >}}),
|
|
it's pretty easy to tell apart calls to global functions from "local" ones. If
|
|
we see that a variable was previously bound (perhaps as a function argument,
|
|
or by a pattern in a case expression), we know for sure that it is not a global
|
|
function call. Otherwise, if the variable isn't bound anywhere in the function
|
|
definition (it's a __free variable__), it must refer to a global function. Then,
|
|
we can traverse the function body, storing variables that are bound (but only within
|
|
their scope), and noting references to variables we haven't yet seen. To
|
|
implement this, we can use a linked list, where each node refers to a particular
|
|
scope, points to the scope enclosing it, and contains a list of variables...
|
|
|
|
Wait a minute, this is identical to `type_env`! There's no reason to reimplement all
|
|
this. But then, another question arises: do we throw away the `type_env` generated
|
|
by the dependency-searching step? It seems wasteful, since we will eventually
|
|
repeat this same work. Rather, we'll re-use the same `type_env` instances
|
|
in both this new step and `typecheck`. To do this, we will now store a pointer
|
|
to a `type_env` in every AST node, and set this pointer during our first traversal
|
|
of the tree. Indeed, this makes our `type_env` more like a
|
|
[symbol table](https://en.wikipedia.org/wiki/Symbol_table). With this change,
|
|
our new dependency-finding step will be implemented by the `find_free` function
|
|
with the following signature:
|
|
|
|
```C++
|
|
void ast::find_free(type_mgr& mgr, type_env_ptr& env, std::set<std::string>& into);
|
|
```
|
|
|
|
Let's take a look at how this will be implemented. The simplest case (as usual)
|
|
is `ast_int`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 16 18 >}}
|
|
|
|
In this case, we associate the `type_env` with the node, but don't do anything
|
|
else: a number is not a variable. A more interesting case is `ast_lid`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 33 36 >}}
|
|
|
|
If a lowercase variable has not yet been bound to something, it's free,
|
|
and we store it. Somewhat counterintuitively, `ast_uid` behaves
|
|
differently:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 54 56 >}}
|
|
|
|
We don't allow uppercase variables to be bound to anything outside of data type
|
|
declarations, so we don't care about uppercase free variables. Next up is
|
|
`ast_binop`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 73 77 >}}
|
|
|
|
A binary operator can have free variables in the subexpressions on the left and on the right, and
|
|
the above implementation reflects that. This is identical to the implementation of
|
|
`ast_app`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 109 113 >}}
|
|
|
|
Finally, `ast_case` requires the most complicated function (as usual):
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 142 150 >}}
|
|
|
|
The `type_scope` function replaces the `type_env::scope` method,
|
|
which cannot (without significant effort) operate on smart pointers.
|
|
Importantly, we are using a new `pattern` method here, `insert_bindings`. This
|
|
is because we split "introducing variables" and "typechecking variables"
|
|
into two steps for patterns, as well. The implementation of `insert_bindings`
|
|
for `pattern_var` is as follows:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 230 232 >}}
|
|
|
|
A variable pattern always introduces the variable it is made up of.
|
|
On the other hand, the implementation for `pattern_constr` is as follows:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 245 249 >}}
|
|
|
|
All the variables of the pattern are placed into the environment. For now, we don't worry
|
|
about arity; this is the job of typechecking.
|
|
|
|
These changes are reflected in all instances of our `typecheck` function. First of
|
|
all, `typecheck` no longer needs to receive a `type_env` parameter, since each
|
|
tree node has a `type_env_ptr`. Furthermore, `typecheck` should no longer call
|
|
`bind`, since this was already done by `find_free`. For example,
|
|
`ast_lid::typecheck` will now use `env::lookup`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 38 40 >}}
|
|
|
|
Don't worry about `instantiate` for now; that's coming up. Similarly to
|
|
`ast_lid`, `ast_case::typecheck` will no longer introduce new bindings,
|
|
but unify existing types via the `pattern`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 152 169 >}}
|
|
|
|
The above implementation uses another new `pattern` method, `typecheck`.
|
|
This method inherits the type checking functionality previously
|
|
contained in `pattern::match`. Here's the implementation for `pattern_var`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 234 236 >}}
|
|
|
|
And here's the implementation for `pattern_constr`:
|
|
|
|
{{< codelines "C++" "compiler/10/ast.cpp" 251 266 >}}
|
|
|
|
So far, so good. However, for all of this to reach the main typechecking
|
|
code, not only `ast` subclasses need to be updated, but also
|
|
the `definition`s. Here things get more complicated, because
|
|
`definition_data` and `definition_defn` are growing more and more apart.
|
|
Previously, we had two typechecking steps: `typecheck_first` (which registered
|
|
function names into the environment) and `typecheck_second` (which performed
|
|
the actual typechecking). However, not only are these names not informative,
|
|
but the algorithms for typechecking the two types of definition will soon
|
|
have different numbers of "major" steps.
|
|
|
|
Let's take a look at how we would typecheck data types. I propose the following
|
|
steps:
|
|
|
|
1. Iterate all declared data types, storing them into some kind of "known" list.
|
|
2. Iterate again, and for each constructor of a type, verify that
|
|
it refers to "known" types. Add valid constructors to the global environment as functions.
|
|
|
|
We don't currently verify that types are "known"; A user could declare a list of `Floobs`,
|
|
and never say what a `Floob` is.
|
|
{{< sidenote "right" "known-type-note" "This isn't too big of an issue" >}}
|
|
Curiously, this flaw did lead to some valid programs being rejected. Since
|
|
we had no notion of a "known" type, whenever data type constructors
|
|
were created, every argument type was marked a "base" type;
|
|
see <a href="https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/code/compiler/09/definition.cpp#L82">
|
|
this line</a> if you're curious.
|
|
This would cause pattern matching to fail on the tail of a list, with
|
|
the "attempt to pattern match on non-data argument" error.
|
|
{{< /sidenote >}}(good luck constructing
|
|
a value of a non-existent type), but a mature compiler should prevent this from happening.
|
|
|
|
On the other hand, here are the steps for function definitions:
|
|
|
|
1. Find the free variables of each function to create the ordered list of groups as described above.
|
|
2. Within each group, insert a general function type (like \\(a \\rightarrow b \\rightarrow c\\))
|
|
into the environment for each function.
|
|
3. Within each group (in the same pass) run typechecking
|
|
(including polymorphism, using the rules as described above).
|
|
|
|
The two types of definitions further diverge when generating LLVM and compiling to G-machine instructions:
|
|
data types immediately construct and insert their functions, and do not emit G-machine instructions,
|
|
while functions generate G-machine instructions, declare prototypes, and emit LLVM in three distinct phases.
|
|
Overall, there are virtually no similarities between the two data type declarations, and any inheritance
|
|
of common functions starts to appear somewhat forced. To address this, we remove the `definition` class
|
|
altogether, and sever the relationship between `definition_data` and `definition_defn`. The
|
|
two now look as follows:
|
|
|
|
{{< codelines "C++" "compiler/10/definition.hpp" 23 67 >}}
|
|
|
|
In `definition_defn`, the functions are arranged as follows:
|
|
|
|
* `find_free` locates the free variables in the definition, populating
|
|
the `free_variables` field and thereby finding edges for the function graph.
|
|
* `insert_types` stores the type of the function into the global environment
|
|
(a pointer to which is now stored as a field).
|
|
* `typecheck` runs the standard typechecking steps.
|
|
* `compile` generates G-machine instructions.
|
|
* `declare_llvm` inserts LLVM function prototypes into the `llvm_context`.
|
|
* `generate_llvm` converts G-machine instructions into LLVM IR.
|
|
|
|
In `definition_data`, the steps are significantly simpler:
|
|
|
|
* `insert_types` registers the type being declared as a "known" type.
|
|
* `insert_constructors` inserts constructors (which are verified to
|
|
refer to "known" types) into the global environment.
|
|
* `generate_llvm` creates the LLVM functions (and their IR).
|
|
|
|
While the last three methods of `definition_defn` remain unchanged save
|
|
for the name, the implementations of the first three see some updates.
|
|
First is `find_free`:
|
|
|
|
{{< codelines "C++" "compiler/10/definition.cpp" 12 26 >}}
|
|
|
|
First, to make sure we don't pollute the global scope
|
|
with function parameters, `find_free` creates a new environment
|
|
`var_env`. Then, it stores into this new environment the function parameters,
|
|
ensuring that the parameters of a function aren't marked "free".
|
|
Concurrently, `find_free` constructs the "general" function
|
|
type (used by `insert_types`). Once all the arguments have been bound, `definition_defn::find_free`
|
|
makes a call to `ast::find_free`, which does the work of actually
|
|
finding free variables.
|
|
|
|
Since the function type is created by `find_free`, `insert_types` has very little to do:
|
|
|
|
{{< codelines "C++" "compiler/10/definition.cpp" 28 30 >}}
|
|
|
|
Finally, `typecheck`, which no longer has to bind the function
|
|
arguments to new types, is also fairly simple:
|
|
|
|
{{< codelines "C++" "compiler/10/definition.cpp" 32 35 >}}
|
|
|
|
Let's move on to data types. In order to implement `definition_data::insert_types`,
|
|
we need to store somewhere a list of all the valid type names. We do this
|
|
by adding a new `type_names` field to `type_env`, and implementing the
|
|
corresponding methods `lookup_type`:
|
|
|
|
{{< codelines "C++" "compiler/10/type_env.cpp" 11 16 >}}
|
|
|
|
And `bind_type`:
|
|
|
|
{{< codelines "C++" "compiler/10/type_env.cpp" 26 29 >}}
|
|
|
|
Note in the above snippets that we disallow redeclaring type names;
|
|
declaring two data types (or other types) with the same name in
|
|
our language will not be valid. In `insert_types`, we create a new
|
|
data type and store it in the environment:
|
|
|
|
{{< codelines "C++" "compiler/10/definition.cpp" 59 62 >}}
|
|
|
|
We then update `insert_constructors` to query the environment
|
|
when creating constructor types, rather than blindly using `new type_base(...)`
|
|
like before:
|
|
|
|
{{< codelines "C++" "compiler/10/definition.cpp" 64 82 >}}
|
|
|
|
The separation of data and function definitions must be reconciled with code
|
|
going back as far as the parser. While previously, we populated a single, global
|
|
vector of definitions called `program`, we can no longer do that. Instead, we'll
|
|
split our program into two maps, one for data types and one for functions. We
|
|
use maps for convenience: the groups generated by our function graph refer
|
|
to functions by name, and it would be nice to quickly look up the data
|
|
the names refer to. Rather than returning such maps, we change our semantic
|
|
actions to simply insert new data into one of two global maps. Below
|
|
is a snippet that includes all the changes:
|
|
|
|
{{< codelines "plaintext" "compiler/10/parser.y" 39 65 >}}
|
|
|
|
Note that `program` and `definitions` no longer have a type, and that `data` and `defn`
|
|
have been changed to return `definition_data_ptr` and `definition_defn_ptr`, respectively.
|
|
This necessitates changes to our main file. First of all, we declare the two new maps
|
|
we hope to receive from Bison:
|
|
|
|
{{< codelines "C++" "compiler/10/main.cpp" 24 25 >}}
|
|
|
|
We then change all affected functions, which in many cases amounts to splitting the `program` parameter
|
|
into `defs_data` and `defs_defn` parameters. We also make other, largely mechanical changes: code iterating
|
|
over definitions now requires the use of `second` to refer to the value stored in the map, and LLVM
|
|
generation now needs to separately process the two different types of definitions. The biggest change
|
|
occurs in `typecheck_program`, which not only undergoes all the aforementioned modifications, but
|
|
is also updated to use topological ordering:
|
|
|
|
{{< codelines "C++" "compiler/10/main.cpp" 27 84 >}}
|
|
|
|
The above code uses the yet-unexplained `generalize` method. What's going on?
|
|
|
|
Observe that the __Var__ rule of the Hindley-Milner type system says that a variable \\(x\\)
|
|
can have a __polytype__ in the environment \\(\\Gamma\\). Our `type_ptr` can only represent monotypes,
|
|
so we must change what `type_env` associates with names to a new struct for representing polytypes,
|
|
which we will call `type_scheme`. The `type_scheme` struct, just like the formal definition of
|
|
a polytype, contains zero or more "forall"-quantified type variables, followed by a monotype which
|
|
may use these variables:
|
|
|
|
{{< codelines "C++" "compiler/10/type.hpp" 17 27 >}}
|
|
|
|
The `type_scheme::instantiate` method is effectively an implementation of the special
|
|
case of the __Inst__ rule, in which a polytype is specialized to a monotype. Since
|
|
the __App__ and __Case__ rules only use monotypes, we'll be using this special case a lot.
|
|
We implement this method as follows:
|
|
|
|
{{< codelines "C++" "compiler/10/type.cpp" 34 41 >}}
|
|
|
|
In the above code, if the type scheme represents a monotype (i.e., it has no quantified variables),
|
|
we simply return that monotype. Otherwise, we must perform a substitution, replacing "forall"-quantified
|
|
variables with fresh type parameters to be determined (we will never determine a single type for any of
|
|
the quantified variables, since they are specifically meant to represent any type).
|
|
We build a substitution map, which assigns to each quantified type variable a corresponding
|
|
"fresh" type, and then create a new type with with the substitution applied using `substitute`,
|
|
which is implemented as follows:
|
|
|
|
{{< codelines "C++" "compiler/10/type.cpp" 18 32 >}}
|
|
|
|
In principle, the function is fairly simple: if the current type is equivalent to a
|
|
quantified type, we return the corresponding "fresh" type. If, on the other hand,
|
|
the type represents a function, we perform a substitution in the function's input
|
|
and output types. This method avoids creating new types where possible; a new type
|
|
is only created if a function's input or output type is changed by a substitution
|
|
(in which case, the function itself is changed by the substitution). In all
|
|
other cases, substitution won't do anything, so we just return the original type.
|
|
|
|
Now it is a bit more clear why we saw `instantiate` in a code snippet some time ago;
|
|
to compute a monotype for a variable reference, we must take into account the
|
|
possibility that the variable has a polymorphic type, which needs to be specialized
|
|
(potentially differently in every occurrence of the variable).
|
|
|
|
When talking about our new typechecking algorithm, we mentioned using __Gen__ to sprinkle
|
|
polymorphism into our program. If it can, __Gen__ will add free variables
|
|
in a type to the "forall" quantifier at the front, making that type polymorphic.
|
|
We implement this using a new `generalize` method added to the `type_env`, which (as per
|
|
convention) generalizes the type of a given variable as much as possible:
|
|
|
|
{{< codelines "C++" "compiler/10/type_env.cpp" 31 41 >}}
|
|
|
|
For now, we disallow types to be generalized twice, and we naturally disallow generalizing
|
|
types of nonexistent variables. If neither of those things occurs, we find all the free
|
|
variables in the variable's current type using a new method called `type_mgr::find_free`,
|
|
and put them into the "forall" quantifier. `type_mgr::find_free` is implemented as follows:
|
|
|
|
{{< codelines "C++" "compiler/10/type.cpp" 138 148 >}}
|
|
|
|
The above code is fairly straightforward; if a type is a variable that is not yet bound to anything,
|
|
it is free; if the type is a function, we search for free variables in its input and output types;
|
|
otherwise, the type has no free variables.
|
|
|
|
Finally, we have made the necessary changes. Let's test it out with the example from the beginning:
|
|
|
|
{{< rawblock "compiler/10/examples/if.txt" >}}
|
|
|
|
Running it, we get the output:
|
|
|
|
```
|
|
3
|
|
```
|
|
|
|
Hooray!
|
|
|
|
While this is a major success, we are not yet done. Although our functions can now
|
|
have polymorphic types, the same cannot be said for our data types! We want to
|
|
have lists of integers __and__ lists of booleans, without having to duplicate any code!
|
|
While this also falls into the category of polymorphism, this post has already gotten very long,
|
|
and we will return to it in [part 11]({{< relref "11_compiler_polymorphic_data_types.md" >}}). Once we're done with that, I still intend
|
|
to go over `let/in` expressions, __lambda functions__, and __Input/Output__ together with
|
|
__strings__.
|