477 lines
20 KiB
Markdown
477 lines
20 KiB
Markdown
---
|
|
title: "A Typesafe Representation of an Imperative Language"
|
|
date: 2020-11-02T01:07:21-08:00
|
|
tags: ["Idris", "Programming Languages"]
|
|
---
|
|
|
|
A recent homework assignment for my university's programming languages
|
|
course was to encode the abstract syntax for a small imperative language
|
|
into Haskell data types. The language consisted of very few constructs, and was very much a "toy".
|
|
On the expression side of things, it had three registers (`A`, `B`, and `R`),
|
|
numbers, addition, comparison using "less than", and logical negation. It also
|
|
included a statement for storing the result of an expression into
|
|
a register, `if/else`, and an infinite loop construct with an associated `break` operation.
|
|
A sample program in the language which computes the product of two
|
|
numbers is as follows:
|
|
|
|
```
|
|
A := 7
|
|
B := 9
|
|
R := 0
|
|
do
|
|
if A <= 0 then
|
|
break
|
|
else
|
|
R := R + B;
|
|
A := A + -1;
|
|
end
|
|
end
|
|
```
|
|
|
|
The homework notes that type errors may arise in the little imperative language.
|
|
We could, for instance, try to add a boolean to a number: `3 + (1 < 2)`. Alternatively,
|
|
we could try use a number in the condition of an `if/else` expression. A "naive"
|
|
encoding of the abstract syntax would allow for such errors.
|
|
|
|
However, assuming that registers could only store integers and not booleans, it is fairly easy to
|
|
separate the expression grammar into two nonterminals, yielding boolean
|
|
and integer expressions respectively. Since registers can only store integers,
|
|
the `(:=)` operation will always require an integer expression, and an `if/else`
|
|
statement will always require a boolean expression. A matching Haskell encoding
|
|
would not allow "invalid" programs to compile. That is, the programs would be
|
|
type-correct by construction.
|
|
|
|
Then, a question arose in the ensuing discussion: what if registers _could_
|
|
contain booleans? It would be impossible to create such a "correct-by-construction"
|
|
representation then, wouldn't it?
|
|
{{< sidenote "right" "haskell-note" "Although I don't know about Haskell," >}}
|
|
I am pretty certain that a similar encoding in Haskell is possible. However,
|
|
Haskell wasn't originally created for that kind of abuse of its type system,
|
|
so it would probably not look very good.
|
|
{{< /sidenote >}} I am sure that it _is_ possible to do this
|
|
in Idris, a dependently typed programming language. In this post I will
|
|
talk about how to do that.
|
|
|
|
### Registers and Expressions
|
|
Let's start by encoding registers. Since we only have three registers, we
|
|
can encode them using a simple data type declaration, much the same as we
|
|
would in Haskell:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 1 1 >}}
|
|
|
|
Now that registers can store either integers or booleans (and only those two),
|
|
we need to know which one is which. For this purpose, we can declare another
|
|
data type:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 3 3 >}}
|
|
|
|
At any point in the (hypothetical) execution of our program, each
|
|
of the registers will have a type, either boolean or integer. The
|
|
combined state of the three registers would then be the combination
|
|
of these three states; we can represent this using a 3-tuple:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 5 6 >}}
|
|
|
|
Let's say that the first element of the tuple will be the type of the register
|
|
`A`, the second the type of `B`, and the third the type of `R`. Then,
|
|
we can define two helper functions, one for retrieving the type of
|
|
a register, and one for changing it:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 8 16 >}}
|
|
|
|
Now, it's time to talk about expressions. We know now that an expression
|
|
can evaluate to either a boolean or an integer value (because a register
|
|
can contain either of those types of values). Perhaps we can specify
|
|
the type that an expression evaluates to in the expression's own type:
|
|
`Expr IntTy` would evaluate to integers, and `Expr BoolTy` would evaluate
|
|
to booleans. Then, we could have constructors as follows:
|
|
|
|
```Idris
|
|
Lit : Int -> Expr IntTy
|
|
Not : Expr BoolTy -> Expr BoolTy
|
|
```
|
|
|
|
Sounds good! But what about loading a register?
|
|
|
|
```Idris
|
|
Load : Reg -> Expr IntTy -- no; what if the register is a boolean?
|
|
Load : Reg -> Expr BoolTy -- no; what if the register is an integer?
|
|
Load : Reg -> Expr a -- no; a register access can't be either!
|
|
```
|
|
|
|
The type of an expression that loads a register depends on the current
|
|
state of the program! If we last stored an integer into a register,
|
|
then loading from that register would give us an integer. But if we
|
|
last stored a boolean into a register, then reading from it would
|
|
give us a boolean. Our expressions need to be aware of the current
|
|
types of each register. To do so, we add the state as a parameter to
|
|
our `Expr` type. This would lead to types like the following:
|
|
|
|
```Idris
|
|
-- An expression that produces a boolean when all the registers
|
|
-- are integers.
|
|
Expr (IntTy, IntTy, IntTy) BoolTy
|
|
|
|
-- An expression that produces an integer when A and B are integers,
|
|
-- and R is a boolean.
|
|
Expr (IntTy, IntTy, BoolTy) IntTy
|
|
```
|
|
|
|
In Idris, the whole definition becomes:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 18 23 >}}
|
|
|
|
The only "interesting" constructor is `Load`, which, given a register `r`,
|
|
creates an expression that has `r`'s type in the current state `s`.
|
|
|
|
### Statements
|
|
Statements are a bit different. Unlike expressions, they don't evaluate to
|
|
anything; rather, they do something. That "something" may very well be changing
|
|
the current state. We could, for instance, set `A` to be a boolean, while it was
|
|
previously an integer. This suggests equipping our `Stmt` type with two
|
|
arguments: the initial state (before the statement's execution), and the final
|
|
state (after the statement's execution). This would lead to types like this:
|
|
|
|
```Idris
|
|
-- Statement that, when run while all registers contain integers,
|
|
-- terminates with registers B and R having been assigned boolean values.
|
|
Stmt (IntTy, IntTy, IntTy) (IntTy, BoolTy, BoolTy)
|
|
```
|
|
|
|
However, there's a problem with `loop` and `break`. When we run a loop,
|
|
we will require that the state at the end of one iteration is the
|
|
same as the state at its beginning. Otherwise, it would be possible
|
|
for a loop to keep changing the types of registers every iteration,
|
|
and it would become impossible for us to infer the final state
|
|
without actually running the program. In itself, this restriction
|
|
isn't a problem; most static type systems require both branches
|
|
of an `if/else` expression to be of the same type for a similar
|
|
reason. The problem comes from the interaction with `break`.
|
|
|
|
By itself, the would-be type of `break` seems innocent enough. It
|
|
doesn't change any registers, so we could call it `Stmt s s`.
|
|
But consider the following program:
|
|
|
|
```
|
|
A := 0
|
|
B := 0
|
|
R := 0
|
|
do
|
|
if 5 <= A then
|
|
B := 1 <= 1
|
|
break
|
|
B := 0
|
|
else
|
|
A := A + 1
|
|
end
|
|
end
|
|
```
|
|
|
|
The loop starts with all registers having integer values.
|
|
As per our aforementioned loop requirement, the body
|
|
of the loop must terminate with all registers _still_ having
|
|
integer values. For the first five iterations that's exactly
|
|
what will happen. However, after we increment `A` the fifth time,
|
|
we will set `B` to a boolean value -- using a valid statement --
|
|
and then `break`. The `break` statement will be accepted by
|
|
the typechecker, and so will the whole `then` branch. After all,
|
|
it seems as though we reset `B` back to an integer value.
|
|
But that won't be the case. We will have jumped to the end
|
|
of the loop, where we are expected to have an all-integer type,
|
|
which we will not have.
|
|
|
|
The solution I came up with to address this issue was to
|
|
add a _third_ argument to `Stmt`, which contains the "context"
|
|
type. That is, it contains the type of the innermost loop surrounding
|
|
the statement. A `break` statement would only be permissible
|
|
if the current type matches the loop type. With this, we finally
|
|
write down a definition of `Stmt`:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 26 30 >}}
|
|
|
|
The `Store` constructor takes a register `r` and an expression producing some type `t` in state `s`.
|
|
From these, it creates a statement that starts in `s`, and finishes
|
|
in a state similar to `s`, but with `r` now having type `t`. The loop
|
|
type `l` remains unaffected and unused; we are free to assign any register
|
|
any value.
|
|
|
|
The `If` constructor takes a condition `Expr`, which starts in state `s` and _must_ produce
|
|
a boolean. It also takes two programs (sequences of statements), each of which
|
|
starts in `s` and finishes in another state `n`. This results in
|
|
a statement that starts in state `s`, and finishes in state `n`. Conceptually,
|
|
each branch of the `if/else` statement must result in the same final state (in terms of types);
|
|
otherwise, we wouldn't know which of the states to pick when deciding the final
|
|
state of the `If` itself. As with `Store`, the loop type `l` is untouched here.
|
|
Individual statements are free to modify the state however they wish.
|
|
|
|
The `Loop` constructor is very restrictive. It takes a single program (the sequence
|
|
of instructions that it will be repeating). As we discussed above, this program
|
|
must start _and_ end in the same state `s`. Furthermore, this program's loop
|
|
type must also be `s`, since the loop we're constructing will be surrounding the
|
|
program. The resulting loop itself still has an arbitrary loop type `l`, since
|
|
it doesn't surround itself.
|
|
|
|
Finally, `Break` can only be constructed when the loop state matches the current
|
|
state. Since we'll be jumping to the end of the innermost loop, the final state
|
|
is also the same as the loop state.
|
|
|
|
These are all the constructors we'll be needing. It's time to move on to
|
|
whole programs!
|
|
|
|
### Programs
|
|
A program is simply a list of statements. However, we can't use a regular Idris list,
|
|
because a regular list wouldn't be able to represent the relationship between
|
|
each successive statement. In our program, we want the final state of one
|
|
statement to be the initial state of the following one, since they'll
|
|
be executed in sequence. To represent this, we have to define our own
|
|
list-like GADT. The definition of the type turns out fairly straightforward:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 32 34 >}}
|
|
|
|
The `Nil` constructor represents an empty program (much like the built-in `Nil` represents an empty list).
|
|
Since no actions are done, it creates a `Prog` that starts and ends in the same state: `s`.
|
|
The `(::)` constructor, much like the built-in `(::)` constructor, takes a statement
|
|
and another program. The statement begins in state `s` and ends in state `n`; the program after
|
|
that statement must then start in state `n`, and end in some other state `m`.
|
|
The combination of the statement and the program starts in state `s`, and finishes in state `m`.
|
|
Thus, `(::)` yields `Prog s m`. None of the constructors affect the loop type `l`: we
|
|
are free to sequence any statements that we want, and it is impossible for us
|
|
to construct statements using `l` that cause runtime errors.
|
|
|
|
This should be all! Let's try out some programs.
|
|
|
|
### Trying it Out
|
|
The following (type-correct) program compiles just fine:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 36 47 >}}
|
|
|
|
First, it loads a boolean into register `A`; then,
|
|
inside the `if/else` statement, it stores an integer into `A`. Finally,
|
|
it stores another integer into `B`, and adds them into `R`. Even though
|
|
`A` was a boolean at first, the type checker can deduce that it
|
|
was reset back to an integer after the `if/else`, and the program is accepted.
|
|
On the other hand, had we forgotten to set `A` to a boolean first:
|
|
|
|
```Idris
|
|
[ If (Load A)
|
|
[ Store A (Lit 1) ]
|
|
[ Store A (Lit 2) ]
|
|
, Store B (Lit 2)
|
|
, Store R (Add (Load A) (Load B))
|
|
]
|
|
```
|
|
|
|
We would get a type error:
|
|
|
|
```
|
|
Type mismatch between getRegTy A (IntTy, IntTy, IntTy) and BoolTy
|
|
```
|
|
|
|
The type of register `A` (that is, `IntTy`) is incompatible
|
|
with `BoolTy`. Our `initialState` says that `A` starts out as
|
|
an integer, so it can't be used in an `if/else` right away!
|
|
Similar errors occur if we make one of the branches of
|
|
the `if/else` empty, or if we set `B` to a boolean.
|
|
|
|
We can also encode the example program from the beginning
|
|
of this post:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 49 61 >}}
|
|
|
|
This program compiles just fine, too! It is a little reminiscent of
|
|
the program we used to demonstrate how `break` could break things
|
|
if we weren't careful. So, let's go ahead and try `break` in an invalid
|
|
state:
|
|
|
|
```Idris
|
|
[ Store A (Lit 7)
|
|
, Store B (Lit 9)
|
|
, Store R (Lit 0)
|
|
, Loop
|
|
[ If (Load A `Leq` Lit 0)
|
|
[ Store B (Lit 1 `Leq` Lit 1), Break, Store B (Lit 0) ]
|
|
[ Store R (Load R `Add` Load B)
|
|
, Store A (Load A `Add` Lit (-1))
|
|
]
|
|
]
|
|
]
|
|
```
|
|
|
|
Again, the type checker complains:
|
|
|
|
```
|
|
Type mismatch between IntTy and BoolTy
|
|
```
|
|
|
|
And so, we have an encoding of our language that allows registers to
|
|
be either integers or booleans, while still preventing
|
|
type-incorrect programs!
|
|
|
|
### Building an Interpreter
|
|
A good test of such an encoding is the implementation
|
|
of an interpreter. It should be possible to convince the
|
|
typechecker that our interpreter doesn't need to
|
|
handle type errors in the toy language, since they
|
|
cannot occur.
|
|
|
|
Let's start with something simple. First of all, suppose
|
|
we have an expression of type `Expr InTy`. In our toy
|
|
language, it produces an integer. Our interpreter, then,
|
|
will probably want to use Idris' type `Int`. Similarly,
|
|
an expression of type `Expr BoolTy` will produce a boolean
|
|
in our toy language, which in Idris we can represent using
|
|
the built-in `Bool` type. Despite the similar naming, though,
|
|
there's no connection between Idris' `Bool` and our own `BoolTy`.
|
|
We need to define a conversion from our own types -- which are
|
|
values of type `Ty` -- into Idris types that result from evaluating
|
|
expressions. We do so as follows:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 63 65 >}}
|
|
|
|
Similarly, we want to convert our `TypeState` (a tuple describing the _types_
|
|
of our registers) into a tuple that actually holds the values of each
|
|
register, which we will call `State`. The value of each register at
|
|
any point depends on its type. My first thought was to define
|
|
`State` as a function from `TypeState` to an Idris `Type`:
|
|
|
|
```Idris
|
|
State : TypeState -> Type
|
|
State (a, b, c) = (repr a, repr b, repr c)
|
|
```
|
|
|
|
Unfortunately, this doesn't quite cut it. The problem is that this
|
|
function technically doesn't give Idris any guarantees that `State`
|
|
will be a tuple. The most Idris knows is that `State` will be some
|
|
`Type`, which could be `Int`, `Bool`, or anything else! This
|
|
becomes a problem when we try to pattern match on states to get
|
|
the contents of a particular register. Instead, we have to define
|
|
a new data type:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 67 68 >}}
|
|
|
|
In this snippet, `State` is still a (type level) function from `TypeState` to `Type`.
|
|
However, by using a GADT, we guarantee that there's only one way to construct
|
|
a `State (a,b,c)`: using a corresponding tuple. Now, Idris will accept our
|
|
pattern matching:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 70 78 >}}
|
|
|
|
The `getReg` function will take out the value of the corresponding
|
|
register, returning `Int` or `Bool` depending on the `TypeState`.
|
|
What's important is that if the `TypeState` is known, then so
|
|
is the type of `getReg`: no `Either` is involved here, and we
|
|
can directly use the integer or boolean stored in the
|
|
register. This is exactly what we do:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 80 85 >}}
|
|
|
|
This is pretty concise. Idris knows that `Lit i` is of type `Expr IntTy`,
|
|
and it knows that `repr IntTy = Int`, so it also knows that
|
|
`eval (Lit i)` produces an `Int`. Similarly, we wrote
|
|
`Reg r` to have type `Expr s (getRegTy r s)`. Since `getReg`
|
|
returns `repr (getRegTy r s)`, things check out here, too.
|
|
A similar logic applies to the rest of the cases.
|
|
|
|
The situation with statements is somewhat different. As we said, a statement
|
|
doesn't return a value; it changes state. A good initial guess would
|
|
be that to evaluate a statement that starts in state `s` and terminates in state `n`,
|
|
we would take as input `State s` and return `State n`. However, things are not
|
|
quite as simple, thanks to `Break`. Not only does `Break` require
|
|
special case logic to return control to the end of the `Loop`, but
|
|
it also requires some additional consideration: in a statement
|
|
of type `Stmt l s n`, evaluating `Break` can return `State l`.
|
|
|
|
To implement this, we'll use the `Either` type. The `Left` constructor
|
|
will be contain the state at the time of evaluating a `Break`,
|
|
and will indicate to the interpreter that we're breaking out of a loop.
|
|
On the other hand, the `Right` constructor will contain the state
|
|
as produced by all other statements, which would be considered
|
|
{{< sidenote "right" "left-right-note" "the \"normal\" case." >}}
|
|
We use <code>Left</code> for the "abnormal" case because of
|
|
Idris' (and Haskell's) convention to use it as such. For
|
|
instance, the two languages define a <code>Monad</code>
|
|
instance for <code>Either a</code> where <code>(>>=)</code>
|
|
behaves very much like it does for <code>Maybe</code>, with
|
|
<code>Left</code> being treated as <code>Nothing</code>,
|
|
and <code>Right</code> as <code>Just</code>. We will
|
|
use this instance to clean up some of our computations.
|
|
{{< /sidenote >}} Note that this doesn't indicate an error:
|
|
we need to represent the two states (breaking out of a loop
|
|
and normal execution) to define our language's semantics.
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 88 95 >}}
|
|
|
|
First, note the type. We return an `Either` value, which will
|
|
contain `State l` (in the `Left` constructor) if a `Break`
|
|
was evaluated, and `State n` (in the `Right` constructor)
|
|
if execution went on without breaking.
|
|
|
|
The `Store` case is rather simple. We use `setReg` to update the result
|
|
of the register `r` with the result of evaluating `e`. Because
|
|
a store doesn't cause us to start breaking out of a loop,
|
|
we use `Right` to wrap the new state.
|
|
|
|
The `If` case is also rather simple. Its condition is guaranteed
|
|
to evaluate to a boolean, so it's sufficient for us to use
|
|
Idris' `if` expression. We use the `prog` function here, which
|
|
implements the evaluation of a whole program. We'll get to it
|
|
momentarily.
|
|
|
|
`Loop` is the most interesting case. We start by evaluating
|
|
the program `p` serving as the loop body. The result of this
|
|
computation will be either a state after a break,
|
|
held in `Left`, or as the normal execution state, held
|
|
in `Right`. The `(>>=)` operation will do nothing in
|
|
the first case, and feed the resulting (normal) state
|
|
to `stmt (Loop p)` in the second case. This is exactly
|
|
what we want: if we broke out of the current iteration
|
|
of the loop, we shouldn't continue on to the next iteration.
|
|
At the end of evaluating both `p` and the recursive call to
|
|
`stmt`, we'll either have exited normally, or via breaking
|
|
out. We don't want to continue breaking out further,
|
|
so we return the final state wrapped in `Right` in both cases.
|
|
Finally, `Break` returns the current state wrapped in `Left`,
|
|
beginning the process of breaking out.
|
|
|
|
The task of `prog` is simply to sequence several statements
|
|
together. The monadic bind operator, `(>>=)`, is again perfect
|
|
for this task, since it "stops" when it hits a `Left`, but
|
|
continues otherwise. This is the implementation:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 97 99 >}}
|
|
|
|
Awesome! Let's try it out, shall we? I defined a quick `run` function
|
|
as follows:
|
|
|
|
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 101 102 >}}
|
|
|
|
We then have:
|
|
|
|
```
|
|
*TypesafeImp> run prodProg (MkState (0,0,0))
|
|
MkState (0, 9, 63) : State (IntTy, IntTy, IntTy)
|
|
```
|
|
|
|
This seems correct! The program multiplies seven by nine,
|
|
and stops when register `A` reaches zero. Our test program
|
|
runs, too:
|
|
|
|
```
|
|
*TypesafeImp> run testProg (MkState (0,0,0))
|
|
MkState (1, 2, 3) : State (IntTy, IntTy, IntTy)
|
|
```
|
|
|
|
This is the correct answer: `A` ends up being set to
|
|
`1` in the `then` branch of the conditional statement,
|
|
`B` is set to 2 right after, and `R`, the sum of `A`
|
|
and `B`, is rightly `3`.
|
|
|
|
As you can see, we didn't have to write any error handling
|
|
code! This is because the typechecker _knows_ that type errors
|
|
aren't possible: our programs are guaranteed to be
|
|
{{< sidenote "right" "termination-note" "type correct." >}}
|
|
Our programs <em>aren't</em> guaranteed to terminate:
|
|
we're lucky that Idris' totality checker is turned off by default.
|
|
{{< /sidenote >}} This was a fun exercise, and I hope you enjoyed reading along!
|
|
I hope to see you in my future posts.
|