blog-static/content/blog/typesafe_imperative_lang.md

477 lines
20 KiB
Markdown

---
title: "A Typesafe Representation of an Imperative Language"
date: 2020-11-02T01:07:21-08:00
tags: ["Idris", "Programming Languages"]
---
A recent homework assignment for my university's programming languages
course was to encode the abstract syntax for a small imperative language
into Haskell data types. The language consisted of very few constructs, and was very much a "toy".
On the expression side of things, it had three registers (`A`, `B`, and `R`),
numbers, addition, comparison using "less than", and logical negation. It also
included a statement for storing the result of an expression into
a register, `if/else`, and an infinite loop construct with an associated `break` operation.
A sample program in the language which computes the product of two
numbers is as follows:
```
A := 7
B := 9
R := 0
do
if A <= 0 then
break
else
R := R + B;
A := A + -1;
end
end
```
The homework notes that type errors may arise in the little imperative language.
We could, for instance, try to add a boolean to a number: `3 + (1 < 2)`. Alternatively,
we could try use a number in the condition of an `if/else` expression. A "naive"
encoding of the abstract syntax would allow for such errors.
However, assuming that registers could only store integers and not booleans, it is fairly easy to
separate the expression grammar into two nonterminals, yielding boolean
and integer expressions respectively. Since registers can only store integers,
the `(:=)` operation will always require an integer expression, and an `if/else`
statement will always require a boolean expression. A matching Haskell encoding
would not allow "invalid" programs to compile. That is, the programs would be
type-correct by construction.
Then, a question arose in the ensuing discussion: what if registers _could_
contain booleans? It would be impossible to create such a "correct-by-construction"
representation then, wouldn't it?
{{< sidenote "right" "haskell-note" "Although I don't know about Haskell," >}}
I am pretty certain that a similar encoding in Haskell is possible. However,
Haskell wasn't originally created for that kind of abuse of its type system,
so it would probably not look very good.
{{< /sidenote >}} I am sure that it _is_ possible to do this
in Idris, a dependently typed programming language. In this post I will
talk about how to do that.
### Registers and Expressions
Let's start by encoding registers. Since we only have three registers, we
can encode them using a simple data type declaration, much the same as we
would in Haskell:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 1 1 >}}
Now that registers can store either integers or booleans (and only those two),
we need to know which one is which. For this purpose, we can declare another
data type:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 3 3 >}}
At any point in the (hypothetical) execution of our program, each
of the registers will have a type, either boolean or integer. The
combined state of the three registers would then be the combination
of these three states; we can represent this using a 3-tuple:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 5 6 >}}
Let's say that the first element of the tuple will be the type of the register
`A`, the second the type of `B`, and the third the type of `R`. Then,
we can define two helper functions, one for retrieving the type of
a register, and one for changing it:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 8 16 >}}
Now, it's time to talk about expressions. We know now that an expression
can evaluate to either a boolean or an integer value (because a register
can contain either of those types of values). Perhaps we can specify
the type that an expression evaluates to in the expression's own type:
`Expr IntTy` would evaluate to integers, and `Expr BoolTy` would evaluate
to booleans. Then, we could have constructors as follows:
```Idris
Lit : Int -> Expr IntTy
Not : Expr BoolTy -> Expr BoolTy
```
Sounds good! But what about loading a register?
```Idris
Load : Reg -> Expr IntTy -- no; what if the register is a boolean?
Load : Reg -> Expr BoolTy -- no; what if the register is an integer?
Load : Reg -> Expr a -- no; a register access can't be either!
```
The type of an expression that loads a register depends on the current
state of the program! If we last stored an integer into a register,
then loading from that register would give us an integer. But if we
last stored a boolean into a register, then reading from it would
give us a boolean. Our expressions need to be aware of the current
types of each register. To do so, we add the state as a parameter to
our `Expr` type. This would lead to types like the following:
```Idris
-- An expression that produces a boolean when all the registers
-- are integers.
Expr (IntTy, IntTy, IntTy) BoolTy
-- An expression that produces an integer when A and B are integers,
-- and R is a boolean.
Expr (IntTy, IntTy, BoolTy) IntTy
```
In Idris, the whole definition becomes:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 18 23 >}}
The only "interesting" constructor is `Load`, which, given a register `r`,
creates an expression that has `r`'s type in the current state `s`.
### Statements
Statements are a bit different. Unlike expressions, they don't evaluate to
anything; rather, they do something. That "something" may very well be changing
the current state. We could, for instance, set `A` to be a boolean, while it was
previously an integer. This suggests equipping our `Stmt` type with two
arguments: the initial state (before the statement's execution), and the final
state (after the statement's execution). This would lead to types like this:
```Idris
-- Statement that, when run while all registers contain integers,
-- terminates with registers B and R having been assigned boolean values.
Stmt (IntTy, IntTy, IntTy) (IntTy, BoolTy, BoolTy)
```
However, there's a problem with `loop` and `break`. When we run a loop,
we will require that the state at the end of one iteration is the
same as the state at its beginning. Otherwise, it would be possible
for a loop to keep changing the types of registers every iteration,
and it would become impossible for us to infer the final state
without actually running the program. In itself, this restriction
isn't a problem; most static type systems require both branches
of an `if/else` expression to be of the same type for a similar
reason. The problem comes from the interaction with `break`.
By itself, the would-be type of `break` seems innocent enough. It
doesn't change any registers, so we could call it `Stmt s s`.
But consider the following program:
```
A := 0
B := 0
R := 0
do
if 5 <= A then
B := 1 <= 1
break
B := 0
else
A := A + 1
end
end
```
The loop starts with all registers having integer values.
As per our aforementioned loop requirement, the body
of the loop must terminate with all registers _still_ having
integer values. For the first five iterations that's exactly
what will happen. However, after we increment `A` the fifth time,
we will set `B` to a boolean value -- using a valid statement --
and then `break`. The `break` statement will be accepted by
the typechecker, and so will the whole `then` branch. After all,
it seems as though we reset `B` back to an integer value.
But that won't be the case. We will have jumped to the end
of the loop, where we are expected to have an all-integer type,
which we will not have.
The solution I came up with to address this issue was to
add a _third_ argument to `Stmt`, which contains the "context"
type. That is, it contains the type of the innermost loop surrounding
the statement. A `break` statement would only be permissible
if the current type matches the loop type. With this, we finally
write down a definition of `Stmt`:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 26 30 >}}
The `Store` constructor takes a register `r` and an expression producing some type `t` in state `s`.
From these, it creates a statement that starts in `s`, and finishes
in a state similar to `s`, but with `r` now having type `t`. The loop
type `l` remains unaffected and unused; we are free to assign any register
any value.
The `If` constructor takes a condition `Expr`, which starts in state `s` and _must_ produce
a boolean. It also takes two programs (sequences of statements), each of which
starts in `s` and finishes in another state `n`. This results in
a statement that starts in state `s`, and finishes in state `n`. Conceptually,
each branch of the `if/else` statement must result in the same final state (in terms of types);
otherwise, we wouldn't know which of the states to pick when deciding the final
state of the `If` itself. As with `Store`, the loop type `l` is untouched here.
Individual statements are free to modify the state however they wish.
The `Loop` constructor is very restrictive. It takes a single program (the sequence
of instructions that it will be repeating). As we discussed above, this program
must start _and_ end in the same state `s`. Furthermore, this program's loop
type must also be `s`, since the loop we're constructing will be surrounding the
program. The resulting loop itself still has an arbitrary loop type `l`, since
it doesn't surround itself.
Finally, `Break` can only be constructed when the loop state matches the current
state. Since we'll be jumping to the end of the innermost loop, the final state
is also the same as the loop state.
These are all the constructors we'll be needing. It's time to move on to
whole programs!
### Programs
A program is simply a list of statements. However, we can't use a regular Idris list,
because a regular list wouldn't be able to represent the relationship between
each successive statement. In our program, we want the final state of one
statement to be the initial state of the following one, since they'll
be executed in sequence. To represent this, we have to define our own
list-like GADT. The definition of the type turns out fairly straightforward:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 32 34 >}}
The `Nil` constructor represents an empty program (much like the built-in `Nil` represents an empty list).
Since no actions are done, it creates a `Prog` that starts and ends in the same state: `s`.
The `(::)` constructor, much like the built-in `(::)` constructor, takes a statement
and another program. The statement begins in state `s` and ends in state `n`; the program after
that statement must then start in state `n`, and end in some other state `m`.
The combination of the statement and the program starts in state `s`, and finishes in state `m`.
Thus, `(::)` yields `Prog s m`. None of the constructors affect the loop type `l`: we
are free to sequence any statements that we want, and it is impossible for us
to construct statements using `l` that cause runtime errors.
This should be all! Let's try out some programs.
### Trying it Out
The following (type-correct) program compiles just fine:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 36 47 >}}
First, it loads a boolean into register `A`; then,
inside the `if/else` statement, it stores an integer into `A`. Finally,
it stores another integer into `B`, and adds them into `R`. Even though
`A` was a boolean at first, the type checker can deduce that it
was reset back to an integer after the `if/else`, and the program is accepted.
On the other hand, had we forgotten to set `A` to a boolean first:
```Idris
[ If (Load A)
[ Store A (Lit 1) ]
[ Store A (Lit 2) ]
, Store B (Lit 2)
, Store R (Add (Load A) (Load B))
]
```
We would get a type error:
```
Type mismatch between getRegTy A (IntTy, IntTy, IntTy) and BoolTy
```
The type of register `A` (that is, `IntTy`) is incompatible
with `BoolTy`. Our `initialState` says that `A` starts out as
an integer, so it can't be used in an `if/else` right away!
Similar errors occur if we make one of the branches of
the `if/else` empty, or if we set `B` to a boolean.
We can also encode the example program from the beginning
of this post:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 49 61 >}}
This program compiles just fine, too! It is a little reminiscent of
the program we used to demonstrate how `break` could break things
if we weren't careful. So, let's go ahead and try `break` in an invalid
state:
```Idris
[ Store A (Lit 7)
, Store B (Lit 9)
, Store R (Lit 0)
, Loop
[ If (Load A `Leq` Lit 0)
[ Store B (Lit 1 `Leq` Lit 1), Break, Store B (Lit 0) ]
[ Store R (Load R `Add` Load B)
, Store A (Load A `Add` Lit (-1))
]
]
]
```
Again, the type checker complains:
```
Type mismatch between IntTy and BoolTy
```
And so, we have an encoding of our language that allows registers to
be either integers or booleans, while still preventing
type-incorrect programs!
### Building an Interpreter
A good test of such an encoding is the implementation
of an interpreter. It should be possible to convince the
typechecker that our interpreter doesn't need to
handle type errors in the toy language, since they
cannot occur.
Let's start with something simple. First of all, suppose
we have an expression of type `Expr InTy`. In our toy
language, it produces an integer. Our interpreter, then,
will probably want to use Idris' type `Int`. Similarly,
an expression of type `Expr BoolTy` will produce a boolean
in our toy language, which in Idris we can represent using
the built-in `Bool` type. Despite the similar naming, though,
there's no connection between Idris' `Bool` and our own `BoolTy`.
We need to define a conversion from our own types -- which are
values of type `Ty` -- into Idris types that result from evaluating
expressions. We do so as follows:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 63 65 >}}
Similarly, we want to convert our `TypeState` (a tuple describing the _types_
of our registers) into a tuple that actually holds the values of each
register, which we will call `State`. The value of each register at
any point depends on its type. My first thought was to define
`State` as a function from `TypeState` to an Idris `Type`:
```Idris
State : TypeState -> Type
State (a, b, c) = (repr a, repr b, repr c)
```
Unfortunately, this doesn't quite cut it. The problem is that this
function technically doesn't give Idris any guarantees that `State`
will be a tuple. The most Idris knows is that `State` will be some
`Type`, which could be `Int`, `Bool`, or anything else! This
becomes a problem when we try to pattern match on states to get
the contents of a particular register. Instead, we have to define
a new data type:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 67 68 >}}
In this snippet, `State` is still a (type level) function from `TypeState` to `Type`.
However, by using a GADT, we guarantee that there's only one way to construct
a `State (a,b,c)`: using a corresponding tuple. Now, Idris will accept our
pattern matching:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 70 78 >}}
The `getReg` function will take out the value of the corresponding
register, returning `Int` or `Bool` depending on the `TypeState`.
What's important is that if the `TypeState` is known, then so
is the type of `getReg`: no `Either` is involved here, and we
can directly use the integer or boolean stored in the
register. This is exactly what we do:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 80 85 >}}
This is pretty concise. Idris knows that `Lit i` is of type `Expr IntTy`,
and it knows that `repr IntTy = Int`, so it also knows that
`eval (Lit i)` produces an `Int`. Similarly, we wrote
`Reg r` to have type `Expr s (getRegTy r s)`. Since `getReg`
returns `repr (getRegTy r s)`, things check out here, too.
A similar logic applies to the rest of the cases.
The situation with statements is somewhat different. As we said, a statement
doesn't return a value; it changes state. A good initial guess would
be that to evaluate a statement that starts in state `s` and terminates in state `n`,
we would take as input `State s` and return `State n`. However, things are not
quite as simple, thanks to `Break`. Not only does `Break` require
special case logic to return control to the end of the `Loop`, but
it also requires some additional consideration: in a statement
of type `Stmt l s n`, evaluating `Break` can return `State l`.
To implement this, we'll use the `Either` type. The `Left` constructor
will be contain the state at the time of evaluating a `Break`,
and will indicate to the interpreter that we're breaking out of a loop.
On the other hand, the `Right` constructor will contain the state
as produced by all other statements, which would be considered
{{< sidenote "right" "left-right-note" "the \"normal\" case." >}}
We use <code>Left</code> for the "abnormal" case because of
Idris' (and Haskell's) convention to use it as such. For
instance, the two languages define a <code>Monad</code>
instance for <code>Either a</code> where <code>(>>=)</code>
behaves very much like it does for <code>Maybe</code>, with
<code>Left</code> being treated as <code>Nothing</code>,
and <code>Right</code> as <code>Just</code>. We will
use this instance to clean up some of our computations.
{{< /sidenote >}} Note that this doesn't indicate an error:
we need to represent the two states (breaking out of a loop
and normal execution) to define our language's semantics.
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 88 95 >}}
First, note the type. We return an `Either` value, which will
contain `State l` (in the `Left` constructor) if a `Break`
was evaluated, and `State n` (in the `Right` constructor)
if execution went on without breaking.
The `Store` case is rather simple. We use `setReg` to update the result
of the register `r` with the result of evaluating `e`. Because
a store doesn't cause us to start breaking out of a loop,
we use `Right` to wrap the new state.
The `If` case is also rather simple. Its condition is guaranteed
to evaluate to a boolean, so it's sufficient for us to use
Idris' `if` expression. We use the `prog` function here, which
implements the evaluation of a whole program. We'll get to it
momentarily.
`Loop` is the most interesting case. We start by evaluating
the program `p` serving as the loop body. The result of this
computation will be either a state after a break,
held in `Left`, or as the normal execution state, held
in `Right`. The `(>>=)` operation will do nothing in
the first case, and feed the resulting (normal) state
to `stmt (Loop p)` in the second case. This is exactly
what we want: if we broke out of the current iteration
of the loop, we shouldn't continue on to the next iteration.
At the end of evaluating both `p` and the recursive call to
`stmt`, we'll either have exited normally, or via breaking
out. We don't want to continue breaking out further,
so we return the final state wrapped in `Right` in both cases.
Finally, `Break` returns the current state wrapped in `Left`,
beginning the process of breaking out.
The task of `prog` is simply to sequence several statements
together. The monadic bind operator, `(>>=)`, is again perfect
for this task, since it "stops" when it hits a `Left`, but
continues otherwise. This is the implementation:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 97 99 >}}
Awesome! Let's try it out, shall we? I defined a quick `run` function
as follows:
{{< codelines "Idris" "typesafe-imperative/TypesafeImp.idr" 101 102 >}}
We then have:
```
*TypesafeImp> run prodProg (MkState (0,0,0))
MkState (0, 9, 63) : State (IntTy, IntTy, IntTy)
```
This seems correct! The program multiplies seven by nine,
and stops when register `A` reaches zero. Our test program
runs, too:
```
*TypesafeImp> run testProg (MkState (0,0,0))
MkState (1, 2, 3) : State (IntTy, IntTy, IntTy)
```
This is the correct answer: `A` ends up being set to
`1` in the `then` branch of the conditional statement,
`B` is set to 2 right after, and `R`, the sum of `A`
and `B`, is rightly `3`.
As you can see, we didn't have to write any error handling
code! This is because the typechecker _knows_ that type errors
aren't possible: our programs are guaranteed to be
{{< sidenote "right" "termination-note" "type correct." >}}
Our programs <em>aren't</em> guaranteed to terminate:
we're lucky that Idris' totality checker is turned off by default.
{{< /sidenote >}} This was a fun exercise, and I hope you enjoyed reading along!
I hope to see you in my future posts.