diff --git a/content/blog/coq_dawn_eval.md b/content/blog/coq_dawn_eval.md index 58890a0..6153692 100644 --- a/content/blog/coq_dawn_eval.md +++ b/content/blog/coq_dawn_eval.md @@ -124,7 +124,7 @@ Coq, why don't we use another language? Coq's extraction mechanism allows us to I added the following code to the end of my file: -{{< codelines "Coq" "dawn/DawnEval.v" 219 223 >}} +{{< codelines "Coq" "dawn/DawnEval.v" 231 235 >}} Coq happily produces a new file, `UccGen.hs` with a lot of code. It's not exactly the most aesthetic; here's a quick peek: @@ -172,11 +172,12 @@ Okay, so `true false or` evaluates to `true`, much like our semantics claims. However, does our evaluator _always_ match the semantics? So far, we have not claimed or verified that it does. Let's try giving it a shot. +#### First Steps and Evaluation Chains The first thing we can do is show that if we have a proof that `e` takes initial stack `vs` to final stack `vs'`, then each `eval_step` "makes progress" towards `vs'` (it doesn't simply _return_ `vs'`, since it only takes a single step and doesn't always complete -the evaluation). We also want to show that if the semanics dictates +the evaluation). We also want to show that if the semantics dictates `e` finishes in stack `vs'`, then `eval_step` will never return an error. Thus, we have two possibilities: @@ -236,9 +237,9 @@ exists ei1 ei2 vsi1 vsi2, This is awful! Not only is this a lot of writing, but it also makes various sequences of steps have a different "shape". This way, we can't make -proofs about evalutions of an _arbitrary_ number of steps. Not all is lost, though: if we squint +proofs about evaluations of an _arbitrary_ number of steps. Not all is lost, though: if we squint a little at the last example (three steps), a pattern starts to emerge. -First, let's re-arrange the `exists` qualifiers: +First, let's re-arrange the `exists` quantifiers: ```Coq exists ei1 vsi1, eval_step vs e = middle ei1 vsi1 /\ (* Cons *) @@ -274,7 +275,7 @@ eval_chain vs e vs' ``` Evaluation chains have a couple of interesting properties. First and foremost, -they can be "concatenated". This is analagous to the `Sem_e_comp` rule +they can be "concatenated". This is analogous to the `Sem_e_comp` rule in our original semantics: if an expression `e1` starts in stack `vs` and finishes in stack `vs'`, and another expression starts in stack `vs'` and finishes in stack `vs''`, then we can compose these two expressions, @@ -283,7 +284,7 @@ and the result will start in `vs` and finish in `vs''`. {{< codelines "Coq" "dawn/DawnEval.v" 69 75 >}} The proof is very short. We go -{{< sidenote "right" "concat-note" "by induction on the left evaluation" >}} +{{< sidenote "right" "concat-note" "by induction on the left evaluation chain" >}} It's not a coincidence that defining something like (++) (list concatenation) in Haskell typically starts by pattern matching on the left list. In fact, proofs by induction @@ -294,8 +295,8 @@ in sidenotes, but if you're playing with the the fixed point combinator used to implement recursion. {{< /sidenote >}} -chain (the one for `e1`). Coq takes care of most of the rest with `auto`. -The key to this proof is that whatever `P` is cotained within a "node" +(the one for `e1`). Coq takes care of most of the rest with `auto`. +The key to this proof is that whatever `P` is contained within a "node" in the left chain determines how `eval_step (e_comp e1 e2)` behaves. Whether `e1` evaluates to `final` or `middle`, the composition evaluates to `middle` (a composition of two expressions cannot be done in one step), so we always @@ -307,7 +308,7 @@ create new nodes when appending to an empty list. * If the step was `final`, then the rest of the steps use only `e2`, and good news, we already have a chain for that! -* Otherwise, the step was `middle`, an we stll have a chain for some +* Otherwise, the step was `middle`, an we still have a chain for some intermediate program `ei` that starts in some stack `vsi` and ends in `vs'`. By induction, we know that _this_ chain can be concatenated with the one for `e2`, resulting in a chain for `e_comp ei e2`. All that remains is to @@ -323,3 +324,297 @@ one for each composed expression: While interesting, this particular fact isn't used anywhere else in this post, and I will omit the proof here. + +#### The Forward Direction +Let's try rewording our very first proof, `eval_step_correct`, using +chains. The _direction_ remains the same: given that an expression +produces a final stack under the formal semantics, we need to +prove that this expression _evaluates_ to the same final stack using +a sequence of `eval_step`. + +{{< codelines "Coq" "dawn/DawnEval.v" 93 96 >}} + +The power of this theorem is roughly the same as that of +the original one: we can use `eval_step_correct` to build up a chain +by applying it over and over, and we can take the "first element" +of a chain to serve as a witness for `eval_step_correct`. However, +the formulation is arguably significantly cleaner, and contains +a proof for _all_ steps right away, rather than a single one. + +Before we go through the proof, notice that there's actually _two_ +theorems being stated here, which depend on each other. This +is not surprising given that our semantics are given using two +data types, `Sem_expr` and `Sem_int`, each of which contains the other. +Regular proofs by induction, which work on only one of the data types, +break down because we can't make claims "by induction" about the _other_ +type. For example, while going by induction on `Sem_expr`, we run into +issues in the `e_int` case while handling \\(\\text{apply}\\). We know +a single step -- popping the value being run from the stack. But what then? +The rule for \\(\\text{apply}\\) in `Sem_int` contains another `Sem_expr` +which confirms that the quoted value properly evaluates. But this other +`Sem_expr` isn't directly a child of the "bigger" `Sem_expr`, and so we +don't get an inductive hypothesis about it. We know nothing about it; we're stuck. + +We therefore have two theorems declared together using `with` (just like we +used `with` to declare `Sem_expr` and `Sem_int`). We have to prove both, +which is once again a surprisingly easy task thanks to Coq's `auto`. Let's +start with the first theorem, the one for expression semantics. + +{{< codelines "Coq" "dawn/DawnEval.v" 98 107 >}} + +We go by induction on the semantics data type. There are therefore three cases: +intrinsics, quotes, and composition. The intrinsic case is handed right +off to the second theorem (which we'll see momentarily). The quote case +is very simple since quoted expressions are simply pushed onto the stack in +a single step (we thus use `chain_final`). Finally, in the composition +case, we have two sub-proofs, one for each expression being evaluated. +By induction, we know that each of these sub-proofs can be turned into +a chain, and we use `eval_chain_merge` to combine these two chains into one. +That's it. + +Now, let's try the second theorem. The code is even shorter: + +{{< codelines "Coq" "dawn/DawnEval.v" 108 115 >}} + +The first command, `inversion Hsem`, lets us go case-by-case on the various +ways an intrinsic can be evaluated. Most intrinsics are quite boring; +in our evaluator, they only need a single step, and their semantics +rules guarantee that the stack has the exact kind of data that the evaluator +expects. We dismiss such cases with `apply chain_final; auto`. The only +time this doesn't work is when we encounter the \\(\\text{apply}\\) intrinsic; +in that case, however, we can simply use the first theorem we proved. + +#### A Quick Aside: Automation Using `Ltac2` +Going in the other direction will involve lots of situations in which we _know_ +that `eval_step` evaluated to something. Here's a toy proof built around +one such case: + +{{< codelines "Coq" "dawn/DawnEval.v" 237 246 "hl_lines=5-8">}} + +The proof claims that if the `swap` instruction was evaluated to something, +then the initial stack must contain two values, and the final stack must +have those two values on top but flipped. This is very easy to prove, since +that's the exact behavior of `eval_step` for the `swap` intrinsic. However, +notice how much boilerplate we're writing: lines 241 through 244 deal entirely +with ensuring that our stack does, indeed, have two values. This is trivially true: +if the stack _didn't_ have two values, it wouldn't evaluate to `final`, but to `error`. +However, it takes some effort to convince Coq of this. + +This was just for a single intrinsic. What if we're trying to prove something +for _every_ intrinsic? Things will get messy very quickly. We can't even re-use +our `destruct vs` code, because different intrinsics need stacks with different numbers +of values (they have a different _arity_); if we try to handle all cases with at most +2 values on the stack, we'd have to prove the same thing twice for simpler intrinsics. +In short, proofs will look like a mess. + +Things don't have to be this way, though! The boilerplate code is very repetitive, and +this makes it a good candidate for automation. For this, Coq has `Ltac2`. + +In short, `Ltac2` is another mini-language contained within Coq. We can use it to put +into code the decision making that we'd be otherwise doing on our own. For example, +here's a tactic that checks if the current proof has a hypothesis in which +an intrinsic is evaluated to something: + +{{< codelines "Coq" "dawn/DawnEval.v" 141 148 >}} + +`Ltac2` has a pattern matching construct much like Coq itself does, but it can +pattern match on Coq expressions and proof goals. When pattern matching +on goals, we use the following notation: `[ h1 : t1, ..., hn : tn |- g ]`, which reads: + +> Given hypotheses `h1` through `hn`, of types `t1` through `tn` respectively, + we need to prove `g`. + +In our pattern match, then, we're ignoring the actual thing we're supposed to prove +(the tactic we're writing won't be that smart; its user -- ourselves -- will need +to know when to use it). We are, however, searching for a hypothesis in the form +`eval_step ?a (e_int ?b) = ?c`. The three variables, `?a`, `?b`, and `?c` are placeholders +which can be matched with anything. Thus, we expect any hypothesis in which +an intrinsic is evaluated to anything else (by the way, Coq checks all hypotheses one-by-one, +starting at the most recent and finishing with the oldest). The code on lines 144 through 146 actually +performs the `destruct` calls, as well as the `inversion` attempts that complete +any proofs with an inconsistent assumption (like `err = final vs'`). + +So how do these lines work? Well, first we need to get a handle on the hypothesis we found. +Various tactics can manipulate the proof state, and cause hypotheses to disappear; +by calling `Control.hyp` on `h`, we secure a more permanent hold on our evaluation assumption. +In the next line, we call _another_ `Ltac2` function, `destruct_int_stack`, which unpacks +the stack exactly as many times as necessary. We'll look at this function in a moment. +Finally, we use regular old tactics. Specifically, we use `inversion` (as mentioned before). +I use `fail` after `inversion` to avoid cluttering the proof in case there's _not_ +a contradiction. + +On to `destruct_int_stack`. The definition is very simple: + +{{< codelines "Coq" "dawn/DawnEval.v" 139 139 >}} + +The main novelty is that this function takes two arguments, both of which are Coq expressions: +`int`, or the intrinsic being evaluated, and `va`, the stack that's being analyzed. The `constr` +type in Coq holds terms. Let's look at `int_arity` next: + +{{< codelines "Coq" "dawn/DawnEval.v" 128 137 >}} + +This is a pretty standard pattern match that assigns to each expression its arity. However, we're +case analyzing over _literally any possible Coq expression_, so we need to handle the case in which +the expression isn't actually a specific intrinsic. For this, we use `Control.throw` with +the `Not_intrinsic` exception. + +Wait a moment, does Coq just happen to include a `Not_intrinsic` exception? No, it does not. +We have to register that one ourselves. In `Ltac2`, the type of exception (`exn`) is _extensible_, +which means we can add on to it. We add just a single constructor with no arguments: + +{{< codelines "Coq" "dawn/DawnEval.v" 118 118 >}} + +Finally, unpacking the value stack. Here's the code for that: + +{{< codelines "Coq" "dawn/DawnEval.v" 120 126 >}} + +This function accepts the arity of an operation, and unpacks the stack that many times. +`Ltac2` doesn't have a way to pattern match on numbers, so we resort to good old "less than or equal", +and and `if`/`else`. If the arity is zero (or less than zero, since it's an integer), we don't +need to unpack anymore. This is our base case. Otherwise, we generate two new free variables +(we can't just hardcode `v1` and `v2`, since they may be in use elsewhere in the proof). To this +end we use `Fresh.in_goal` and give it a "base symbol" to build on. We then use +`destruct`, passing it our "scrutinee" (the expression being destructed) and the names +we generated for its components. This generates multiple goals; the `Control.enter` tactic +is used to run code for each one of these goals. In the non-empty list case, we try to break +apart its tail as necessary by recursively calling `destruct_n`. + +That's pretty much it! We can now use our tactic from Coq like any other. Rewriting +our earlier proof about \\(\\text{swap}\\), we now only need to handle the valid case: + +{{< codelines "Coq" "dawn/DawnEval.v" 248 254 >}} + +We'll be using this new `ensure_valid_stack` tactic in our subsequent proofs. + +#### The Backward Direction +After our last proof before the `Ltac2` diversion, we know that our evaluator matches our semantics, but only when a `Sem_expr` +object exists for our program. However, invalid programs don't have a `Sem_expr` object +(there's no way to prove that they evaluate to anything). Does our evaluator behave +properly on "bad" programs, too? We've not ruled out that our evaluator produces junk +`final` or `middle` outputs whenever it encounters an error. We need another theorem: + +{{< codelines "Coq" "dawn/DawnEval.v" 215 216 >}} + +That is, if our evalutor reaches a final state, this state matches our semantics. +This proof is most conveniently split into two pieces. The first piece says that +if our evaluator completes in a single step, then this step matches our semantics. + +{{< codelines "Coq" "dawn/DawnEval.v" 158 175 >}} + +The statement for a non-`final` step is more involved; the one step by itself need +not match the semantics, since it only carries out a part of a computation. However, +we _can_ say that if the "rest of the program" matches the semantics, then so does +the whole expression. + +{{< codelines "Coq" "dawn/DawnEval.v" 177 213 >}} + +Finally, we snap these two pieces together in `eval_step_sem_back`: + +{{< codelines "Coq" "dawn/DawnEval.v" 215 222 >}} + +We have now proved complete equivalence: our evaluator completes in a final state +if and only if our semantics lead to this same final state. As a corollary of this, +we can see that if a program is invalid (it doesn't evaluate to anything under our semantics), +then our evaluator won't finish in a `final` state: + +{{< codelines "Coq" "dawn/DawnEval.v" 224 229 >}} + +### Making a Mini-REPL +We can now be pretty confident about our evaluator. However, this whole business with using GHCi +to run our programs is rather inconvenient; I'd rather write `[drop]` than `E_quote (E_int Drop)`. +I'd also rather _read_ the former instead of the latter. We can do some work in Haskell to make +playing with our interpreter more convenient. + +#### Pretty Printing Code +Coq generated Haskell data types that correspond precisely to the data types we defined for our +proofs. We can use the standard type class mechanism in Haskell to define how they should be +printed: + +{{< codelines "Haskell" "dawn/Ucc.hs" 8 22 >}} + +Now our expressions and values print a lot nicer: + +``` +ghci> true +[swap drop] +ghci> false +[drop] +ghci> UccGen.or +clone apply +``` + +#### Reading Code In +The Parsec library in Haskell can be used to convert text back into data structures. +It's not too hard to create a parser for UCC: + +{{< codelines "Haskell" "dawn/Ucc.hs" 24 44 >}} + +Now, `parseExpression` can be used to read code: + +``` +ghci> parseExpression "[drop] [drop]" +Right [drop] [drop] +``` + +#### The REPL +We now literally have the three pieces of a read-evaluate-print loop. All that's left +is putting them together: + +{{< codelines "Haskell" "dawn/Ucc.hs" 53 64 >}} + +We now have our own little verified evaluator: + +``` +$ ghci Ucc +$ ghc -main-is Ucc Ucc +$ ./Ucc +> [drop] [drop] clone apply +[[drop]] +> [drop] [swap drop] clone apply +[[swap drop]] +> [swap drop] [drop] clone apply +[[swap drop]] +> [swap drop] [swap drop] clone apply +[[swap drop]] +``` + +### Potential Future Work +We now have a verified UCC evaluator in Haskell. What now? + +1. We only verified the evaluation component of our REPL. In fact, + the whole thing is technically a bit buggy due to the parser: + ``` + > [drop] drop coq is a weird name to say out loud in interviews + [] + ``` + Shouldn't this be an error? Do you see why it's not? There's work in + writing formally verified parsers; some time ago I saw a Galois + talk about a [formally verified parser generator](https://galois.com/blog/2019/09/tech-talk-a-verified-ll1-parser-generator/). + We could use this formally verified parser generator to create our parsers, and then be + sure that our grammar is precisely followed. +2. We could try make our evaluator a bit smarter. One thing we could definitely do + is maker our REPL support variables. Then, we would be able to write: + ``` + > true = [swap drop] + > false = [drop] + > or = clone apply + > true false or + [swap drop] + ``` + There are different ways to go about this. One way is to extend our `expr` data type + with a variable constructor. This would complicate the semantics (a _lot_), but it would + allow us to prove facts about variables. Alternatively, we could implement expressions + as syntax sugar in our parser. Using a variable would be the same as simply + pasting in the variable's definition. This is pretty much what the Dawn article + seems to be doing. +3. We could prove more things. Can we confirm, once and for all, the correctness of \\(\\text{quote}_n\\), + for _any_ \\(n\\)? Is there is a generalized way of converting inductive data types into a UCC encoding? + Or could we perhaps formally verify the following comment from Lobsters: + + > with the encoding of natural numbers given, n+1 contains the definition of n duplicated two times. + This means that the encoding of n has size exponential in n, which seems extremely impractical. + + The world's our oyster! + +Despite all of these exciting possibilities, this is where I stop, for now. I hope you enjoyed this article, +and thank you for reading!