blog-static/content/blog/05_spa_agda_semantics/index.md

454 lines
20 KiB
Markdown
Raw Normal View History

---
title: "Implementing and Verifying \"Static Program Analysis\" in Agda, Part 5: Our Programming Language"
series: "Static Program Analysis in Agda"
description: "In this post, I define the language that well serve as the object of our vartious analyses"
date: 2024-08-10T17:37:43-07:00
tags: ["Agda", "Programming Languages"]
draft: true
custom_js: ["parser.js"]
bergamot:
render_presets:
default: "bergamot/rendering/imp.bergamot"
input_modes:
- name: "Expression"
fn: "window.parseExpr"
- name: "Basic Statement"
fn: "window.parseBasicStmt"
- name: "Statement"
fn: "window.parseStmt"
---
In the previous several posts, I've formalized the notion of lattices, which
are an essential ingredient to formalizing the analyses in Anders Møller's
lecture notes. However, there can be no program analysis without a program
to analyze! In this post, I will define the (very simple) language that we
will be analyzing. An essential aspect of the language is its
[semantics](https://en.wikipedia.org/wiki/Semantics_(computer_science)), which
simply speaking explains what each feature of the language does. At the end
of the previous article, I gave the following _inference rule_ which defined
(partially) how the `if`-`else` statement in the language works.
{{< latex >}}
\frac{\rho_1, e \Downarrow z \quad \neg (z = 0) \quad \rho_1,s_1 \Downarrow \rho_2}
{\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2}
{{< /latex >}}
Like I mentioned then, this rule reads as follows:
> If the condition of an `if`-`else` statement evaluates to a nonzero value,
> then to evaluate the statement, you evaluate its `then` branch.
Another similar --- but crucially, not equivlalent -- rule is the following:
{{< latex >}}
\frac{\rho_1, e \Downarrow z \quad z = 1 \quad \rho_1,s_1 \Downarrow \rho_2}
{\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2}
{{< /latex >}}
This time, the English interpretation of the rule is as follows:
> If the condition of an `if`-`else` statement evaluates to one,
> then to evaluate the statement, you evaluate its `then` branch.
These rules are certainly not equivalent. For instance, the former allows
the "then" branch to be executed when the condition is `2`; however, in
the latter, the value of the conditional must be `1`. If our analysis were
intelligent (our first few will not be), then this difference would change
its output when determining the signs of the following program:
```
x = 2
if x {
y = - 1
} else {
y = 1
}
```
Using the first, more "relaxed" rule, the condition would be considered "true",
and the sign of `y` would be `-`. On the other hand, using the second,
"stricter" rule, the sign of `y` would be `+`. I stress that in this case,
I am showing a flow-sensitive analysis (one that can understand control flow
and make more specific predictions); for our simplest analyses, we will not
be aiming for flow-sensitivity. There is plenty of work to do even then.
The point of showing these two distinct rules is that we need to be very precise
about how the language will behave, because our analyses depend on that behavior.
Let's not get ahead of ourselves, though. I've motivated the need for semantics,
but there is much groundwork to be laid before we delve into the precise rules
of our language. After all, to define the language's semantics, we need to
have a language.
### The Syntax of Our Simple Language
I've shown a couple of examples our our language now, and there won't be that
much more to it. We can start with _expressions_: things that evaluate to
something. Some examples of expressions are `1`, `x`, and `2-(x+y)`. For our
specific language, the precise set of possible expressions can be given
by the following [Context-Free Grammar](https://en.wikipedia.org/wiki/Context-free_grammar):
{{< latex >}}
\begin{array}{rcll}
e & ::= & x & \text{(variables)} \\
& | & z & \text{(integer literals)} \\
& | & e + e & \text{(addition)} \\
& | & e - e & \text{(subtraction)}
\end{array}
{{< /latex >}}
The above can be read as follows:
> An expression \(e\) is one of the following things:
> 1. Some variable \(x\) [importantly \(x\) is a placeholder for _any_ variable,
> which could be `x` or `y` in our program code; specifically, \(x\) is
> a [_metavariable_](https://en.wikipedia.org/wiki/Metavariable).]
> 2. Some integer \(z\) [once again, \(z\) can be any integer, like 1, -42, etc.].
> 3. The addition of two other expressions [which could themselves be additions etc.].
> 4. The subtraction of two other expressions [which could also themselves be additions, subtractions, etc.].
Since expressions can be nested within other expressions --- which is necessary
to allow complicated code like `2-(x+y)` above --- they form a tree. Each node
is one of the elements of the grammar above (variable, addition, etc.). If
a node contains sub-expressions (like addition and subtraction do), then
these sub-expressions form sub-trees of the given node. This data structure
is called an [Abstract Syntax Tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
Notably, though `2-(x+y)` has parentheses, our grammar above does not include
include them as a case. The reason for this is that the structure of an
abstract syntax tree is sufficient to encode the order in which the operations
should be evaluated.
{{< todo >}}
Probably two drawings of differently-associated ASTs here.
{{< /todo >}}
To an Agda programmer, the one-of-four-things definition above should read
quite similarly to the definition of an algebraic data type. Indeed, this
is how we can encode the abstract syntax tree of expressions:
{{< codelines "Agda" "agda-spa/Language/Base.agda" 12 16 >}}
The only departure from the grammar above is that I had to invent constructors
for the variable and integer cases, since Agda doesn't support implicit coercions.
This adds a little bit of extra overhead, requiring, for example, that we write
numbers as `# 42` instead of `42`.
Having defined expressions, the next thing on the menu is _statements_. Unlike
expressions, which just produce values, statements "do something"; an example
of a statement might be the following Python line:
```Python
print("Hello, world!")
```
The `print` function doesn't produce any value, but it does perform an action;
it prints its argument to the console!
For the formalization, it turns out to be convenient to separate "simple"
statements from "complex" ones. Pragmatically speaking, the difference is that
between the "simple" and the "complex" is control flow; simple statements
will be guaranteed to always execute without any decisions or jumps.
The reason for this will become clearer in subsequent posts; I will foreshadow
a bit by saying that consecutive simple statements can be placed into a single
[basic block](https://en.wikipedia.org/wiki/Basic_block).
The following is a group of three simple statements:
```
x = 1
y = x + 2
noop
```
These will always be executed in the same order, exactly once. Here, `noop`
is a convenient type of statement that simply does nothing.
On the other hand, the following statement is not simple:
```
while x {
x = x - 1
}
```
It's not simple because it makes decisions about how the code should be executed;
if `x` is nonzero, it will try executing the statement in the body of the loop
(`x = x - 1`). Otherwise, it would skip evaluating that statement, and carry on
with subsequent code.
I first define simple statements using the `BasicStmt` type:
{{< codelines "Agda" "agda-spa/Language/Base.agda" 18 20 >}}
Complex statements are just called `Stmt`; they include loops, conditionals and
sequences ---
{{< sidenote "right" "then-note" "\(s_1\ \text{then}\ s_2\)" >}}
The standard notation for sequencing in imperative languages is
\(s_1; s_2\). However, Agda gives special meaning to the semicolon,
and I couldn't find any passable symbolic alternatives.
{{< /sidenote >}} is a sequence where \(s_2\) is evaluated after \(s_1\).
Complex statements subsume simple statements, which I model using the constructor
`⟨_⟩`.
{{< codelines "Agda" "agda-spa/Language/Base.agda" 25 29 >}}
For an example of using this encoding, take the following simple program:
```
var = 1
if var {
x = 1
}
```
The Agda version is:
{{< codelines "Agda" "agda-spa/Main.agda" 27 34 >}}
Notice how we used `noop` to express the fact that the `else` branch of the
conditional does nothing.
### The Semantics of Our Language
We now have all the language constructs that I'll be showing off --- because
those are all the concepts that I've formalized. What's left is to define
how they behave. We will do this using a logical tool called
[_inference rules_](https://en.wikipedia.org/wiki/Rule_of_inference). I've
written about them a number of times; they're ubiquitous, particularly in the
sorts of things I like explore on this site. The [section on inference rules]({{< relref "01_aoc_coq#inference-rules" >}})
from my Advent of Code series is pretty relevant, and [the notation section from
a post in my compiler series]({{< relref "03_compiler_typechecking#some-notation" >}}) says
much the same thing; I won't be re-describing them here.
There are three pieces which demand semantics: expressions, simple statements,
and non-simple statements. The semantics of each of the three requires the
semantics of the items that precede it. We will therefore start with expressions.
#### Expressions
The trickiest thing about expression is that the value of an expression depends
on the "context": `x+1` can evaluate to `43` if `x` is `42`, or it can evaluate
to `0` if `x` is `-1`. To evaluate an expression, we will therefore need to
assign values to all of the variables in that expression. A mapping that
assigns values to variables is typically called an _environment_. We will write
\(\varnothing\) for "empty environment", and \(\{\texttt{x} \mapsto 42, \texttt{y} \mapsto -1 \}\) for
an environment that maps the variable \(\texttt{x}\) to 42, and the variable \(\texttt{y}\) to -1.
Now, a bit of notation. We will use the letter \(\rho\) to represent environments
(and if several environments are involved, we will occasionally number them
as \(\rho_1\), \(\rho_2\), etc.) We will use the letter \(e\) to stand for
expressions, and the letter \(v\) to stand for values. Finally, we'll write
\(\rho, e \Downarrow v\) to say that "in an environment \(\rho\), expression \(e\)
evaluates to value \(v\)". Our two previous examples of evaluating `x+1` can
thus be written as follows:
{{< latex >}}
\{ \texttt{x} \mapsto 42 \}, \texttt{x}+1 \Downarrow 43 \\
\{ \texttt{x} \mapsto -1 \}, \texttt{x}+1 \Downarrow 0 \\
{{< /latex >}}
Now, on to the actual rules for how to evaluate expressions. Most simply,
integer literals `1` just evaluate to themselves.
{{< latex >}}
\frac{n \in \text{Int}}{\rho, n \Downarrow n}
{{< /latex >}}
Note that the letter \(\rho\) is completely unused in the above rule. That's
because no matter what values _variables_ have, a number still evaluates to
the same value. As we've already established, the same is not true for a
variable like \(\texttt{x}\). To evaluate such a variable, we need to retrieve
the value it's mapped to in the current environment, which we will write as
\(\rho(\texttt{x})\). This gives the following inference rule:
{{< latex >}}
\frac{\rho(x) = v}{\rho, x \Downarrow v}
{{< /latex >}}
All that's left is to define addition and subtraction. For an expression in the
form \(e_1+e_2\), we first need to evaluate the two subexpressions \(e_1\)
and \(e_2\), and then add the two resulting numbers. As a result, the addition
rule includes two additional premises, one for evaluating each summand.
{{< latex >}}
\frac
{\rho, e_1 \Downarrow v_1 \quad \rho, e_2 \Downarrow v_2 \quad v_1 + v_2 = v}
{\rho, e_1+e_2 \Downarrow v}
{{< /latex >}}
The subtraction rule is similar. Below, I've configured an instance of
[Bergamot]({{< relref "bergamot" >}}) to interpret these exact rules. Try
typing various expressions like `1`, `1+1`, etc. into the input box below
to see them evaluate. If you click the "Full Proof Tree" button, you can also view
the exact rules that were used in computing a particular value. The variables
`x`, `y`, and `z` are pre-defined for your convenience.
{{< bergamot_widget id="expr-widget" query="" prompt="eval(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?v)" modes="Expression:Expression" >}}
section "" {
EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n);
EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho);
}
section "" {
EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v);
EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v);
}
hidden section "" {
EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-;
EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(symeq(?x, ?y));
}
{{< /bergamot_widget >}}
The Agda equivalent of this looks very similar to the rules themselves. I use
`⇒ᵉ` instead of \(\Downarrow\), and there's a little bit of tedium with
wrapping integers into a new `Value` type. I also used a (partial) relation
`(x, v) ∈ ρ` instead of explicitly defining accessing an environment, since
it is conceivable for a user to attempt accessing a variable that has not
been assigned to. Aside from these notational changes, the structure of each
of the constructors of the evaluation data type matches the inference rules
I showed above.
{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 27 35 >}}
#### Simple Statements
The main difference between formalizing (simple and "normal") statements is
that they modify the environment. If `x` has one value, writing `x = x + 1` will
certainly change that value. On the other hands, statements don't produce values.
So, we will be writing claims like \(\rho_1 , \textit{bs} \Rightarrow \rho_2\)
to say that the basic statement \(\textit{bs}\), when starting in environment
\(\rho_1\), will produce environment \(\rho_2\). Here's an example:
{{< latex >}}
\{ \texttt{x} \mapsto 42, \texttt{y} \mapsto 17 \}, \ \texttt{x = x - \text{1}} \Rightarrow \{ \texttt{x} \mapsto 41, \texttt{y} \mapsto 17 \}
{{< /latex >}}
Here, we subtracted one from a variable with value `42`, leaving it with a new
value of `41`.
There are two basic statements, and one of them quite literally does nothing.
The inference rule for `noop` is very simple:
{{< latex >}}
\rho,\ \texttt{noop} \Rightarrow \rho
{{< /latex >}}
For the assignment rule, we need to know how to evaluate the expression on the
right side of the equal sign. This is why we needed to define the semantics
of expressions first. Given those, the evaluation rule for assignment is as
follows, with \(\rho[x \mapsto v]\) meaning "the environment \(\rho\) but
mapping the variable \(x\) to value \(v\)".
{{< latex >}}
\frac
{\rho, e \Downarrow v}
{\rho, x = e \Rightarrow \rho[x \mapsto v]}
{{< /latex >}}
Those are actually all the rules we need, and below, I am once again configuring
a Bergamot instance, this time with simple statements. Try out `noop` or some
sort of variable assignment, like `x = x + 1`.
{{< bergamot_widget id="basic-stmt-widget" query="" prompt="stepbasic(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?env)" modes="Basic Statement:Basic Statement" >}}
hidden section "" {
EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n);
EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho);
}
hidden section "" {
EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v);
EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v);
}
section "" {
StepNoop @ stepbasic(?rho, noop, ?rho) <-;
StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v);
}
hidden section "" {
EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-;
EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(symeq(?x, ?y));
}
{{< /bergamot_widget >}}
The Agda implementation is once again just a data type with constructors-for-rules.
This time they also look quite similar to the rules I've shown up until now,
though I continue to explicitly quantify over variables like `ρ`.
{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 37 40 >}}
#### Statements
Let's work on non-simple statements next. The easiest rule to define is probably
sequencing. When we use `then` (or `;`) to combine two statements, what we
actually want is to execute the first statement, which may change variables,
and then execute the second statement while keeping the changes from the first.
This means there are three environments: \(\rho_1\) for the initial state before
either statement is executed, \(\rho_2\) for the state between executing the
first and second statement, and \(\rho_3\) for the final state after both
are done executing. This leads to the following rule:
{{< latex >}}
\frac
{ \rho_1, s_1 \Rightarrow \rho_2 \quad \rho_2, s_2 \Rightarrow \rho_3 }
{ \rho_1, s_1; s_2 \Rightarrow \rho_3 }
{{< /latex >}}
We will actually need two rules to evaluate the conditional statement: one
for when the condition evaluates to "true", and one for when the condition
evaluates to "false". Only, I never specified booleans as being part of
the language, which means that we will need to come up what "false" and "true"
are. I will take my cue from C++ and use zero as "false", and any other number
as "true".
If the condition of an `if`-`else` statement is "true" (nonzero), then the
effect of executing the `if`-`else` should be the same as executing the "then"
part of the statement, while completely ignoring the "else" part.
{{< latex >}}
\frac
{ \rho_1 , e \Downarrow v \quad v \neq 0 \quad \rho_1, s_1 \Rightarrow \rho_2}
{ \rho_1, \textbf{if}\ e\ \{ s_1 \}\ \textbf{else}\ \{ s_2 \}\ \Rightarrow \rho_2 }
{{< /latex >}}
Notice that in the above rule, we used the evaluation judgement \(\rho_1, e \Downarrow v\)
to evaluate the _expression_ that serves as the condition. We then had an
additional premise that requires the truthiness of the resulting value \(v\).
The rule for evaluating a conditional with a "false" branch is very similar.
{{< latex >}}
\frac
{ \rho_1 , e \Downarrow v \quad v = 0 \quad \rho_1, s_2 \Rightarrow \rho_2}
{ \rho_1, \textbf{if}\ e\ \{ s_1 \}\ \textbf{else}\ \{ s_2 \}\ \Rightarrow \rho_2 }
{{< /latex >}}
{{< bergamot_widget id="stmt-widget" query="" prompt="step(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?env)" modes="Statement:Statement" >}}
hidden section "" {
EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n);
EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho);
}
hidden section "" {
EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v);
EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v);
}
hidden section "" {
StepNoop @ stepbasic(?rho, noop, ?rho) <-;
StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v);
}
hidden section "" {
StepNoop @ stepbasic(?rho, noop, ?rho) <-;
StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v);
}
hidden section "" {
StepLiftBasic @ step(?rho_1, ?s, ?rho_2) <- stepbasic(?rho_1, ?s, ?rho_2);
}
section "" {
StepIfTrue @ step(?rho_1, if(?e, ?s_1, ?s_2), ?rho_2) <- eval(?rho_1, ?e, ?v), not(numeq(?v, 0)), step(?rho_1, ?s_1, ?rho_2);
StepIfFalse @ step(?rho_1, if(?e, ?s_1, ?s_2), ?rho_2) <- eval(?rho_1, ?e, ?v), numeq(?v, 0), step(?rho_1, ?s_2, ?rho_2);
StepWhileTrue @ step(?rho_1, while(?e, ?s), ?rho_3) <- eval(?rho_1, ?e, ?v), not(numeq(?v, 0)), step(?rho_1, ?s, ?rho_2), step(?rho_2, while(?e, ?s), ?rho_3);
StepWhileFalse @ step(?rho_1, while(?e, ?s), ?rho_1) <- eval(?rho_1, ?e, ?v), numeq(?v, 0);
StepSeq @ step(?rho_1, seq(?s_1, ?s_2), ?rho_3) <- step(?rho_1, ?s_1, ?rho_2), step(?rho_2, ?s_2, ?rho_3);
}
hidden section "" {
EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-;
EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(symeq(?x, ?y));
}
{{< /bergamot_widget >}}