24 KiB
title | series | description | date | tags | draft | custom_js | bergamot | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Implementing and Verifying "Static Program Analysis" in Agda, Part 5: Our Programming Language | Static Program Analysis in Agda | In this post, I define the language that well serve as the object of our vartious analyses | 2024-08-10T17:37:43-07:00 |
|
true |
|
|
In the previous several posts, I've formalized the notion of lattices, which
are an essential ingredient to formalizing the analyses in Anders Møller's
lecture notes. However, there can be no program analysis without a program
to analyze! In this post, I will define the (very simple) language that we
will be analyzing. An essential aspect of the language is its
semantics, which
simply speaking explains what each feature of the language does. At the end
of the previous article, I gave the following inference rule which defined
(partially) how the if
-else
statement in the language works.
{{< latex >}} \frac{\rho_1, e \Downarrow z \quad \neg (z = 0) \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}
Like I mentioned then, this rule reads as follows:
If the condition of an
if
-else
statement evaluates to a nonzero value, then to evaluate the statement, you evaluate itsthen
branch.
Another similar --- but crucially, not equivlalent -- rule is the following:
{{< latex >}} \frac{\rho_1, e \Downarrow z \quad z = 1 \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}
This time, the English interpretation of the rule is as follows:
If the condition of an
if
-else
statement evaluates to one, then to evaluate the statement, you evaluate itsthen
branch.
These rules are certainly not equivalent. For instance, the former allows
the "then" branch to be executed when the condition is 2
; however, in
the latter, the value of the conditional must be 1
. If our analysis were
intelligent (our first few will not be), then this difference would change
its output when determining the signs of the following program:
x = 2
if x {
y = - 1
} else {
y = 1
}
Using the first, more "relaxed" rule, the condition would be considered "true",
and the sign of y
would be -
. On the other hand, using the second,
"stricter" rule, the sign of y
would be +
. I stress that in this case,
I am showing a flow-sensitive analysis (one that can understand control flow
and make more specific predictions); for our simplest analyses, we will not
be aiming for flow-sensitivity. There is plenty of work to do even then.
The point of showing these two distinct rules is that we need to be very precise about how the language will behave, because our analyses depend on that behavior.
Let's not get ahead of ourselves, though. I've motivated the need for semantics, but there is much groundwork to be laid before we delve into the precise rules of our language. After all, to define the language's semantics, we need to have a language.
The Syntax of Our Simple Language
I've shown a couple of examples our our language now, and there won't be that
much more to it. We can start with expressions: things that evaluate to
something. Some examples of expressions are 1
, x
, and 2-(x+y)
. For our
specific language, the precise set of possible expressions can be given
by the following Context-Free Grammar:
{{< latex >}} \begin{array}{rcll} e & ::= & x & \text{(variables)} \ & | & z & \text{(integer literals)} \ & | & e + e & \text{(addition)} \ & | & e - e & \text{(subtraction)} \end{array} {{< /latex >}}
The above can be read as follows:
An expression
e
is one of the following things:
- Some variable
x
[importantlyx
is a placeholder for any variable, which could bex
ory
in our program code; specifically,x
is a metavariable.]- Some integer
z
[once again,z
can be any integer, like 1, -42, etc.].- The addition of two other expressions [which could themselves be additions etc.].
- The subtraction of two other expressions [which could also themselves be additions, subtractions, etc.].
Since expressions can be nested within other expressions --- which is necessary
to allow complicated code like 2-(x+y)
above --- they form a tree. Each node
is one of the elements of the grammar above (variable, addition, etc.). If
a node contains sub-expressions (like addition and subtraction do), then
these sub-expressions form sub-trees of the given node. This data structure
is called an Abstract Syntax Tree.
Notably, though 2-(x+y)
has parentheses, our grammar above does not include
include them as a case. The reason for this is that the structure of an
abstract syntax tree is sufficient to encode the order in which the operations
should be evaluated.
{{< todo >}} Probably two drawings of differently-associated ASTs here. {{< /todo >}}
To an Agda programmer, the one-of-four-things definition above should read quite similarly to the definition of an algebraic data type. Indeed, this is how we can encode the abstract syntax tree of expressions:
{{< codelines "Agda" "agda-spa/Language/Base.agda" 12 16 >}}
The only departure from the grammar above is that I had to invent constructors
for the variable and integer cases, since Agda doesn't support implicit coercions.
This adds a little bit of extra overhead, requiring, for example, that we write
numbers as # 42
instead of 42
.
Having defined expressions, the next thing on the menu is statements. Unlike expressions, which just produce values, statements "do something"; an example of a statement might be the following Python line:
print("Hello, world!")
The print
function doesn't produce any value, but it does perform an action;
it prints its argument to the console!
For the formalization, it turns out to be convenient to separate "simple" statements from "complex" ones. Pragmatically speaking, the difference is that between the "simple" and the "complex" is control flow; simple statements will be guaranteed to always execute without any decisions or jumps. The reason for this will become clearer in subsequent posts; I will foreshadow a bit by saying that consecutive simple statements can be placed into a single basic block.
The following is a group of three simple statements:
x = 1
y = x + 2
noop
These will always be executed in the same order, exactly once. Here, noop
is a convenient type of statement that simply does nothing.
On the other hand, the following statement is not simple:
while x {
x = x - 1
}
It's not simple because it makes decisions about how the code should be executed;
if x
is nonzero, it will try executing the statement in the body of the loop
(x = x - 1
). Otherwise, it would skip evaluating that statement, and carry on
with subsequent code.
I first define simple statements using the BasicStmt
type:
{{< codelines "Agda" "agda-spa/Language/Base.agda" 18 20 >}}
Complex statements are just called Stmt
; they include loops, conditionals and
sequences ---
{{< sidenote "right" "then-note" "(s_1\ \text{then}\ s_2)" >}}
The standard notation for sequencing in imperative languages is
s_1; s_2
. However, Agda gives special meaning to the semicolon,
and I couldn't find any passable symbolic alternatives.
{{< /sidenote >}} is a sequence where s_2
is evaluated after s_1
.
Complex statements subsume simple statements, which I model using the constructor
⟨_⟩
.
{{< codelines "Agda" "agda-spa/Language/Base.agda" 25 29 >}}
For an example of using this encoding, take the following simple program:
var = 1
if var {
x = 1
}
The Agda version is:
{{< codelines "Agda" "agda-spa/Main.agda" 27 34 >}}
Notice how we used noop
to express the fact that the else
branch of the
conditional does nothing.
The Semantics of Our Language
We now have all the language constructs that I'll be showing off --- because those are all the concepts that I've formalized. What's left is to define how they behave. We will do this using a logical tool called inference rules. I've written about them a number of times; they're ubiquitous, particularly in the sorts of things I like explore on this site. The [section on inference rules]({{< relref "01_aoc_coq#inference-rules" >}}) from my Advent of Code series is pretty relevant, and [the notation section from a post in my compiler series]({{< relref "03_compiler_typechecking#some-notation" >}}) says much the same thing; I won't be re-describing them here.
There are three pieces which demand semantics: expressions, simple statements, and non-simple statements. The semantics of each of the three requires the semantics of the items that precede it. We will therefore start with expressions.
Expressions
The trickiest thing about expression is that the value of an expression depends
on the "context": x+1
can evaluate to 43
if x
is 42
, or it can evaluate
to 0
if x
is -1
. To evaluate an expression, we will therefore need to
assign values to all of the variables in that expression. A mapping that
assigns values to variables is typically called an environment. We will write
\varnothing
for "empty environment", and \{\texttt{x} \mapsto 42, \texttt{y} \mapsto -1 \}
for
an environment that maps the variable \texttt{x}
to 42, and the variable \texttt{y}
to -1.
Now, a bit of notation. We will use the letter \rho
to represent environments
(and if several environments are involved, we will occasionally number them
as \rho_1
, \rho_2
, etc.) We will use the letter e
to stand for
expressions, and the letter v
to stand for values. Finally, we'll write
\rho, e \Downarrow v
to say that "in an environment \rho
, expression (e)
evaluates to value (v)". Our two previous examples of evaluating x+1
can
thus be written as follows:
{{< latex >}} { \texttt{x} \mapsto 42 }, \texttt{x}+1 \Downarrow 43 \ { \texttt{x} \mapsto -1 }, \texttt{x}+1 \Downarrow 0 \ {{< /latex >}}
Now, on to the actual rules for how to evaluate expressions. Most simply,
integer literals 1
just evaluate to themselves.
{{< latex >}} \frac{n \in \text{Int}}{\rho, n \Downarrow n} {{< /latex >}}
Note that the letter \rho
is completely unused in the above rule. That's
because no matter what values variables have, a number still evaluates to
the same value. As we've already established, the same is not true for a
variable like \texttt{x}
. To evaluate such a variable, we need to retrieve
the value it's mapped to in the current environment, which we will write as
\rho(\texttt{x})
. This gives the following inference rule:
{{< latex >}} \frac{\rho(x) = v}{\rho, x \Downarrow v} {{< /latex >}}
All that's left is to define addition and subtraction. For an expression in the
form e_1+e_2
, we first need to evaluate the two subexpressions (e_1)
and e_2
, and then add the two resulting numbers. As a result, the addition
rule includes two additional premises, one for evaluating each summand.
{{< latex >}} \frac {\rho, e_1 \Downarrow v_1 \quad \rho, e_2 \Downarrow v_2 \quad v_1 + v_2 = v} {\rho, e_1+e_2 \Downarrow v} {{< /latex >}}
The subtraction rule is similar. Below, I've configured an instance of
[Bergamot]({{< relref "bergamot" >}}) to interpret these exact rules. Try
typing various expressions like 1
, 1+1
, etc. into the input box below
to see them evaluate. If you click the "Full Proof Tree" button, you can also view
the exact rules that were used in computing a particular value. The variables
x
, y
, and z
are pre-defined for your convenience.
{{< bergamot_widget id="expr-widget" query="" prompt="eval(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?v)" modes="Expression:Expression" >}} hidden section "" { Eq @ eq(?x, ?x) <-; } section "" { EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n); EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho); } section "" { EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v); EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v); } hidden section "" { EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-; EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(eq(?x, ?y)); } {{< /bergamot_widget >}}
The Agda equivalent of this looks very similar to the rules themselves. I use
⇒ᵉ
instead of \Downarrow
, and there's a little bit of tedium with
wrapping integers into a new Value
type. I also used a (partial) relation
(x, v) ∈ ρ
instead of explicitly defining accessing an environment, since
it is conceivable for a user to attempt accessing a variable that has not
been assigned to. Aside from these notational changes, the structure of each
of the constructors of the evaluation data type matches the inference rules
I showed above.
{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 27 35 >}}
Simple Statements
The main difference between formalizing (simple and "normal") statements is
that they modify the environment. If x
has one value, writing x = x + 1
will
certainly change that value. On the other hands, statements don't produce values.
So, we will be writing claims like (\rho_1 , \textit{bs} \Rightarrow \rho_2)
to say that the basic statement \textit{bs}
, when starting in environment
\rho_1
, will produce environment \rho_2
. Here's an example:
{{< latex >}} { \texttt{x} \mapsto 42, \texttt{y} \mapsto 17 }, \ \texttt{x = x - \text{1}} \Rightarrow { \texttt{x} \mapsto 41, \texttt{y} \mapsto 17 } {{< /latex >}}
Here, we subtracted one from a variable with value 42
, leaving it with a new
value of 41
.
There are two basic statements, and one of them quite literally does nothing.
The inference rule for noop
is very simple:
{{< latex >}} \rho,\ \texttt{noop} \Rightarrow \rho {{< /latex >}}
For the assignment rule, we need to know how to evaluate the expression on the
right side of the equal sign. This is why we needed to define the semantics
of expressions first. Given those, the evaluation rule for assignment is as
follows, with \rho[x \mapsto v]
meaning "the environment \rho
but
mapping the variable x
to value (v)".
{{< latex >}} \frac {\rho, e \Downarrow v} {\rho, x = e \Rightarrow \rho[x \mapsto v]} {{< /latex >}}
Those are actually all the rules we need, and below, I am once again configuring
a Bergamot instance, this time with simple statements. Try out noop
or some
sort of variable assignment, like x = x + 1
.
{{< bergamot_widget id="basic-stmt-widget" query="" prompt="stepbasic(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?env)" modes="Basic Statement:Basic Statement" >}} hidden section "" { Eq @ eq(?x, ?x) <-; } hidden section "" { EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n); EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho); } hidden section "" { EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v); EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v); } section "" { StepNoop @ stepbasic(?rho, noop, ?rho) <-; StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v); } hidden section "" { EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-; EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(eq(?x, ?y)); } {{< /bergamot_widget >}}
The Agda implementation is once again just a data type with constructors-for-rules.
This time they also look quite similar to the rules I've shown up until now,
though I continue to explicitly quantify over variables like ρ
.
{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 37 40 >}}
Statements
Let's work on non-simple statements next. The easiest rule to define is probably
sequencing. When we use then
(or ;
) to combine two statements, what we
actually want is to execute the first statement, which may change variables,
and then execute the second statement while keeping the changes from the first.
This means there are three environments: \rho_1
for the initial state before
either statement is executed, \rho_2
for the state between executing the
first and second statement, and \rho_3
for the final state after both
are done executing. This leads to the following rule:
{{< latex >}} \frac { \rho_1, s_1 \Rightarrow \rho_2 \quad \rho_2, s_2 \Rightarrow \rho_3 } { \rho_1, s_1; s_2 \Rightarrow \rho_3 } {{< /latex >}}
We will actually need two rules to evaluate the conditional statement: one for when the condition evaluates to "true", and one for when the condition evaluates to "false". Only, I never specified booleans as being part of the language, which means that we will need to come up what "false" and "true" are. I will take my cue from C++ and use zero as "false", and any other number as "true".
If the condition of an if
-else
statement is "true" (nonzero), then the
effect of executing the if
-else
should be the same as executing the "then"
part of the statement, while completely ignoring the "else" part.
{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v \neq 0 \quad \rho_1, s_1 \Rightarrow \rho_2} { \rho_1, \textbf{if}\ e\ { s_1 }\ \textbf{else}\ { s_2 }\ \Rightarrow \rho_2 } {{< /latex >}}
Notice that in the above rule, we used the evaluation judgement (\rho_1, e \Downarrow v)
to evaluate the expression that serves as the condition. We then had an
additional premise that requires the truthiness of the resulting value v
.
The rule for evaluating a conditional with a "false" branch is very similar.
{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v = 0 \quad \rho_1, s_2 \Rightarrow \rho_2} { \rho_1, \textbf{if}\ e\ { s_1 }\ \textbf{else}\ { s_2 }\ \Rightarrow \rho_2 } {{< /latex >}}
Now that we have rules for conditional statements, it will be surprisingly easy
to define the rules for while
loops. A while
loop will also have two rules,
one for when its condition is truthy and one for when it's falsey. However,
unlike the "false" case, a while loop will do nothing, leaving the environment
unchanged:
{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v = 0 } { \rho_1 , \textbf{while}\ e\ { s }\ \Rightarrow \rho_1 } {{< /latex >}}
The trickiest rule is for when the condition of a while
loop is true.
We evaluate the body once, starting in environment \rho_1
and finishing
in \rho_2
, but then we're not done. In fact, we have to go back to the top,
and check the condition again, starting over. As a result, we include another
premise, that tells us that evaluating the loop starting at \rho_2
, we
eventually end in state \rho_3
. This encodes the "rest of the iterations"
in addition to the one we just performed. The environment \rho_3
is our
final state, so that's what we use in the rule's conclusion.
{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v \neq 0 \quad \rho_1 , s \Rightarrow \rho_2 \quad \rho_2 , \textbf{while}\ e\ { s }\ \Rightarrow \rho_3 } { \rho_1 , \textbf{while}\ e\ { s }\ \Rightarrow \rho_3 } {{< /latex >}}
And that's it! We have now seen every rule that defines the little object language I've been using for my Agda work. As with all the other rules we've seen, the mathematical notation above can be directly translated into Agda:
{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 47 64 >}}
Below is a Bergamot widget that implements these rules. Try the following program,
which computes the x
th power of two, and stores it in y
:
x = 5; y = 1; while (x) { y = y + y; x = x - 1 }
{{< bergamot_widget id="stmt-widget" query="" prompt="step(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?env)" modes="Statement:Statement" >}} hidden section "" { Eq @ eq(?x, ?x) <-; } hidden section "" { EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n); EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho); } hidden section "" { EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v); EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v); } hidden section "" { StepNoop @ stepbasic(?rho, noop, ?rho) <-; StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v); } hidden section "" { StepNoop @ stepbasic(?rho, noop, ?rho) <-; StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v); } hidden section "" { StepLiftBasic @ step(?rho_1, ?s, ?rho_2) <- stepbasic(?rho_1, ?s, ?rho_2); } section "" { StepIfTrue @ step(?rho_1, if(?e, ?s_1, ?s_2), ?rho_2) <- eval(?rho_1, ?e, ?v), not(eq(?v, 0)), step(?rho_1, ?s_1, ?rho_2); StepIfFalse @ step(?rho_1, if(?e, ?s_1, ?s_2), ?rho_2) <- eval(?rho_1, ?e, ?v), eq(?v, 0), step(?rho_1, ?s_2, ?rho_2); StepWhileTrue @ step(?rho_1, while(?e, ?s), ?rho_3) <- eval(?rho_1, ?e, ?v), not(eq(?v, 0)), step(?rho_1, ?s, ?rho_2), step(?rho_2, while(?e, ?s), ?rho_3); StepWhileFalse @ step(?rho_1, while(?e, ?s), ?rho_1) <- eval(?rho_1, ?e, ?v), eq(?v, 0); StepSeq @ step(?rho_1, seq(?s_1, ?s_2), ?rho_3) <- step(?rho_1, ?s_1, ?rho_2), step(?rho_2, ?s_2, ?rho_3); } hidden section "" { EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-; EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(eq(?x, ?y)); } {{< /bergamot_widget >}}
Semantics as Ground Truth
Prior to this post, we had been talking about using lattices and monotone functions for program analysis. The key problem with using this framework to define analyses is that there are many monotone functions that produce complete nonsese; their output is, at best, unrelated to the program they're supposed to analyze. We don't want to write such functions, since having incorrect information about the programs in question is unhelpful.
What does it mean for a function to produce correct information, though?
In the context of sign analysis, it would mean that if we say a variable x
is +
,
then evaluating the program will leave us in a state in which x
is posive.
The semantics we defined in this post give us the "evaluating the program piece".
They establish what the programs actually do, and we can use this ground
truth when checking that our analyses are correct. In subsequent posts, I will
prove the exact property I informally stated above: for the program analyses
we define, things they "claim" about our program will match what actually happens
when executing the program using our semantics.
A piece of the puzzle still remains: how are we going to use the monotone functions we've been talking so much about? We need to figure out what to feed to our analyses before we can prove their correctness.
I have an answer to that question: we will be using control flow graphs (CFGs). These are another program representation, one that's more commonly found in compilers. I will show what they look like in the next post. I hope to see you there!