blog-static/content/blog/05_spa_agda_semantics/index.md

24 KiB
Raw Blame History

title series description date tags custom_js bergamot
Implementing and Verifying "Static Program Analysis" in Agda, Part 5: Our Programming Language Static Program Analysis in Agda In this post, I define the language that well serve as the object of our vartious analyses 2024-08-10T17:37:43-07:00
Agda
Programming Languages
parser.js
render_presets input_modes
default
bergamot/rendering/imp.bergamot
name fn
Expression window.parseExpr
name fn
Basic Statement window.parseBasicStmt
name fn
Statement window.parseStmt

In the previous several posts, I've formalized the notion of lattices, which are an essential ingredient to formalizing the analyses in Anders Møller's lecture notes. However, there can be no program analysis without a program to analyze! In this post, I will define the (very simple) language that we will be analyzing. An essential aspect of the language is its semantics, which simply speaking explains what each feature of the language does. At the end of the previous article, I gave the following inference rule which defined (partially) how the if-else statement in the language works.

{{< latex >}} \frac{\rho_1, e \Downarrow z \quad \neg (z = 0) \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}

Like I mentioned then, this rule reads as follows:

If the condition of an if-else statement evaluates to a nonzero value, then to evaluate the statement, you evaluate its then branch.

Another similar --- but crucially, not equivlalent -- rule is the following:

{{< latex >}} \frac{\rho_1, e \Downarrow z \quad z = 1 \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}

This time, the English interpretation of the rule is as follows:

If the condition of an if-else statement evaluates to one, then to evaluate the statement, you evaluate its then branch.

These rules are certainly not equivalent. For instance, the former allows the "then" branch to be executed when the condition is 2; however, in the latter, the value of the conditional must be 1. If our analysis were intelligent (our first few will not be), then this difference would change its output when determining the signs of the following program:

x = 2
if x {
  y = - 1
} else {
  y = 1
}

Using the first, more "relaxed" rule, the condition would be considered "true", and the sign of y would be -. On the other hand, using the second, "stricter" rule, the sign of y would be +. I stress that in this case, I am showing a flow-sensitive analysis (one that can understand control flow and make more specific predictions); for our simplest analyses, we will not be aiming for flow-sensitivity. There is plenty of work to do even then.

The point of showing these two distinct rules is that we need to be very precise about how the language will behave, because our analyses depend on that behavior.

Let's not get ahead of ourselves, though. I've motivated the need for semantics, but there is much groundwork to be laid before we delve into the precise rules of our language. After all, to define the language's semantics, we need to have a language.

The Syntax of Our Simple Language

I've shown a couple of examples our our language now, and there won't be that much more to it. We can start with expressions: things that evaluate to something. Some examples of expressions are 1, x, and 2-(x+y). For our specific language, the precise set of possible expressions can be given by the following Context-Free Grammar:

{{< latex >}} \begin{array}{rcll} e & ::= & x & \text{(variables)} \ & | & z & \text{(integer literals)} \ & | & e + e & \text{(addition)} \ & | & e - e & \text{(subtraction)} \end{array} {{< /latex >}}

The above can be read as follows:

An expression e is one of the following things:

  1. Some variable x [importantly x is a placeholder for any variable, which could be x or y in our program code; specifically, x is a metavariable.]
  2. Some integer z [once again, z can be any integer, like 1, -42, etc.].
  3. The addition of two other expressions [which could themselves be additions etc.].
  4. The subtraction of two other expressions [which could also themselves be additions, subtractions, etc.].

Since expressions can be nested within other expressions --- which is necessary to allow complicated code like 2-(x+y) above --- they form a tree. Each node is one of the elements of the grammar above (variable, addition, etc.). If a node contains sub-expressions (like addition and subtraction do), then these sub-expressions form sub-trees of the given node. This data structure is called an Abstract Syntax Tree.

Notably, though 2-(x+y) has parentheses, our grammar above does not include include them as a case. The reason for this is that the structure of an abstract syntax tree is sufficient to encode the order in which the operations should be evaluated. Since I lack a nice way of drawing ASTs, I will use an ASCII drawing to show an example.

Expression: 2 - (x+y)
    (-)
   /   \
  2    (+)
      /   \
     x     y


Expression: (2-x) + y
       (+)
      /   \
    (-)    y
   /   \
  2     x

Above, in the first AST, (+) is a child of the (-) node, which means that it's a sub-expression. As a result, that subexpression is evaluated first, before evaluating (-), and so, the AST expresents 2-(x+y). In the other example, (-) is a child of (+), and is therefore evaluated first. The resulting association encoded by that AST is (2-x)+y.

To an Agda programmer, the one-of-four-things definition above should read quite similarly to the definition of an algebraic data type. Indeed, this is how we can encode the abstract syntax tree of expressions:

{{< codelines "Agda" "agda-spa/Language/Base.agda" 12 16 >}}

The only departure from the grammar above is that I had to invent constructors for the variable and integer cases, since Agda doesn't support implicit coercions. This adds a little bit of extra overhead, requiring, for example, that we write numbers as # 42 instead of 42.

Having defined expressions, the next thing on the menu is statements. Unlike expressions, which just produce values, statements "do something"; an example of a statement might be the following Python line:

print("Hello, world!")

The print function doesn't produce any value, but it does perform an action; it prints its argument to the console!

For the formalization, it turns out to be convenient to separate "simple" statements from "complex" ones. Pragmatically speaking, the difference is that between the "simple" and the "complex" is control flow; simple statements will be guaranteed to always execute without any decisions or jumps. The reason for this will become clearer in subsequent posts; I will foreshadow a bit by saying that consecutive simple statements can be placed into a single basic block.

The following is a group of three simple statements:

x = 1
y = x + 2
noop

These will always be executed in the same order, exactly once. Here, noop is a convenient type of statement that simply does nothing.

On the other hand, the following statement is not simple:

while x {
  x = x - 1
}

It's not simple because it makes decisions about how the code should be executed; if x is nonzero, it will try executing the statement in the body of the loop (x = x - 1). Otherwise, it would skip evaluating that statement, and carry on with subsequent code.

I first define simple statements using the BasicStmt type:

{{< codelines "Agda" "agda-spa/Language/Base.agda" 18 20 >}}

Complex statements are just called Stmt; they include loops, conditionals and sequences --- {{< sidenote "right" "then-note" "(s_1\ \text{then}\ s_2)" >}} The standard notation for sequencing in imperative languages is s_1; s_2. However, Agda gives special meaning to the semicolon, and I couldn't find any passable symbolic alternatives. {{< /sidenote >}} is a sequence where s_2 is evaluated after s_1. Complex statements subsume simple statements, which I model using the constructor ⟨_⟩.

{{< codelines "Agda" "agda-spa/Language/Base.agda" 25 29 >}}

For an example of using this encoding, take the following simple program:

var = 1
if var {
  x = 1
}

The Agda version is:

{{< codelines "Agda" "agda-spa/Main.agda" 27 34 >}}

Notice how we used noop to express the fact that the else branch of the conditional does nothing.

The Semantics of Our Language

We now have all the language constructs that I'll be showing off --- because those are all the concepts that I've formalized. What's left is to define how they behave. We will do this using a logical tool called inference rules. I've written about them a number of times; they're ubiquitous, particularly in the sorts of things I like explore on this site. The [section on inference rules]({{< relref "01_aoc_coq#inference-rules" >}}) from my Advent of Code series is pretty relevant, and [the notation section from a post in my compiler series]({{< relref "03_compiler_typechecking#some-notation" >}}) says much the same thing; I won't be re-describing them here.

There are three pieces which demand semantics: expressions, simple statements, and non-simple statements. The semantics of each of the three requires the semantics of the items that precede it. We will therefore start with expressions.

Expressions

The trickiest thing about expression is that the value of an expression depends on the "context": x+1 can evaluate to 43 if x is 42, or it can evaluate to 0 if x is -1. To evaluate an expression, we will therefore need to assign values to all of the variables in that expression. A mapping that assigns values to variables is typically called an environment. We will write \varnothing for "empty environment", and \{\texttt{x} \mapsto 42, \texttt{y} \mapsto -1 \} for an environment that maps the variable \texttt{x} to 42, and the variable \texttt{y} to -1.

Now, a bit of notation. We will use the letter \rho to represent environments (and if several environments are involved, we will occasionally number them as \rho_1, \rho_2, etc.) We will use the letter e to stand for expressions, and the letter v to stand for values. Finally, we'll write \rho, e \Downarrow v to say that "in an environment \rho, expression (e) evaluates to value (v)". Our two previous examples of evaluating x+1 can thus be written as follows:

{{< latex >}} { \texttt{x} \mapsto 42 }, \texttt{x}+1 \Downarrow 43 \ { \texttt{x} \mapsto -1 }, \texttt{x}+1 \Downarrow 0 \ {{< /latex >}}

Now, on to the actual rules for how to evaluate expressions. Most simply, integer literals like 1 just evaluate to themselves.

{{< latex >}} \frac{n \in \text{Int}}{\rho, n \Downarrow n} {{< /latex >}}

Note that the letter \rho is completely unused in the above rule. That's because no matter what values variables have, a number still evaluates to the same value. As we've already established, the same is not true for a variable like \texttt{x}. To evaluate such a variable, we need to retrieve the value it's mapped to in the current environment, which we will write as \rho(\texttt{x}). This gives the following inference rule:

{{< latex >}} \frac{\rho(x) = v}{\rho, x \Downarrow v} {{< /latex >}}

All that's left is to define addition and subtraction. For an expression in the form e_1+e_2, we first need to evaluate the two subexpressions (e_1) and e_2, and then add the two resulting numbers. As a result, the addition rule includes two additional premises, one for evaluating each summand.

{{< latex >}} \frac {\rho, e_1 \Downarrow v_1 \quad \rho, e_2 \Downarrow v_2 \quad v_1 + v_2 = v} {\rho, e_1+e_2 \Downarrow v} {{< /latex >}}

The subtraction rule is similar. Below, I've configured an instance of [Bergamot]({{< relref "bergamot" >}}) to interpret these exact rules. Try typing various expressions like 1, 1+1, etc. into the input box below to see them evaluate. If you click the "Full Proof Tree" button, you can also view the exact rules that were used in computing a particular value. The variables x, y, and z are pre-defined for your convenience.

{{< bergamot_widget id="expr-widget" query="" prompt="eval(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?v)" modes="Expression:Expression" >}} hidden section "" { Eq @ eq(?x, ?x) <-; } section "" { EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n); EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho); } section "" { EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v); EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v); } hidden section "" { EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-; EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(eq(?x, ?y)); } {{< /bergamot_widget >}}

The Agda equivalent of this looks very similar to the rules themselves. I use ⇒ᵉ instead of \Downarrow, and there's a little bit of tedium with wrapping integers into a new Value type. I also used a (partial) relation (x, v) ∈ ρ instead of explicitly defining accessing an environment, since it is conceivable for a user to attempt accessing a variable that has not been assigned to. Aside from these notational changes, the structure of each of the constructors of the evaluation data type matches the inference rules I showed above.

{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 27 35 >}}

Simple Statements

The main difference between formalizing (simple and "normal") statements is that they modify the environment. If x has one value, writing x = x + 1 will certainly change that value. On the other hand, statements don't produce values. So, we will be writing claims like (\rho_1 , \textit{bs} \Rightarrow \rho_2) to say that the basic statement \textit{bs}, when starting in environment \rho_1, will produce environment \rho_2. Here's an example:

{{< latex >}} { \texttt{x} \mapsto 42, \texttt{y} \mapsto 17 }, \ \texttt{x = x - \text{1}} \Rightarrow { \texttt{x} \mapsto 41, \texttt{y} \mapsto 17 } {{< /latex >}}

Here, we subtracted one from a variable with value 42, leaving it with a new value of 41.

There are two basic statements, and one of them quite literally does nothing. The inference rule for noop is very simple:

{{< latex >}} \rho,\ \texttt{noop} \Rightarrow \rho {{< /latex >}}

For the assignment rule, we need to know how to evaluate the expression on the right side of the equal sign. This is why we needed to define the semantics of expressions first. Given those, the evaluation rule for assignment is as follows, with \rho[x \mapsto v] meaning "the environment \rho but mapping the variable x to value (v)".

{{< latex >}} \frac {\rho, e \Downarrow v} {\rho, x = e \Rightarrow \rho[x \mapsto v]} {{< /latex >}}

Those are actually all the rules we need, and below, I am once again configuring a Bergamot instance, this time with simple statements. Try out noop or some sort of variable assignment, like x = x + 1.

{{< bergamot_widget id="basic-stmt-widget" query="" prompt="stepbasic(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?env)" modes="Basic Statement:Basic Statement" >}} hidden section "" { Eq @ eq(?x, ?x) <-; } hidden section "" { EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n); EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho); } hidden section "" { EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v); EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v); } section "" { StepNoop @ stepbasic(?rho, noop, ?rho) <-; StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v); } hidden section "" { EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-; EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(eq(?x, ?y)); } {{< /bergamot_widget >}}

The Agda implementation is once again just a data type with constructors-for-rules. This time they also look quite similar to the rules I've shown up until now, though I continue to explicitly quantify over variables like ρ.

{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 37 40 >}}

Statements

Let's work on non-simple statements next. The easiest rule to define is probably sequencing. When we use then (or ;) to combine two statements, what we actually want is to execute the first statement, which may change variables, and then execute the second statement while keeping the changes from the first. This means there are three environments: \rho_1 for the initial state before either statement is executed, \rho_2 for the state between executing the first and second statement, and \rho_3 for the final state after both are done executing. This leads to the following rule:

{{< latex >}} \frac { \rho_1, s_1 \Rightarrow \rho_2 \quad \rho_2, s_2 \Rightarrow \rho_3 } { \rho_1, s_1; s_2 \Rightarrow \rho_3 } {{< /latex >}}

We will actually need two rules to evaluate the conditional statement: one for when the condition evaluates to "true", and one for when the condition evaluates to "false". Only, I never specified booleans as being part of the language, which means that we will need to come up what "false" and "true" are. I will take my cue from C++ and use zero as "false", and any other number as "true".

If the condition of an if-else statement is "true" (nonzero), then the effect of executing the if-else should be the same as executing the "then" part of the statement, while completely ignoring the "else" part.

{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v \neq 0 \quad \rho_1, s_1 \Rightarrow \rho_2} { \rho_1, \textbf{if}\ e\ { s_1 }\ \textbf{else}\ { s_2 }\ \Rightarrow \rho_2 } {{< /latex >}}

Notice that in the above rule, we used the evaluation judgement (\rho_1, e \Downarrow v) to evaluate the expression that serves as the condition. We then had an additional premise that requires the truthiness of the resulting value v. The rule for evaluating a conditional with a "false" branch is very similar.

{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v = 0 \quad \rho_1, s_2 \Rightarrow \rho_2} { \rho_1, \textbf{if}\ e\ { s_1 }\ \textbf{else}\ { s_2 }\ \Rightarrow \rho_2 } {{< /latex >}}

Now that we have rules for conditional statements, it will be surprisingly easy to define the rules for while loops. A while loop will also have two rules, one for when its condition is truthy and one for when it's falsey. However, unlike the "false" case, a while loop will do nothing, leaving the environment unchanged:

{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v = 0 } { \rho_1 , \textbf{while}\ e\ { s }\ \Rightarrow \rho_1 } {{< /latex >}}

The trickiest rule is for when the condition of a while loop is true. We evaluate the body once, starting in environment \rho_1 and finishing in \rho_2, but then we're not done. In fact, we have to go back to the top, and check the condition again, starting over. As a result, we include another premise, that tells us that evaluating the loop starting at \rho_2, we eventually end in state \rho_3. This encodes the "rest of the iterations" in addition to the one we just performed. The environment \rho_3 is our final state, so that's what we use in the rule's conclusion.

{{< latex >}} \frac { \rho_1 , e \Downarrow v \quad v \neq 0 \quad \rho_1 , s \Rightarrow \rho_2 \quad \rho_2 , \textbf{while}\ e\ { s }\ \Rightarrow \rho_3 } { \rho_1 , \textbf{while}\ e\ { s }\ \Rightarrow \rho_3 } {{< /latex >}}

And that's it! We have now seen every rule that defines the little object language I've been using for my Agda work. Below is a Bergamot widget that implements these rules. Try the following program, which computes the xth power of two, and stores it in y:

x = 5; y = 1; while (x) { y = y + y; x = x - 1 }

{{< bergamot_widget id="stmt-widget" query="" prompt="step(extend(extend(extend(empty, x, 17), y, 42), z, 0), TERM, ?env)" modes="Statement:Statement" >}} hidden section "" { Eq @ eq(?x, ?x) <-; } hidden section "" { EvalNum @ eval(?rho, lit(?n), ?n) <- int(?n); EvalVar @ eval(?rho, var(?x), ?v) <- inenv(?x, ?v, ?rho); } hidden section "" { EvalPlus @ eval(?rho, plus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), add(?v_1, ?v_2, ?v); EvalMinus @ eval(?rho, minus(?e_1, ?e_2), ?v) <- eval(?rho, ?e_1, ?v_1), eval(?rho, ?e_2, ?v_2), subtract(?v_1, ?v_2, ?v); } hidden section "" { StepNoop @ stepbasic(?rho, noop, ?rho) <-; StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v); } hidden section "" { StepNoop @ stepbasic(?rho, noop, ?rho) <-; StepAssign @ stepbasic(?rho, assign(?x, ?e), extend(?rho, ?x, ?v)) <- eval(?rho, ?e, ?v); } hidden section "" { StepLiftBasic @ step(?rho_1, ?s, ?rho_2) <- stepbasic(?rho_1, ?s, ?rho_2); } section "" { StepIfTrue @ step(?rho_1, if(?e, ?s_1, ?s_2), ?rho_2) <- eval(?rho_1, ?e, ?v), not(eq(?v, 0)), step(?rho_1, ?s_1, ?rho_2); StepIfFalse @ step(?rho_1, if(?e, ?s_1, ?s_2), ?rho_2) <- eval(?rho_1, ?e, ?v), eq(?v, 0), step(?rho_1, ?s_2, ?rho_2); StepWhileTrue @ step(?rho_1, while(?e, ?s), ?rho_3) <- eval(?rho_1, ?e, ?v), not(eq(?v, 0)), step(?rho_1, ?s, ?rho_2), step(?rho_2, while(?e, ?s), ?rho_3); StepWhileFalse @ step(?rho_1, while(?e, ?s), ?rho_1) <- eval(?rho_1, ?e, ?v), eq(?v, 0); StepSeq @ step(?rho_1, seq(?s_1, ?s_2), ?rho_3) <- step(?rho_1, ?s_1, ?rho_2), step(?rho_2, ?s_2, ?rho_3); } hidden section "" { EnvTake @ inenv(?x, ?v, extend(?rho, ?x, ?v)) <-; EnvSkip @ inenv(?x, ?v_1, extend(?rho, ?y, ?v_2)) <- inenv(?x, ?v_1, ?rho), not(eq(?x, ?y)); } {{< /bergamot_widget >}}

As with all the other rules we've seen, the mathematical notation above can be directly translated into Agda:

{{< codelines "Agda" "agda-spa/Language/Semantics.agda" 47 64 >}}

Semantics as Ground Truth

Prior to this post, we had been talking about using lattices and monotone functions for program analysis. The key problem with using this framework to define analyses is that there are many monotone functions that produce complete nonsese; their output is, at best, unrelated to the program they're supposed to analyze. We don't want to write such functions, since having incorrect information about the programs in question is unhelpful.

What does it mean for a function to produce correct information, though? In the context of sign analysis, it would mean that if we say a variable x is +, then evaluating the program will leave us in a state in which x is posive. The semantics we defined in this post give us the "evaluating the program piece". They establish what the programs actually do, and we can use this ground truth when checking that our analyses are correct. In subsequent posts, I will prove the exact property I informally stated above: for the program analyses we define, things they "claim" about our program will match what actually happens when executing the program using our semantics.

A piece of the puzzle still remains: how are we going to use the monotone functions we've been talking so much about? We need to figure out what to feed to our analyses before we can prove their correctness.

I have an answer to that question: we will be using control flow graphs (CFGs). These are another program representation, one that's more commonly found in compilers. I will show what they look like in the next post. I hope to see you there!