blog-static/05_spa_agda_semantics.md at b854eade3c338cb4db0886e26eb7ff694fb60db1

Danila Fedorin e4101f1396 Fix missing paren

Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>

2024-09-01 15:19:47 -07:00

8.0 KiB

Raw Blame History

title

series

description

date

tags

draft

Implementing and Verifying "Static Program Analysis" in Agda, Part 5: Our Programming Language

Static Program Analysis in Agda

In this post, I define the language that well serve as the object of our vartious analyses

2024-08-10T17:37:43-07:00

Agda

Programming Languages

true

In the previous several posts, I've formalized the notion of lattices, which are an essential ingredient to formalizing the analyses in Anders Møller's lecture notes. However, there can be no program analysis without a program to analyze! In this post, I will define the (very simple) language that we will be analyzing. An essential aspect of the language is its semantics, which simply speaking explains what each feature of the language does. At the end of the previous article, I gave the following inference rule which defined (partially) how the if-else statement in the language works.

{{< latex >}} \frac{\rho_1, e \Downarrow z \quad \neg (z = 0) \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}

Like I mentioned then, this rule reads as follows:

If the condition of an if-else statement evaluates to a nonzero value, then to evaluate the statement, you evaluate its then branch.

Another similar --- but crucially, not equivlalent -- rule is the following:

{{< latex >}} \frac{\rho_1, e \Downarrow z \quad z = 1 \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}

This time, the English interpretation of the rule is as follows:

If the condition of an if-else statement evaluates to one, then to evaluate the statement, you evaluate its then branch.

These rules are certainly not equivalent. For instance, the former allows the "then" branch to be executed when the condition is 2; however, in the latter, the value of the conditional must be 1. If our analysis were intelligent (our first few will not be), then this difference would change its output when determining the signs of the following program:

x = 2
if x {
  y = - 1
} else {
  y = 1
}

Using the first, more "relaxed" rule, the condition would be considered "true", and the sign of y would be -. On the other hand, using the second, "stricter" rule, the sign of y would be +. I stress that in this case, I am showing a flow-sensitive analysis (one that can understand control flow and make more specific predictions); for our simplest analyses, we will not be aiming for flow-sensitivity. There is plenty of work to do even then.

The point of showing these two distinct rules is that we need to be very precise about how the language will behave, because our analyses depend on that behavior.

Let's not get ahead of ourselves, though. I've motivated the need for semantics, but there is much groundwork to be laid before we delve into the precise rules of our language. After all, to define the language's semantics, we need to have a language.

The Syntax of Our Simple Language

I've shown a couple of examples our our language now, and there won't be that much more to it. We can start with expressions: things that evaluate to something. Some examples of expressions are 1, x, and 2-(x+y). For our specific language, the precise set of possible expressions can be given by the following Context-Free Grammar:

{{< latex >}} \begin{array}{rcll} e & ::= & x & \text{(variables)} \ & | & z & \text{(integer literals)} \ & | & e + e & \text{(addition)} \ & | & e - e & \text{(subtraction)} \end{array} {{< /latex >}}

The above can be read as follows:

An expression e is one of the following things:

Some variable x [importantly x is a placeholder for any variable, which could be x or y in our program code; specifically, x is a metavariable.]

Some integer z [once again, z can be any integer, like 1, -42, etc.].

The addition of two other expressions [which could themselves be additions etc.].

The subtraction of two other expressions [which could also themselves be additions, subtractions, etc.].

Since expressions can be nested within other expressions --- which is necessary to allow complicated code like 2-(x+y) above --- they form a tree. Each node is one of the elements of the grammar above (variable, addition, etc.). If a node contains sub-expressions (like addition and subtraction do), then these sub-expressions form sub-trees of the given node. This data structure is called an Abstract Syntax Tree.

Notably, though 2-(x+y) has parentheses, our grammar above does not include include them as a case. The reason for this is that the structure of an abstract syntax tree is sufficient to encode the order in which the operations should be evaluated.

{{< todo >}} Probably two drawings of differently-associated ASTs here. {{< /todo >}}

To an Agda programmer, the one-of-four-things definition above should read quite similarly to the definition of an algebraic data type. Indeed, this is how we can encode the abstract syntax tree of expressions:

The only departure from the grammar above is that I had to invent constructors for the variable and integer cases, since Agda doesn't support implicit coercions. This adds a little bit of extra overhead, requiring, for example, that we write numbers as # 42 instead of 42.

Having defined expressions, the next thing on the menu is statements. Unlike expressions, which just produce values, statements "do something"; an example of a statement might be the following Python line:

print("Hello, world!")

The print function doesn't produce any value, but it does perform an action; it prints its argument to the console!

For the formalization, it turns out to be convenient to separate "simple" statements from "complex" ones. Pragmatically speaking, the difference is that between the "simple" and the "complex" is control flow; simple statements will be guaranteed to always execute without any decisions or jumps. The reason for this will become clearer in subsequent posts; I will foreshadow a bit by saying that consecutive simple statements can be placed into a single basic block.

The following is a group of three simple statements:

x = 1
y = x + 2
noop

These will always be executed in the same order, exactly once. Here, noop is a convenient type of statement that simply does nothing.

On the other hand, the following statement is not simple:

while x {
  x = x - 1
}

It's not simple because it makes decisions about how the code should be executed; if x is nonzero, it will try executing the statement in the body of the loop (x = x - 1). Otherwise, it would skip evaluating that statement, and carry on with subsequent code.

I first define simple statements using the BasicStmt type:

Complex statements are just called Stmt; they include loops, conditionals and sequences --- {{< sidenote "right" "then-note" "(s_1\ \text{then}\ s_2)" >}} The standard notation for sequencing in imperative languages is s_1; s_2. However, Agda gives special meaning to the semicolon, and I couldn't find any passable symbolic alternatives. {{< /sidenote >}} is a sequence where s_2 is evaluated after s_1. Complex statements subsume simple statements, which I model using the constructor ⟨_⟩.

For an example of using this encoding, take the following simple program:

var = 1
if var {
  x = 1
}

The Agda version is:

Notice how we used noop to express the fact that the else branch of the conditional does nothing.

8.0 KiB Raw Blame History

The Syntax of Our Simple Language

8.0 KiB

Raw Blame History