8.0 KiB
title | series | description | date | tags | draft | ||
---|---|---|---|---|---|---|---|
Implementing and Verifying "Static Program Analysis" in Agda, Part 5: Our Programming Language | Static Program Analysis in Agda | In this post, I define the language that well serve as the object of our vartious analyses | 2024-08-10T17:37:43-07:00 |
|
true |
In the previous several posts, I've formalized the notion of lattices, which
are an essential ingredient to formalizing the analyses in Anders Møller's
lecture notes. However, there can be no program analysis without a program
to analyze! In this post, I will define the (very simple) language that we
will be analyzing. An essential aspect of the language is its
semantics, which
simply speaking explains what each feature of the language does. At the end
of the previous article, I gave the following inference rule which defined
(partially) how the if
-else
statement in the language works.
{{< latex >}} \frac{\rho_1, e \Downarrow z \quad \neg (z = 0) \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}
Like I mentioned then, this rule reads as follows:
If the condition of an
if
-else
statement evaluates to a nonzero value, then to evaluate the statement, you evaluate itsthen
branch.
Another similar --- but crucially, not equivlalent -- rule is the following:
{{< latex >}} \frac{\rho_1, e \Downarrow z \quad z = 1 \quad \rho_1,s_1 \Downarrow \rho_2} {\rho_1, \textbf{if}\ e\ \textbf{then}\ s_1\ \textbf{else}\ s_2\ \Downarrow\ \rho_2} {{< /latex >}}
This time, the English interpretation of the rule is as follows:
If the condition of an
if
-else
statement evaluates to one, then to evaluate the statement, you evaluate itsthen
branch.
These rules are certainly not equivalent. For instance, the former allows
the "then" branch to be executed when the condition is 2
; however, in
the latter, the value of the conditional must be 1
. If our analysis were
intelligent (our first few will not be), then this difference would change
its output when determining the signs of the following program:
x = 2
if x {
y = - 1
} else {
y = 1
}
Using the first, more "relaxed" rule, the condition would be considered "true",
and the sign of y
would be -
. On the other hand, using the second,
"stricter" rule, the sign of y
would be +
. I stress that in this case,
I am showing a flow-sensitive analysis (one that can understand control flow
and make more specific predictions); for our simplest analyses, we will not
be aiming for flow-sensitivity. There is plenty of work to do even then.
The point of showing these two distinct rules is that we need to be very precise about how the language will behave, because our analyses depend on that behavior.
Let's not get ahead of ourselves, though. I've motivated the need for semantics, but there is much groundwork to be laid before we delve into the precise rules of our language. After all, to define the language's semantics, we need to have a language.
The Syntax of Our Simple Language
I've shown a couple of examples our our language now, and there won't be that
much more to it. We can start with expressions: things that evaluate to
something. Some examples of expressions are 1
, x
, and 2-(x+y)
. For our
specific language, the precise set of possible expressions can be given
by the following Context-Free Grammar:
{{< latex >}} \begin{array}{rcll} e & ::= & x & \text{(variables)} \ & | & z & \text{(integer literals)} \ & | & e + e & \text{(addition)} \ & | & e - e & \text{(subtraction)} \end{array} {{< /latex >}}
The above can be read as follows:
An expression
e
is one of the following things:
- Some variable
x
[importantlyx
is a placeholder for any variable, which could bex
ory
in our program code; specifically,x
is a metavariable.]- Some integer
z
[once again,z
can be any integer, like 1, -42, etc.].- The addition of two other expressions [which could themselves be additions etc.].
- The subtraction of two other expressions [which could also themselves be additions, subtractions, etc.].
Since expressions can be nested within other expressions --- which is necessary
to allow complicated code like 2-(x+y)
above --- they form a tree. Each node
is one of the elements of the grammar above (variable, addition, etc.). If
a node contains sub-expressions (like addition and subtraction do), then
these sub-expressions form sub-trees of the given node. This data structure
is called an Abstract Syntax Tree.
Notably, though 2-(x+y)
has parentheses, our grammar above does not include
include them as a case. The reason for this is that the structure of an
abstract syntax tree is sufficient to encode the order in which the operations
should be evaluated.
{{< todo >}} Probably two drawings of differently-associated ASTs here. {{< /todo >}}
To an Agda programmer, the one-of-four-things definition above should read quite similarly to the definition of an algebraic data type. Indeed, this is how we can encode the abstract syntax tree of expressions:
{{< codelines "Agda" "agda-spa/Language/Base.agda" 12 16 >}}
The only departure from the grammar above is that I had to invent constructors
for the variable and integer cases, since Agda doesn't support implicit coercions.
This adds a little bit of extra overhead, requiring, for example, that we write
numbers as # 42
instead of 42
.
Having defined expressions, the next thing on the menu is statements. Unlike expressions, which just produce values, statements "do something"; an example of a statement might be the following Python line:
print("Hello, world!")
The print
function doesn't produce any value, but it does perform an action;
it prints its argument to the console!
For the formalization, it turns out to be convenient to separate "simple" statements from "complex" ones. Pragmatically speaking, the difference is that between the "simple" and the "complex" is control flow; simple statements will be guaranteed to always execute without any decisions or jumps. The reason for this will become clearer in subsequent posts; I will foreshadow a bit by saying that consecutive simple statements can be placed into a single basic block.
The following is a group of three simple statements:
x = 1
y = x + 2
noop
These will always be executed in the same order, exactly once. Here, noop
is a convenient type of statement that simply does nothing.
On the other hand, the following statement is not simple:
while x {
x = x - 1
}
It's not simple because it makes decisions about how the code should be executed;
if x
is nonzero, it will try executing the statement in the body of the loop
(x = x - 1
). Otherwise, it would skip evaluating that statement, and carry on
with subsequent code.
I first define simple statements using the BasicStmt
type:
{{< codelines "Agda" "agda-spa/Language/Base.agda" 18 20 >}}
Complex statements are just called Stmt
; they include loops, conditionals and
sequences ---
{{< sidenote "right" "then-note" "(s_1\ \text{then}\ s_2)" >}}
The standard notation for sequencing in imperative languages is
s_1; s_2
. However, Agda gives special meaning to the semicolon,
and I couldn't find any passable symbolic alternatives.
{{< /sidenote >}} is a sequence where s_2
is evaluated after s_1
.
Complex statements subsume simple statements, which I model using the constructor
⟨_⟩
.
{{< codelines "Agda" "agda-spa/Language/Base.agda" 25 29 >}}
For an example of using this encoding, take the following simple program:
var = 1
if var {
x = 1
}
The Agda version is:
{{< codelines "Agda" "agda-spa/Main.agda" 27 34 >}}
Notice how we used noop
to express the fact that the else
branch of the
conditional does nothing.