Edit and publish post on control flow graphs

Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>
This commit is contained in:
Danila Fedorin 2024-11-27 16:27:26 -08:00
parent 5de2ae1203
commit 19be7eb1f5

View File

@ -2,7 +2,7 @@
title: "Implementing and Verifying \"Static Program Analysis\" in Agda, Part 6: Control Flow Graphs"
series: "Static Program Analysis in Agda"
description: "In this post, I show how I show an Agda definition of control flow graph building"
date: 2024-11-13T17:32:42-07:00
date: 2024-11-27T16:26:42-07:00
tags: ["Agda", "Programming Languages"]
draft: true
---
@ -109,8 +109,10 @@ each number encoding the index of the node connected by the edge.
Here's where it gets a little complicated. I don't want to use plain natural
numbers for indices, because that means you can easily introduce "broken" edge.
For example, what if you have 4 nodes, and you have an edge `(5, 5)`? Therefore,
I picked the finite natural numbers represented by [`Fin`](https://agda.github.io/agda-stdlib/v2.0/Data.Fin.Base.html#1154).
For example, what if you have 4 nodes, and you have an edge `(5, 5)`? To avoid
this, I picked the finite natural numbers represented by
[`Fin`](https://agda.github.io/agda-stdlib/v2.0/Data.Fin.Base.html#1154)
as endpoints for edges.
```Agda
data Fin : → Set where
@ -188,7 +190,7 @@ some of the indices of the elements within. For instance, in the lists
`[x]` and `[y]`, the indices of both `x` and `y` are `0`; however, in the
concatenated list `[x, y]`, the index of `x` is still `0`, but the index of `y`
is `1`. More generally, when we concatenate two lists `l1` and `l2`, the indices
into `l1` remain unchanged, whereas the indices `l2` are shifted `length l2`.
into `l1` remain unchanged, whereas the indices `l2` are shifted by `length l1`.
Actually, that's not all there is to it. The _values_ of the indices into
the left list don't change, but their types do! They start as `Fin (length l1)`,
@ -219,7 +221,7 @@ l2 [ idx2 ] ≡ (l1 ++ l2) [ n ↑ʳ idx2 ]
The operators used in the definition above are just versions of the same
re-indexing operators. The `↑ˡᵉ` operator applies `↑ˡ` to all the (__e__)dges
in a graph, and the `↑ˡi` applies it to all the (__i__)ndices in a list
in a graph, and the `↑ˡ` applies it to all the (__i__)ndices in a list
(like `inputs` and `outputs`).
Given these definitions, hopefully the intent with the rest of the definition
@ -300,6 +302,76 @@ we may have duplicates in that list (why not?).
{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 165 166 >}}
{{< todo >}}
the rest
{{< /todo >}}
Above, `indices` is a list of all the node identifiers in the graph. Since the
graph has `size` nodes, the indices of all these nodes are simply the values
`0`, `1`, ..., `size - 1`. I defined a special function `finValues` to compute
this list, together with a proof that this list is unique.
{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 127 143 >}}
Another important property of `finValues` is that each node identifier is
present in the list, so that our computation written by traversing the node
list do not "miss" nodes.
{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 145 147 >}}
We can specialize these definitions for a particular graph `g`:
{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 156 163 >}}
To recap, we now have:
* A way to build control flow graphs from programs
* A list (unique'd and complete) of all nodes in the control flow graph so that
we can iterate over them when the algorithm demands.
* A 'predecessors' function, which will be used by our static program analyses,
implemented as an iteration over the list of nodes.
All that's left is to connect our `predecessors` function to edges in the graph.
The following definitions say that when an edge is in the graph, the starting
node is listed as a predecessor of the ending node, and vise versa.
{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 168 177 >}}
### Connecting Two Distinct Representations
I've described Control Flow Graphs as a compiler-centric representation of the
program. Unlike the formal semantics from the previous post, CFGs do not reason
about the dynamic behavior of the code. Instead, they capture the possible
paths that execution can take through the instructions. In that
sense, they are more of an approximation of what the program will do. This is
good: because of [Rice's theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem),
we can't do anything other than approximating without running the program.
However, an incorrect approximation is of no use at all. Since the CFGs we build
will be the core data type used by our program analyses, it's important that they
are an accurate, if incomplete, representation. Specifically, because most
of our analyses reason about possible outcomes --- we report what sign each
variable __could__ have, for instance --- it's important that we don't accidentally
omit cases that can happen in practice from our CFGs. Formally, this means
that for each possible execution of a program according to its semantics,
{{< sidenote "right" "converse-note" "there exists a corresponding path through the graph." >}}
The converse is desirable too: that the graph has only paths that correspond
to possible executions of the program. One graph that violates this property is
the strongly-connected graph of all basic blocks in a program. Analyzing
such a graph would give us an overly-conservative estimation; since anything
can happen, most of our answers will likely be too general to be of any use. If,
on the other hand, only the necessary graph connections exist, we can be more
precise.<br>
<br>
However, proving this converse property (or even stating it precisely) is much
harder, because our graphs are somewhat conservative already. There exist
programs in which the condition of an <code>if</code>-statement is always
evaluated to <code>false</code>, but our graphs always have edges for both
the "then" and "else" cases. Determining whether a condition is always false
(e.g.) is undecidable thanks to Rice's theorem (again), so we can't rule it out.
Instead, we could broaden "all possible executions"
to "all possible executions where branching conditions can produce arbitrary
results", but this is something else entirely.<br>
<br>
For the time being, I will leave this converse property aside. As a result,
our approximations might be "too careful". However, they will at the very least
be sound.
{{< /sidenote >}}
In the next post, I will prove that this property holds for the graphs shown
here and the formal semantics I defined earlier. I hope to see you there!