--- title: "Implementing and Verifying \"Static Program Analysis\" in Agda, Part 6: Control Flow Graphs" series: "Static Program Analysis in Agda" description: "In this post, I show how I show an Agda definition of control flow graph building" date: 2024-11-13T17:32:42-07:00 tags: ["Agda", "Programming Languages"] draft: true --- In the previous section, I've given a formal definition of the programming language that I've been trying to analyze. This formal definition serves as the "ground truth" for how our little imperative programs are executed; however, program analyses (especially in practice) seldom use the formal semantics as their subject matter. Instead, they focus on more pragmatic program representations from the world of compilers. One such representation are _Control Flow Graphs (CFGs)_. Let's start by building some informal intuition. CFGs are pretty much what their name suggests. They are a type of [graph](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)). Edges in CFGs represent how execution might jump from one piece of code to another (how control might flow). For example, take the below program. ``` x = ...; if x { x = 1; } else { x = 0; } y = x; ``` The CFG might look like this: {{< figure src="if-cfg.png" label="CFG for simple `if`-`else` code." class="small" >}} Here, the initialization of `x` with `...`, as well as the `if` condition (just `x`), are guaranteed to execute one after another, so they occupy a single node. From there, depending on the condition, the control flow can jump to one of the branches of the `if` statement: the "then" branch if the condition is true, and the "else" branch if the condition is false. As a result, there are two arrows coming out of the initial node. Once either branch is executed, control always jumps to the code right after the `if` statement (the `y = x`). Thus, both the `x = 1` and `x = 0` nodes have a single arrow to the `y = x` node. As another example, if you had a loop: ``` x = ...; while x { x = x - 1; } y = x; ``` The CFG would look like this: {{< figure src="while-cfg.png" label="CFG for simple `while` code." class="small" >}} Here, condition of the loop (`x`) is not always guaranteed to execute together with the code that initializes `x`. That's because the condition of the loop is checked after every iteration, whereas the code before the loop is executed only once. As a result, `x = ...` and `x` occupy distinct CFG nodes. From there, the control flow can proceed in two different ways, depending on the value of `x`. If `x` is truthy, the program will proceed to the loop body (decrementing `x`). If `x` is falsy, the program will skip the loop body altogether, and go to the code right after the loop (`y = x`). This is indicated by the two arrows going out of the `x` node. After executing the body, we return to the condition of the loop to see if we need to run another iteration. Because of this, the decrementing node has an arrow back to the loop condition. Now, let's be a bit more precise. Control Flow Graphs are defined as follows: * __The nodes__ are [_basic blocks_](https://en.wikipedia.org/wiki/Basic_block). Paraphrasing Wikipedia's definition, a basic block is a piece of code that has only one entry point and one exit point. The one-entry-point rule means that it's not possible to jump into the middle of the basic block, executing only half of its instructions. The execution of a basic block always begins at the top. Symmetrically, the one-exit-point rule means that you can't jump away to other code (even within the same block), skipping some instructions. The execution of a basic block always ends at the bottom. As a result of these constraints, when running a basic block, you are guaranteed to execute every instruction in exactly the order they occur in, and execute each instruction exactly once. * __The edges__ are jumps between basic blocks. We've already seen how `if` and `while` statements introduce these jumps. Basic blocks can only be made of code that doen't jump (otherwise, we violate the single-exit-point policy). In the previous post, we defined exactly this kind of code as [simple statements]({{< relref "05_spa_agda_semantics#introduce-simple-statements" >}}). So, in our control flow graph, nodes will be sequences of simple statements. {#list-basic-stmts} ### Control Flow Graphs in Agda #### Basic Definition At an abstract level, it's easy to say "it's just a graph where X is Y" about anything. It's much harder to give a precise definition of such a graph, particularly if you want to rule out invalid graphs (e.g., ones with edges pointing nowhere). In Agda, I chose the represent a two lists: one of nodes, and one of edges. Each node is simply a list of `BasicStmt`s, as I described in a preceding paragraph. An edge is simply a pair of numbers, each number encoding the index of the node connected by the edge. Here's where it gets a little complicated. I don't want to use plain natural numbers for indices, because that means you can easily introduce "broken" edge. For example, what if you have 4 nodes, and you have an edge `(5, 5)`? Therefore, I picked the finite natural numbers represented by [`Fin`](https://agda.github.io/agda-stdlib/v2.0/Data.Fin.Base.html#1154). ```Agda data Fin : ℕ → Set where zero : Fin (suc n) suc : (i : Fin n) → Fin (suc n) ``` Specifically, `Fin n` is the type of natural numbers less than `n`. Following this definition, `Fin 3` represents the numbers `0`, `1` and `2`. These are represented using the same constructors as `Nat`: `zero` and `suc`. The type of `zero` is `Fin (suc n)` for any `n`; this makes sense because zero is less than any number plus one. For `suc,` the bound `n` of the input `i` is incremented by one, leading to another `suc n` in the final type. This makes sense because if `i < n`, then `i + 1 < n + 1`. I've previously explained this data type [in another post on this site]({{< relref "01_aoc_coq#aside-vectors-and-finite-mathbbn" >}}). Here's my definition of `Graph`s written using `Fin`: {{< codelines "Agda" "agda-spa/Language/Graphs.agda" 24 39 >}} I explicitly used a `size` field, which determines how many nodes are in the graph, and serves both as the upper bound the edge indices as well as the size `nodes` field. From there, an index `Index` into the node list is {{< sidenote "right" "size-note" "just a natural number less than `size`," >}} Ther are size natural numbers less than size:
0, 1, ..., size - 1. {{< /sidenote >}} and an edge is just a pair of indices. The graph then contains a vector (exact-length list) `nodes` of all the basic blocks, and then a list of edges `edges`. There are two fields here that I have not yet said anything about: `inputs` and `outputs`. When we have a complete CFG for our programs, these fields are totally unnecessary. However, as we are _building_ the CFG, these will come in handy, by telling us how to stitch together smaller sub-graphs that we've already built. Let's talk about that next. #### Combining Graphs Suppose you're building a CFG for a program in the following form: ``` code1; code2; ``` Where `code1` and `code2` are arbitrary pieces of code, which could include statements, loops, and pretty much anything else. Besides the fact that they occur one after another, these pieces of code are unrelated, and we can build CFGs for each one them independently. However, the fact that `code1` and `code2` are in sequence means that the full control flow graph for the above program should have edges going from the nodes in `code1` to the nodes in `code2`. Of course, not _every_ node in `code1` should have such edges: that would mean that after executing any "basic" sequence of instructions, you could suddenly decide to skip the rest of `code1` and move on to executing `code2`. Thus, we need to be more precise about what edges we need to insert; we want to insert edges between the "final" nodes in `code1` (where control ends up after `code1` is finished executing) and the "initial" nodes in `code2` (where control would begin once we started executing `code2`). Those are the `outputs` and `inputs`, respectively. When stitching together sequenced control graphs, we will connect each of the outputs of one to each of the inputs of the other. This is defined by the operation `_↦_`: {{< codelines "Agda" "agda-spa/Language/Graphs.agda" 72 83 >}}