diff --git a/content/blog/00_spa_agda_intro.md b/content/blog/00_spa_agda_intro.md index 1b9ff0e..175f258 100644 --- a/content/blog/00_spa_agda_intro.md +++ b/content/blog/00_spa_agda_intro.md @@ -104,5 +104,5 @@ Here are the posts that I’ve written so far for this series: * {{< draftlink "Lattices of Finite Height" "03_spa_agda_fixed_height" >}} * {{< draftlink "The Fixed-Point Algorithm" "04_spa_agda_fixedpoint" >}} * {{< draftlink "Our Programming Language" "05_spa_agda_semantics" >}} -* {{< draftlink "Control Flow Graphs" "06_spa_agda_cfg" >}}. -* {{< draftlink "Connecting Semantics and Control Flow Graphs" "07_spa_agda_semantics_and_cfg" >}}. +* {{< draftlink "Control Flow Graphs" "06_spa_agda_cfg" >}} +* {{< draftlink "Connecting Semantics and Control Flow Graphs" "07_spa_agda_semantics_and_cfg" >}} diff --git a/content/blog/05_spa_agda_semantics/index.md b/content/blog/05_spa_agda_semantics/index.md index cf0fdfc..b08ee5e 100644 --- a/content/blog/05_spa_agda_semantics/index.md +++ b/content/blog/05_spa_agda_semantics/index.md @@ -171,6 +171,7 @@ will be guaranteed to always execute without any decisions or jumps. The reason for this will become clearer in subsequent posts; I will foreshadow a bit by saying that consecutive simple statements can be placed into a single [basic block](https://en.wikipedia.org/wiki/Basic_block). +{#introduce-simple-statements} The following is a group of three simple statements: diff --git a/content/blog/06_spa_agda_cfg/if-cfg.dot b/content/blog/06_spa_agda_cfg/if-cfg.dot new file mode 100644 index 0000000..5a45973 --- /dev/null +++ b/content/blog/06_spa_agda_cfg/if-cfg.dot @@ -0,0 +1,15 @@ +digraph G { + graph[dpi=300 fontsize=14 fontname="Courier New"]; + node[shape=rectangle style="filled" fillcolor="#fafafa" penwidth=0.5 color="#aaaaaa"]; + edge[arrowsize=0.3 color="#444444"] + + node_begin [label="x = ...;\lx\l"] + node_then [label="x = 1\l"] + node_else [label="x = 0\l"] + node_end [label="y = x\l"] + + node_begin -> node_then + node_begin -> node_else + node_then -> node_end + node_else -> node_end +} diff --git a/content/blog/06_spa_agda_cfg/if-cfg.png b/content/blog/06_spa_agda_cfg/if-cfg.png new file mode 100644 index 0000000..5263208 Binary files /dev/null and b/content/blog/06_spa_agda_cfg/if-cfg.png differ diff --git a/content/blog/06_spa_agda_cfg/index.md b/content/blog/06_spa_agda_cfg/index.md new file mode 100644 index 0000000..3ce3dc0 --- /dev/null +++ b/content/blog/06_spa_agda_cfg/index.md @@ -0,0 +1,179 @@ +--- +title: "Implementing and Verifying \"Static Program Analysis\" in Agda, Part 6: Control Flow Graphs" +series: "Static Program Analysis in Agda" +description: "In this post, I show how I show an Agda definition of control flow graph building" +date: 2024-11-13T17:32:42-07:00 +tags: ["Agda", "Programming Languages"] +draft: true +--- + +In the previous section, I've given a formal definition of the programming +language that I've been trying to analyze. This formal definition serves +as the "ground truth" for how our little imperative programs are executed; +however, program analyses (especially in practice) seldom use the formal semantics +as their subject matter. Instead, they focus on more pragmatic program +representations from the world of compilers. One such representation are +_Control Flow Graphs (CFGs)_. + +Let's start by building some informal intuition. CFGs are pretty much what +their name suggests. They are a type of [graph](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)). +Edges in CFGs represent how execution might jump from one piece of code to +another (how control might flow). + +For example, take the below program. + + +``` +x = ...; +if x { + x = 1; +} else { + x = 0; +} +y = x; +``` + +The CFG might look like this: + +{{< figure src="if-cfg.png" label="CFG for simple `if`-`else` code." class="small" >}} + +Here, the initialization of `x` with `...`, as well as the `if` condition (just `x`), +are guaranteed to execute one after another, so they occupy a single node. From there, +depending on the condition, the control flow can jump to one of the +branches of the `if` statement: the "then" branch if the condition is true, +and the "else" branch if the condition is false. As a result, there are two +arrows coming out of the initial node. Once either branch is executed, control +always jumps to the code right after the `if` statement (the `y = x`). Thus, +both the `x = 1` and `x = 0` nodes have a single arrow to the `y = x` node. + +As another example, if you had a loop: + +``` +x = ...; +while x { + x = x - 1; +} +y = x; +``` + +The CFG would look like this: + +{{< figure src="while-cfg.png" label="CFG for simple `while` code." class="small" >}} + +Here, condition of the loop (`x`) is not always guaranteed to execute together +with the code that initializes `x`. That's because the condition of the loop +is checked after every iteration, whereas the code before the loop is executed +only once. As a result, `x = ...` and `x` occupy distinct CFG nodes. From there, +the control flow can proceed in two different ways, depending on the value +of `x`. If `x` is truthy, the program will proceed to the loop body (decrementing `x`). +If `x` is falsy, the program will skip the loop body altogether, and go to the +code right after the loop (`y = x`). This is indicated by the two arrows +going out of the `x` node. After executing the body, we return to the condition +of the loop to see if we need to run another iteration. Because of this, the +decrementing node has an arrow back to the loop condition. + +Now, let's be a bit more precise. Control Flow Graphs are defined as follows: + +* __The nodes__ are [_basic blocks_](https://en.wikipedia.org/wiki/Basic_block). + Paraphrasing Wikipedia's definition, a basic block is a piece of code that + has only one entry point and one exit point. + + The one-entry-point rule means that it's not possible to jump into the middle + of the basic block, executing only half of its instructions. The execution of + a basic block always begins at the top. Symmetrically, the one-exit-point + rule means that you can't jump away to other code (even within the same block), + skipping some instructions. The execution of a basic block always ends at + the bottom. + + As a result of these constraints, when running a basic block, you are + guaranteed to execute every instruction in exactly the order they occur in, + and execute each instruction exactly once. +* __The edges__ are jumps between basic blocks. We've already seen how + `if` and `while` statements introduce these jumps. + +Basic blocks can only be made of code that doen't jump (otherwise, +we violate the single-exit-point policy). In the previous post, +we defined exactly this kind of code as [simple statements]({{< relref "05_spa_agda_semantics#introduce-simple-statements" >}}). +So, in our control flow graph, nodes will be sequences of simple statements. +{#list-basic-stmts} + +### Control Flow Graphs in Agda + +#### Basic Definition +At an abstract level, it's easy to say "it's just a graph where X is Y" about +anything. It's much harder to give a precise definition of such a graph, +particularly if you want to rule out invalid graphs (e.g., ones with edges +pointing nowhere). In Agda, I chose the represent a two lists: one of nodes, +and one of edges. Each node is simply a list of `BasicStmt`s, as +I described in a preceding paragraph. An edge is simply a pair of numbers, +each number encoding the index of the node connected by the edge. + +Here's where it gets a little complicated. I don't want to use plain natural +numbers for indices, because that means you can easily introduce "broken" edge. +For example, what if you have 4 nodes, and you have an edge `(5, 5)`? Therefore, +I picked the finite natural numbers represented by [`Fin`](https://agda.github.io/agda-stdlib/v2.0/Data.Fin.Base.html#1154). + +```Agda +data Fin : ℕ → Set where + zero : Fin (suc n) + suc : (i : Fin n) → Fin (suc n) +``` + +Specifically, `Fin n` is the type of natural numbers less than `n`. Following +this definition, `Fin 3` represents the numbers `0`, `1` and `2`. These are +represented using the same constructors as `Nat`: `zero` and `suc`. The type +of `zero` is `Fin (suc n)` for any `n`; this makes sense because zero is less +than any number plus one. For `suc,` the bound `n` of the input `i` is incremented +by one, leading to another `suc n` in the final type. This makes sense because if +`i < n`, then `i + 1 < n + 1`. I've previously explained this data type +[in another post on this site]({{< relref "01_aoc_coq#aside-vectors-and-finite-mathbbn" >}}). + +Here's my definition of `Graph`s written using `Fin`: + +{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 24 39 >}} + +I explicitly used a `size` field, which determines how many nodes are in the +graph, and serves both as the upper bound the edge indices as well as the +size `nodes` field. From there, an index `Index` into the node list is +{{< sidenote "right" "size-note" "just a natural number less than `size`," >}} +Ther are size natural numbers less than size:
+0, 1, ..., size - 1. +{{< /sidenote >}} +and an edge is just a pair of indices. The graph then contains a vector +(exact-length list) `nodes` of all the basic blocks, and then a list of +edges `edges`. + +There are two fields here that I have not yet said anything about: `inputs` +and `outputs`. When we have a complete CFG for our programs, these fields are +totally unnecessary. However, as we are _building_ the CFG, these will come +in handy, by telling us how to stitch together smaller sub-graphs that we've +already built. Let's talk about that next. + +#### Combining Graphs +Suppose you're building a CFG for a program in the following form: + +``` +code1; +code2; +``` + +Where `code1` and `code2` are arbitrary pieces of code, which could include +statements, loops, and pretty much anything else. Besides the fact that they +occur one after another, these pieces of code are unrelated, and we can +build CFGs for each one them independently. However, the fact that `code1` and +`code2` are in sequence means that the full control flow graph for the above +program should have edges going from the nodes in `code1` to the nodes in `code2`. +Of course, not _every_ node in `code1` should have such edges: that would +mean that after executing any "basic" sequence of instructions, you could suddenly +decide to skip the rest of `code1` and move on to executing `code2`. + +Thus, we need to be more precise about what edges we need to insert; we want +to insert edges between the "final" nodes in `code1` (where control ends up +after `code1` is finished executing) and the "initial" nodes in `code2` (where +control would begin once we started executing `code2`). Those are the `outputs` +and `inputs`, respectively. When stitching together sequenced control graphs, +we will connect each of the outputs of one to each of the inputs of the other. + +This is defined by the operation `_↦_`: + +{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 72 83 >}} diff --git a/content/blog/06_spa_agda_cfg/while-cfg.dot b/content/blog/06_spa_agda_cfg/while-cfg.dot new file mode 100644 index 0000000..ef37a6c --- /dev/null +++ b/content/blog/06_spa_agda_cfg/while-cfg.dot @@ -0,0 +1,15 @@ +digraph G { + graph[dpi=300 fontsize=14 fontname="Courier New"]; + node[shape=rectangle style="filled" fillcolor="#fafafa" penwidth=0.5 color="#aaaaaa"]; + edge[arrowsize=0.3 color="#444444"] + + node_begin [label="x = ...;\l"] + node_cond [label="x\l"] + node_body [label="x = x - 1\l"] + node_end [label="y = x\l"] + + node_begin -> node_cond + node_cond -> node_body + node_cond -> node_end + node_body -> node_cond +} diff --git a/content/blog/06_spa_agda_cfg/while-cfg.png b/content/blog/06_spa_agda_cfg/while-cfg.png new file mode 100644 index 0000000..278ce60 Binary files /dev/null and b/content/blog/06_spa_agda_cfg/while-cfg.png differ