From 19be7eb1f5a0dc6d99d19e24b99631bf6af2504a Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Wed, 27 Nov 2024 16:27:26 -0800 Subject: [PATCH] Edit and publish post on control flow graphs Signed-off-by: Danila Fedorin --- content/blog/06_spa_agda_cfg/index.md | 88 ++++++++++++++++++++++++--- 1 file changed, 80 insertions(+), 8 deletions(-) diff --git a/content/blog/06_spa_agda_cfg/index.md b/content/blog/06_spa_agda_cfg/index.md index 76ecff2..d904313 100644 --- a/content/blog/06_spa_agda_cfg/index.md +++ b/content/blog/06_spa_agda_cfg/index.md @@ -2,7 +2,7 @@ title: "Implementing and Verifying \"Static Program Analysis\" in Agda, Part 6: Control Flow Graphs" series: "Static Program Analysis in Agda" description: "In this post, I show how I show an Agda definition of control flow graph building" -date: 2024-11-13T17:32:42-07:00 +date: 2024-11-27T16:26:42-07:00 tags: ["Agda", "Programming Languages"] draft: true --- @@ -109,8 +109,10 @@ each number encoding the index of the node connected by the edge. Here's where it gets a little complicated. I don't want to use plain natural numbers for indices, because that means you can easily introduce "broken" edge. -For example, what if you have 4 nodes, and you have an edge `(5, 5)`? Therefore, -I picked the finite natural numbers represented by [`Fin`](https://agda.github.io/agda-stdlib/v2.0/Data.Fin.Base.html#1154). +For example, what if you have 4 nodes, and you have an edge `(5, 5)`? To avoid +this, I picked the finite natural numbers represented by +[`Fin`](https://agda.github.io/agda-stdlib/v2.0/Data.Fin.Base.html#1154) +as endpoints for edges. ```Agda data Fin : ℕ → Set where @@ -188,7 +190,7 @@ some of the indices of the elements within. For instance, in the lists `[x]` and `[y]`, the indices of both `x` and `y` are `0`; however, in the concatenated list `[x, y]`, the index of `x` is still `0`, but the index of `y` is `1`. More generally, when we concatenate two lists `l1` and `l2`, the indices -into `l1` remain unchanged, whereas the indices `l2` are shifted `length l2`. +into `l1` remain unchanged, whereas the indices `l2` are shifted by `length l1`. Actually, that's not all there is to it. The _values_ of the indices into the left list don't change, but their types do! They start as `Fin (length l1)`, @@ -219,7 +221,7 @@ l2 [ idx2 ] ≡ (l1 ++ l2) [ n ↑ʳ idx2 ] The operators used in the definition above are just versions of the same re-indexing operators. The `↑ˡᵉ` operator applies `↑ˡ` to all the (__e__)dges -in a graph, and the `↑ˡi` applies it to all the (__i__)ndices in a list +in a graph, and the `↑ˡⁱ` applies it to all the (__i__)ndices in a list (like `inputs` and `outputs`). Given these definitions, hopefully the intent with the rest of the definition @@ -300,6 +302,76 @@ we may have duplicates in that list (why not?). {{< codelines "Agda" "agda-spa/Language/Graphs.agda" 165 166 >}} -{{< todo >}} -the rest -{{< /todo >}} +Above, `indices` is a list of all the node identifiers in the graph. Since the +graph has `size` nodes, the indices of all these nodes are simply the values +`0`, `1`, ..., `size - 1`. I defined a special function `finValues` to compute +this list, together with a proof that this list is unique. + +{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 127 143 >}} + +Another important property of `finValues` is that each node identifier is +present in the list, so that our computation written by traversing the node +list do not "miss" nodes. + +{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 145 147 >}} + +We can specialize these definitions for a particular graph `g`: + +{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 156 163 >}} + +To recap, we now have: +* A way to build control flow graphs from programs +* A list (unique'd and complete) of all nodes in the control flow graph so that + we can iterate over them when the algorithm demands. +* A 'predecessors' function, which will be used by our static program analyses, + implemented as an iteration over the list of nodes. + +All that's left is to connect our `predecessors` function to edges in the graph. +The following definitions say that when an edge is in the graph, the starting +node is listed as a predecessor of the ending node, and vise versa. + +{{< codelines "Agda" "agda-spa/Language/Graphs.agda" 168 177 >}} + +### Connecting Two Distinct Representations + +I've described Control Flow Graphs as a compiler-centric representation of the +program. Unlike the formal semantics from the previous post, CFGs do not reason +about the dynamic behavior of the code. Instead, they capture the possible +paths that execution can take through the instructions. In that +sense, they are more of an approximation of what the program will do. This is +good: because of [Rice's theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem), +we can't do anything other than approximating without running the program. + +However, an incorrect approximation is of no use at all. Since the CFGs we build +will be the core data type used by our program analyses, it's important that they +are an accurate, if incomplete, representation. Specifically, because most +of our analyses reason about possible outcomes --- we report what sign each +variable __could__ have, for instance --- it's important that we don't accidentally +omit cases that can happen in practice from our CFGs. Formally, this means +that for each possible execution of a program according to its semantics, +{{< sidenote "right" "converse-note" "there exists a corresponding path through the graph." >}} +The converse is desirable too: that the graph has only paths that correspond +to possible executions of the program. One graph that violates this property is +the strongly-connected graph of all basic blocks in a program. Analyzing +such a graph would give us an overly-conservative estimation; since anything +can happen, most of our answers will likely be too general to be of any use. If, +on the other hand, only the necessary graph connections exist, we can be more +precise.
+
+However, proving this converse property (or even stating it precisely) is much +harder, because our graphs are somewhat conservative already. There exist +programs in which the condition of an if-statement is always +evaluated to false, but our graphs always have edges for both +the "then" and "else" cases. Determining whether a condition is always false +(e.g.) is undecidable thanks to Rice's theorem (again), so we can't rule it out. +Instead, we could broaden "all possible executions" +to "all possible executions where branching conditions can produce arbitrary +results", but this is something else entirely.
+
+For the time being, I will leave this converse property aside. As a result, +our approximations might be "too careful". However, they will at the very least +be sound. +{{< /sidenote >}} + +In the next post, I will prove that this property holds for the graphs shown +here and the formal semantics I defined earlier. I hope to see you there!