From 13636a0d290054c52a4631cc3c078dfc5ffc674a Mon Sep 17 00:00:00 2001
From: Danila Fedorin <danila.fedorin@gmail.com>
Date: Thu, 30 May 2024 00:13:48 -0700
Subject: [PATCH] Write more about Agda+Hugo

Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>
---
 content/blog/agda_hugo.md | 353 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 345 insertions(+), 8 deletions(-)
diff --git a/content/blog/agda_hugo.md b/content/blog/agda_hugo.md
index e1413c1..99bdcb6 100644
--- a/content/blog/agda_hugo.md
+++ b/content/blog/agda_hugo.md
@@ -2,7 +2,7 @@
 title: "Integrating Agda's HTML Output with Hugo"
 date: 2024-05-25T21:02:10-07:00
 draft: true
-tags: ["Agda", "Hugo", "Ruby", "Nix"]
+tags: ["Agda", "Hugo", "Ruby"]
 ---
 
 One of my favorite things about Agda are its clickable HTML pages. If you don't
@@ -37,13 +37,29 @@ And here's the denotational semantics for that expression:
 
 Notice that you can click `Expr`, `_∪_`, `⟦`, etc.! All of this integrates
 with my existing Hugo site, and only required a little bit of additional
-metadata to make it work.
+metadata to make it work. The conversion is implemented as
+[a Ruby script](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb);
+this script transfers the link structure from an Agda-generated documentation
+HTML file onto lightly-annotated Hugo code blocks.
 
-Now, the details. Right now, the solution is pretty tailored to my site
-and workflow, but the core of the script -- a piece that transfers links
-from an Agda HTML file into a syntax highlighted Hugo HTML block -- should
+To use the script, your Hugo theme (or your Markdown content) must
+annotate the code blocks with several properties:
+* `data-agda-block`, which marks code that needs to be processed.
+* `data-file-path`, which tells the script what Agda file provided the
+  code in the block, and therefore what Agda HTML file should be searched
+  for links.
+* `data-first-line` and `data-last-line`, which tell the script what
+  section of the Agda HTML file should be searched for said links.
+
+Given this -- and a couple of other assumptions, such as that all Agda
+projects are in a `code/<project>` folder, the script post-processes
+the HTML files automatically. Right now, the solution is pretty tailored to my
+site and workflow, but the core of the script -- the piece that transfers links
+from an Agda HTML file into a syntax-highlighted Hugo HTML block -- should
 be fairly reusable.
 
+Now, the details.
+
 ### The Constraints
 The goal was simple: to allow the code blocks on my Hugo-generated site to
 have links that take the user to the definition of a given symbol.
@@ -84,7 +100,7 @@ In summary:
 1. I want to create cross-links between symbols in Agda blocks in a blog post.
 2. These code blocks could include code from disjoint files, and be out of order.
 3. Code blocks among a whole series of posts should be cross-linked too.
-4. The code blocks should be syntax highlighted the same way as the rest of the
+4. The code blocks should be syntax highlighting the same way as the rest of the
    code on the site.
 5. Ideally, I should be able to use my regular method for referencing code.
 
@@ -92,6 +108,7 @@ I've hit all of these requirements; now it's time to dig into how I got there.
 
 ### Implementation
 
+#### Processing Agda's HTML Output
 It's pretty much a no-go to try to resolve Agda from Hugo, or perform some
 sort of "heuristic" to detect cross-links. Agda is a very complex programming
 language, and Hugo's templating engine, though powerful, is just not
@@ -135,6 +152,7 @@ What I ended up doing is translating Agda's HTML output into offsets and data
 about the code block's _plain text_ -- the source code being decorated.
 Both the Agda and Hugo HTML describe the same code; thus, the plain text
 is the common denominator between the two.
+{#plain-text}
 
 I wrote a Ruby script to extract the decorations from the Agda output; here
 it is in slightly abridged form. You can find the [original `agda.rb` file here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb).
@@ -147,7 +165,7 @@ def process_agda_html_file(file)
   document = Nokogiri::HTML.parse(File.open(file))
   pre_code = document.css("pre.Agda")[0]
 
-  # The traversal postorder; we always visit children before their
+  # The traversal is postorder; we always visit children before their
   # parents, and we visit leaves in sequence.
   line_infos = []
   offset = 0 # Column index within the current Agda source code line
@@ -205,6 +223,325 @@ example above, the script might produce:
 ]
 ```
 
+#### Modifying Hugo's HTML
+
+Given such line information, the next step is to transfer it onto existing
+Hugo HTML files. Within a file, I've made my `codelines` shortcode emit
+custom attributes that can be used to find syntax highlighting Agda code.
+The chief such attribute is `data-agda-block`; my script traverses all
+elements with this attribute.
+
+```Ruby
+def process_source_file(file, document)
+  # Process each highlight group that's been marked as an Agda file.
+  document.css('div[data-agda-block]').each do |t|
+    # ...
+```
+
+To figure out which Agda HTML file to use, and which lines to search for links,
+the script also expects some additional attributes.
+
+```Ruby
+    # ...
+    first_line, last_line = nil, nil
+
+    if first_line_attr = t.attribute("data-first-line")
+      first_line = first_line_attr.to_s.to_i
+    end
+    if last_line_attr = t.attribute("data-last-line")
+      last_line = last_line_attr.to_s.to_i
+    end
+
+    if first_line and last_line
+      line_range = first_line..last_line
+    else
+      # no line number attributes = the code block contains the whole file
+      line_range = 1..
+    end
+
+    full_path = t.attribute("data-file-path").to_s
+    # ...
+```
+
+At this point, the Agda file could be in some nested directory, like
+`A/B/C/File.agda`. However, the project root -- the place where Agda modules
+are compiled from -- could be any one of the folders `A`, `B`, or `C`.
+Thus, the fully qualified module name for `File.agda` could be `File`,
+`C.File`, `B.C.File`, or `A.B.C.File`. Since Agda's HTML output produces
+files named after the fully qualified module name, the script needs to guess
+what the module file is. This is where some conventions come in play:
+I keep my code in folders directly nested within a top-level `code` directory;
+thus, I'll have folders `project1` or `project2`. As a result,
+I guess that the first directory should be discarded, while the rest should be
+included in the path. The only exception to this is Git submodules: if an Agda
+file is included using a submodule, the root directory of the submodule is
+considered the Agda project root. My Hugo theme indicates the submodule using
+an additional `data-base-path` attribute; in all, that leads to the
+following logic:
+
+```Ruby
+    # ...
+    full_path_dirs = Pathname(full_path).each_filename.to_a
+    base_path = t.attribute("data-base-path").to_s
+    base_dir_depth = 0
+    if base_path.empty?
+      # No submodules were used. Assume code/<X> is the root.
+      # The path of the file is given relative to `code`, so need
+      # to strip only the one outermost directory.
+      base_dir_depth = 1
+      base_path = full_path_dirs[0]
+    else
+      # The code is in a submodule. Assume that the base path / submodule
+      # root is the Agda module root, ignore all folders before that.
+      base_path_dirs = Pathname(base_path).each_filename.to_a
+      base_dir_depth = base_path_dirs.length
+    end
+    # ...
+```
+
+With that, the script determines the actual HTML file path ---
+by assuming that there's an `html` folder in the same place as the Agda
+project root ---  and runs the above `process_agda_html_file`:
+
+```Ruby
+    # ...
+    dirs_in_base = full_path_dirs[base_dir_depth..-1]
+    html_file = dirs_in_base.join(".").gsub(/\.agda$/, ".html")
+    html_path = File.join(["code", base_path, "html", html_file])
+
+    agda_info = process_agda_html_file(html_path)
+    # ...
+```
+
+The next step is specific to the output of Hugo's syntax highlighter,
+[Chroma](https://github.com/alecthomas/chroma). When line numbers are enabled
+-- and they are on my site -- Chroma generates a table that, at some point,
+contains a bunch of `span` HTML nodes, each with the `line` class. Each
+such `span` corresponds to a single line of output; naturally, the first
+one contains the code from `first_line`, the second from `first_line + 1`,
+and so on until `last_line`. This is quite convenient, because it saves the
+headache of counting newlines the way that the Agda processing code above has to.
+
+For each line of syntax-highlighted code, the script retrieves the corresponding
+list of links that were collected from the Agda HTML file.
+
+```Ruby
+    # ...
+    lines = t.css("pre.chroma code[data-lang] .line")
+    lines.zip(line_range).each do |line, line_no|
+      line_info = agda_info[line_no]
+      next unless line_info
+
+      # ...
+```
+
+The subsequent traversal -- which picks out the plain text of the Agda file
+as [reasoned above](#plain-text) -- is very similar to the previous
+one. Here too there's an `offset` variable, which gets incremented with
+the length of a new plain text pieces. Since we know the lines match up
+to `span`s, there's no need to count newlines.
+
+```Ruby
+      # ...
+      offset = 0
+      line.traverse do |lt|
+        if lt.text?
+          content = lt.content
+          new_offset = offset + content.length
+
+          # ...
+```
+
+At this point, we have a line number, and an offset within that line number
+that describes the portion of the text under consideration. We can
+traverse all the links for the line, and find ones that mark a piece of
+text somewhere in this range. For the time being -- since inserting overlapping
+spans is quite complicated -- I require the links to lie entirely within a
+particular plain text region. As a result, if Chroma splits a single Agda
+identifier into several tokens, it will not be linked. For now, this seems
+like the most conservative and safe approach.
+
+```Ruby
+          # ...
+          matching_links = line_info.links.filter do |link|
+            link[:from] >= offset and link[:to] <= new_offset
+          end
+          # ...
+```
+
+All that's left is to slice up the plain text fragment into a bunch of HTML
+pieces: the substrings that are links will turn into `a` HTML nodes, while
+the substrings that are "in between" the links will be left over as plain
+text nodes. The code to do so is relatively verbose, but not all that complicated.
+
+```Ruby
+          replace_with = []
+          replace_offset = 0
+          matching_links.each do |match|
+            # The link's range is an offset from the beginning of the line,
+            # but the text piece we're splitting up might be partway into
+            # the line. Convert the link coordinates to piece-relative ones.
+            relative_from = match[:from] - offset
+            relative_to = match[:to] - offset
+
+            # If the previous link ended some time before the new link
+            # began (or if the current link is the first one, and is not
+            # at the beginning), ensure that the plain text "in between"
+            # is kept.
+            replace_with << content[replace_offset...relative_from]
+
+            tag = (match.include? :href) ? 'a' : 'span'
+            new_node = Nokogiri::XML::Node.new(tag, document)
+            if match.include? :href
+              # For nodes with links, note what they're referring to, so
+              # we can adjust their hrefs when we assign global IDs.
+              href = match[:href].to_s
+              new_node['href'] = note_used_href file, new_node, href
+            end
+            if match.include? :id
+              # For nodes with IDs visible in the current Hugo file, we'll
+              # want to redirect links that previously go to other Agda
+              # module HTML files. So, note the ID that we want to redirect,
+              # and pick a new unique ID to replace it with.
+              id = match[:id].to_s
+              new_node['id'] = note_defined_href file, "#{html_file}##{id}"
+            end
+            new_node.content = content[relative_from...relative_to]
+
+            replace_with << new_node
+            replace_offset = relative_to
+          end
+          replace_with << content[replace_offset..-1]
+```
+
+There's a little bit of a subtlety in the above code: specifically, I use
+the `note_used_href` and `note_defined_href` methods. These are important
+for rewriting links. Like I mentioned earlier, Agda's HTML output assumes
+that each source file should produce a single HTML file -- named after its
+qualified module -- and creates links accordingly. However, my blog posts
+interweave multiple source files. Some links that would've jumped to a different
+file must now point to an internal identifier within the page. Another
+important aspect of the transformation is that, since I'm pulling HTML files
+from distinct files, it's not guaranteed that each of them will have a unique
+`id` attribute. After all, Agda just assigns sequential numbers to each
+node that it generates; it would only take, e.g., including the first line
+from two distinct modules to end up with two nodes with `id="1"`. 
+
+The solution is then twofold:
+
+1. Track all the nodes referencing a particular `href` (made up of an HTML
+   file and a numerical identifier, like `File.html#123`). When we pick
+   new IDs -- thus guaranteeing their uniqueness -- we'll visit all the
+   nodes that refer to the old ID and HTML file, and update their `href`.
+2. Track all existing Agda HTML IDs that we're inserting. If we transfer
+   an `<a id="1234">` onto the Hugo content, we know we'll need to pick a new
+   ID for it (since `1234` need not be unique), and that we'll need to redirect
+   the other links to that new ID as the previous bullet describes.
+
+Here's how these two methods work:
+
+```Ruby
+def note_defined_href(file, href)
+  file_hrefs = @local_seen_hrefs.fetch(file) do
+    @local_seen_hrefs[file] = {}
+  end
+
+  uniq_id = file_hrefs.fetch(href) do
+    new_id = "agda-unique-ident-#{@id_counter}"
+    @id_counter += 1
+    file_hrefs[href] = new_id
+  end
+
+  unless @global_seen_hrefs.include? href
+    @global_seen_hrefs[href] = { :file => file, :id => uniq_id }
+  end
+
+  return uniq_id
+end
+
+def note_used_href(file, node, href)
+  ref_list = @nodes_referencing_href.fetch(href) { @nodes_referencing_href[href] = [] }
+  ref_list << { :file => file, :node => node }
+  return href
+end
+```
+
+Note that they use class variables: these are methods on a `FileGroup` class.
+I've omitted the various classes I've declared from the above code for brevity,
+but here it makes sense to show them. Like I mentioned earlier, you can
+view the [complete code here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/6a168f2fe144850ed3a81b796e07266cbf80f382/agda.rb).
+
+Interestingly, `note_defined_href` makes use of _two_ global maps:
+`@local_seen_hrefs` and `@global_seen_hrefs`. This helps satisfy the third
+constraint above, which is linking between code defined in the same series.
+The logic is as follows: when rewriting a link to a new HTML file and ID,
+if the code we're trying to link to exists on the current page, we should link
+to that. Otherwise, if the code we're trying to link to was presented in
+a different part of the series, then we should link to that other part.
+So, we consult the "local" map for `href`s that will be rewritten to HTML
+nodes in the current file, and as a fallback, consult the "global" map for
+`hrefs` that were introduced in other parts. The `note_defined_href` populates
+both maps, and is "biased" towards the first occurrence of a piece of code:
+if posts A and B define a function `f`, and post C only references `f`, then
+that link will go to post A's definition, which came earlier.
+
+The other method, `note_used_href`, is simpler. It just appends to a list
+of Nokogiri HTML nodes that reference a given `href`. We keep track of the file
+in which the reference occurred so we can be sure to consult the right sub-map
+of `@local_seen_hrefs` when checking for in-page rewrites.
+
+After running `process_source_file` on all Hugo HTML files within a particular
+series, the following holds true:
+* We have inserted `span` or `a` nodes wherever Agda's original output
+  had nodes with `id` or `href` elements. This is with the exception of the
+  case where Hugo's inline HTML doesn't "line up" with Agda's inline HTML,
+  which I've only found to happen when the leading character of an identifier is a digit.
+* We have picked new IDs for each HTML node we inserted that had an ID,
+  noting them both globally and for the current file. We noted their original
+  `href` value (in the form `File.html#123`) and that it should be transformed
+  into our globally-unique identifiers, in the form `agda-unique-ident-1234`.
+* For each HTML node we inserted that links to another, we noted the `href`
+  of the reference (also in the form `File.html#123`).
+
+Now, all that's left is to redirect the `href`s of the nodes we inserted
+from their old values to the new ones. I do this by iterating over `@nodes_referencing_href`,
+which contains every link we inserted.
+
+```Ruby
+def cross_link_files
+  @nodes_referencing_href.each do |href, references|
+    references.each do |reference|
+      file = reference[:file]
+      node = reference[:node]
+
+      local_targets = @local_seen_hrefs[file]
+      if local_targets.include? href
+        # A code block in this file provides this href, create a local link.
+        node['href'] = "##{local_targets[href]}"
+      elsif @global_seen_hrefs.include? href
+        # A code block in this series, but not in this file, defines
+        # this href. Create a cross-file link.
+        target = @global_seen_hrefs[href]
+        other_file = target[:file]
+        id = target[:id]
+
+        relpath = Pathname.new(other_file).dirname.relative_path_from(Pathname.new(file).dirname)
+        node['href'] = "#{relpath}##{id}"
+      else
+        # No definitions in any blog page. For now, just delete the anchor.
+        node.replace node.content
+      end
+    end
+  end
+end
+```
+
+Notice that for the time being, I simply remove links to Agda definitions that
+didn't occur in the Hugo post. Ideally, this would link to the plain, non-blog
+documentation page generated by Agda; however, this requires either hosting
+those documentation pages, or expecting the Agda standard library HTML pages
+to remain stable and hosted at a fixed URL. Neither was simple enough to do,
+so I opted for the conservative "just don't insert links" approach.
 
 {{< todo >}}This isn't as important probably, but might be worth talking about. {{< /todo >}}
 
@@ -253,4 +590,4 @@ end
 ```
 
 In short, it traverses all the folders in my `code` directory -- which is where
-I keep my code, looking for Agda source files. Once it finds them,
+I keep my code -- looking for Agda source files. Once it finds them,