From 13636a0d290054c52a4631cc3c078dfc5ffc674a Mon Sep 17 00:00:00 2001 From: Danila Fedorin Date: Thu, 30 May 2024 00:13:48 -0700 Subject: [PATCH] Write more about Agda+Hugo Signed-off-by: Danila Fedorin --- content/blog/agda_hugo.md | 353 +++++++++++++++++++++++++++++++++++++- 1 file changed, 345 insertions(+), 8 deletions(-) diff --git a/content/blog/agda_hugo.md b/content/blog/agda_hugo.md index e1413c1..99bdcb6 100644 --- a/content/blog/agda_hugo.md +++ b/content/blog/agda_hugo.md @@ -2,7 +2,7 @@ title: "Integrating Agda's HTML Output with Hugo" date: 2024-05-25T21:02:10-07:00 draft: true -tags: ["Agda", "Hugo", "Ruby", "Nix"] +tags: ["Agda", "Hugo", "Ruby"] --- One of my favorite things about Agda are its clickable HTML pages. If you don't @@ -37,13 +37,29 @@ And here's the denotational semantics for that expression: Notice that you can click `Expr`, `_∪_`, `⟦`, etc.! All of this integrates with my existing Hugo site, and only required a little bit of additional -metadata to make it work. +metadata to make it work. The conversion is implemented as +[a Ruby script](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb); +this script transfers the link structure from an Agda-generated documentation +HTML file onto lightly-annotated Hugo code blocks. -Now, the details. Right now, the solution is pretty tailored to my site -and workflow, but the core of the script -- a piece that transfers links -from an Agda HTML file into a syntax highlighted Hugo HTML block -- should +To use the script, your Hugo theme (or your Markdown content) must +annotate the code blocks with several properties: +* `data-agda-block`, which marks code that needs to be processed. +* `data-file-path`, which tells the script what Agda file provided the + code in the block, and therefore what Agda HTML file should be searched + for links. +* `data-first-line` and `data-last-line`, which tell the script what + section of the Agda HTML file should be searched for said links. + +Given this -- and a couple of other assumptions, such as that all Agda +projects are in a `code/` folder, the script post-processes +the HTML files automatically. Right now, the solution is pretty tailored to my +site and workflow, but the core of the script -- the piece that transfers links +from an Agda HTML file into a syntax-highlighted Hugo HTML block -- should be fairly reusable. +Now, the details. + ### The Constraints The goal was simple: to allow the code blocks on my Hugo-generated site to have links that take the user to the definition of a given symbol. @@ -84,7 +100,7 @@ In summary: 1. I want to create cross-links between symbols in Agda blocks in a blog post. 2. These code blocks could include code from disjoint files, and be out of order. 3. Code blocks among a whole series of posts should be cross-linked too. -4. The code blocks should be syntax highlighted the same way as the rest of the +4. The code blocks should be syntax highlighting the same way as the rest of the code on the site. 5. Ideally, I should be able to use my regular method for referencing code. @@ -92,6 +108,7 @@ I've hit all of these requirements; now it's time to dig into how I got there. ### Implementation +#### Processing Agda's HTML Output It's pretty much a no-go to try to resolve Agda from Hugo, or perform some sort of "heuristic" to detect cross-links. Agda is a very complex programming language, and Hugo's templating engine, though powerful, is just not @@ -135,6 +152,7 @@ What I ended up doing is translating Agda's HTML output into offsets and data about the code block's _plain text_ -- the source code being decorated. Both the Agda and Hugo HTML describe the same code; thus, the plain text is the common denominator between the two. +{#plain-text} I wrote a Ruby script to extract the decorations from the Agda output; here it is in slightly abridged form. You can find the [original `agda.rb` file here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb). @@ -147,7 +165,7 @@ def process_agda_html_file(file) document = Nokogiri::HTML.parse(File.open(file)) pre_code = document.css("pre.Agda")[0] - # The traversal postorder; we always visit children before their + # The traversal is postorder; we always visit children before their # parents, and we visit leaves in sequence. line_infos = [] offset = 0 # Column index within the current Agda source code line @@ -205,6 +223,325 @@ example above, the script might produce: ] ``` +#### Modifying Hugo's HTML + +Given such line information, the next step is to transfer it onto existing +Hugo HTML files. Within a file, I've made my `codelines` shortcode emit +custom attributes that can be used to find syntax highlighting Agda code. +The chief such attribute is `data-agda-block`; my script traverses all +elements with this attribute. + +```Ruby +def process_source_file(file, document) + # Process each highlight group that's been marked as an Agda file. + document.css('div[data-agda-block]').each do |t| + # ... +``` + +To figure out which Agda HTML file to use, and which lines to search for links, +the script also expects some additional attributes. + +```Ruby + # ... + first_line, last_line = nil, nil + + if first_line_attr = t.attribute("data-first-line") + first_line = first_line_attr.to_s.to_i + end + if last_line_attr = t.attribute("data-last-line") + last_line = last_line_attr.to_s.to_i + end + + if first_line and last_line + line_range = first_line..last_line + else + # no line number attributes = the code block contains the whole file + line_range = 1.. + end + + full_path = t.attribute("data-file-path").to_s + # ... +``` + +At this point, the Agda file could be in some nested directory, like +`A/B/C/File.agda`. However, the project root -- the place where Agda modules +are compiled from -- could be any one of the folders `A`, `B`, or `C`. +Thus, the fully qualified module name for `File.agda` could be `File`, +`C.File`, `B.C.File`, or `A.B.C.File`. Since Agda's HTML output produces +files named after the fully qualified module name, the script needs to guess +what the module file is. This is where some conventions come in play: +I keep my code in folders directly nested within a top-level `code` directory; +thus, I'll have folders `project1` or `project2`. As a result, +I guess that the first directory should be discarded, while the rest should be +included in the path. The only exception to this is Git submodules: if an Agda +file is included using a submodule, the root directory of the submodule is +considered the Agda project root. My Hugo theme indicates the submodule using +an additional `data-base-path` attribute; in all, that leads to the +following logic: + +```Ruby + # ... + full_path_dirs = Pathname(full_path).each_filename.to_a + base_path = t.attribute("data-base-path").to_s + base_dir_depth = 0 + if base_path.empty? + # No submodules were used. Assume code/ is the root. + # The path of the file is given relative to `code`, so need + # to strip only the one outermost directory. + base_dir_depth = 1 + base_path = full_path_dirs[0] + else + # The code is in a submodule. Assume that the base path / submodule + # root is the Agda module root, ignore all folders before that. + base_path_dirs = Pathname(base_path).each_filename.to_a + base_dir_depth = base_path_dirs.length + end + # ... +``` + +With that, the script determines the actual HTML file path --- +by assuming that there's an `html` folder in the same place as the Agda +project root --- and runs the above `process_agda_html_file`: + +```Ruby + # ... + dirs_in_base = full_path_dirs[base_dir_depth..-1] + html_file = dirs_in_base.join(".").gsub(/\.agda$/, ".html") + html_path = File.join(["code", base_path, "html", html_file]) + + agda_info = process_agda_html_file(html_path) + # ... +``` + +The next step is specific to the output of Hugo's syntax highlighter, +[Chroma](https://github.com/alecthomas/chroma). When line numbers are enabled +-- and they are on my site -- Chroma generates a table that, at some point, +contains a bunch of `span` HTML nodes, each with the `line` class. Each +such `span` corresponds to a single line of output; naturally, the first +one contains the code from `first_line`, the second from `first_line + 1`, +and so on until `last_line`. This is quite convenient, because it saves the +headache of counting newlines the way that the Agda processing code above has to. + +For each line of syntax-highlighted code, the script retrieves the corresponding +list of links that were collected from the Agda HTML file. + +```Ruby + # ... + lines = t.css("pre.chroma code[data-lang] .line") + lines.zip(line_range).each do |line, line_no| + line_info = agda_info[line_no] + next unless line_info + + # ... +``` + +The subsequent traversal -- which picks out the plain text of the Agda file +as [reasoned above](#plain-text) -- is very similar to the previous +one. Here too there's an `offset` variable, which gets incremented with +the length of a new plain text pieces. Since we know the lines match up +to `span`s, there's no need to count newlines. + +```Ruby + # ... + offset = 0 + line.traverse do |lt| + if lt.text? + content = lt.content + new_offset = offset + content.length + + # ... +``` + +At this point, we have a line number, and an offset within that line number +that describes the portion of the text under consideration. We can +traverse all the links for the line, and find ones that mark a piece of +text somewhere in this range. For the time being -- since inserting overlapping +spans is quite complicated -- I require the links to lie entirely within a +particular plain text region. As a result, if Chroma splits a single Agda +identifier into several tokens, it will not be linked. For now, this seems +like the most conservative and safe approach. + +```Ruby + # ... + matching_links = line_info.links.filter do |link| + link[:from] >= offset and link[:to] <= new_offset + end + # ... +``` + +All that's left is to slice up the plain text fragment into a bunch of HTML +pieces: the substrings that are links will turn into `a` HTML nodes, while +the substrings that are "in between" the links will be left over as plain +text nodes. The code to do so is relatively verbose, but not all that complicated. + +```Ruby + replace_with = [] + replace_offset = 0 + matching_links.each do |match| + # The link's range is an offset from the beginning of the line, + # but the text piece we're splitting up might be partway into + # the line. Convert the link coordinates to piece-relative ones. + relative_from = match[:from] - offset + relative_to = match[:to] - offset + + # If the previous link ended some time before the new link + # began (or if the current link is the first one, and is not + # at the beginning), ensure that the plain text "in between" + # is kept. + replace_with << content[replace_offset...relative_from] + + tag = (match.include? :href) ? 'a' : 'span' + new_node = Nokogiri::XML::Node.new(tag, document) + if match.include? :href + # For nodes with links, note what they're referring to, so + # we can adjust their hrefs when we assign global IDs. + href = match[:href].to_s + new_node['href'] = note_used_href file, new_node, href + end + if match.include? :id + # For nodes with IDs visible in the current Hugo file, we'll + # want to redirect links that previously go to other Agda + # module HTML files. So, note the ID that we want to redirect, + # and pick a new unique ID to replace it with. + id = match[:id].to_s + new_node['id'] = note_defined_href file, "#{html_file}##{id}" + end + new_node.content = content[relative_from...relative_to] + + replace_with << new_node + replace_offset = relative_to + end + replace_with << content[replace_offset..-1] +``` + +There's a little bit of a subtlety in the above code: specifically, I use +the `note_used_href` and `note_defined_href` methods. These are important +for rewriting links. Like I mentioned earlier, Agda's HTML output assumes +that each source file should produce a single HTML file -- named after its +qualified module -- and creates links accordingly. However, my blog posts +interweave multiple source files. Some links that would've jumped to a different +file must now point to an internal identifier within the page. Another +important aspect of the transformation is that, since I'm pulling HTML files +from distinct files, it's not guaranteed that each of them will have a unique +`id` attribute. After all, Agda just assigns sequential numbers to each +node that it generates; it would only take, e.g., including the first line +from two distinct modules to end up with two nodes with `id="1"`. + +The solution is then twofold: + +1. Track all the nodes referencing a particular `href` (made up of an HTML + file and a numerical identifier, like `File.html#123`). When we pick + new IDs -- thus guaranteeing their uniqueness -- we'll visit all the + nodes that refer to the old ID and HTML file, and update their `href`. +2. Track all existing Agda HTML IDs that we're inserting. If we transfer + an `` onto the Hugo content, we know we'll need to pick a new + ID for it (since `1234` need not be unique), and that we'll need to redirect + the other links to that new ID as the previous bullet describes. + +Here's how these two methods work: + +```Ruby +def note_defined_href(file, href) + file_hrefs = @local_seen_hrefs.fetch(file) do + @local_seen_hrefs[file] = {} + end + + uniq_id = file_hrefs.fetch(href) do + new_id = "agda-unique-ident-#{@id_counter}" + @id_counter += 1 + file_hrefs[href] = new_id + end + + unless @global_seen_hrefs.include? href + @global_seen_hrefs[href] = { :file => file, :id => uniq_id } + end + + return uniq_id +end + +def note_used_href(file, node, href) + ref_list = @nodes_referencing_href.fetch(href) { @nodes_referencing_href[href] = [] } + ref_list << { :file => file, :node => node } + return href +end +``` + +Note that they use class variables: these are methods on a `FileGroup` class. +I've omitted the various classes I've declared from the above code for brevity, +but here it makes sense to show them. Like I mentioned earlier, you can +view the [complete code here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/6a168f2fe144850ed3a81b796e07266cbf80f382/agda.rb). + +Interestingly, `note_defined_href` makes use of _two_ global maps: +`@local_seen_hrefs` and `@global_seen_hrefs`. This helps satisfy the third +constraint above, which is linking between code defined in the same series. +The logic is as follows: when rewriting a link to a new HTML file and ID, +if the code we're trying to link to exists on the current page, we should link +to that. Otherwise, if the code we're trying to link to was presented in +a different part of the series, then we should link to that other part. +So, we consult the "local" map for `href`s that will be rewritten to HTML +nodes in the current file, and as a fallback, consult the "global" map for +`hrefs` that were introduced in other parts. The `note_defined_href` populates +both maps, and is "biased" towards the first occurrence of a piece of code: +if posts A and B define a function `f`, and post C only references `f`, then +that link will go to post A's definition, which came earlier. + +The other method, `note_used_href`, is simpler. It just appends to a list +of Nokogiri HTML nodes that reference a given `href`. We keep track of the file +in which the reference occurred so we can be sure to consult the right sub-map +of `@local_seen_hrefs` when checking for in-page rewrites. + +After running `process_source_file` on all Hugo HTML files within a particular +series, the following holds true: +* We have inserted `span` or `a` nodes wherever Agda's original output + had nodes with `id` or `href` elements. This is with the exception of the + case where Hugo's inline HTML doesn't "line up" with Agda's inline HTML, + which I've only found to happen when the leading character of an identifier is a digit. +* We have picked new IDs for each HTML node we inserted that had an ID, + noting them both globally and for the current file. We noted their original + `href` value (in the form `File.html#123`) and that it should be transformed + into our globally-unique identifiers, in the form `agda-unique-ident-1234`. +* For each HTML node we inserted that links to another, we noted the `href` + of the reference (also in the form `File.html#123`). + +Now, all that's left is to redirect the `href`s of the nodes we inserted +from their old values to the new ones. I do this by iterating over `@nodes_referencing_href`, +which contains every link we inserted. + +```Ruby +def cross_link_files + @nodes_referencing_href.each do |href, references| + references.each do |reference| + file = reference[:file] + node = reference[:node] + + local_targets = @local_seen_hrefs[file] + if local_targets.include? href + # A code block in this file provides this href, create a local link. + node['href'] = "##{local_targets[href]}" + elsif @global_seen_hrefs.include? href + # A code block in this series, but not in this file, defines + # this href. Create a cross-file link. + target = @global_seen_hrefs[href] + other_file = target[:file] + id = target[:id] + + relpath = Pathname.new(other_file).dirname.relative_path_from(Pathname.new(file).dirname) + node['href'] = "#{relpath}##{id}" + else + # No definitions in any blog page. For now, just delete the anchor. + node.replace node.content + end + end + end +end +``` + +Notice that for the time being, I simply remove links to Agda definitions that +didn't occur in the Hugo post. Ideally, this would link to the plain, non-blog +documentation page generated by Agda; however, this requires either hosting +those documentation pages, or expecting the Agda standard library HTML pages +to remain stable and hosted at a fixed URL. Neither was simple enough to do, +so I opted for the conservative "just don't insert links" approach. {{< todo >}}This isn't as important probably, but might be worth talking about. {{< /todo >}} @@ -253,4 +590,4 @@ end ``` In short, it traverses all the folders in my `code` directory -- which is where -I keep my code, looking for Agda source files. Once it finds them, +I keep my code -- looking for Agda source files. Once it finds them,