blog-static/content/blog/agda_hugo.md

560 lines
25 KiB
Markdown
Raw Normal View History

---
title: "Integrating Agda's HTML Output with Hugo"
date: 2024-05-30T00:29:26-07:00
tags: ["Agda", "Hugo", "Ruby"]
---
One of my favorite things about Agda are its clickable HTML pages. If you don't
know what they are, that's pages like [`Data.List.Properties`](https://agda.github.io/agda-stdlib/master/Data.List.Properties.html);
they just give the code from a particular Agda file, but make every identifier
clickable. Then, if you see some variable or function that you don't know, you
can just click it and jump right to it! It makes exploring the documentation
a lot smoother. I've found that these HTML pages provide all the information
I need for writing proofs.
Recently, I've been writing a fair bit about Agda; mostly about the patterns
that I've learned about, such as the ["is something" pattern]({{< relref "agda_is_pattern" >}})
and the ["deeply embedded expression" trick]({{< relref "agda_expr_pattern" >}}).
I've found myself wanting to click on definitions in my own code blocks; recently,
I got this working, and I wanted to share how I did it, in case someone else
wants to integrate Agda into their own static website. Though my stack
is based on Hugo, the general idea should work with any other static site
generator.
### TL;DR and Demo
I wrote a script to transfer links from an Agda HTML file into Hugo's HTML
output, making it possible to embellish "plain" Hugo output with Agda's
'go-to-definition links'. It looks like this. Here's an Agda code block
defining an 'expression' data type, from a project of mine:
{{< codelines "Agda" "agda-spa/Lattice/Map.agda" 543 546 >}}
And here's the denotational semantics for that expression:
{{< codelines "Agda" "agda-spa/Lattice/Map.agda" 586 589 >}}
Notice that you can click `Expr`, `__`, `⟦`, etc.! All of this integrates
with my existing Hugo site, and only required a little bit of additional
metadata to make it work. The conversion is implemented as
[a Ruby script](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb);
this script transfers the link structure from an Agda-generated documentation
HTML file onto lightly-annotated Hugo code blocks.
To use the script, your Hugo theme (or your Markdown content) must
annotate the code blocks with several properties:
* `data-agda-block`, which marks code that needs to be processed.
* `data-file-path`, which tells the script what Agda file provided the
code in the block, and therefore what Agda HTML file should be searched
for links.
* `data-first-line` and `data-last-line`, which tell the script what
section of the Agda HTML file should be searched for said links.
Given this -- and a couple of other assumptions, such as that all Agda
projects are in a `code/<project>` folder, the script post-processes
the HTML files automatically. Right now, the solution is pretty tailored to my
site and workflow, but the core of the script -- the piece that transfers links
from an Agda HTML file into a syntax-highlighted Hugo HTML block -- should
be fairly reusable.
Now, the details.
### The Constraints
The goal was simple: to allow the code blocks on my Hugo-generated site to
have links that take the user to the definition of a given symbol.
Specifically, if the symbol occurs somewhere on the same blog page, the link
should take the user there (and not to a regular `Module.html` file). That
way, the reader can not only get to the code that they want to see, but also
have a chance to read the surrounding prose in properly-rendered Markdown.
Next, unlike standard "literate Agda" files, my blog posts are not single
`.agda` files with Markdown in comments. Rather, I use regular Hugo
Markdown, and present portions of an existing project, weaving together many
files, and showing the fragments out of order. So, my tool needs to support
links that come from distinct modules, in any order.
Additionally, I've recently been writing a whole series about an Agda project
of mine; in this series, I gradually build up to the final product, explaining
one or two modules at a time. I would expect that links on pages in this series
could jump to other pages in the same series: if I cover module `A` in part 1,
then write `A.f` in part 2, clicking on `A` -- and maybe `f` -- should take
the reader back to the first part's page; once again, this would help provide
them with the surrounding explanation.
Finally, I wanted the Agda code to appear exactly the same as any other code
on my site, including the Hugo-provided syntax highlighting and theme. This
ruled out just copy-pasting pieces of the Agda-generated HTML in place of
code blocks on my page (and redirecting the links). Thought it was not
a hard requirement, I also hoped to include Agda code in the same
manner that I include all other code: [my `codelines` shortcode]({{< relref "codelines" >}}).
In brief, the `codelines` shortcode creates a syntax-highlighted code block,
as well as a surrounding "context" that says what file the code is from,
which lines are listed, and where to find the full code (e.g., on my Git server).
It looks something like this:
{{< codelines "Agda" "agda-spa/Language/Base.agda" 12 20 >}}
In summary:
1. I want to create cross-links between symbols in Agda blocks in a blog post.
2. These code blocks could include code from disjoint files, and be out of order.
3. Code blocks among a whole series of posts should be cross-linked too.
4. The code blocks should be syntax highlighted the same way as the rest of the
code on the site.
5. Ideally, I should be able to use my regular method for referencing code.
I've hit all of these requirements; now it's time to dig into how I got there.
### Implementation
#### Processing Agda's HTML Output
It's pretty much a no-go to try to resolve Agda from Hugo, or perform some
sort of "heuristic" to detect cross-links. Agda is a very complex programming
language, and Hugo's templating engine, though powerful, is just not
up to this task. Fortunately, Agda has support for
[HTML output using the `--html` flag](https://agda.readthedocs.io/en/v2.6.4.3-r1/tools/generating-html.html).
As a build step, I can invoke Agda on files that are referenced by my blog,
and generate HTML. This would decidedly slow down the site build process,
but it would guarantee accurate link information.
On the other hand, to satisfy the 4th constraint, I need to somehow mimic --
or keep -- the format of Hugo's existing HTML output. The easiest way to
do this without worrying about breaking changes and version incompatibility
is to actually use the existing syntax-highlighted HTML, and annotate it
with links as I discover them. Effectively, what I need to do is a "link
transfer": I need to identify regions of code that are highlighted in Agda's HTML,
find those regions in Hugo's HTML output, and mark them with links. In addition,
I'll need to fix up the links themselves: the HTML output assumes that each
Agda file is its own HTML page, but this is ruled out by the second constraint of mine.
As a little visualization, the overall problems looks something like this:
````Agda {linenos=table}
-- Agda's HTML output (blocks of 't' are links):
-- |tttttt| |tttt| |t| |t| |ttttt|
module ModX ( x : T ) where
-- |tttttt| |tt|t| |t| |t| |ttttt|
-- Hugo's HTML output (blocks of 't' are syntax highlighting spans)
````
Both Agda and Hugo output a preformatted code block, decorated with various
inline HTMl that indicates information (token color for Hugo; symbol IDs and
links in Agda). However, Agda and Hugo do not use the same process to create
this decorated output; it's entirely possible -- and not uncommon -- for
Hugo and Agda to produce misaligned HTML nodes. In my diagram above,
this is reflected as `ModX` being considered a single token by Agda, but
two tokens (`Mod` and `X`) by the syntax highlighter. As a result, it's
difficult to naively iterate the two HTML formats in parallel.
What I ended up doing is translating Agda's HTML output into offsets and data
about the code block's _plain text_ -- the source code being decorated.
Both the Agda and Hugo HTML describe the same code; thus, the plain text
is the common denominator between the two.
{#plain-text}
I wrote a Ruby script to extract the decorations from the Agda output; here
it is in slightly abridged form. You can find the [original `agda.rb` file here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb).
```Ruby
# Traverse the preformatted Agda block in the given Agda HTML file
# and find which textual ranges have IDs and links to other ranges.
# Store this information in a hash, line => links[]
def process_agda_html_file(file)
document = Nokogiri::HTML.parse(File.open(file))
pre_code = document.css("pre.Agda")[0]
# The traversal is postorder; we always visit children before their
# parents, and we visit leaves in sequence.
line_infos = []
offset = 0 # Column index within the current Agda source code line
line = 1
pre_code.traverse do |at|
# Text nodes are always leaves; visiting a new leaf means we've advanced
# in the text by the length of that text. However, if there are newlines
# -- since this is a preformatted block -- we also advanced by a line.
# At this time, do not support links that span multiple lines, but
# Agda doesn't produce those either.
if at.text?
if at.content.include? "\n"
raise "no support for links with newlines inside" if at.parent.name != "pre"
# Increase the line and track the final offset. Written as a loop
# in case we eventually want to add some handling for the pieces
# sandwiched between newlines.
at.content.split("\n", -1).each_with_index do |bit, idx|
line += 1 unless idx == 0
offset = bit.length
end
else
# It's not a newline node. Just adjust the offset within the plain text.
offset += at.content.length
end
elsif at.name == "a"
# Agda emits both links and things-to-link-to as 'a' nodes.
line_info = line_infos.fetch(line) { line_infos[line] = [] }
href = at.attribute("href")
id = at.attribute("id")
if href or id
new_node = { :from => offset-at.content.length, :to => offset }
new_node[:href] = href if href
new_node[:id] = id if id
line_info << new_node
end
end
end
return line_infos
end
```
This script takes an Agda HTML file and returns a map in which each line
of the Agda source code is associated with a list of ranges; the ranges
indicate links or places that can be linked to. For example, for the `ModX`
example above, the script might produce:
```Ruby
3 => [
{ :from => 3, :to => 9, id => "..." }, # Agda creates <a> nodes even for keywords.
{ :from => 12, :to => 16, id => "ModX-id" }, # The IDs Agda generates aren't usually this nice.
{ :from => 20, :to => 21, id => "x-id" },
]
```
#### Modifying Hugo's HTML
Given such line information, the next step is to transfer it onto existing
Hugo HTML files. Within a file, I've made my `codelines` shortcode emit
custom attributes that can be used to find syntax-highlighted Agda code.
The chief such attribute is `data-agda-block`; my script traverses all
elements with this attribute.
```Ruby
def process_source_file(file, document)
# Process each highlight group that's been marked as an Agda file.
document.css('div[data-agda-block]').each do |t|
# ...
```
To figure out which Agda HTML file to use, and which lines to search for links,
the script also expects some additional attributes.
```Ruby
# ...
first_line, last_line = nil, nil
if first_line_attr = t.attribute("data-first-line")
first_line = first_line_attr.to_s.to_i
end
if last_line_attr = t.attribute("data-last-line")
last_line = last_line_attr.to_s.to_i
end
if first_line and last_line
line_range = first_line..last_line
else
# no line number attributes = the code block contains the whole file
line_range = 1..
end
full_path = t.attribute("data-file-path").to_s
# ...
```
At this point, the Agda file could be in some nested directory, like
`A/B/C/File.agda`. However, the project root -- the place where Agda modules
are compiled from -- could be any one of the folders `A`, `B`, or `C`.
Thus, the fully qualified module name for `File.agda` could be `File`,
`C.File`, `B.C.File`, or `A.B.C.File`. Since Agda's HTML output produces
files named after the fully qualified module name, the script needs to guess
what the module file is. This is where some conventions come in play:
I keep my code in folders directly nested within a top-level `code` directory;
thus, I'll have folders `project1` or `project2` inside `code`, and those
will always be project roots. As a result,
I guess that the first directory relative to `code` should be discarded,
while the rest should be included in the path. The only exception to this is
Git submodules: if an Agda file is included using a submodule, the root
directory of the submodule is considered the Agda project root. My Hugo theme
indicates the submodule using an additional `data-base-path` attribute; in all,
that leads to the following logic:
```Ruby
# ...
full_path_dirs = Pathname(full_path).each_filename.to_a
base_path = t.attribute("data-base-path").to_s
base_dir_depth = 0
if base_path.empty?
# No submodules were used. Assume code/<X> is the root.
# The path of the file is given relative to `code`, so need
# to strip only the one outermost directory.
base_dir_depth = 1
base_path = full_path_dirs[0]
else
# The code is in a submodule. Assume that the base path / submodule
# root is the Agda module root, ignore all folders before that.
base_path_dirs = Pathname(base_path).each_filename.to_a
base_dir_depth = base_path_dirs.length
end
# ...
```
With that, the script determines the actual HTML file path ---
by assuming that there's an `html` folder in the same place as the Agda
project root --- and runs the above `process_agda_html_file`:
```Ruby
# ...
dirs_in_base = full_path_dirs[base_dir_depth..-1]
html_file = dirs_in_base.join(".").gsub(/\.agda$/, ".html")
html_path = File.join(["code", base_path, "html", html_file])
agda_info = process_agda_html_file(html_path)
# ...
```
The next step is specific to the output of Hugo's syntax highlighter,
[Chroma](https://github.com/alecthomas/chroma). When line numbers are enabled
-- and they are on my site -- Chroma generates a table that, at some point,
contains a bunch of `span` HTML nodes, each with the `line` class. Each
such `span` corresponds to a single line of output; naturally, the first
one contains the code from `first_line`, the second from `first_line + 1`,
and so on until `last_line`. This is quite convenient, because it saves the
headache of counting newlines the way that the Agda processing code above has to.
For each line of syntax-highlighted code, the script retrieves the corresponding
list of links that were collected from the Agda HTML file.
```Ruby
# ...
lines = t.css("pre.chroma code[data-lang] .line")
lines.zip(line_range).each do |line, line_no|
line_info = agda_info[line_no]
next unless line_info
# ...
```
The subsequent traversal -- which picks out the plain text of the Agda file
as [reasoned above](#plain-text) -- is very similar to the previous
one. Here too there's an `offset` variable, which gets incremented with
the length of a new plain text pieces. Since we know the lines match up
to `span`s, there's no need to count newlines.
```Ruby
# ...
offset = 0
line.traverse do |lt|
if lt.text?
content = lt.content
new_offset = offset + content.length
# ...
```
At this point, we have a line number, and an offset within that line number
that describes the portion of the text under consideration. We can
traverse all the links for the line, and find ones that mark a piece of
text somewhere in this range. For the time being -- since inserting overlapping
spans is quite complicated -- I require the links to lie entirely within a
particular plain text region. As a result, if Chroma splits a single Agda
identifier into several tokens, it will not be linked. For now, this seems
like the most conservative and safe approach.
```Ruby
# ...
matching_links = line_info.links.filter do |link|
link[:from] >= offset and link[:to] <= new_offset
end
# ...
```
All that's left is to slice up the plain text fragment into a bunch of HTML
pieces: the substrings that are links will turn into `a` HTML nodes, while
the substrings that are "in between" the links will be left over as plain
text nodes. The code to do so is relatively verbose, but not all that complicated.
```Ruby
replace_with = []
replace_offset = 0
matching_links.each do |match|
# The link's range is an offset from the beginning of the line,
# but the text piece we're splitting up might be partway into
# the line. Convert the link coordinates to piece-relative ones.
relative_from = match[:from] - offset
relative_to = match[:to] - offset
# If the previous link ended some time before the new link
# began (or if the current link is the first one, and is not
# at the beginning), ensure that the plain text "in between"
# is kept.
replace_with << content[replace_offset...relative_from]
tag = (match.include? :href) ? 'a' : 'span'
new_node = Nokogiri::XML::Node.new(tag, document)
if match.include? :href
# For nodes with links, note what they're referring to, so
# we can adjust their hrefs when we assign global IDs.
href = match[:href].to_s
new_node['href'] = note_used_href file, new_node, href
end
if match.include? :id
# For nodes with IDs visible in the current Hugo file, we'll
# want to redirect links that previously go to other Agda
# module HTML files. So, note the ID that we want to redirect,
# and pick a new unique ID to replace it with.
id = match[:id].to_s
new_node['id'] = note_defined_href file, "#{html_file}##{id}"
end
new_node.content = content[relative_from...relative_to]
replace_with << new_node
replace_offset = relative_to
end
replace_with << content[replace_offset..-1]
```
There's a little bit of a subtlety in the above code: specifically, I use
the `note_used_href` and `note_defined_href` methods. These are important
for rewriting links. Like I mentioned earlier, Agda's HTML output assumes
that each source file should produce a single HTML file -- named after its
qualified module -- and creates links accordingly. However, my blog posts
interweave multiple source files. Some links that would've jumped to a different
file must now point to an internal identifier within the page. Another
important aspect of the transformation is that, since I'm pulling HTML files
from distinct files, it's not guaranteed that each of them will have a unique
`id` attribute. After all, Agda just assigns sequential numbers to each
node that it generates; it would only take, e.g., including the first line
from two distinct modules to end up with two nodes with `id="1"`.
The solution is then twofold:
1. Track all the nodes referencing a particular `href` (made up of an HTML
file and a numerical identifier, like `File.html#123`). When we pick
new IDs -- thus guaranteeing their uniqueness -- we'll visit all the
nodes that refer to the old ID and HTML file, and update their `href`.
2. Track all existing Agda HTML IDs that we're inserting. If we transfer
an `<a id="1234">` onto the Hugo content, we know we'll need to pick a new
ID for it (since `1234` need not be unique), and that we'll need to redirect
the other links to that new ID as the previous bullet describes.
Here's how these two methods work:
```Ruby
def note_defined_href(file, href)
file_hrefs = @local_seen_hrefs.fetch(file) do
@local_seen_hrefs[file] = {}
end
uniq_id = file_hrefs.fetch(href) do
new_id = "agda-unique-ident-#{@id_counter}"
@id_counter += 1
file_hrefs[href] = new_id
end
unless @global_seen_hrefs.include? href
@global_seen_hrefs[href] = { :file => file, :id => uniq_id }
end
return uniq_id
end
def note_used_href(file, node, href)
ref_list = @nodes_referencing_href.fetch(href) { @nodes_referencing_href[href] = [] }
ref_list << { :file => file, :node => node }
return href
end
```
Note that they use class variables: these are methods on a `FileGroup` class.
I've omitted the various classes I've declared from the above code for brevity,
but here it makes sense to show them. Like I mentioned earlier, you can
view the [complete code here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/6a168f2fe144850ed3a81b796e07266cbf80f382/agda.rb).
Interestingly, `note_defined_href` makes use of _two_ global maps:
`@local_seen_hrefs` and `@global_seen_hrefs`. This helps satisfy the third
constraint above, which is linking between code defined in the same series.
The logic is as follows: when rewriting a link to a new HTML file and ID,
if the code we're trying to link to exists on the current page, we should link
to that. Otherwise, if the code we're trying to link to was presented in
a different part of the series, then we should link to that other part.
So, we consult the "local" map for `href`s that will be rewritten to HTML
nodes in the current file, and as a fallback, consult the "global" map for
`hrefs` that were introduced in other parts. The `note_defined_href` populates
both maps, and is "biased" towards the first occurrence of a piece of code:
if posts A and B define a function `f`, and post C only references `f`, then
that link will go to post A's definition, which came earlier.
The other method, `note_used_href`, is simpler. It just appends to a list
of Nokogiri HTML nodes that reference a given `href`. We keep track of the file
in which the reference occurred so we can be sure to consult the right sub-map
of `@local_seen_hrefs` when checking for in-page rewrites.
After running `process_source_file` on all Hugo HTML files within a particular
series, the following holds true:
* We have inserted `span` or `a` nodes wherever Agda's original output
had nodes with `id` or `href` elements. This is with the exception of the
case where Hugo's inline HTML doesn't "line up" with Agda's inline HTML,
which I've only found to happen when the leading character of an identifier is a digit.
* We have picked new IDs for each HTML node we inserted that had an ID,
noting them both globally and for the current file. We noted their original
`href` value (in the form `File.html#123`) and that it should be transformed
into our globally-unique identifiers, in the form `agda-unique-ident-1234`.
* For each HTML node we inserted that links to another, we noted the `href`
of the reference (also in the form `File.html#123`).
Now, all that's left is to redirect the `href`s of the nodes we inserted
from their old values to the new ones. I do this by iterating over `@nodes_referencing_href`,
which contains every link we inserted.
```Ruby
def cross_link_files
@nodes_referencing_href.each do |href, references|
references.each do |reference|
file = reference[:file]
node = reference[:node]
local_targets = @local_seen_hrefs[file]
if local_targets.include? href
# A code block in this file provides this href, create a local link.
node['href'] = "##{local_targets[href]}"
elsif @global_seen_hrefs.include? href
# A code block in this series, but not in this file, defines
# this href. Create a cross-file link.
target = @global_seen_hrefs[href]
other_file = target[:file]
id = target[:id]
relpath = Pathname.new(other_file).dirname.relative_path_from(Pathname.new(file).dirname)
node['href'] = "#{relpath}##{id}"
else
# No definitions in any blog page. For now, just delete the anchor.
node.replace node.content
end
end
end
end
```
Notice that for the time being, I simply remove links to Agda definitions that
didn't occur in the Hugo post. Ideally, this would link to the plain, non-blog
documentation page generated by Agda; however, this requires either hosting
those documentation pages, or expecting the Agda standard library HTML pages
to remain stable and hosted at a fixed URL. Neither was simple enough to do,
so I opted for the conservative "just don't insert links" approach.
And that's all of the approach that I wanted to show off today! There
are other details, like finding posts in the same series (I achieve this
with a `meta` element) and invoking `agda --html` on the necessary source files
(my [`build-agda-html.rb`](https://dev.danilafe.com/Web-Projects/blog-static/src/branch/master/build-agda-html.rb)
script is how I personally do this), but I don't think it's all that valuable
to describe them here.
Unfortunately, the additional metadata I had my theme insert makes it
harder for others to use this approach out of the box. However, I hope that by
sharing my experience, others who write Agda and post about it might be able
to get a similar solution working. And of course, it's always fun to write
about a recent project or endeavor.
Happy (dependently typed) programming and blogging!