Write more about Agda+Hugo

Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>
This commit is contained in:
Danila Fedorin 2024-05-30 00:13:48 -07:00
parent 5232f0a6e2
commit 13636a0d29

View File

@ -2,7 +2,7 @@
title: "Integrating Agda's HTML Output with Hugo"
date: 2024-05-25T21:02:10-07:00
draft: true
tags: ["Agda", "Hugo", "Ruby", "Nix"]
tags: ["Agda", "Hugo", "Ruby"]
---
One of my favorite things about Agda are its clickable HTML pages. If you don't
@ -37,13 +37,29 @@ And here's the denotational semantics for that expression:
Notice that you can click `Expr`, `__`, `⟦`, etc.! All of this integrates
with my existing Hugo site, and only required a little bit of additional
metadata to make it work.
metadata to make it work. The conversion is implemented as
[a Ruby script](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb);
this script transfers the link structure from an Agda-generated documentation
HTML file onto lightly-annotated Hugo code blocks.
Now, the details. Right now, the solution is pretty tailored to my site
and workflow, but the core of the script -- a piece that transfers links
from an Agda HTML file into a syntax highlighted Hugo HTML block -- should
To use the script, your Hugo theme (or your Markdown content) must
annotate the code blocks with several properties:
* `data-agda-block`, which marks code that needs to be processed.
* `data-file-path`, which tells the script what Agda file provided the
code in the block, and therefore what Agda HTML file should be searched
for links.
* `data-first-line` and `data-last-line`, which tell the script what
section of the Agda HTML file should be searched for said links.
Given this -- and a couple of other assumptions, such as that all Agda
projects are in a `code/<project>` folder, the script post-processes
the HTML files automatically. Right now, the solution is pretty tailored to my
site and workflow, but the core of the script -- the piece that transfers links
from an Agda HTML file into a syntax-highlighted Hugo HTML block -- should
be fairly reusable.
Now, the details.
### The Constraints
The goal was simple: to allow the code blocks on my Hugo-generated site to
have links that take the user to the definition of a given symbol.
@ -84,7 +100,7 @@ In summary:
1. I want to create cross-links between symbols in Agda blocks in a blog post.
2. These code blocks could include code from disjoint files, and be out of order.
3. Code blocks among a whole series of posts should be cross-linked too.
4. The code blocks should be syntax highlighted the same way as the rest of the
4. The code blocks should be syntax highlighting the same way as the rest of the
code on the site.
5. Ideally, I should be able to use my regular method for referencing code.
@ -92,6 +108,7 @@ I've hit all of these requirements; now it's time to dig into how I got there.
### Implementation
#### Processing Agda's HTML Output
It's pretty much a no-go to try to resolve Agda from Hugo, or perform some
sort of "heuristic" to detect cross-links. Agda is a very complex programming
language, and Hugo's templating engine, though powerful, is just not
@ -135,6 +152,7 @@ What I ended up doing is translating Agda's HTML output into offsets and data
about the code block's _plain text_ -- the source code being decorated.
Both the Agda and Hugo HTML describe the same code; thus, the plain text
is the common denominator between the two.
{#plain-text}
I wrote a Ruby script to extract the decorations from the Agda output; here
it is in slightly abridged form. You can find the [original `agda.rb` file here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/04f12b545d5692a78b1a2f13ef968417c749e295/agda.rb).
@ -147,7 +165,7 @@ def process_agda_html_file(file)
document = Nokogiri::HTML.parse(File.open(file))
pre_code = document.css("pre.Agda")[0]
# The traversal postorder; we always visit children before their
# The traversal is postorder; we always visit children before their
# parents, and we visit leaves in sequence.
line_infos = []
offset = 0 # Column index within the current Agda source code line
@ -205,6 +223,325 @@ example above, the script might produce:
]
```
#### Modifying Hugo's HTML
Given such line information, the next step is to transfer it onto existing
Hugo HTML files. Within a file, I've made my `codelines` shortcode emit
custom attributes that can be used to find syntax highlighting Agda code.
The chief such attribute is `data-agda-block`; my script traverses all
elements with this attribute.
```Ruby
def process_source_file(file, document)
# Process each highlight group that's been marked as an Agda file.
document.css('div[data-agda-block]').each do |t|
# ...
```
To figure out which Agda HTML file to use, and which lines to search for links,
the script also expects some additional attributes.
```Ruby
# ...
first_line, last_line = nil, nil
if first_line_attr = t.attribute("data-first-line")
first_line = first_line_attr.to_s.to_i
end
if last_line_attr = t.attribute("data-last-line")
last_line = last_line_attr.to_s.to_i
end
if first_line and last_line
line_range = first_line..last_line
else
# no line number attributes = the code block contains the whole file
line_range = 1..
end
full_path = t.attribute("data-file-path").to_s
# ...
```
At this point, the Agda file could be in some nested directory, like
`A/B/C/File.agda`. However, the project root -- the place where Agda modules
are compiled from -- could be any one of the folders `A`, `B`, or `C`.
Thus, the fully qualified module name for `File.agda` could be `File`,
`C.File`, `B.C.File`, or `A.B.C.File`. Since Agda's HTML output produces
files named after the fully qualified module name, the script needs to guess
what the module file is. This is where some conventions come in play:
I keep my code in folders directly nested within a top-level `code` directory;
thus, I'll have folders `project1` or `project2`. As a result,
I guess that the first directory should be discarded, while the rest should be
included in the path. The only exception to this is Git submodules: if an Agda
file is included using a submodule, the root directory of the submodule is
considered the Agda project root. My Hugo theme indicates the submodule using
an additional `data-base-path` attribute; in all, that leads to the
following logic:
```Ruby
# ...
full_path_dirs = Pathname(full_path).each_filename.to_a
base_path = t.attribute("data-base-path").to_s
base_dir_depth = 0
if base_path.empty?
# No submodules were used. Assume code/<X> is the root.
# The path of the file is given relative to `code`, so need
# to strip only the one outermost directory.
base_dir_depth = 1
base_path = full_path_dirs[0]
else
# The code is in a submodule. Assume that the base path / submodule
# root is the Agda module root, ignore all folders before that.
base_path_dirs = Pathname(base_path).each_filename.to_a
base_dir_depth = base_path_dirs.length
end
# ...
```
With that, the script determines the actual HTML file path ---
by assuming that there's an `html` folder in the same place as the Agda
project root --- and runs the above `process_agda_html_file`:
```Ruby
# ...
dirs_in_base = full_path_dirs[base_dir_depth..-1]
html_file = dirs_in_base.join(".").gsub(/\.agda$/, ".html")
html_path = File.join(["code", base_path, "html", html_file])
agda_info = process_agda_html_file(html_path)
# ...
```
The next step is specific to the output of Hugo's syntax highlighter,
[Chroma](https://github.com/alecthomas/chroma). When line numbers are enabled
-- and they are on my site -- Chroma generates a table that, at some point,
contains a bunch of `span` HTML nodes, each with the `line` class. Each
such `span` corresponds to a single line of output; naturally, the first
one contains the code from `first_line`, the second from `first_line + 1`,
and so on until `last_line`. This is quite convenient, because it saves the
headache of counting newlines the way that the Agda processing code above has to.
For each line of syntax-highlighted code, the script retrieves the corresponding
list of links that were collected from the Agda HTML file.
```Ruby
# ...
lines = t.css("pre.chroma code[data-lang] .line")
lines.zip(line_range).each do |line, line_no|
line_info = agda_info[line_no]
next unless line_info
# ...
```
The subsequent traversal -- which picks out the plain text of the Agda file
as [reasoned above](#plain-text) -- is very similar to the previous
one. Here too there's an `offset` variable, which gets incremented with
the length of a new plain text pieces. Since we know the lines match up
to `span`s, there's no need to count newlines.
```Ruby
# ...
offset = 0
line.traverse do |lt|
if lt.text?
content = lt.content
new_offset = offset + content.length
# ...
```
At this point, we have a line number, and an offset within that line number
that describes the portion of the text under consideration. We can
traverse all the links for the line, and find ones that mark a piece of
text somewhere in this range. For the time being -- since inserting overlapping
spans is quite complicated -- I require the links to lie entirely within a
particular plain text region. As a result, if Chroma splits a single Agda
identifier into several tokens, it will not be linked. For now, this seems
like the most conservative and safe approach.
```Ruby
# ...
matching_links = line_info.links.filter do |link|
link[:from] >= offset and link[:to] <= new_offset
end
# ...
```
All that's left is to slice up the plain text fragment into a bunch of HTML
pieces: the substrings that are links will turn into `a` HTML nodes, while
the substrings that are "in between" the links will be left over as plain
text nodes. The code to do so is relatively verbose, but not all that complicated.
```Ruby
replace_with = []
replace_offset = 0
matching_links.each do |match|
# The link's range is an offset from the beginning of the line,
# but the text piece we're splitting up might be partway into
# the line. Convert the link coordinates to piece-relative ones.
relative_from = match[:from] - offset
relative_to = match[:to] - offset
# If the previous link ended some time before the new link
# began (or if the current link is the first one, and is not
# at the beginning), ensure that the plain text "in between"
# is kept.
replace_with << content[replace_offset...relative_from]
tag = (match.include? :href) ? 'a' : 'span'
new_node = Nokogiri::XML::Node.new(tag, document)
if match.include? :href
# For nodes with links, note what they're referring to, so
# we can adjust their hrefs when we assign global IDs.
href = match[:href].to_s
new_node['href'] = note_used_href file, new_node, href
end
if match.include? :id
# For nodes with IDs visible in the current Hugo file, we'll
# want to redirect links that previously go to other Agda
# module HTML files. So, note the ID that we want to redirect,
# and pick a new unique ID to replace it with.
id = match[:id].to_s
new_node['id'] = note_defined_href file, "#{html_file}##{id}"
end
new_node.content = content[relative_from...relative_to]
replace_with << new_node
replace_offset = relative_to
end
replace_with << content[replace_offset..-1]
```
There's a little bit of a subtlety in the above code: specifically, I use
the `note_used_href` and `note_defined_href` methods. These are important
for rewriting links. Like I mentioned earlier, Agda's HTML output assumes
that each source file should produce a single HTML file -- named after its
qualified module -- and creates links accordingly. However, my blog posts
interweave multiple source files. Some links that would've jumped to a different
file must now point to an internal identifier within the page. Another
important aspect of the transformation is that, since I'm pulling HTML files
from distinct files, it's not guaranteed that each of them will have a unique
`id` attribute. After all, Agda just assigns sequential numbers to each
node that it generates; it would only take, e.g., including the first line
from two distinct modules to end up with two nodes with `id="1"`.
The solution is then twofold:
1. Track all the nodes referencing a particular `href` (made up of an HTML
file and a numerical identifier, like `File.html#123`). When we pick
new IDs -- thus guaranteeing their uniqueness -- we'll visit all the
nodes that refer to the old ID and HTML file, and update their `href`.
2. Track all existing Agda HTML IDs that we're inserting. If we transfer
an `<a id="1234">` onto the Hugo content, we know we'll need to pick a new
ID for it (since `1234` need not be unique), and that we'll need to redirect
the other links to that new ID as the previous bullet describes.
Here's how these two methods work:
```Ruby
def note_defined_href(file, href)
file_hrefs = @local_seen_hrefs.fetch(file) do
@local_seen_hrefs[file] = {}
end
uniq_id = file_hrefs.fetch(href) do
new_id = "agda-unique-ident-#{@id_counter}"
@id_counter += 1
file_hrefs[href] = new_id
end
unless @global_seen_hrefs.include? href
@global_seen_hrefs[href] = { :file => file, :id => uniq_id }
end
return uniq_id
end
def note_used_href(file, node, href)
ref_list = @nodes_referencing_href.fetch(href) { @nodes_referencing_href[href] = [] }
ref_list << { :file => file, :node => node }
return href
end
```
Note that they use class variables: these are methods on a `FileGroup` class.
I've omitted the various classes I've declared from the above code for brevity,
but here it makes sense to show them. Like I mentioned earlier, you can
view the [complete code here](https://dev.danilafe.com/Web-Projects/blog-static/src/commit/6a168f2fe144850ed3a81b796e07266cbf80f382/agda.rb).
Interestingly, `note_defined_href` makes use of _two_ global maps:
`@local_seen_hrefs` and `@global_seen_hrefs`. This helps satisfy the third
constraint above, which is linking between code defined in the same series.
The logic is as follows: when rewriting a link to a new HTML file and ID,
if the code we're trying to link to exists on the current page, we should link
to that. Otherwise, if the code we're trying to link to was presented in
a different part of the series, then we should link to that other part.
So, we consult the "local" map for `href`s that will be rewritten to HTML
nodes in the current file, and as a fallback, consult the "global" map for
`hrefs` that were introduced in other parts. The `note_defined_href` populates
both maps, and is "biased" towards the first occurrence of a piece of code:
if posts A and B define a function `f`, and post C only references `f`, then
that link will go to post A's definition, which came earlier.
The other method, `note_used_href`, is simpler. It just appends to a list
of Nokogiri HTML nodes that reference a given `href`. We keep track of the file
in which the reference occurred so we can be sure to consult the right sub-map
of `@local_seen_hrefs` when checking for in-page rewrites.
After running `process_source_file` on all Hugo HTML files within a particular
series, the following holds true:
* We have inserted `span` or `a` nodes wherever Agda's original output
had nodes with `id` or `href` elements. This is with the exception of the
case where Hugo's inline HTML doesn't "line up" with Agda's inline HTML,
which I've only found to happen when the leading character of an identifier is a digit.
* We have picked new IDs for each HTML node we inserted that had an ID,
noting them both globally and for the current file. We noted their original
`href` value (in the form `File.html#123`) and that it should be transformed
into our globally-unique identifiers, in the form `agda-unique-ident-1234`.
* For each HTML node we inserted that links to another, we noted the `href`
of the reference (also in the form `File.html#123`).
Now, all that's left is to redirect the `href`s of the nodes we inserted
from their old values to the new ones. I do this by iterating over `@nodes_referencing_href`,
which contains every link we inserted.
```Ruby
def cross_link_files
@nodes_referencing_href.each do |href, references|
references.each do |reference|
file = reference[:file]
node = reference[:node]
local_targets = @local_seen_hrefs[file]
if local_targets.include? href
# A code block in this file provides this href, create a local link.
node['href'] = "##{local_targets[href]}"
elsif @global_seen_hrefs.include? href
# A code block in this series, but not in this file, defines
# this href. Create a cross-file link.
target = @global_seen_hrefs[href]
other_file = target[:file]
id = target[:id]
relpath = Pathname.new(other_file).dirname.relative_path_from(Pathname.new(file).dirname)
node['href'] = "#{relpath}##{id}"
else
# No definitions in any blog page. For now, just delete the anchor.
node.replace node.content
end
end
end
end
```
Notice that for the time being, I simply remove links to Agda definitions that
didn't occur in the Hugo post. Ideally, this would link to the plain, non-blog
documentation page generated by Agda; however, this requires either hosting
those documentation pages, or expecting the Agda standard library HTML pages
to remain stable and hosted at a fixed URL. Neither was simple enough to do,
so I opted for the conservative "just don't insert links" approach.
{{< todo >}}This isn't as important probably, but might be worth talking about. {{< /todo >}}
@ -253,4 +590,4 @@ end
```
In short, it traverses all the folders in my `code` directory -- which is where
I keep my code, looking for Agda source files. Once it finds them,
I keep my code -- looking for Agda source files. Once it finds them,