blog-static/agda_hugo.md at master

Danila Fedorin 406c934b7a Update code in blog post to match new line numbers

Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>

2024-08-18 13:22:28 -10:00

25 KiB

Raw Permalink Blame History

title

date

TL;DR and Demo

I wrote a script to transfer links from an Agda HTML file into Hugo's HTML output, making it possible to embellish "plain" Hugo output with Agda's 'go-to-definition links'. It looks like this. Here's an Agda code block defining an 'expression' data type, from a project of mine:

And here's the denotational semantics for that expression:

Notice that you can click Expr, _∪_, ⟦, etc.! All of this integrates with my existing Hugo site, and only required a little bit of additional metadata to make it work. The conversion is implemented as a Ruby script; this script transfers the link structure from an Agda-generated documentation HTML file onto lightly-annotated Hugo code blocks.

To use the script, your Hugo theme (or your Markdown content) must annotate the code blocks with several properties:

data-agda-block, which marks code that needs to be processed.
data-file-path, which tells the script what Agda file provided the code in the block, and therefore what Agda HTML file should be searched for links.
data-first-line and data-last-line, which tell the script what section of the Agda HTML file should be searched for said links.

Given this -- and a couple of other assumptions, such as that all Agda projects are in a code/<project> folder, the script post-processes the HTML files automatically. Right now, the solution is pretty tailored to my site and workflow, but the core of the script -- the piece that transfers links from an Agda HTML file into a syntax-highlighted Hugo HTML block -- should be fairly reusable.

Now, the details.

The Constraints

The goal was simple: to allow the code blocks on my Hugo-generated site to have links that take the user to the definition of a given symbol. Specifically, if the symbol occurs somewhere on the same blog page, the link should take the user there (and not to a regular Module.html file). That way, the reader can not only get to the code that they want to see, but also have a chance to read the surrounding prose in properly-rendered Markdown.

Next, unlike standard "literate Agda" files, my blog posts are not single .agda files with Markdown in comments. Rather, I use regular Hugo Markdown, and present portions of an existing project, weaving together many files, and showing the fragments out of order. So, my tool needs to support links that come from distinct modules, in any order.

Additionally, I've recently been writing a whole series about an Agda project of mine; in this series, I gradually build up to the final product, explaining one or two modules at a time. I would expect that links on pages in this series could jump to other pages in the same series: if I cover module A in part 1, then write A.f in part 2, clicking on A -- and maybe f -- should take the reader back to the first part's page; once again, this would help provide them with the surrounding explanation.

Finally, I wanted the Agda code to appear exactly the same as any other code on my site, including the Hugo-provided syntax highlighting and theme. This ruled out just copy-pasting pieces of the Agda-generated HTML in place of code blocks on my page (and redirecting the links). Thought it was not a hard requirement, I also hoped to include Agda code in the same manner that I include all other code: [my codelines shortcode]({{< relref "codelines" >}}). In brief, the codelines shortcode creates a syntax-highlighted code block, as well as a surrounding "context" that says what file the code is from, which lines are listed, and where to find the full code (e.g., on my Git server). It looks something like this:

In summary:

I want to create cross-links between symbols in Agda blocks in a blog post.
These code blocks could include code from disjoint files, and be out of order.
Code blocks among a whole series of posts should be cross-linked too.
The code blocks should be syntax highlighted the same way as the rest of the code on the site.
Ideally, I should be able to use my regular method for referencing code.

I've hit all of these requirements; now it's time to dig into how I got there.

Implementation

Processing Agda's HTML Output

It's pretty much a no-go to try to resolve Agda from Hugo, or perform some sort of "heuristic" to detect cross-links. Agda is a very complex programming language, and Hugo's templating engine, though powerful, is just not up to this task. Fortunately, Agda has support for HTML output using the --html flag. As a build step, I can invoke Agda on files that are referenced by my blog, and generate HTML. This would decidedly slow down the site build process, but it would guarantee accurate link information.

On the other hand, to satisfy the 4th constraint, I need to somehow mimic -- or keep -- the format of Hugo's existing HTML output. The easiest way to do this without worrying about breaking changes and version incompatibility is to actually use the existing syntax-highlighted HTML, and annotate it with links as I discover them. Effectively, what I need to do is a "link transfer": I need to identify regions of code that are highlighted in Agda's HTML, find those regions in Hugo's HTML output, and mark them with links. In addition, I'll need to fix up the links themselves: the HTML output assumes that each Agda file is its own HTML page, but this is ruled out by the second constraint of mine.

As a little visualization, the overall problems looks something like this:



1
2
3
4
5


-- Agda's HTML output (blocks of 't' are links):
-- |tttttt| |tttt|  |t|  |t| |ttttt|
    module   ModX  ( x  : T ) where
-- |tttttt| |tt|t|  |t|  |t| |ttttt|
-- Hugo's HTML output (blocks of 't' are syntax highlighting spans)

Both Agda and Hugo output a preformatted code block, decorated with various inline HTMl that indicates information (token color for Hugo; symbol IDs and links in Agda). However, Agda and Hugo do not use the same process to create this decorated output; it's entirely possible -- and not uncommon -- for Hugo and Agda to produce misaligned HTML nodes. In my diagram above, this is reflected as ModX being considered a single token by Agda, but two tokens (Mod and X) by the syntax highlighter. As a result, it's difficult to naively iterate the two HTML formats in parallel.

What I ended up doing is translating Agda's HTML output into offsets and data about the code block's plain text -- the source code being decorated. Both the Agda and Hugo HTML describe the same code; thus, the plain text is the common denominator between the two. {#plain-text}

I wrote a Ruby script to extract the decorations from the Agda output; here it is in slightly abridged form. You can find the original agda.rb file here.

# Traverse the preformatted Agda block in the given Agda HTML file
# and find which textual ranges have IDs and links to other ranges.
# Store this information in a hash, line => links[]
def process_agda_html_file(file)
  document = Nokogiri::HTML.parse(File.open(file))
  pre_code = document.css("pre.Agda")[0]

  # The traversal is postorder; we always visit children before their
  # parents, and we visit leaves in sequence.
  line_infos = []
  offset = 0 # Column index within the current Agda source code line
  line = 1
  pre_code.traverse do |at|
    # Text nodes are always leaves; visiting a new leaf means we've advanced
    # in the text by the length of that text. However, if there are newlines
    # -- since this is a preformatted block -- we also advanced by a line.
    # At this time, do not support links that span multiple lines, but
    # Agda doesn't produce those either.
    if at.text?
      if at.content.include? "\n"
        raise "no support for links with newlines inside" if at.parent.name != "pre"

        # Increase the line and track the final offset. Written as a loop
        # in case we eventually want to add some handling for the pieces
        # sandwiched between newlines.
        at.content.split("\n", -1).each_with_index do |bit, idx|
          line += 1 unless idx == 0
          offset = bit.length
        end
      else
        # It's not a newline node. Just adjust the offset within the plain text.
        offset += at.content.length
      end
    elsif at.name == "a"
      # Agda emits both links and things-to-link-to as 'a' nodes.

      line_info = line_infos.fetch(line) { line_infos[line] = [] }
      href = at.attribute("href")
      id = at.attribute("id")
      if href or id
        new_node = { :from => offset-at.content.length, :to => offset }
        new_node[:href] = href if href
        new_node[:id] = id if id

        line_info << new_node
      end
    end
  end
  return line_infos
end

This script takes an Agda HTML file and returns a map in which each line of the Agda source code is associated with a list of ranges; the ranges indicate links or places that can be linked to. For example, for the ModX example above, the script might produce:

3 => [
  { :from => 3, :to => 9, id => "..." },       # Agda creates <a> nodes even for keywords.
  { :from => 12, :to => 16, id => "ModX-id" }, # The IDs Agda generates aren't usually this nice.
  { :from => 20, :to => 21, id => "x-id" },
]

Modifying Hugo's HTML

Given such line information, the next step is to transfer it onto existing Hugo HTML files. Within a file, I've made my codelines shortcode emit custom attributes that can be used to find syntax-highlighted Agda code. The chief such attribute is data-agda-block; my script traverses all elements with this attribute.

def process_source_file(file, document)
  # Process each highlight group that's been marked as an Agda file.
  document.css('div[data-agda-block]').each do |t|
    # ...

To figure out which Agda HTML file to use, and which lines to search for links, the script also expects some additional attributes.

    # ...
    first_line, last_line = nil, nil

    if first_line_attr = t.attribute("data-first-line")
      first_line = first_line_attr.to_s.to_i
    end
    if last_line_attr = t.attribute("data-last-line")
      last_line = last_line_attr.to_s.to_i
    end

    if first_line and last_line
      line_range = first_line..last_line
    else
      # no line number attributes = the code block contains the whole file
      line_range = 1..
    end

    full_path = t.attribute("data-file-path").to_s
    # ...

At this point, the Agda file could be in some nested directory, like A/B/C/File.agda. However, the project root -- the place where Agda modules are compiled from -- could be any one of the folders A, B, or C. Thus, the fully qualified module name for File.agda could be File, C.File, B.C.File, or A.B.C.File. Since Agda's HTML output produces files named after the fully qualified module name, the script needs to guess what the module file is. This is where some conventions come in play: I keep my code in folders directly nested within a top-level code directory; thus, I'll have folders project1 or project2 inside code, and those will always be project roots. As a result, I guess that the first directory relative to code should be discarded, while the rest should be included in the path. The only exception to this is Git submodules: if an Agda file is included using a submodule, the root directory of the submodule is considered the Agda project root. My Hugo theme indicates the submodule using an additional data-base-path attribute; in all, that leads to the following logic:

    # ...
    full_path_dirs = Pathname(full_path).each_filename.to_a
    base_path = t.attribute("data-base-path").to_s
    base_dir_depth = 0
    if base_path.empty?
      # No submodules were used. Assume code/<X> is the root.
      # The path of the file is given relative to `code`, so need
      # to strip only the one outermost directory.
      base_dir_depth = 1
      base_path = full_path_dirs[0]
    else
      # The code is in a submodule. Assume that the base path / submodule
      # root is the Agda module root, ignore all folders before that.
      base_path_dirs = Pathname(base_path).each_filename.to_a
      base_dir_depth = base_path_dirs.length
    end
    # ...

With that, the script determines the actual HTML file path --- by assuming that there's an html folder in the same place as the Agda project root --- and runs the above process_agda_html_file:

    # ...
    dirs_in_base = full_path_dirs[base_dir_depth..-1]
    html_file = dirs_in_base.join(".").gsub(/\.agda$/, ".html")
    html_path = File.join(["code", base_path, "html", html_file])

    agda_info = process_agda_html_file(html_path)
    # ...

The next step is specific to the output of Hugo's syntax highlighter, Chroma. When line numbers are enabled -- and they are on my site -- Chroma generates a table that, at some point, contains a bunch of span HTML nodes, each with the line class. Each such span corresponds to a single line of output; naturally, the first one contains the code from first_line, the second from first_line + 1, and so on until last_line. This is quite convenient, because it saves the headache of counting newlines the way that the Agda processing code above has to.

For each line of syntax-highlighted code, the script retrieves the corresponding list of links that were collected from the Agda HTML file.

    # ...
    lines = t.css("pre.chroma code[data-lang] .line")
    lines.zip(line_range).each do |line, line_no|
      line_info = agda_info[line_no]
      next unless line_info

      # ...

The subsequent traversal -- which picks out the plain text of the Agda file as reasoned above -- is very similar to the previous one. Here too there's an offset variable, which gets incremented with the length of a new plain text pieces. Since we know the lines match up to spans, there's no need to count newlines.

      # ...
      offset = 0
      line.traverse do |lt|
        if lt.text?
          content = lt.content
          new_offset = offset + content.length

          # ...

At this point, we have a line number, and an offset within that line number that describes the portion of the text under consideration. We can traverse all the links for the line, and find ones that mark a piece of text somewhere in this range. For the time being -- since inserting overlapping spans is quite complicated -- I require the links to lie entirely within a particular plain text region. As a result, if Chroma splits a single Agda identifier into several tokens, it will not be linked. For now, this seems like the most conservative and safe approach.

          # ...
          matching_links = line_info.links.filter do |link|
            link[:from] >= offset and link[:to] <= new_offset
          end
          # ...

All that's left is to slice up the plain text fragment into a bunch of HTML pieces: the substrings that are links will turn into a HTML nodes, while the substrings that are "in between" the links will be left over as plain text nodes. The code to do so is relatively verbose, but not all that complicated.

          replace_with = []
          replace_offset = 0
          matching_links.each do |match|
            # The link's range is an offset from the beginning of the line,
            # but the text piece we're splitting up might be partway into
            # the line. Convert the link coordinates to piece-relative ones.
            relative_from = match[:from] - offset
            relative_to = match[:to] - offset

            # If the previous link ended some time before the new link
            # began (or if the current link is the first one, and is not
            # at the beginning), ensure that the plain text "in between"
            # is kept.
            replace_with << content[replace_offset...relative_from]

            tag = (match.include? :href) ? 'a' : 'span'
            new_node = Nokogiri::XML::Node.new(tag, document)
            if match.include? :href
              # For nodes with links, note what they're referring to, so
              # we can adjust their hrefs when we assign global IDs.
              href = match[:href].to_s
              new_node['href'] = note_used_href file, new_node, href
            end
            if match.include? :id
              # For nodes with IDs visible in the current Hugo file, we'll
              # want to redirect links that previously go to other Agda
              # module HTML files. So, note the ID that we want to redirect,
              # and pick a new unique ID to replace it with.
              id = match[:id].to_s
              new_node['id'] = note_defined_href file, "#{html_file}##{id}"
            end
            new_node.content = content[relative_from...relative_to]

            replace_with << new_node
            replace_offset = relative_to
          end
          replace_with << content[replace_offset..-1]

There's a little bit of a subtlety in the above code: specifically, I use the note_used_href and note_defined_href methods. These are important for rewriting links. Like I mentioned earlier, Agda's HTML output assumes that each source file should produce a single HTML file -- named after its qualified module -- and creates links accordingly. However, my blog posts interweave multiple source files. Some links that would've jumped to a different file must now point to an internal identifier within the page. Another important aspect of the transformation is that, since I'm pulling HTML files from distinct files, it's not guaranteed that each of them will have a unique id attribute. After all, Agda just assigns sequential numbers to each node that it generates; it would only take, e.g., including the first line from two distinct modules to end up with two nodes with id="1".

The solution is then twofold:

Track all the nodes referencing a particular href (made up of an HTML file and a numerical identifier, like File.html#123). When we pick new IDs -- thus guaranteeing their uniqueness -- we'll visit all the nodes that refer to the old ID and HTML file, and update their href.
Track all existing Agda HTML IDs that we're inserting. If we transfer an <a id="1234"> onto the Hugo content, we know we'll need to pick a new ID for it (since 1234 need not be unique), and that we'll need to redirect the other links to that new ID as the previous bullet describes.

Here's how these two methods work:

def note_defined_href(file, href)
  file_hrefs = @local_seen_hrefs.fetch(file) do
    @local_seen_hrefs[file] = {}
  end

  uniq_id = file_hrefs.fetch(href) do
    new_id = "agda-unique-ident-#{@id_counter}"
    @id_counter += 1
    file_hrefs[href] = new_id
  end

  unless @global_seen_hrefs.include? href
    @global_seen_hrefs[href] = { :file => file, :id => uniq_id }
  end

  return uniq_id
end

def note_used_href(file, node, href)
  ref_list = @nodes_referencing_href.fetch(href) { @nodes_referencing_href[href] = [] }
  ref_list << { :file => file, :node => node }
  return href
end

Note that they use class variables: these are methods on a FileGroup class. I've omitted the various classes I've declared from the above code for brevity, but here it makes sense to show them. Like I mentioned earlier, you can view the complete code here.

Interestingly, note_defined_href makes use of two global maps: @local_seen_hrefs and @global_seen_hrefs. This helps satisfy the third constraint above, which is linking between code defined in the same series. The logic is as follows: when rewriting a link to a new HTML file and ID, if the code we're trying to link to exists on the current page, we should link to that. Otherwise, if the code we're trying to link to was presented in a different part of the series, then we should link to that other part. So, we consult the "local" map for hrefs that will be rewritten to HTML nodes in the current file, and as a fallback, consult the "global" map for hrefs that were introduced in other parts. The note_defined_href populates both maps, and is "biased" towards the first occurrence of a piece of code: if posts A and B define a function f, and post C only references f, then that link will go to post A's definition, which came earlier.

The other method, note_used_href, is simpler. It just appends to a list of Nokogiri HTML nodes that reference a given href. We keep track of the file in which the reference occurred so we can be sure to consult the right sub-map of @local_seen_hrefs when checking for in-page rewrites.

After running process_source_file on all Hugo HTML files within a particular series, the following holds true:

We have inserted span or a nodes wherever Agda's original output had nodes with id or href elements. This is with the exception of the case where Hugo's inline HTML doesn't "line up" with Agda's inline HTML, which I've only found to happen when the leading character of an identifier is a digit.
We have picked new IDs for each HTML node we inserted that had an ID, noting them both globally and for the current file. We noted their original href value (in the form File.html#123) and that it should be transformed into our globally-unique identifiers, in the form agda-unique-ident-1234.
For each HTML node we inserted that links to another, we noted the href of the reference (also in the form File.html#123).

Now, all that's left is to redirect the hrefs of the nodes we inserted from their old values to the new ones. I do this by iterating over @nodes_referencing_href, which contains every link we inserted.

def cross_link_files
  @nodes_referencing_href.each do |href, references|
    references.each do |reference|
      file = reference[:file]
      node = reference[:node]

      local_targets = @local_seen_hrefs[file]
      if local_targets.include? href
        # A code block in this file provides this href, create a local link.
        node['href'] = "##{local_targets[href]}"
      elsif @global_seen_hrefs.include? href
        # A code block in this series, but not in this file, defines
        # this href. Create a cross-file link.
        target = @global_seen_hrefs[href]
        other_file = target[:file]
        id = target[:id]

        relpath = Pathname.new(other_file).dirname.relative_path_from(Pathname.new(file).dirname)
        node['href'] = "#{relpath}##{id}"
      else
        # No definitions in any blog page. For now, just delete the anchor.
        node.replace node.content
      end
    end
  end
end

Notice that for the time being, I simply remove links to Agda definitions that didn't occur in the Hugo post. Ideally, this would link to the plain, non-blog documentation page generated by Agda; however, this requires either hosting those documentation pages, or expecting the Agda standard library HTML pages to remain stable and hosted at a fixed URL. Neither was simple enough to do, so I opted for the conservative "just don't insert links" approach.

And that's all of the approach that I wanted to show off today! There are other details, like finding posts in the same series (I achieve this with a meta element) and invoking agda --html on the necessary source files (my build-agda-html.rb script is how I personally do this), but I don't think it's all that valuable to describe them here.

Unfortunately, the additional metadata I had my theme insert makes it harder for others to use this approach out of the box. However, I hope that by sharing my experience, others who write Agda and post about it might be able to get a similar solution working. And of course, it's always fun to write about a recent project or endeavor.

Happy (dependently typed) programming and blogging!

25 KiB Raw Permalink Blame History Unescape Escape