305 lines
14 KiB
Markdown
305 lines
14 KiB
Markdown
---
|
|
title: Rendering Mathematics On The Back End
|
|
date: 2020-07-21T14:54:26-07:00
|
|
tags: ["Website", "Nix", "Ruby", "KaTeX"]
|
|
---
|
|
|
|
Due to something of a streak of bad luck when it came to computers, I spent a
|
|
significant amount of time using a Linux-based Chromebook, and then a
|
|
Pinebook Pro. It was, in some way, enlightening. The things that I used to take
|
|
for granted with a 'powerful' machine now became a rare luxury: StackOverflow,
|
|
and other relatively static websites, took upwards of ten seconds to finish
|
|
loading. On Slack, each of my keypresses could take longer than 500ms to
|
|
appear on the screen, and sometimes, it would take several seconds. Some
|
|
websites would present me with a white screen, and remain that way for much
|
|
longer than I had time to wait. It was awful.
|
|
|
|
At one point, I installed uMatrix, and made it the default policy to block
|
|
all JavaScript. For the most part, this worked well. Of course, I had to
|
|
enable JavaScript for applications that needed to be interactive, like
|
|
Slack, and Discord. But for the most part, I was able to browse the majority
|
|
of the websites I normally browse. This went on until I started working
|
|
on the [compiler series]({{< relref "00_compiler_intro.md" >}}) again,
|
|
and discovered that the LaTeX math on my page, which was required
|
|
for displaying things like inference rules, didn't work without
|
|
JavaScript. I was left with two options:
|
|
|
|
* Allow JavaScript, and continue using MathJax to render my math.
|
|
* Make it so that the mathematics are rendered on the back end.
|
|
|
|
I've [previously written about math rendering]({{< relref "math_rendering_is_wrong.md" >}}),
|
|
and made the observation that MathJax's output for LaTeX is __identical__
|
|
on every computer. From the MathJax 2.6 change log:
|
|
|
|
> _Improved CommonHTML output_. The CommonHTML output now provides the same layout quality and MathML support as the HTML-CSS and SVG output. It is on average 40% faster than the other outputs and the markup it produces are identical on all browsers and thus can also be pre-generated on the server via MathJax-node.
|
|
|
|
It seems absurd, then, to offload this kind of work into the users, to
|
|
be done over and over again. As should be clear from the title of
|
|
this post, this made me settle for the second option: it was
|
|
__obviously within reach__, especially for a statically-generated website
|
|
like mine, to render math on the backend.
|
|
|
|
I settled on the following architecture:
|
|
|
|
* As before, I would generate my pages using Hugo.
|
|
* I would use the KaTeX NPM package to render math.
|
|
* To build the website no matter what system I was on, I would use Nix.
|
|
|
|
It so happens that Nix isn't really required for using my approach in general.
|
|
I will give my setup here, but feel free to skip ahead.
|
|
|
|
### Setting Up A Nix Build
|
|
My `default.nix` file looks like this:
|
|
|
|
```Nix {linenos=table}
|
|
{ stdenv, hugo, fetchgit, pkgs, nodejs, ruby }:
|
|
|
|
let
|
|
url = "https://dev.danilafe.com/Web-Projects/blog-static.git";
|
|
rev = "<commit>";
|
|
sha256 = "<hash>";
|
|
requiredPackages = import ./required-packages.nix {
|
|
inherit pkgs nodejs;
|
|
};
|
|
in
|
|
stdenv.mkDerivation {
|
|
name = "blog-static";
|
|
version = rev;
|
|
src = fetchgit {
|
|
inherit url rev sha256;
|
|
};
|
|
builder = ./builder.sh;
|
|
converter = ./convert.rb;
|
|
buildInputs = [
|
|
hugo
|
|
requiredPackages.katex
|
|
(ruby.withPackages (ps: [ ps.nokogiri ]))
|
|
];
|
|
}
|
|
```
|
|
|
|
I'm using `node2nix` to generate the `required-packages.nix` file, which allows me,
|
|
even from a sandboxed Nix build, to download and install `npm` packages. This is needed
|
|
so that I have access to the `katex` binary at build time. I fed the following JSON file
|
|
to `node2nix`:
|
|
|
|
```JSON {linenos=table}
|
|
[
|
|
"katex"
|
|
]
|
|
```
|
|
|
|
The Ruby script I wrote for this (more on that soon) required the `nokogiri` gem, which
|
|
I used for traversing the HTML generated for my site. Hugo was obviously required to
|
|
generate the HTML.
|
|
|
|
### Converting LaTeX To HTML
|
|
After my first post complaining about the state of mathematics on the web, I received
|
|
the following email (which the author allowed me to share):
|
|
|
|
> Sorry for having a random stranger email you, but in your blog post
|
|
[(link)]({{< relref "math_rendering_is_wrong" >}}) you seem to focus on MathJax's
|
|
difficulty in rendering things server-side, while quietly ignoring that KaTeX's front
|
|
page advertises server-side rendering. Their documentation [(link)](https://katex.org/docs/options.html)
|
|
even shows (at least as of the time this email was sent) that it renders both HTML
|
|
(to be arranged nicely with their CSS) for visuals and MathML for accessibility.
|
|
|
|
The author of the email then kindly provided a link to a page they generated using KaTeX and
|
|
some Bash scripts. The math on this page was rendered at the time it was generated.
|
|
|
|
This is a great point, and KaTeX is indeed usable for server-side rendering. But I've
|
|
seen few people who do actually use it. Unfortunately, as I pointed out in my previous post on the subject,
|
|
few tools actually take your HTML page and replace LaTeX with rendered math.
|
|
Here's what I wrote about this last time:
|
|
|
|
> [In MathJax,] The bigger issue, though, was that the `page2html`
|
|
program, which rendered all the mathematics in a single HTML page,
|
|
was gone. I found `tex2html` and `text2htmlcss`, which could only
|
|
render equations without the surrounding HTML. I also found `mjpage`,
|
|
which replaced mathematical expressions in a page with their SVG forms.
|
|
|
|
This is still the case, in both MathJax and KaTeX. The ability
|
|
to render math in one step is the main selling point of front-end LaTeX renderers:
|
|
all you have to do is drop in a file from a CDN, and voila, you have your
|
|
math. There are no such easy answers for back-end rendering. In fact,
|
|
as we will soon see, it's not possible to just search-and-replace occurences
|
|
of mathematics on your page, either. To actually get KaTeX working
|
|
on the backend, you need access to tools that handle the potential variety
|
|
of edge cases associated with HTML. Such tools, to my knowledge, do not
|
|
currently exist.
|
|
|
|
I decided to write my own Ruby script to get the job done. From this script, I
|
|
would call the `katex` command-line program, which would perform
|
|
the heavy lifting of rendering the mathematics.
|
|
|
|
There are two types of math on my website: inline math and display math.
|
|
On the command line ([here are the docs](https://katex.org/docs/cli.html)),
|
|
the distinction is made using the `--display-mode` argument. So, the general algorithm
|
|
is to replace the code inside the `$$...$$` with their display-rendered version,
|
|
and the code inside the `\(...\)` with the inline-rendered version. I came up with
|
|
the following Ruby function:
|
|
|
|
```Ruby {linenos=table}
|
|
def render_cached(cache, command, string, render_comment = nil)
|
|
cache.fetch(string) do |new|
|
|
puts " Rendering #{render_comment || new}"
|
|
cache[string] = Open3.popen3(command) do |i, o, e, t|
|
|
i.write new
|
|
i.close
|
|
o.read.force_encoding(Encoding::UTF_8).strip
|
|
end
|
|
end
|
|
end
|
|
```
|
|
|
|
Here, the `cache` argument is used to prevent re-running the `katex` command
|
|
on an equation that was already rendered before (the output is the same, after all).
|
|
The `command` is the specific shell command that we want to invoke; this would
|
|
be either `katex` or `katex -d`. The `string` is the math equation to render,
|
|
and the `render_comment` is the string to print to the console instead of the equation
|
|
(so that long, display math equations are not printed out to standard out).
|
|
|
|
Then, given a substring of the HTML file, we use regular expressions
|
|
to find the `\(...\)` and `$$...$$`s, and use the `render_cached` method
|
|
on the LaTeX code inside.
|
|
|
|
```Ruby {linenos=table}
|
|
def perform_katex_sub(inline_cache, display_cache, content)
|
|
rendered = content.gsub /\\\(((?:[^\\]|\\[^\)])*)\\\)/ do |match|
|
|
render_cached(inline_cache, "katex", $~[1])
|
|
end
|
|
rendered = rendered.gsub /\$\$((?:[^\$]|$[^\$])*)\$\$/ do |match|
|
|
render_cached(display_cache, "katex -d", $~[1], "display")
|
|
end
|
|
return rendered
|
|
end
|
|
```
|
|
|
|
There's a bit of a trick to the final layer of this script. We want to be
|
|
really careful about where we replace LaTeX, and where we don't. In
|
|
particular, we _don't_ want to go into the `code` tags. Otherwise,
|
|
it wouldn't be possible to talk about LaTeX code! I also suspect that
|
|
some captions, alt texts, and similar elements should also be left alone.
|
|
However, I don't have those on my website (yet), and I won't worry about
|
|
them now. Either way, because of the code tags,
|
|
we can't just search-and-replace over the entire page; we need to be context
|
|
aware. This is where `nokogiri` comes in. We parse the HTML, and iterate
|
|
over all of the 'text' nodes, calling `perform_katex_sub` on all
|
|
of those that _aren't_ inside code tags.
|
|
|
|
Fortunately, this kind of iteration is pretty easy to specify thanks to something called XPath.
|
|
This was my first time encountering it, but it seems extremely useful: it's
|
|
a sort of language for selecting XML nodes. First, you provide an 'axis',
|
|
which is used to specify the positions of the nodes you want to look at
|
|
relative to the root node. The axis `/` looks at the immediate children
|
|
(this would be the `html` tag in a properly formatted document, I would imagine).
|
|
The axis `//` looks at all the transitive children. That is, it will look at the
|
|
children of the root, then its children, and so on. There's also the `self` axis,
|
|
which looks at the node itself.
|
|
|
|
After you provide an axis, you need to specify the type of node that you want to
|
|
select. We can write `code`, for instance, to pick only the `<code>....</code>` tags
|
|
from the axis we've chosen. We can also use `*` to select any node, and we can
|
|
use `text()` to select text nodes, such as the `Hello` inside of `<b>Hello</b>`.
|
|
|
|
We can also apply some more conditions to the nodes we pick using `[]`.
|
|
For us, the relevant feature here is `not(...)`, which allows us to
|
|
select nodes that do __not__ match a particular condition. This is all
|
|
we need to know.
|
|
|
|
We write:
|
|
|
|
* `//`, starting to search for nodes everywhere, not just the root of the document.
|
|
* `*`, to match _any_ node. We want to replace math inside of `div`s, `span`s, `nav`s,
|
|
all of the `h`s, and so on.
|
|
* `[not(self::code)]`, cutting out all the `code` tags.
|
|
* `/`, now selecting the nodes that are immediate descendants of the nodes we've selected.
|
|
* `text()`, giving us the text contents of all the nodes we've selected.
|
|
|
|
All in all:
|
|
|
|
```
|
|
//*[not(self::code)]/text()
|
|
```
|
|
|
|
Finally, we use this XPath from `nokogiri`:
|
|
|
|
```Ruby {linenos=table}
|
|
files = ARGV[0..-1]
|
|
inline_cache, display_cache = {}, {}
|
|
|
|
files.each do |file|
|
|
puts "Rendering file: #{file}"
|
|
document = Nokogiri::HTML.parse(File.open(file))
|
|
document.search('//*[not(self::code)]/text()').each do |t|
|
|
t.replace(perform_katex_sub(inline_cache, display_cache, t.content))
|
|
end
|
|
File.write(file, document.to_html)
|
|
end
|
|
```
|
|
|
|
I named this script `convert.rb`; it's used from inside of the Nix expression
|
|
and its builder, which we will cover below.
|
|
|
|
### Tying it All Together
|
|
Finally, I wanted an end-to-end script to generate HTML pages and render the LaTeX in them.
|
|
I used Nix for this, but the below script will largely be compatible with a non-Nix system.
|
|
I came up with the following, commenting on Nix-specific commands:
|
|
|
|
```Bash {linenos=table}
|
|
# Nix-specific; set up paths.
|
|
source $stdenv/setup
|
|
|
|
# Build site with Hugo
|
|
# The cp is Nix-specific; it copies the blog source into the current directory.
|
|
cp -r $src/* .
|
|
hugo --baseUrl="https://danilafe.com"
|
|
|
|
# Render math in HTML and XML files.
|
|
# $converter is Nix-specific; you can just use convert.rb.
|
|
find public/ -regex "public/.*\.html" | xargs ruby $converter
|
|
|
|
# Output result
|
|
# $out is Nix-specific; you can replace it with your destination folder.
|
|
mkdir $out
|
|
cp -r public/* $out/
|
|
```
|
|
|
|
This is it! Using the two scripts, `convert.rb` and `builder.sh`, I
|
|
was able to generate my blog with the math rendered on the back-end.
|
|
Please note, though, that I had to add the KaTeX CSS to my website's
|
|
`<head>`.
|
|
|
|
### Caveats
|
|
The main caveat of my approach is performance. For every piece of
|
|
mathematics that I render, I invoke the `katex` command. This incurs
|
|
the penalty of Node's startup time, every time, and makes my approach
|
|
take a few dozen seconds to run on my relatively small site. The
|
|
better approach would be to use a NodeJS script, rather than a Ruby one,
|
|
to perform the conversion. KaTeX also provides an API, so such a NodeJS
|
|
script can find the files, parse the HTML, and perform the substitutions.
|
|
I did quite like using `nokogiri` here, though, and I hope that an equivalently
|
|
pleasant solution exists in JavaScript.
|
|
|
|
Re-rendering the whole website is also pretty wasteful. I rarely change the
|
|
mathematics on more than one page at a time, but every time I do so, I have
|
|
to re-run the script, and therefore re-render every page. This makes sense
|
|
for me, since I use Nix, and my builds are pretty much always performed
|
|
from scratch. On the other hand, for others, this may not be the best solution.
|
|
|
|
### Alternatives
|
|
The same person who sent me the original email above also pointed out
|
|
[this `pandoc` filter for KaTeX](https://github.com/Zaharid/pandoc_static_katex).
|
|
I do not use Pandoc, but from what I can see, this fitler relies on
|
|
Pandoc's `Math` AST nodes, and applies KaTeX to each of those. This
|
|
should work, but wasn't applicable in my case, since Hugo's shrotcodes
|
|
don't mix well with Pandoc. However, it certainly seems like a workable
|
|
solution.
|
|
|
|
### Conclusion
|
|
With the removal of MathJax from my site, it is now completely JavaScript free,
|
|
and contains virtually the same HTML that it did beforehand. This, I hope,
|
|
makes it work better on devices where computational power is more limited.
|
|
I also hope that it illustrates a general principle - it's very possible,
|
|
and plausible, to render LaTeX on the back-end for a static site.
|