diff --git a/content/blog/backend_math_rendering.md b/content/blog/backend_math_rendering.md new file mode 100644 index 0000000..e8cf5e5 --- /dev/null +++ b/content/blog/backend_math_rendering.md @@ -0,0 +1,280 @@ +--- +title: Rendering Mathematics On The Back End +date: 2020-07-15T15:27:19-07:00 +draft: true +tags: ["Website", "Nix", "Ruby", "KaTeX", "Hugo"] +--- + +Due to something of a streak of bad luck when it came to computers, I spent a +significant amount of time using a Linux-based Chromebook, and then a +Pinebook Pro. It was, in some way, enlightening. The things that I used to take +for granted with a 'powerful' machine now became a rare luxury: StackOverflow, +and other relatively static websites, took upwards of ten seconds to finish +loading. On Slack, each of my keypresses could take longer than 500ms to +appear on the screen, and sometimes, it would take several seconds. Some +websites would present me with a white screen, and remain that way for much +longer than I had time to wait. It was awful. + +At one point, I installed uMatrix, and made it the default policy to block +all JavaScript. For the most part, this worked well. Of course, I had to +enable JavaScript for applications that needed to be interactive, like +Slack, and Discord. But for the most part, I was able to browse the majority +of the websites I normally browse. This went on until I started working +on the [compiler series]({{< relref "00_compiler_intro.md" >}}) again, +and discovered that the LaTeX math on my page, which was required +for displaying things like inference rules, didn't work without +JavaScript. I was left with two options: + +* Allow JavaScript, and continue using MathJax to render my math. +* Make it so that the mathematics is rendered on the back end. + +I've [previously written about math rendering]({{< relref "math_rendering_is_wrong.md" >}}), +and made the observation that MathJax's output for LaTeX is __identical__ +on every computer. From the MathJax 2.6 change log: + +> _Improved CommonHTML output_. The CommonHTML output now provides the same layout quality and MathML support as the HTML-CSS and SVG output. It is on average 40% faster than the other outputs and the markup it produces are identical on all browsers and thus can also be pre-generated on the server via MathJax-node. + +It seems absurd, then, to offload this kind of work into the users, to +be done over and over again. As should be clear from the title of +this post, this made me settle for the second option: it was +__obviously within reach__, especially for a statically-generated website +like mine, to render math on the backend. + +I settled on the following architecture: + +* As before I would generate my pages using Hugo. +* I would use the KaTeX NPM package to rendering math. +* To build the website no matter what computer I was on, I would use Nix. + +It so happens that Nix isn't really required for using my approach in general. +I will give my setup here, but feel free to skip ahead. + +### Setting Up A Nix Build +My `default.nix` file looks like this: + +```Nix {linenos=table} + { stdenv, hugo, fetchgit, pkgs, nodejs, ruby }: + + let + url = "https://dev.danilafe.com/Web-Projects/blog-static.git"; + rev = ""; + sha256 = ""; + requiredPackages = import ./required-packages.nix { + inherit pkgs nodejs; + }; + in + stdenv.mkDerivation { + name = "blog-static"; + version = rev; + src = fetchgit { + inherit url rev sha256; + }; + builder = ./builder.sh; + converter = ./convert.rb; + buildInputs = [ + hugo + requiredPackages.katex + (ruby.withPackages (ps: [ ps.nokogiri ])) + ]; + } +``` + +I'm using `node2nix` to generate the `required-packages.nix` file, which allows me, +even from a sandboxed Nix build, to download and install `npm` packages. This is needed +so that I have access to the `katex` binary at build time. I fed the following JSON file +to `node2nix`: + +```JSON {linenos=table} +[ + "katex" +] +``` + +The Ruby script I wrote for this (more on that soon) required the `nokigiri` gem, which +I used for traversing the HTML generated for my site. Hugo was obviously required to +generate the HTML. + +### Converting LaTeX To HTML +After my first post complaining about the state of mathematics on the web, I received +the following email (which the author allowed me to share): + +> Sorry for having a random stranger email you, but in your blog post +[(link)](https://danilafe.com/blog/math_rendering_is_wrong) you seem to focus on MathJax's +difficulty in rendering things server-side, while quietly ignoring that KaTeX's front +page advertises server-side rendering. Their documentation [(link)](https://katex.org/docs/options.html) +even shows (at least as of the time this email was sent) that it renders both HTML +(to be arranged nicely with their CSS) for visuals and MathML for accessibility. + +This is a great point, and KaTeX is indeed usable for server-side rendering. But I've +seen few people who do actually use it. Unfortunately, as I pointed out in my previous post on the subject, +few tools remain that provide the software that actually takes your HTML page and substitutes +LaTeX for math. + +> [In MathJax,] The bigger issue, though, was that the `page2html` +program, which rendered all the mathematics in a single HTML page, +was gone. I found `tex2html` and `text2htmlcss`, which could only +render equations without the surrounding HTML. I also found `mjpage`, +which replaced mathematical expressions in a page with their SVG forms. + +This is still the case, in both MathJax and KaTeX. The ability +to render math in one step is the main selling point of front-end LaTeX renderers: +all you have to do is drop in a file from a CDN, and voila, you have your +math. There are no such easy answers for back-end rendering. + +So what _do_ I do? Well, there are two types on my website: inline math and display math. +On the command line ([here are the docs](https://katex.org/docs/cli.html)), +the distinction is made using the `--display-mode` argument. So, the general algorithm +is to replace the code inside the `$$...$$` with their display-rendered version, +and the code inside the `\(...\)` with the inline-rendered version. I came up with +the following Ruby function: + +```Ruby {linenos=table} +def render_cached(cache, command, string, render_comment = nil) + cache.fetch(string) do |new| + puts " Rendering #{render_comment || new}" + cache[string] = Open3.popen3(command) do |i, o, e, t| + i.write new + i.close + o.read.force_encoding(Encoding::UTF_8).strip + end + end +end +``` + +Here, the `cache` argument is used to prevent re-running the `katex` command +on an equation that was already rendered before (the output is the same, after all). +The `command` is the specific shell command that we want to invoke; this would +be either `katex` or `katex -d`. The `string` is the math equation to render, +and the `render_comment` is the string to print to the console instead of the equation +(so that long, display math equations are not printed out to standard out). + +Then, given a substring of the HTML file, we use regular expressions +to find the `\(...\)` and `$$...$$`s, and use the `render_cached` method +on the LaTeX code inside. + +```Ruby {linenos=table} +def perform_katex_sub(inline_cache, display_cache, content) + rendered = content.gsub /\\\(((?:[^\\]|\\[^\)])*)\\\)/ do |match| + render_cached(inline_cache, "katex", $~[1]) + end + rendered = rendered.gsub /\$\$((?:[^\$]|$[^\$])*)\$\$/ do |match| + render_cached(display_cache, "katex -d", $~[1], "display") + end + return rendered +end +``` + +There's a bit of a trick to the final layer of this script. We want to be +really careful about where we replace LaTeX, and where we don't. In +particular, we _don't_ want to go into the `code` tags. Otherwise, +it wouldn't be able to talk about LaTeX code! Thus, we can't just +search-and-replace over the entire HTML document; we need to be context +aware. This is where `nokigiri` comes in. We parse the HTML, and iterate +over all of the 'text' nodes, calling `perform_katex_sub` on all +of those that _aren't_ inside code tags. + +Fortunately, this is pretty easy to specify thanks to something called XPath. +This was my first time encountering it, but it seems extremely useful: it's +a sort of language for selecting XML nodes. First, you provide an 'axis', +which is used to specify the positions of the nodes you want to look at +relative to the root node. The axis `/` looks at the immediate children +(this would be the `html` tag in a properly formatted document, I would imagine). +The axis `//` looks at all the transitive children. That is, it will look at the +children of the root, then its children, and so on. There's also the `self` axis, +which looks at the node itself. + +After you provide an axis, you need to specify the type of node that you want to +select. We can write `code`, for instance, to pick only the `....` tags +from the axis we've chosen. We can also use `*` to select any node, and we can +use `text()` to select text nodes, such as the `Hello` inside of `Hello`. + +We can also apply some more conditions to the nodes we pick using `[]`. +For us, the relevant feature here is `not(...)`, which allows us to +select nodes that do __not__ match a particular condition. This is all +we need to know. + +We write: + +* `//`, starting to search for nodes everywhere, not just the root of the document. +* `*`, to match _any_ node. We want to replace math inside of `div`s, `span`s, `nav`s, +all of the `h`s, and so on. +* `[not(self::code)]` cutting out all the `code` tags. +* `/`, now selecting the nodes that are immediate descendants of the nodes we've selected. +* `text()`, giving us the text contents of all the nodes we've selected. + +All in all: + +``` +//*[not(self::code)]/text() +``` + +Finally, we use this XPath from `nokigiri`: + +```Ruby {linenos=table} +files = ARGV[0..-1] +inline_cache, display_cache = {}, {} + +files.each do |file| + puts "Rendering file: #{file}" + document = Nokogiri::HTML.parse(File.open(file)) + document.search('//*[not(self::code)]/text()').each do |t| + t.replace(perform_katex_sub(inline_cache, display_cache, t.content)) + end + File.write(file, document.to_html) +end +``` + +I named this script `convert.rb`; it's used from inside of the Nix expression +and its builder, which we will cover below. + +### Tying it All Together +Finally, I wanted an end-to-end script to generate HTML pages and render the LaTeX in them. +I used Nix for this, but the below script will largely be compatible with a non-Nix system. +I came up with the following, commenting on Nix-specific commands: + +```Bash {linenos=table} +source $stdenv/setup # Nix-specific; set up paths. + +# Build site with Hugo +# The cp is Nix-specific; it copies the blog source into the current directory. +cp -r $src/* . +hugo --baseUrl="https://danilafe.com" + +# Render math in HTML and XML files. +# $converter is Nix-specific; you can just use convert.rb. +find public/ -regex "public/.*\.html" | xargs ruby $converter + +# Output result +# $out is Nix-specific; you can replace it with your destination folder. +mkdir $out +cp -r public/* $out/ +``` + +This is it! Using the two scripts, `convert.rb` and `builder.sh`, I +was able to generate my blog with the math rendered on the back-end. +Please note, though, that I had to add the KaTeX CSS to my website's +``. + +### Caveats +The main caveat of my approach is performance. For every piece of +mathematics that I render, I invoke the `katex` command. This incurs +the penalty of Node's startup time, every time, and makes my approach +take a few dozen seconds to run on my relatively small site. The +better approach would be to use a NodeJS script, rather than a Ruby one, +to perform the conversion. KaTeX also provides an API, so such a NodeJS +script can find the files, parse the HTML, and perform the substitutions. +I did quite like using `nokigiri` here, though, and I hope that an equivalently +pleasant solution exists in JavaScript. + +Re-rendering the whole website is also pretty wasteful. I rarely change the +mathematics on more than one page at a time, but every time I do so, I have +to re-run the script, and therefore re-render every page. This makes sense +for me, since I use Nix, and my builds are pretty much always performed +from scratch. On the other hand, for others, this may not be the best solution. + +### Conclusion +With the removal of MathJax from my site, it is now completely JavaScript free, +and contains virtually the same HTML that it did beforehand. This, I hope, +makes it work better on devices where computational power is more limited. +I also hope that it illustrates a general principle - it's very possible, +and plausible, to render LaTeX on the back-end for a static site.