Add initial (bundled) draft of flashcard writeup

2026-04-05 15:22:12 -07:00
parent bdd2be48bd
commit a278d1f572
1 changed files with 431 additions and 0 deletions
--- a/content/blog/pdf_flashcards_llm/index.md
+++ b/content/blog/pdf_flashcards_llm/index.md
@@ -0,0 +1,431 @@
+---
+title: "Generating Flashcards from PDF Underlines"
+date: 2026-04-04T12:25:14-07:00
+tags: ["LLMs", "Python"]
+draft: true
+---
+
+__TL;DR__: I, with the help of ChatGPT, wrote a program that helps me
+extract vocabulary words from PDFs. Scroll just a bit further down
+to see what it looks like.
+
+Sometime in 2020 or 2021, during the COVID-19 pandemic, I overheard from some
+source that Albert Camus, in his book _La Peste_ (The Plague), had quite
+accurately described the experience that many of us were going through
+at the time. Having studied French for several years, I decided that the
+best way to see for myself what _La Peste_ is all about was to read it
+in its original, untranslated form.
+
+I made good progress, but I certainly did not know every word. At the surface,
+I was faced with two choices: guess the words from context and read without
+stopping, or interrupt my reading to look up unfamiliar terms. The former
+seemed unfortunate since it stunted my ability to acquire new vocabulary;
+the latter was unpleasant, making me constantly break from the prose
+(and the e-ink screen of my tablet) to consult a dictionary.
+
+In the end, I decided to underline the words, and come back to them later.
+However, even then, the task is fairly arduous. For one, words I don't recognize
+aren't always in their canonical form (they can conjugated, plural, compound,
+and more): I have to spend some time deciphering what I should add to a
+flashcard. For another, I had to bounce between a PDF of my book
+(from where, fortunately, I can copy-paste) and my computer. Often, a word
+confused the translation software out of context, so I had to copy more of the
+surrounding text. Finally, I learned that given these limitations, the pace of
+my reading far exceeds the rate of my translation. This led me to underline
+less words.
+
+I thought,
+
+> Perhaps I can just have some software automatically extract the underlined
+> portions of the words, find the canonical forms, and generate flashcards?
+
+Even thinking this thought was a mistake. From then on, as I read and went
+about underlining my words, I thought about how much manual effort I will
+be taking on that could be automated. However, I didn't know how to start
+the automation. In the end, I switched to reading books in English.
+
+Then, LLMs got good at writing code. With the help of
+Codex, I finally got the tools that I was dreaming about. Here's what it looks
+like.
+
+{{< figure src="./underlines.png" caption="Detected underlined words on a page" label="Detected underlined words on a page" >}}
+
+{{< figure src="./result.png" caption="Auto-flashcard application" label="Auto-flashcard application" class="fullwide" >}}
+
+This was my first foray into LLM-driven development. My commentary about that
+experience (as if there isn't enough of such content out there!) will be
+interleaved with the technical details.
+
+### The Core Solution
+The core idea has always been:
+
+1. Find thing that look like underlines
+2. See which words they correspond to
+3. Perform {{< sidenote "right" "lemmatization-node" "lemmatization" >}}
+Lemmatization (<a href="https://en.wikipedia.org/wiki/Lemmatization">wikipedia</a>) is the
+process of turning non-canonical forms of words (like <code>am</code> (eng) /
+<code>suis</code> (fr)) into their canonical form which might be found in the
+dictionary (<code>to be</code> / <code>être</code>).
+{{< /sidenote >}} and translate.
+
+My initial direction was shaped by the impressive demonstrations of OCR
+models, which could follow instructions at the same time as reading a document.
+For these models, a prompt like "extract all the text in the red box"
+constituted the entire targeted OCR pipeline. My hope was that a similar
+prompt, "extract all underlined words", would be sufficient to accomplish
+steps 1 and 2. However, I was never to find out: as it turns out,
+OCR models are large and very expensive to run. In addition, the model
+that I was looking at was specifically tailored for NVIDIA hardware which
+I, with my MacBook, simply didn't have access to.
+
+In the end, I came to the conclusion that a VLM is overkill for the problem
+I'm tackling. This took me down the route of analyzing the PDFs. The
+problem, of course, is that I know nothing of the Python landscape
+of PDF analysis tools, and that I also know nothing about the PDF format
+itself. This is where a Codex v1 came in. Codex opted (from its training
+data, I presume) to use the [`PyMuPDF`](https://pymupdf.readthedocs.io) package.
+It also guessed (correctly) that the PDFs exported by my tablet used
+the 'drawings' part of the PDF spec to encode what I penned. I was instantly
+able to see (on the console) the individual drawings.
+
+The LLM also chose to approach the problem by treating each drawing as just
+a "cloud of points", discarding the individual line segment data. This
+seemed like a nice enough simplification, and it worked well in the long run.
+
+#### Iterating on the Heuristic
+The trouble with the LLM agent was that it had no good way of verifying
+whether the lines it detected (and indeed, the words it considered underlined)
+were actually lines (and underlined words). Its initial algorithm missed
+many words, and misidentified others. I had to resort to visual inspection
+to see what was being missed, and for the likely cause.
+
+The exact process of the iteration is not particularly interesting. I'd
+tweak a threshold, re-run the code, and see the new list of words.
+I'd then cross-reference the list with the page in question, to see
+if things were being over- or under-included. Rinse, repeat.
+
+This got tedious fast. In some cases, letters or words I penned would get picked
+up as underlines, and slightly diagonal strokes would get missed. I ended up
+requesting Codex to generate a debugging utility that highlighted (in a box)
+all the segments that it flagged, and the corresponding words. This
+is the first picture I showed in the post. Here it is again:
+
+{{< figure src="./underlines.png" caption="Detected underlined words on a page" label="Detected underlined words on a page" >}}
+
+In the end, the rough algorithm was as follows:
+
+1. __Identify all "cloud points" that are not too tall__. Lines that
+   vertically span too many lines of text are likely not underlines.
+   * The 'height threshold' ended up being larger than I anticipated:
+     turns out I don't draw very straight horizontal lines.
+
+     {{< figure src="tallmarks.png" caption="My angled underlines" label="My angled underlines" >}}
+2. __Create a bounding box for the line,__ with some padding.
+   I don't draw the lines _directly_ underneath the text, but a bit below.
+   * Sometimes, I draw the line quite a bit below; the upward padding
+     had to be sizeable.
+
+     {{< figure src="lowmarks.png" caption="My too-low underlines" label="My too-low underlines" >}}
+3. __Intersect `PyMuPDF` bounding boxes with the line__. Fortunately,
+   `PyMuPDF` provides word rectangles out of the box.
+   * I required the intersection to overlap with at least 60% of the word's
+     horizontal width, so accidental overlaps don't count.
+
+     {{< figure src="widemarks.png" caption="My too-wide underline hitting `Cela`" label="My too-wide underline hitting `Cela`" >}}
+   * The smallest underlines are roughly the same size as the biggest strokes
+     in my handwriting. The 60% requirement filtered out the latter, while
+     keeping the former.
+
+     {{< figure src="flaggedmarks.png" caption="Letters of a hand-writing word detected as lines" label="Letters of a hand-writing word detected as lines" >}}
+4. __Reject underlines that overlap from the top__. Since, as I mentioned,
+   my underlines are often so low that they touch the next line.
+
+#### Lemmatization and Translation
+
+I don't recall now how I arrived at [`spaCy`](https://github.com/explosion/spaCy),
+but that's what I ended up using for my lemmatization. There was only
+one main catch: sometimes, instead of underlining words I didn't know,
+I underlined whole phrases. Lemmatization did not work well in those
+contexts; I had to specifically restrict my lemmatization to single-word
+underlines, and to strip punctuation which occasionally got tacked on.
+With lemmatization in hand, I moved on to the next step: translation.
+
+I wanted my entire tool to work completely offline. As a result, I had to
+search for "python offline translation", to learn about
+[`argos-translate`](https://github.com/argosopentech/argos-translate).
+Frankly, the translation piece is almost entirely uninteresting:
+it boils down to invoking a single function. I might add that
+`argos-translate` requires one to download language packages --- they
+do not ship with the Python package. Codex knew to write a script to do
+so, which saved a little bit of documentation-reading and typing.
+
+The net result is a program that could produce:
+
+```
+Page 95: fougueuse -> fougueux -> fiery
+```
+
+Pretty good!
+
+### Manual Intervention
+That "pretty good" breaks down very fast. There are several points of failure:
+the lemmatization can often get confused, and the offline translation
+fails for some of the more flowery Camus language.
+
+In the end, for somewhere on the order of 70% of the words I underlined,
+the automatic translation was insufficient, and required small tweaks
+(changing the tense of the lemma, adding "to" to infinitive English verbs, etc.)
+
+I thought --- why not just make this interactive? Fortunately, there are
+plenty of Flask applications in Codex's training dataset. In one shot,
+it generated a little web application that enabled me to tweak the source word
+and final translation. It also enabled me to throw away certain underlines.
+This was useful when, across different sessions, I forgot and underlined
+the same word, or when I underlined a word but later decided it not worth
+including in my studying. This application produced an Anki deck, using
+the Python library [`genanki`](https://github.com/kerrickstaley/genanki).
+Anki has a nice mechanism to de-duplicate decks, which meant that every
+time I exported a new batch of words, I could add them to my running
+collection.
+
+Even then, however, cleaning up the auto-translation was not always easy.
+The OCR copy of the book had strange idiosyncrasies: the letters 'fi' together
+would OCR to '=' or '/'. Sometimes, I would underline a compound phrase
+that spanned two lines; though I knew the individual words (and would be surprised
+to find them in my list), I did not know their interaction.
+
+In the end, I added (had Codex add) both a text-based context and a visual
+capture of the word in question to the web application. This led to the final
+version, whose screenshot I included above. Here it is again:
+
+{{< figure src="./result.png" caption="Auto-flashcard application" label="Auto-flashcard application" class="fullwide" >}}
+
+The net result was that, for many words, I could naively accept the
+automatically-generated suggestion. For those where this was not possible,
+in most cases I only had to tweak a few letters, which still saved me time.
+Finally, I was able to automatically include the context of the word in
+my flashcards, which often helps reinforce the translation and remember
+the exact sense in which the word was used.
+
+To this day, I haven't found a single word that was underlined and missed,
+nor one that was mis-identified as underlined.
+
+### Future Direction
+
+In many ways, this software is more than good enough for my needs.
+I add a new batch of vocabulary roughly every two weeks, during which time
+I manually export a PDF of _La Peste_ from my tablet and plug it into
+my software.
+
+In my ideal world, I wouldn't have to do that. I would just underline some
+words, and come back to my laptop a few days later to find a set of draft
+flashcards for me to review and edit. In an even more ideal world, words
+I underline get "magically" translated, and the translations appear somewhere
+in the margins of my text (while also being placed in my list of flashcards).
+
+I suspect LLMs --- local ones --- might be a decent alternative technology
+to "conventional" translation. By automatically feeding them the context
+and underlined portion, it might be possible to automatically get a more
+robust translation and flashcard. I experimented with this briefly
+early on, but did not have much success. Perhaps better prompting or newer
+models would improve the outcomes.
+
+That said, I think that those features are way beyond the 80:20 transition:
+it would be much harder for me to get to that point, and the benefit would
+be relatively small. Today, I'm happy to stick with what I already have.
+
+### Personal Software with the Help of LLMs
+
+Like I mentioned earlier, this was one of my earliest experiences with
+LLM-driven development, and I think it shaped my outlook on the technology
+quite a bit. For me, the bottom line is this: _with LLMs, I was able to
+rapidly solve a problem that was holding me back in another area of my life_.
+My goal was never to "produce software", but to "acquire vocabulary",
+and, viewed from this perspective, I think the experience has been a
+colossal success.
+
+As someone who works on software, I am always reminded that end-users rarely
+care about the technology as much as us technologists; they care about
+having their problems solved. I find taking that perspective to be challenging
+(though valuable) because software is my craft, and because in thinking
+about the solution, I have to think about the elements that bring it to life.
+
+With LLMs, I was able --- allowed? --- to view things more so from the
+end-user perspective. I didn't know, and didn't need to know, the API
+for `PyMuPDF`, `argostranslate`, or `spaCy`. I didn't need to understand
+the PDF format. I could move one step away from the nitty-gritty and focus
+on the 'why' and the 'what'.
+
+The boundary between 'manual' and 'automatic' was not always consistent.
+Though I didn't touch any of the PyMuPDF code, I did need to look fairly
+closely at the logic that classified my squiggles as "underlines" and found
+associated words. In the end, though, I was able to focus on the core
+challenge of what I wanted to accomplish (the inherent complexity) and
+avoid altogether the unrelated difficulties that merely happened to be
+there (downloading language modules; learning translation APIs; etc.)
+
+This was true even when I was writing the code myself. Codex created the
+word-highlighting utility in one shot in a matter of seconds, saving
+me probably close to an hour of interpreting the algorithm's outputs
+while I iterated on the proper heuristic.
+
+By enabling me to _do_, the LLM let me make rapid progress, and to produce
+solutions to problems I would've previously deemed "too hard" or "too tedious".
+This did, however, markedly reduce the care with which I was examining
+the output. I don't think I've _ever_ read the code that produces the
+pretty colored boxes in my program's debug output. This shift, I think,
+has been a divisive element of AI discourse in technical communities.
+I think that this has to do, at least in part, with different views
+on code as a medium.
+
+#### The Builders and the Craftsmen
+AI discourse is nothing new; others before me have identified a distinction
+between individuals that seems to color their perspective on LLMs. Those
+that appreciate writing software as a craft, treating code as an end
+in and of itself (at least in part), tend to be saddened and repulsed by
+the advent of LLMs. LLMs produce "good enough" code, but so far it
+lacks elegance, organization, and perhaps, care. On the other hand,
+those that treat software as a means to an end, who want to see their
+vision brought to reality, view LLMs with enthusiasm. It has never been
+easier to make something, especially if that something is of a shape
+that's been made before.
+
+My flashcard extractor can be viewed in vastly different ways when faced
+from these two perspective. In terms of craft, I think that it is at best
+mediocre; most of the code is generated, slightly verbose and somewhat
+tedious. The codebase is far from inspiring, and if I had written it by hand,
+I would not be particularly proud of it. In terms of product, though,
+I think it tells an exciting story: here I am, reading Camus again, because
+I was able to improve the workflow around said reading. In a day, I was able
+to achieve what I couldn't muster in a year or two on my own.
+
+The truth is, the "builder vs. craftsman" distinction is a simplifying one,
+another in the long line of "us vs. them" classifications. Any one person is
+capable of being any combination of these two camps at any given time. Indeed,
+different sorts of software demand to be viewed through different lenses.
+I will _still_ treat work on my long-term projects as craft, because
+I will come back to it again and again, and because our craft has evolved
+and to engender stability and maintainability.
+
+However, I am more than happy to settle for 'underwhelming' when it means an
+individual need of mine can be addressed in record time. I think this
+gives rise to a new sort of software: highly individual, explicitly
+non-robust, and treated differently from software crafted with
+deliberate thought and foresight.
+
+#### Personal Software
+
+I think as time goes on, I am becoming more and more convinced by the idea
+of "personal software". One might argue that much of the complexity in many
+pieces of software is driven by the need of that software to accommodate
+the diverse needs of many users. Still, software remains somewhat inflexible and
+unable to accommodate individual needs. Features or uses that demand
+changes at the software level move at a slower pace: finite developer time
+needs to be spent analyzing what users need, determining the costs of this new
+functionality, choosing which of the many possible requests to fulfill.
+On the other hand, software that enables the users to build their customizations
+for themselves, by exposing numerous configuration options and abstractions,
+becomes, over time, very complicated to grasp.
+
+Now, suppose that the complexity of such software scales superlinearly with
+the number of features it provides. Suppose also that individual users
+leverage only a small subset of the software's functionality. From these
+assumptions it would follow that individual programs, made to serve a single
+user's need, would be significantly less complicated than the "whole".
+By definitions, these programs would also be better tailored to the users'
+needs. With LLMs, we're getting to a future where this might be possible.
+
+I think that my flashcard generator is an early instance of such software.
+It doesn't worry about various book formats, or various languages, or
+various page layouts. The heuristic was tweaked to fit my use case, and
+now works 100% of the time. I understand the software in its entirety.
+I thought about sharing it --- and, in way, I did, since it's
+[open source](https://dev.danilafe.com/DanilaFe/vocab-builder) --- but realized
+that outside of the constraints of my own problem, it likely will not be
+of that much use. I _could_ experiment with more varied constraints, but
+that would turn in back into the sort of software I discussed above:
+general, robust, and complex.
+
+Today, I think that there is a whole class of software that is amenable to
+being "personal". My flashcard generator is one such piece of software;
+I imagine file-organization (as served by many "bulk rename and move" pieces
+of software out there), video wrangling (possible today with `ffmpeg`'s
+myriad of flags and switches), and data visualization to be other
+instances of problems in that class. I am merely intuiting here, but
+if I had to give a rough heuristic, it would be problems that:
+
+* __fulfill a short-frequency need__, because availability, deployment,
+  etc. significantly raises the bar for quality.
+  * e.g., I collect flashcards once every two weeks;
+    I organize my filesystem once a month; I don't spend nearly enough money
+    to want to re-generate cash flow charts very often
+* __have an "answer" that's relatively easy to assess__, because
+  LLMs are not perfect and iteration must be possible and easy.
+  * e.g., I can see that all the underlined words are listed in my web app;
+    I know that my files are in the right folders, named appropriately,
+    by inspection; my charts seem to track with reality
+* __have a relatively complex technical implementation__, because
+  why would you bother invoking an LLM if you can "just" click a button somewhere?
+  * e.g., extracting data from PDFs requires some wrangling;
+    bulk-renaming files requires some tedious and possibly case-specific
+    pattern matching; cash flow between N accounts requires some graph
+    analysis
+* __have relatively low stakes__, again, because LLMs are not perfect,
+  and nor is (necessarily) one's understanding of the problem.
+  * e.g., it's OK if I miss some words I underlined; my cash flow
+  charts only give me an impression of my spending;
+  * I recognize that moving files is a potentially destructive operation.
+
+I dream of a world in which, to make use of my hardware, I just _ask_,
+and don't worry much about languages, frameworks, or sharing my solution
+with others --- that last one because they can just ask as well.
+
+#### The Unfair Advantage of Being Technical
+I recognize that my success described here did not come for free. There
+were numerous parts of the process where my software background helped
+get the most out of Codex.
+
+For one thing, writing software trains us to think precisely about problems.
+We learn to state exactly what we want, to decompose tasks into steps,
+and to intuit the exact size of these steps; to know what's hard and what's
+easy for the machine. When working with an LLM, these skills make it possible
+to hit the ground running, to know what to ask and to help pluck out a particular
+solution from the space of various approaches. I think that this greatly
+accelerates the effectiveness of using LLMs compared to non-technical experts.
+
+Another advantage software folks have when leveraging LLMs is the established
+rigor of software development. LLMs can and do make mistakes, but so do people.
+Our field has been built around reducing these mistakes' impact and frequency.
+Knowing to use version control helps turn the pathological downward spiral
+of accumulating incorrect tweaks into monotonic, step-wise improvements.
+Knowing how to construct a test suite and thinking about edge cases cap
+provide an agent LLM the grounding it needs to iterate rapidly and safely.
+
+In this way, I think the dream of personal software is far from being realized
+for the general public. Without the foundation of experience and rigor,
+LLM-driven development can easily devolve into a frustrating and endless
+back-and-forth, or worse, successfully build software that is subtly and
+convincingly wrong.
+
+#### The Shoulders of Giants
+
+The only reason all of this was possible is that the authors of `PyMuPDF`,
+`genanki`, `spaCy`, and `argos-translate` made them available for me to use from
+my code. These libraries provided the bulk of the functionality that Codex and I
+were able to glue into a final product. It would be a mistake to forget this,
+and to confuse the sustained, thoughtful efforts of the people behind these
+projects for the one-off, hyper-specific software I've been talking about.
+
+We need these packages, and others like them, to provide a foundation for the
+things we build. They bring stability, reuse, and the sort of cohesion that
+is not possible through an amalgamation of home-grown personal scripts.
+In my view, something like `spaCy` is to my flashcard script as a brick is to
+grout. There is a fundamental difference.
+
+I don't know how LLMs will integrate into the future of large-scale software
+development. The discipline becomes something else entirely when the
+constraints of "personal software" I floated above cease to apply. Though
+LLMs can still enable doing what was previously too difficult, tedious,
+or time consuming (like my little 'underline visualizer'), it remains
+to be seen how to integrate this new ease into the software lifecycle
+without threatening its future.