Files
blog-static/content/blog/pdf_flashcards_llm/index.md
2026-04-05 15:42:26 -07:00

418 lines
23 KiB
Markdown

---
title: "Generating Flashcards from PDF Underlines"
date: 2026-04-04T12:25:14-07:00
tags: ["LLMs", "Python"]
draft: true
---
__TL;DR__: I, with the help of ChatGPT, wrote a program that helps me
extract vocabulary words from PDFs. Scroll just a bit further down
to see what it looks like.
Sometime in 2020 or 2021, during the COVID-19 pandemic, I overheard from some
source that Albert Camus, in his book _La Peste_ (The Plague), had quite
accurately described the experience that many of us were going through
at the time. Having studied French for several years, I decided that the
best way to see for myself what _La Peste_ is all about was to read it
in its original, untranslated form.
I made good progress, but I certainly did not know every word. At the surface,
I was faced with two choices: guess the words from context and read without
stopping, or interrupt my reading to look up unfamiliar terms. The former
seemed unfortunate since it stunted my ability to acquire new vocabulary;
the latter was unpleasant, making me constantly break from the prose
(and the e-ink screen of my tablet) to consult a dictionary.
In the end, I decided to underline the words, and come back to them later.
However, even then, the task is fairly arduous. For one, words I don't recognize
aren't always in their canonical form (they can conjugated, plural, compound,
and more): I have to spend some time deciphering what I should add to a
flashcard. For another, I had to bounce between a PDF of my book
(from where, fortunately, I can copy-paste) and my computer. Often, a word
confused the translation software out of context, so I had to copy more of the
surrounding text. Finally, I learned that given these limitations, the pace of
my reading far exceeds the rate of my translation. This led me to underline
less words.
I thought,
> Perhaps I can just have some software automatically extract the underlined
> portions of the words, find the canonical forms, and generate flashcards?
Even thinking this thought was a mistake. From then on, as I read and went
about underlining my words, I thought about how much manual effort I will
be taking on that could be automated. However, I didn't know how to start
the automation. In the end, I switched to reading books in English.
Then, LLMs got good at writing code. With the help of
Codex, I finally got the tools that I was dreaming about. Here's what it looks
like.
{{< figure src="./underlines.png" caption="Detected underlined words on a page" label="Detected underlined words on a page" >}}
{{< figure src="./result.png" caption="Auto-flashcard application" label="Auto-flashcard application" class="fullwide" >}}
This was my first foray into LLM-driven development. My commentary about that
experience (as if there isn't enough of such content out there!) will be
interleaved with the technical details.
### The Core Solution
The core idea has always been:
1. Find thing that look like underlines
2. See which words they correspond to
3. Perform {{< sidenote "right" "lemmatization-node" "lemmatization" >}}
Lemmatization (<a href="https://en.wikipedia.org/wiki/Lemmatization">wikipedia</a>) is the
process of turning non-canonical forms of words (like <code>am</code> (eng) /
<code>suis</code> (fr)) into their canonical form which might be found in the
dictionary (<code>to be</code> / <code>être</code>).
{{< /sidenote >}} and translate.
My initial direction was shaped by the impressive demonstrations of OCR
models, which could follow instructions at the same time as reading a document.
For these models, a prompt like "extract all the text in the red box"
constituted the entire targeted OCR pipeline. My hope was that a similar
prompt, "extract all underlined words", would be sufficient to accomplish
steps 1 and 2. However, I was never to find out: as it turns out,
OCR models are large and very expensive to run. In addition, the model
that I was looking at was specifically tailored for NVIDIA hardware which
I, with my MacBook, simply didn't have access to.
In the end, I came to the conclusion that a VLM is overkill for the problem
I'm tackling. This took me down the route of analyzing the PDFs. The
problem, of course, is that I know nothing of the Python landscape
of PDF analysis tools, and that I also know nothing about the PDF format
itself. This is where a Codex v1 came in. Codex opted (from its training
data, I presume) to use the [`PyMuPDF`](https://pymupdf.readthedocs.io) package.
It also guessed (correctly) that the PDFs exported by my tablet used
the 'drawings' part of the PDF spec to encode what I penned. I was instantly
able to see (on the console) the individual drawings.
The LLM also chose to approach the problem by treating each drawing as just
a "cloud of points", discarding the individual line segment data. This
seemed like a nice enough simplification, and it worked well in the long run.
#### Iterating on the Heuristic
The trouble with the LLM agent was that it had no good way of verifying
whether the lines it detected (and indeed, the words it considered underlined)
were actually lines (and underlined words). Its initial algorithm missed
many words, and misidentified others. I had to resort to visual inspection
to see what was being missed, and for the likely cause.
The exact process of the iteration is not particularly interesting. I'd
tweak a threshold, re-run the code, and see the new list of words.
I'd then cross-reference the list with the page in question, to see
if things were being over- or under-included. Rinse, repeat.
This got tedious fast. In some cases, letters or words I penned would get picked
up as underlines, and slightly diagonal strokes would get missed. I ended up
requesting Codex to generate a debugging utility that highlighted (in a box)
all the segments that it flagged, and the corresponding words. This
is the first picture I showed in the post. Here it is again:
{{< figure src="./underlines.png" caption="Detected underlined words on a page" label="Detected underlined words on a page" >}}
In the end, the rough algorithm was as follows:
1. __Identify all "cloud points" that are not too tall__. Lines that
vertically span too many lines of text are likely not underlines.
* The 'height threshold' ended up being larger than I anticipated:
turns out I don't draw very straight horizontal lines.
{{< figure src="tallmarks.png" caption="My angled underlines" label="My angled underlines" >}}
2. __Create a bounding box for the line,__ with some padding.
I don't draw the lines _directly_ underneath the text, but a bit below.
* Sometimes, I draw the line quite a bit below; the upward padding
had to be sizeable.
{{< figure src="lowmarks.png" caption="My too-low underlines" label="My too-low underlines" >}}
3. __Intersect `PyMuPDF` bounding boxes with the line__. Fortunately,
`PyMuPDF` provides word rectangles out of the box.
* I required the intersection to overlap with at least 60% of the word's
horizontal width, so accidental overlaps don't count.
{{< figure src="widemarks.png" caption="My too-wide underline hitting `Cela`" label="My too-wide underline hitting `Cela`" >}}
* The smallest underlines are roughly the same size as the biggest strokes
in my handwriting. The 60% requirement filtered out the latter, while
keeping the former.
{{< figure src="flaggedmarks.png" caption="Letters of a hand-writing word detected as lines" label="Letters of a hand-writing word detected as lines" >}}
4. __Reject underlines that overlap from the top__. Since, as I mentioned,
my underlines are often so low that they touch the next line.
#### Lemmatization and Translation
I don't recall now how I arrived at [`spaCy`](https://github.com/explosion/spaCy),
but that's what I ended up using for my lemmatization. There was only
one main catch: sometimes, instead of underlining words I didn't know,
I underlined whole phrases. Lemmatization did not work well in those
contexts; I had to specifically restrict my lemmatization to single-word
underlines, and to strip punctuation which occasionally got tacked on.
With lemmatization in hand, I moved on to the next step: translation.
I wanted my entire tool to work completely offline. As a result, I had to
search for "python offline translation", to learn about
[`argos-translate`](https://github.com/argosopentech/argos-translate).
Frankly, the translation piece is almost entirely uninteresting:
it boils down to invoking a single function. I might add that
`argos-translate` requires one to download language packages --- they
do not ship with the Python package. Codex knew to write a script to do
so, which saved a little bit of documentation-reading and typing.
The net result is a program that could produce:
```
Page 95: fougueuse -> fougueux -> fiery
```
Pretty good!
### Manual Intervention
That "pretty good" breaks down very fast. There are several points of failure:
the lemmatization can often get confused, and the offline translation
fails for some of the more flowery Camus language.
In the end, for somewhere on the order of 70% of the words I underlined,
the automatic translation was insufficient, and required small tweaks
(changing the tense of the lemma, adding "to" to infinitive English verbs, etc.)
I thought --- why not just make this interactive? Fortunately, there are
plenty of Flask applications in Codex's training dataset. In one shot,
it generated a little web application that enabled me to tweak the source word
and final translation. It also enabled me to throw away certain underlines.
This was useful when, across different sessions, I forgot and underlined
the same word, or when I underlined a word but later decided it not worth
including in my studying. This application produced an Anki deck, using
the Python library [`genanki`](https://github.com/kerrickstaley/genanki).
Anki has a nice mechanism to de-duplicate decks, which meant that every
time I exported a new batch of words, I could add them to my running
collection.
Even then, however, cleaning up the auto-translation was not always easy.
The OCR copy of the book had strange idiosyncrasies: the letters 'fi' together
would OCR to '=' or '/'. Sometimes, I would underline a compound phrase
that spanned two lines; though I knew the individual words (and would be surprised
to find them in my list), I did not know their interaction.
In the end, I added (had Codex add) both a text-based context and a visual
capture of the word in question to the web application. This led to the final
version, whose screenshot I included above. Here it is again:
{{< figure src="./result.png" caption="Auto-flashcard application" label="Auto-flashcard application" class="fullwide" >}}
The net result was that, for many words, I could naively accept the
automatically-generated suggestion. For those where this was not possible,
in most cases I only had to tweak a few letters, which still saved me time.
Finally, I was able to automatically include the context of the word in
my flashcards, which often helps reinforce the translation and remember
the exact sense in which the word was used.
To this day, I haven't found a single word that was underlined and missed,
nor one that was mis-identified as underlined.
### Future Direction
In many ways, this software is more than good enough for my needs.
I add a new batch of vocabulary roughly every two weeks, during which time
I manually export a PDF of _La Peste_ from my tablet and plug it into
my software.
In my ideal world, I wouldn't have to do that. I would just underline some
words, and come back to my laptop a few days later to find a set of draft
flashcards for me to review and edit. In an even more ideal world, words
I underline get "magically" translated, and the translations appear somewhere
in the margins of my text (while also being placed in my list of flashcards).
I suspect LLMs --- local ones --- might be a decent alternative technology
to "conventional" translation. By automatically feeding them the context
and underlined portion, it might be possible to automatically get a more
robust translation and flashcard. I experimented with this briefly
early on, but did not have much success. Perhaps better prompting or newer
models would improve the outcomes.
That said, I think that those features are way beyond the 80:20 transition:
it would be much harder for me to get to that point, and the benefit would
be relatively small. Today, I'm happy to stick with what I already have.
### Personal Software with the Help of LLMs
Like I mentioned earlier, this was one of my earliest experiences with
LLM-driven development, and I think it shaped my outlook on the technology
quite a bit. For me, the bottom line is this: _with LLMs, I was able to
rapidly solve a problem that was holding me back in another area of my life_.
My goal was never to "produce software", but to "acquire vocabulary",
and, viewed from this perspective, I think the experience has been a
colossal success.
As someone who works on software, I am always reminded that end-users rarely
care about the technology as much as us technologists; they care about
having their problems solved. I find taking that perspective to be challenging
(though valuable) because software is my craft, and because in thinking
about the solution, I have to think about the elements that bring it to life.
With LLMs, I was able --- allowed? --- to view things more so from the
end-user perspective. I didn't know, and didn't need to know, the API
for `PyMuPDF`, `argostranslate`, or `spaCy`. I didn't need to understand
the PDF format. I could move one step away from the nitty-gritty and focus
on the 'why' and the 'what', on the challenge of what I wanted to accomplish.
I wrestled with the inherent complexity and
avoided altogether the unrelated difficulties that merely happened to be
there (downloading language modules; learning translation APIs; etc.)
By enabling me to do this, the LLM let me make rapid progress, and to produce
solutions to problems I would've previously deemed "too hard" or "too tedious".
This did, however, markedly reduce the care with which I was examining
the output. I don't think I've _ever_ read the code that produces the
pretty colored boxes in my program's debug output. This shift, I think,
has been a divisive element of AI discourse in technical communities.
I think that this has to do, at least in part, with different views
on code as a medium.
#### The Builders and the Craftsmen
There are two perspectives through which one may view software:
as a craft in and of itself, and as a means to some end.
My flashcard extractor can be viewed in vastly different ways when faced
from these two perspective. In terms of craft, I think that it is at best
mediocre; most of the code is generated, slightly verbose and somewhat
tedious. The codebase is far from inspiring, and if I had written it by hand,
I would not be particularly proud of it. In terms of product, though,
I think it tells an exciting story: here I am, reading Camus again, because
I was able to improve the workflow around said reading. In a day, I was able
to achieve what I couldn't muster in a year or two on my own.
The truth is, the "builder vs. craftsman" distinction is a simplifying one,
another in the long line of "us vs. them" classifications. Any one person is
capable of being any combination of these two camps at any given time. Indeed,
different sorts of software demand to be viewed through different lenses.
I will _still_ treat work on my long-term projects as craft, because
I will come back to it again and again, and because our craft has evolved
to engender stability and maintainability.
However, I am more than happy to settle for 'underwhelming' when it means an
individual need of mine can be addressed in record time. I think this
gives rise to a new sort of software: highly individual, explicitly
non-robust, and treated differently from software crafted with
deliberate thought and foresight.
#### Personal Software
I think as time goes on, I am becoming more and more convinced by the idea
of "personal software". One might argue that much of the complexity in many
pieces of software is driven by the need of that software to accommodate
the diverse needs of many users. Still, software remains somewhat inflexible and
unable to accommodate individual needs. Features or uses that demand
changes at the software level move at a slower pace: finite developer time
needs to be spent analyzing what users need, determining the costs of this new
functionality, choosing which of the many possible requests to fulfill.
On the other hand, software that enables the users to build their customizations
for themselves, by exposing numerous configuration options and abstractions,
becomes, over time, very complicated to grasp.
Now, suppose that the complexity of such software scales superlinearly with
the number of features it provides. Suppose also that individual users
leverage only a small subset of the software's functionality. From these
assumptions it would follow that individual programs, made to serve a single
user's need, would be significantly less complicated than the "whole".
By definition, these programs would also be better tailored to the users'
needs. With LLMs, we're getting to a future where this might be possible.
I think that my flashcard generator is an early instance of such software.
It doesn't worry about various book formats, or various languages, or
various page layouts. The heuristic was tweaked to fit my use case, and
now works 100% of the time. I understand the software in its entirety.
I thought about sharing it --- and, in way, I did, since it's
[open source](https://dev.danilafe.com/DanilaFe/vocab-builder) --- but realized
that outside of the constraints of my own problem, it likely will not be
of that much use. I _could_ experiment with more varied constraints, but
that would turn in back into the sort of software I discussed above:
general, robust, and complex.
Today, I think that there is a whole class of software that is amenable to
being "personal". My flashcard generator is one such piece of software;
I imagine file-organization (as served by many "bulk rename and move" pieces
of software out there), video wrangling (possible today with `ffmpeg`'s
myriad of flags and switches), and data visualization to be other
instances of problems in that class. I am merely intuiting here, but
if I had to give a rough heuristic, it would be problems that:
* __fulfill a short-frequency need__, because availability, deployment,
etc. significantly raises the bar for quality.
* e.g., I collect flashcards once every two weeks;
I organize my filesystem once a month; I don't spend nearly enough money
to want to re-generate cash flow charts very often
* __have an "answer" that's relatively easy to assess__, because
LLMs are not perfect and iteration must be possible and easy.
* e.g., I can see that all the underlined words are listed in my web app;
I know that my files are in the right folders, named appropriately,
by inspection; my charts seem to track with reality
* __have a relatively complex technical implementation__, because
why would you bother invoking an LLM if you can "just" click a button somewhere?
* e.g., extracting data from PDFs requires some wrangling;
bulk-renaming files requires some tedious and possibly case-specific
pattern matching; cash flow between N accounts requires some graph
analysis
* __have relatively low stakes__, again, because LLMs are not perfect,
and nor is (necessarily) one's understanding of the problem.
* e.g., it's OK if I miss some words I underlined; my cash flow
charts only give me an impression of my spending;
* I recognize that moving files is a potentially destructive operation.
I dream of a world in which, to make use of my hardware, I just _ask_,
and don't worry much about languages, frameworks, or sharing my solution
with others --- that last one because they can just ask as well.
#### The Unfair Advantage of Being Technical
I recognize that my success described here did not come for free. There
were numerous parts of the process where my software background helped
get the most out of Codex.
For one thing, writing software trains us to think precisely about problems.
We learn to state exactly what we want, to decompose tasks into steps,
and to intuit the exact size of these steps; to know what's hard and what's
easy for the machine. When working with an LLM, these skills make it possible
to hit the ground running, to know what to ask and to help pluck out a particular
solution from the space of various approaches. I think that this greatly
accelerates the effectiveness of using LLMs compared to non-technical experts.
For another, the boundary between 'manual' and 'automatic' is not always consistent.
Though I didn't touch any of the `PyMuPDF` code, I did need to look fairly
closely at the logic that classified my squiggles as "underlines" and found
associated words. It was not enough to treat LLM-generated code as a black box.
Another advantage software folks have when leveraging LLMs is the established
rigor of software development. LLMs can and do make mistakes, but so do people.
Our field has been built around reducing these mistakes' impact and frequency.
Knowing to use version control helps turn the pathological downward spiral
of accumulating incorrect tweaks into monotonic, step-wise improvements.
Knowing how to construct a test suite and thinking about edge cases can
provide an agent LLM the grounding it needs to iterate rapidly and safely.
In this way, I think the dream of personal software is far from being realized
for the general public. Without the foundation of experience and rigor,
LLM-driven development can easily devolve into a frustrating and endless
back-and-forth, or worse, successfully build software that is subtly and
convincingly wrong.
#### The Shoulders of Giants
The only reason all of this was possible is that the authors of `PyMuPDF`,
`genanki`, `spaCy`, and `argos-translate` made them available for me to use from
my code. These libraries provided the bulk of the functionality that Codex and I
were able to glue into a final product. It would be a mistake to forget this,
and to confuse the sustained, thoughtful efforts of the people behind these
projects for the one-off, hyper-specific software I've been talking about.
We need these packages, and others like them, to provide a foundation for the
things we build. They bring stability, reuse, and the sort of cohesion that
is not possible through an amalgamation of home-grown personal scripts.
In my view, something like `spaCy` is to my flashcard script as a brick is to
grout. There is a fundamental difference.
I don't know how LLMs will integrate into the future of large-scale software
development. The discipline becomes something else entirely when the
constraints of "personal software" I floated above cease to apply. Though
LLMs can still enable doing what was previously too difficult, tedious,
or time consuming (like my little 'underline visualizer'), it remains
to be seen how to integrate this new ease into the software lifecycle
without threatening its future.