Files
blog-static/content/blog/pdf_flashcards_llm/index.md

240 lines
13 KiB
Markdown
Raw Normal View History

---
title: "Generating Flashcards from PDF Underlines"
date: 2026-04-05T16:02:00-07:00
tags: ["LLMs", "Python"]
series: ["LLM-Assisted Flashcard Generator"]
description: "In this post, I describe my personal PDF-to-vocab-flashcards pipeline"
---
__TL;DR__: I, with the help of ChatGPT, wrote [a program](https://dev.danilafe.com/DanilaFe/vocab-builder)
that helps me extract vocabulary words from PDFs. Scroll just a bit further down
to see what it looks like.
Sometime in 2020 or 2021, during the COVID-19 pandemic, I overheard from some
source that Albert Camus, in his book _La Peste_ (The Plague), had quite
accurately described the experience that many of us were going through
at the time. Having studied French for several years, I decided that the
best way to see for myself what _La Peste_ is all about was to read it
in its original, untranslated form.
I made good progress, but I certainly did not know every word. At the surface,
I was faced with two choices: guess the words from context and read without
stopping, or interrupt my reading to look up unfamiliar terms. The former
seemed unfortunate since it stunted my ability to acquire new vocabulary;
the latter was unpleasant, making me constantly break from the prose
(and the e-ink screen of my tablet) to consult a dictionary.
In the end, I decided to underline the words, and come back to them later.
However, even then, the task is fairly arduous. For one, words I don't recognize
aren't always in their canonical form (they can be conjugated, plural, compound,
and more): I have to spend some time deciphering what I should add to a
flashcard. For another, I had to bounce between a PDF of my book
(from where, fortunately, I can copy-paste) and my computer. Often, a word
confused the translation software out of context, so I had to copy more of the
surrounding text. Finally, I learned that given these limitations, the pace of
my reading far exceeds the rate of my translation. This led me to underline
fewer words.
I thought,
> Perhaps I can just have some software automatically extract the underlined
> portions of the words, find the canonical forms, and generate flashcards?
Even thinking this thought was a mistake. From then on, as I read and went
about underlining my words, I thought about how much manual effort I will
be taking on that could be automated. However, I didn't know how to start
the automation. In the end, I switched to reading books in English.
Then, LLMs got good at writing code. With the help of
Codex, I finally got the tools that I was dreaming about. Here's what it looks
like.
{{< figure src="./underlines.png" caption="Detected underlined words on a page" label="Detected underlined words on a page" >}}
{{< figure src="./result.png" caption="Auto-flashcard application" label="Auto-flashcard application" class="fullwide" >}}
This was my first foray into LLM-driven development. My commentary about that
experience (as if there isn't enough of such content out there!) will be
interleaved with the technical details.
### The Core Solution
The core idea has always been:
1. Find things that look like underlines
2. See which words they correspond to
3. Perform {{< sidenote "right" "lemmatization-node" "lemmatization" >}}
Lemmatization (<a href="https://en.wikipedia.org/wiki/Lemmatization">Wikipedia</a>) is the
process of turning non-canonical forms of words (like <code>am</code> (eng) /
<code>suis</code> (fr)) into their canonical form which might be found in the
dictionary (<code>to be</code> / <code>être</code>).
{{< /sidenote >}} and translate.
My initial direction was shaped by the impressive demonstrations of OCR
models, which could follow instructions at the same time as reading a document.
For these models, a prompt like "extract all the text in the red box"
constituted the entire targeted OCR pipeline. My hope was that a similar
prompt, "extract all underlined words", would be sufficient to accomplish
steps 1 and 2. However, I was never to find out: as it turns out,
OCR models are large and very expensive to run. In addition, the model
that I was looking at was specifically tailored for NVIDIA hardware which
I, with my MacBook, simply didn't have access to.
In the end, I came to the conclusion that a VLM is overkill for the problem
I'm tackling. This took me down the route of analyzing the PDFs. The
problem, of course, is that I know nothing of the Python landscape
of PDF analysis tools, and that I also know nothing about the PDF format
itself. This is where a Codex v1 came in. Codex opted (from its training
data, I presume) to use the [`PyMuPDF`](https://pymupdf.readthedocs.io) package.
It also guessed (correctly) that the PDFs exported by my tablet used
the 'drawings' part of the PDF spec to encode what I penned. I was instantly
able to see (on the console) the individual drawings.
The LLM also chose to approach the problem by treating each drawing as just
a "cloud of points", discarding the individual line segment data. This
seemed like a nice enough simplification, and it worked well in the long run.
#### Iterating on the Heuristic
The trouble with the LLM agent was that it had no good way of verifying
whether the lines it detected (and indeed, the words it considered underlined)
were actually lines (and underlined words). Its initial algorithm missed
many words, and misidentified others. I had to resort to visual inspection
to see what was being missed, and for the likely cause.
The exact process of the iteration is not particularly interesting. I'd
tweak a threshold, re-run the code, and see the new list of words.
I'd then cross-reference the list with the page in question, to see
if things were being over- or under-included. Rinse, repeat.
This got tedious fast. In some cases, letters or words I penned would get picked
up as underlines, and slightly diagonal strokes would get missed. I ended up
requesting Codex to generate a debugging utility that highlighted (in a box)
all the segments that it flagged, and the corresponding words. This
is the first picture I showed in the post. Here it is again:
{{< figure src="./underlines.png" caption="Detected underlined words on a page" label="Detected underlined words on a page" >}}
In the end, the rough algorithm was as follows:
1. __Identify all "cloud points" that are not too tall__. Lines that
vertically span too many lines of text are likely not underlines.
* The 'height threshold' ended up being larger than I anticipated:
turns out I don't draw very straight horizontal lines.
{{< figure src="tallmarks.png" caption="My angled underlines" label="My angled underlines" >}}
2. __Create a bounding box for the line,__ with some padding.
I don't draw the lines _directly_ underneath the text, but a bit below.
* Sometimes, I draw the line quite a bit below; the upward padding
had to be sizeable.
{{< figure src="lowmarks.png" caption="My too-low underlines" label="My too-low underlines" >}}
3. __Intersect `PyMuPDF` bounding boxes with the line__. Fortunately,
`PyMuPDF` provides word rectangles out of the box.
* I required the intersection to overlap with at least 60% of the word's
horizontal width, so accidental overlaps don't count.
{{< figure src="widemarks.png" caption="My too-wide underline hitting `Cela`" label="My too-wide underline hitting `Cela`" >}}
* The smallest underlines are roughly the same size as the biggest strokes
in my handwriting. The 60% requirement filtered out the latter, while
keeping the former.
{{< figure src="flaggedmarks.png" caption="Letters of a hand-writing word detected as lines" label="Letters of a hand-writing word detected as lines" >}}
4. __Reject underlines that overlap from the top__. Since, as I mentioned,
my underlines are often so low that they touch the next line.
#### Lemmatization and Translation
I don't recall now how I arrived at [`spaCy`](https://github.com/explosion/spaCy),
but that's what I ended up using for my lemmatization. There was only
one main catch: sometimes, instead of underlining words I didn't know,
I underlined whole phrases. Lemmatization did not work well in those
contexts; I had to specifically restrict my lemmatization to single-word
underlines, and to strip punctuation which occasionally got tacked on.
With lemmatization in hand, I moved on to the next step: translation.
I wanted my entire tool to work completely offline. As a result, I had to
search for "python offline translation", to learn about
[`argos-translate`](https://github.com/argosopentech/argos-translate).
Frankly, the translation piece is almost entirely uninteresting:
it boils down to invoking a single function. I might add that
`argos-translate` requires one to download language packages --- they
do not ship with the Python package. Codex knew to write a script to do
so, which saved a little bit of documentation-reading and typing.
The net result is a program that could produce:
```
Page 95: fougueuse -> fougueux -> fiery
```
Pretty good!
### Manual Intervention
That "pretty good" breaks down very fast. There are several points of failure:
the lemmatization can often get confused, and the offline translation
fails for some of the more flowery Camus language.
In the end, for somewhere on the order of 70% of the words I underlined,
the automatic translation was insufficient, and required small tweaks
(changing the tense of the lemma, adding "to" to infinitive English verbs, etc.)
I thought --- why not just make this interactive? Fortunately, there are
plenty of Flask applications in Codex's training dataset. In one shot,
it generated a little web application that enabled me to tweak the source word
and final translation. It also enabled me to throw away certain underlines.
This was useful when, across different sessions, I forgot and underlined
the same word, or when I underlined a word but later decided it was not worth
including in my studying. This application produced an Anki deck, using
the Python library [`genanki`](https://github.com/kerrickstaley/genanki).
Anki has a nice mechanism to de-duplicate decks, which meant that every
time I exported a new batch of words, I could add them to my running
collection.
Even then, however, cleaning up the auto-translation was not always easy.
The OCR copy of the book had strange idiosyncrasies: the letters 'fi' together
would OCR to '=' or '/'. Sometimes, I would underline a compound phrase
that spanned two lines; though I knew the individual words (and would be surprised
to find them in my list), I did not know their interaction.
In the end, I added (had Codex add) both a text-based context and a visual
capture of the word in question to the web application. This led to the final
version, whose screenshot I included above. Here it is again:
{{< figure src="./result.png" caption="Auto-flashcard application" label="Auto-flashcard application" class="fullwide" >}}
The net result was that, for many words, I could naively accept the
automatically-generated suggestion. For those where this was not possible,
in most cases I only had to tweak a few letters, which still saved me time.
Finally, I was able to automatically include the context of the word in
my flashcards, which often helps reinforce the translation and remember
the exact sense in which the word was used.
To this day, I haven't found a single word that was underlined and missed,
nor one that was mis-identified as underlined.
### Future Direction
In many ways, this software is more than good enough for my needs.
I add a new batch of vocabulary roughly every two weeks, during which time
I manually export a PDF of _La Peste_ from my tablet and plug it into
my software.
In my ideal world, I wouldn't have to do that. I would just underline some
words, and come back to my laptop a few days later to find a set of draft
flashcards for me to review and edit. In an even more ideal world, words
I underline get "magically" translated, and the translations appear somewhere
in the margins of my text (while also being placed in my list of flashcards).
I suspect LLMs --- local ones --- might be a decent alternative technology
to "conventional" translation. By automatically feeding them the context
and underlined portion, it might be possible to automatically get a more
robust translation and flashcard. I experimented with this briefly
early on, but did not have much success. Perhaps better prompting or newer
models would improve the outcomes.
That said, I think that those features are way beyond the 80:20 transition:
it would be much harder for me to get to that point, and the benefit would
be relatively small. Today, I'm happy to stick with what I already have.
In the [next part of this series]({{< relref "llm_personal_software" >}}),
I will talk more about how this project influenced my views on LLMs.