blog-static/content/blog/pdf_flashcards_llm/index.md at 21463ede20ca418cc53113243a5ce9e1e0772e22

Files

Danila Fedorin 21463ede20 Make some edits.

Signed-off-by: Danila Fedorin <danila.fedorin@gmail.com>

2026-04-05 15:42:26 -07:00

23 KiB

Raw Blame History

title, date, tags, draft

title

date

The Core Solution

The core idea has always been:

Find thing that look like underlines
See which words they correspond to
Perform {{< sidenote "right" "lemmatization-node" "lemmatization" >}} Lemmatization (wikipedia) is the process of turning non-canonical forms of words (like am (eng) / suis (fr)) into their canonical form which might be found in the dictionary (to be / être). {{< /sidenote >}} and translate.

My initial direction was shaped by the impressive demonstrations of OCR models, which could follow instructions at the same time as reading a document. For these models, a prompt like "extract all the text in the red box" constituted the entire targeted OCR pipeline. My hope was that a similar prompt, "extract all underlined words", would be sufficient to accomplish steps 1 and 2. However, I was never to find out: as it turns out, OCR models are large and very expensive to run. In addition, the model that I was looking at was specifically tailored for NVIDIA hardware which I, with my MacBook, simply didn't have access to.

In the end, I came to the conclusion that a VLM is overkill for the problem I'm tackling. This took me down the route of analyzing the PDFs. The problem, of course, is that I know nothing of the Python landscape of PDF analysis tools, and that I also know nothing about the PDF format itself. This is where a Codex v1 came in. Codex opted (from its training data, I presume) to use the PyMuPDF package. It also guessed (correctly) that the PDFs exported by my tablet used the 'drawings' part of the PDF spec to encode what I penned. I was instantly able to see (on the console) the individual drawings.

The LLM also chose to approach the problem by treating each drawing as just a "cloud of points", discarding the individual line segment data. This seemed like a nice enough simplification, and it worked well in the long run.

Iterating on the Heuristic

The trouble with the LLM agent was that it had no good way of verifying whether the lines it detected (and indeed, the words it considered underlined) were actually lines (and underlined words). Its initial algorithm missed many words, and misidentified others. I had to resort to visual inspection to see what was being missed, and for the likely cause.

The exact process of the iteration is not particularly interesting. I'd tweak a threshold, re-run the code, and see the new list of words. I'd then cross-reference the list with the page in question, to see if things were being over- or under-included. Rinse, repeat.

This got tedious fast. In some cases, letters or words I penned would get picked up as underlines, and slightly diagonal strokes would get missed. I ended up requesting Codex to generate a debugging utility that highlighted (in a box) all the segments that it flagged, and the corresponding words. This is the first picture I showed in the post. Here it is again:

{{< figure src="./underlines.png" caption="Detected underlined words on a page" label="Detected underlined words on a page" >}}

In the end, the rough algorithm was as follows:

Identify all "cloud points" that are not too tall. Lines that vertically span too many lines of text are likely not underlines.
- The 'height threshold' ended up being larger than I anticipated: turns out I don't draw very straight horizontal lines.
  
  {{< figure src="tallmarks.png" caption="My angled underlines" label="My angled underlines" >}}
Create a bounding box for the line, with some padding. I don't draw the lines directly underneath the text, but a bit below.
- Sometimes, I draw the line quite a bit below; the upward padding had to be sizeable.
  
  {{< figure src="lowmarks.png" caption="My too-low underlines" label="My too-low underlines" >}}
Intersect PyMuPDF bounding boxes with the line. Fortunately, PyMuPDF provides word rectangles out of the box.
- I required the intersection to overlap with at least 60% of the word's horizontal width, so accidental overlaps don't count.
  
  {{< figure src="widemarks.png" caption="My too-wide underline hitting Cela" label="My too-wide underline hitting Cela" >}}
- The smallest underlines are roughly the same size as the biggest strokes in my handwriting. The 60% requirement filtered out the latter, while keeping the former.
  
  {{< figure src="flaggedmarks.png" caption="Letters of a hand-writing word detected as lines" label="Letters of a hand-writing word detected as lines" >}}
Reject underlines that overlap from the top. Since, as I mentioned, my underlines are often so low that they touch the next line.

Lemmatization and Translation

I don't recall now how I arrived at spaCy, but that's what I ended up using for my lemmatization. There was only one main catch: sometimes, instead of underlining words I didn't know, I underlined whole phrases. Lemmatization did not work well in those contexts; I had to specifically restrict my lemmatization to single-word underlines, and to strip punctuation which occasionally got tacked on. With lemmatization in hand, I moved on to the next step: translation.

I wanted my entire tool to work completely offline. As a result, I had to search for "python offline translation", to learn about argos-translate. Frankly, the translation piece is almost entirely uninteresting: it boils down to invoking a single function. I might add that argos-translate requires one to download language packages --- they do not ship with the Python package. Codex knew to write a script to do so, which saved a little bit of documentation-reading and typing.

The net result is a program that could produce:

Page 95: fougueuse -> fougueux -> fiery

Pretty good!

Manual Intervention

That "pretty good" breaks down very fast. There are several points of failure: the lemmatization can often get confused, and the offline translation fails for some of the more flowery Camus language.

In the end, for somewhere on the order of 70% of the words I underlined, the automatic translation was insufficient, and required small tweaks (changing the tense of the lemma, adding "to" to infinitive English verbs, etc.)

I thought --- why not just make this interactive? Fortunately, there are plenty of Flask applications in Codex's training dataset. In one shot, it generated a little web application that enabled me to tweak the source word and final translation. It also enabled me to throw away certain underlines. This was useful when, across different sessions, I forgot and underlined the same word, or when I underlined a word but later decided it not worth including in my studying. This application produced an Anki deck, using the Python library genanki. Anki has a nice mechanism to de-duplicate decks, which meant that every time I exported a new batch of words, I could add them to my running collection.

Even then, however, cleaning up the auto-translation was not always easy. The OCR copy of the book had strange idiosyncrasies: the letters 'fi' together would OCR to '=' or '/'. Sometimes, I would underline a compound phrase that spanned two lines; though I knew the individual words (and would be surprised to find them in my list), I did not know their interaction.

In the end, I added (had Codex add) both a text-based context and a visual capture of the word in question to the web application. This led to the final version, whose screenshot I included above. Here it is again:

{{< figure src="./result.png" caption="Auto-flashcard application" label="Auto-flashcard application" class="fullwide" >}}

The net result was that, for many words, I could naively accept the automatically-generated suggestion. For those where this was not possible, in most cases I only had to tweak a few letters, which still saved me time. Finally, I was able to automatically include the context of the word in my flashcards, which often helps reinforce the translation and remember the exact sense in which the word was used.

To this day, I haven't found a single word that was underlined and missed, nor one that was mis-identified as underlined.

Future Direction

In many ways, this software is more than good enough for my needs. I add a new batch of vocabulary roughly every two weeks, during which time I manually export a PDF of La Peste from my tablet and plug it into my software.

In my ideal world, I wouldn't have to do that. I would just underline some words, and come back to my laptop a few days later to find a set of draft flashcards for me to review and edit. In an even more ideal world, words I underline get "magically" translated, and the translations appear somewhere in the margins of my text (while also being placed in my list of flashcards).

I suspect LLMs --- local ones --- might be a decent alternative technology to "conventional" translation. By automatically feeding them the context and underlined portion, it might be possible to automatically get a more robust translation and flashcard. I experimented with this briefly early on, but did not have much success. Perhaps better prompting or newer models would improve the outcomes.

That said, I think that those features are way beyond the 80:20 transition: it would be much harder for me to get to that point, and the benefit would be relatively small. Today, I'm happy to stick with what I already have.

Personal Software with the Help of LLMs

Like I mentioned earlier, this was one of my earliest experiences with LLM-driven development, and I think it shaped my outlook on the technology quite a bit. For me, the bottom line is this: with LLMs, I was able to rapidly solve a problem that was holding me back in another area of my life. My goal was never to "produce software", but to "acquire vocabulary", and, viewed from this perspective, I think the experience has been a colossal success.

As someone who works on software, I am always reminded that end-users rarely care about the technology as much as us technologists; they care about having their problems solved. I find taking that perspective to be challenging (though valuable) because software is my craft, and because in thinking about the solution, I have to think about the elements that bring it to life.

With LLMs, I was able --- allowed? --- to view things more so from the end-user perspective. I didn't know, and didn't need to know, the API for PyMuPDF, argostranslate, or spaCy. I didn't need to understand the PDF format. I could move one step away from the nitty-gritty and focus on the 'why' and the 'what', on the challenge of what I wanted to accomplish. I wrestled with the inherent complexity and avoided altogether the unrelated difficulties that merely happened to be there (downloading language modules; learning translation APIs; etc.)

By enabling me to do this, the LLM let me make rapid progress, and to produce solutions to problems I would've previously deemed "too hard" or "too tedious". This did, however, markedly reduce the care with which I was examining the output. I don't think I've ever read the code that produces the pretty colored boxes in my program's debug output. This shift, I think, has been a divisive element of AI discourse in technical communities. I think that this has to do, at least in part, with different views on code as a medium.

The Builders and the Craftsmen

There are two perspectives through which one may view software: as a craft in and of itself, and as a means to some end. My flashcard extractor can be viewed in vastly different ways when faced from these two perspective. In terms of craft, I think that it is at best mediocre; most of the code is generated, slightly verbose and somewhat tedious. The codebase is far from inspiring, and if I had written it by hand, I would not be particularly proud of it. In terms of product, though, I think it tells an exciting story: here I am, reading Camus again, because I was able to improve the workflow around said reading. In a day, I was able to achieve what I couldn't muster in a year or two on my own.

The truth is, the "builder vs. craftsman" distinction is a simplifying one, another in the long line of "us vs. them" classifications. Any one person is capable of being any combination of these two camps at any given time. Indeed, different sorts of software demand to be viewed through different lenses. I will still treat work on my long-term projects as craft, because I will come back to it again and again, and because our craft has evolved to engender stability and maintainability.

However, I am more than happy to settle for 'underwhelming' when it means an individual need of mine can be addressed in record time. I think this gives rise to a new sort of software: highly individual, explicitly non-robust, and treated differently from software crafted with deliberate thought and foresight.

Personal Software

I think as time goes on, I am becoming more and more convinced by the idea of "personal software". One might argue that much of the complexity in many pieces of software is driven by the need of that software to accommodate the diverse needs of many users. Still, software remains somewhat inflexible and unable to accommodate individual needs. Features or uses that demand changes at the software level move at a slower pace: finite developer time needs to be spent analyzing what users need, determining the costs of this new functionality, choosing which of the many possible requests to fulfill. On the other hand, software that enables the users to build their customizations for themselves, by exposing numerous configuration options and abstractions, becomes, over time, very complicated to grasp.

Now, suppose that the complexity of such software scales superlinearly with the number of features it provides. Suppose also that individual users leverage only a small subset of the software's functionality. From these assumptions it would follow that individual programs, made to serve a single user's need, would be significantly less complicated than the "whole". By definition, these programs would also be better tailored to the users' needs. With LLMs, we're getting to a future where this might be possible.

I think that my flashcard generator is an early instance of such software. It doesn't worry about various book formats, or various languages, or various page layouts. The heuristic was tweaked to fit my use case, and now works 100% of the time. I understand the software in its entirety. I thought about sharing it --- and, in way, I did, since it's open source --- but realized that outside of the constraints of my own problem, it likely will not be of that much use. I could experiment with more varied constraints, but that would turn in back into the sort of software I discussed above: general, robust, and complex.

Today, I think that there is a whole class of software that is amenable to being "personal". My flashcard generator is one such piece of software; I imagine file-organization (as served by many "bulk rename and move" pieces of software out there), video wrangling (possible today with ffmpeg's myriad of flags and switches), and data visualization to be other instances of problems in that class. I am merely intuiting here, but if I had to give a rough heuristic, it would be problems that:

fulfill a short-frequency need, because availability, deployment, etc. significantly raises the bar for quality.
- e.g., I collect flashcards once every two weeks; I organize my filesystem once a month; I don't spend nearly enough money to want to re-generate cash flow charts very often
have an "answer" that's relatively easy to assess, because LLMs are not perfect and iteration must be possible and easy.
- e.g., I can see that all the underlined words are listed in my web app; I know that my files are in the right folders, named appropriately, by inspection; my charts seem to track with reality
have a relatively complex technical implementation, because why would you bother invoking an LLM if you can "just" click a button somewhere?
- e.g., extracting data from PDFs requires some wrangling; bulk-renaming files requires some tedious and possibly case-specific pattern matching; cash flow between N accounts requires some graph analysis
have relatively low stakes, again, because LLMs are not perfect, and nor is (necessarily) one's understanding of the problem.
- e.g., it's OK if I miss some words I underlined; my cash flow charts only give me an impression of my spending;
- I recognize that moving files is a potentially destructive operation.

I dream of a world in which, to make use of my hardware, I just ask, and don't worry much about languages, frameworks, or sharing my solution with others --- that last one because they can just ask as well.

The Unfair Advantage of Being Technical

I recognize that my success described here did not come for free. There were numerous parts of the process where my software background helped get the most out of Codex.

For one thing, writing software trains us to think precisely about problems. We learn to state exactly what we want, to decompose tasks into steps, and to intuit the exact size of these steps; to know what's hard and what's easy for the machine. When working with an LLM, these skills make it possible to hit the ground running, to know what to ask and to help pluck out a particular solution from the space of various approaches. I think that this greatly accelerates the effectiveness of using LLMs compared to non-technical experts.

For another, the boundary between 'manual' and 'automatic' is not always consistent. Though I didn't touch any of the PyMuPDF code, I did need to look fairly closely at the logic that classified my squiggles as "underlines" and found associated words. It was not enough to treat LLM-generated code as a black box.

Another advantage software folks have when leveraging LLMs is the established rigor of software development. LLMs can and do make mistakes, but so do people. Our field has been built around reducing these mistakes' impact and frequency. Knowing to use version control helps turn the pathological downward spiral of accumulating incorrect tweaks into monotonic, step-wise improvements. Knowing how to construct a test suite and thinking about edge cases can provide an agent LLM the grounding it needs to iterate rapidly and safely.

In this way, I think the dream of personal software is far from being realized for the general public. Without the foundation of experience and rigor, LLM-driven development can easily devolve into a frustrating and endless back-and-forth, or worse, successfully build software that is subtly and convincingly wrong.

The Shoulders of Giants

The only reason all of this was possible is that the authors of PyMuPDF, genanki, spaCy, and argos-translate made them available for me to use from my code. These libraries provided the bulk of the functionality that Codex and I were able to glue into a final product. It would be a mistake to forget this, and to confuse the sustained, thoughtful efforts of the people behind these projects for the one-off, hyper-specific software I've been talking about.

We need these packages, and others like them, to provide a foundation for the things we build. They bring stability, reuse, and the sort of cohesion that is not possible through an amalgamation of home-grown personal scripts. In my view, something like spaCy is to my flashcard script as a brick is to grout. There is a fundamental difference.

I don't know how LLMs will integrate into the future of large-scale software development. The discipline becomes something else entirely when the constraints of "personal software" I floated above cease to apply. Though LLMs can still enable doing what was previously too difficult, tedious, or time consuming (like my little 'underline visualizer'), it remains to be seen how to integrate this new ease into the software lifecycle without threatening its future.

23 KiB Raw Blame History