Get the document but not the elements around it
Decoder works on two kinds of input. PDFs the user has open, and web pages the user is reading. Each one has its own problem to solve.
For PDFs, Decoder uses PDFKit. The text comes out clean. The one thing it loses is tables. Spatial layout gets flattened into a single line of text. For most contracts that's fine, because the words carry the legal meaning, not the layout. Worth knowing about, though.
Web pages are harder. If you grab everything on the page, you also grab the menu, the header, the footer, the cookie banner, the ads, the related-content boxes. On a typical terms-of-service page, half of what you grab is page furniture, not the document.
The extractor does three things. First, it removes the obvious chrome: navigation bars, headers, footers, banners, anything tagged that way. Second, it looks for the main content area, preferring tags meant for it like main and article. If those aren't there, it scores blocks of text by how dense they are and how they fit into the page structure. Third, it keeps enough shape, headings and lists, that the model can see the document the way the reader sees it.
Every bit of leftover chrome costs money to send to the model and pulls its attention away from the document. Cleaning the input cuts the size by roughly a third to a half on most pages, and keeps the model focused on what's being read.
Pick the right model with two simple checks
This step decides which model handles the document. It uses two signals. How long the document is, and how varied its legal vocabulary is. Then it combines them with one rule.
Signal one: length. The document's size decides which models are even allowed to handle it. Short documents can go to any model. Longer ones rule out the smaller models. Very long ones can only go to the biggest one, because that's the only one that can fit them in.
Every document is different as should be treated as such. By doing so, we are able to cut costs on unnecessary processing power or underperformance
Signal two: vocabulary. Decoder keeps a list of legal terms, each with a weight. It scans the document and adds up the weight of every distinct term it finds. A term only counts once, no matter how many times it appears.
Counting terms once was the right call. An earlier version multiplied weight by how often a term appeared. Documents that just repeated the same common terms got pushed up the ladder for no good reason. A short rental template kept getting routed to bigger models because it used a few common lease terms a lot. Counting unique terms measures what concepts the document actually touches, not how repetitive it is.
The 0.5 tier exists because basic agreement vocabulary was inflating scores on simple templates. Demoting those words to half-weight fixed the calibration without dropping them from the list. They still earn their keep in the glossary.
The combining rule. The final tier is whichever is higher: the lowest tier the document's length allows, or the tier its vocabulary suggests. Length sets the floor. Vocabulary can push it up. The bias goes one way: if in doubt, go higher. Over-classifying costs a bit more money. Under-classifying gives the user a worse answer, and they can't tell.
Four model tiers, one provider
Same SDK, same login, same caching. Different models and different reasoning settings. Each step up the ladder buys a real capability the one below can't do reliably.
Light to Standard is a step up in raw capability. Standard to Heavy adds reasoning, so the model thinks before answering. Heavy to Premium swaps in the strongest model. Each step exists because the one below it falls short on a real class of documents.
Why one provider. Mixing providers means more SDKs, more validation calibration, more behaviour to chase. The savings at the bottom end aren't worth the work for an alpha. One provider keeps the system simple. Although, a mixed providers was once initially thought about accordance to their each prominent features, but is deferred.
Caching. The system prompt is cached. After the first call in a session, the next calls reuse the cached version and pay much less for it. Free savings without losing quality.
Drop anything that fails the rules
The model returns its answer as structured JSON. Before any of it gets shown, each finding has to pass a set of checks. If a check fails, the finding is dropped. Better to show the user fewer things than to show them something that's made up.
Every drop is logged with a reason. If validation drops everything, the result says so explicitly. That's a different state from the model deciding the document is out of scope, and a different state from the model finding nothing worth flagging. The user sees that the analysis ran but didn't produce anything usable, instead of getting a silent empty screen.
Show the result and what it had worked through
The validated result gets written to shared storage with the decode's job ID. The results page polls that storage and shows the result when it appears.
Every decode gets a unique job ID at the start. Each step it goes through, from created to processing to completed to rendered, or to failed or abandoned, gets logged against that ID. When the result is ready, the results page checks that the ID in storage matches the one it's expecting. Two decodes can never get crossed in the user's view, because the system enforces it. Not because we hope the user is only running one at a time.
In the rendered findings, terms from the lexicon are shown with a dotted underline. Hover or tap to see the plain-language explanation. The same dotted underline you've been hovering on in this article.
SOUL: the rules behind the pipeline
Decoder has a SOUL document. It says what the product is, what it must never do, and how it behaves over time. The validator enforces the rules in code. The prompt asks the model to follow them. The categories themselves were designed around them.
The rules live in governance files. A SOUL document, and scaffolding agents works side to side throughout sessions. When the codebase changes, the change is checked against the SOUL. When agents work on the product, they read these files first so they inherit the rules.
One lexicon, two jobs
The same weighted list of legal terms powers the classifier in step 2 and the in-product glossary in step 5. One list. Two places it shows up.
This wasn't planned from the start. It came out of noticing that finding legal terms and explaining them is the same job, just for two audiences. Building one list for both is less work than maintaining two, and it keeps the two surfaces in sync. A term worth scoring is, by construction, a term worth explaining.
Decoder is one example of a class
Five steps in the pipeline. Two disciplines that run across all of them. None of the parts is clever on its own. The classifier is two signals and a rule. The router is a lookup. The validator is a few simple checks that fail closed. The lexicon is a list with weights.
What it adds up to is a system that puts effort where it pays off and refuses to waste it where it doesn't. Every finding the user sees can be traced back to a span they can read for themselves. The product's stance toward the user, that it translates and doesn't advocate, is enforced in code. Not just in good intentions.
Decoder is the first thing built this way. The shape of it generalises. A domain-specific lexicon. Length-and-vocabulary classification. Tiered routing inside one provider. Fail-closed validation. Governance kept in files agents read first.