Half of the engineering knowledge at any service company is stuck inside PDFs — contracts, datasheets, commissioning reports, supplier invoices, field notes, manuals. The machine that reads them all is cheaper than it used to be, smarter than it has any right to be, and never complains about the handwriting.
DEMO
Synthetic documents. Sample files reference the invented Cryovox compressor, Helix drive, and Torquon motor from the Diagnostics page. Layouts, field structures, and extraction patterns mirror real-world practice — but no confidential customer contracts, proprietary datasheets, or actual invoices are reproduced here.
01 Discipline
What document intelligence actually is
Structure from the unstructured.
A document is an argument the author made with layout and prose. A human reader reconstructs the argument in their head — this is the party name, this is the date, this clause modifies that one, this number is a total and that number is a line item. The reconstruction happens fast and unconsciously, which is why it feels like nothing until you have to do it on 200 invoices. Then it is everything.
Document intelligence is the discipline of getting a machine to do that reconstruction at volume, with enough ground-truth checks that a human only needs to audit the uncertain cases. Vision models do the reading. Structured-output schemas enforce the shape of the answer. Confidence scores tell you what to trust and what to double-check. The output is not a summary — it is a dataset.
Live demoWatch the extraction happen
02 Toolkit
What's in the extraction bench
Model, schema, and a reviewer.
The extraction stack has three layers that have to work together. The vision-capable model does the actual reading — looking at the pixels, resolving the layout, parsing the prose. The structured-output schema enforces the shape of the answer, so the model can't hallucinate new fields or omit required ones. And a human reviewer audits the confidence-flagged cases before they enter the system of record. Skip any one layer and the output becomes unreliable; skip two and the whole pipeline is worse than doing the work by hand.
Vision
Claude · Sonnet 4.5
Reads PDFs, scanned images, and photographs natively. Understands tables, handles rotated pages, parses multilingual content without a separate OCR stage.
Schema
Structured JSON output
Every extraction job declares its required fields, optional fields, and validation constraints. The model fills the schema; it cannot invent new shape.
Confidence
Per-field scoring
Each extracted value carries a confidence score. Values below threshold get routed to human review; values above get written straight to the database.
Routing
Field-aware workflow
A total over €10k routes to approval; a new vendor routes to AP; a flagged lead-time routes to procurement. Extraction is the beginning, not the end.
Provenance
Source-linked fields
Every extracted value remembers where on which page it came from. Audit trail, legal defensibility, and fast human verification all come from this one discipline.
Reviewer
The human in the loop
Rare but essential. For unfamiliar document types, low-confidence fields, and values that touch money or safety, a human reviews before commit.
03 Process
Six steps · end-to-end pipeline
From inbox to database row.
A good pipeline is a conveyor belt, not a crane lift. Each stage takes a standardised input and produces a standardised output, so the next stage can start without knowing what happened before. The whole thing runs on autopilot 95% of the time; the human only gets involved at the exceptions.
01
Ingest
File in
A PDF arrives from email, a supplier portal, a scanner, or a mobile photo. Normalise the format, compute a content hash, check for duplicates, store the original in the archive. Nothing else happens until the file is safely stored with a stable identifier.
02
Segment
Page-aware
Split multi-page files into logical sections — a 23-page contract might have cover / parties / terms / annex / signatures. Models process segments better than whole books, and segmentation is where a tricky document reveals its structure.
03
Extract
Schema-bound
Run the vision model against each segment with the target JSON schema in the prompt. The model returns structured values plus per-field confidence. For fields that span multiple pages (a total, a lead time buried on page 6), the extractor works across segments.
04
Validate
Rules engine
Cross-check extracted values against business rules. Does the VAT math add up? Does the line-item total match the bottom-line total? Is the vendor on the approved list? Validation catches both model errors and source-document errors.
05
Route
Field-aware
High-confidence, fully-validated extractions flow straight to the target system. Low-confidence or rule-flagged extractions route to a human reviewer with the source highlighted at the right page and span. No extraction is ever committed silently.
06
Store
Provenance preserved
The final record links back to the exact page and pixel region it came from. Six months later, when an auditor or a procurement manager asks "where does this number come from?", the answer is one click away. Provenance is what separates extraction from guessing.
04 Patterns
Six extraction jobs from the real field
The jobs that actually earn the stack.
These are the extraction patterns that come up again and again in an industrial service business. Each one has its own schema, its own validation rules, and its own failure modes. Together they cover most of what the office does with paper on any given week.
Contract term extraction
Purchase agreements, NDAs, supplier frameworks. Pull parties, pricing, lead times, warranty, payment terms, governing law, liability caps. Red-flags anything unusual — long lead times, uncapped liability, auto-renewal clauses.
Turn a 40-page manufacturer PDF into a comparable spec row. Capacity, voltage, current, dimensions, weight, IP rating, operating envelope, connection types. Normalises units across vendors so real comparison becomes possible.
The technician writes a few sentences, takes six photos, reads some parameters off the HMI. Model turns that into a structured service report: symptoms, measurements, root cause, parts replaced, recommendations, sign-off.
symptomsmeasurementsroot causepartshourssign-off
Parts-list reconciliation
Supplier invoice lines vs. BOM vs. project budget. Matches ambiguous part names ("MC-3 gasket set" = "Mycom MC-3 crankcase gasket kit"), flags price drift, catches missing items, catches duplicate billing.
part matchqty checkprice driftduplicatesmissing
Multilingual invoice intake
Incoming invoices in Danish, German, Italian, Romanian. Language-agnostic extraction — vendor name, number, dates, VAT, totals — normalised to a single schema. The AP team sees one format regardless of source.
vendorinvoice #datescurrencyvattotal
Handwritten field-note OCR
A phone photo of a technician's pocket-notebook page. Pulls out what was measured, what was replaced, what was recommended — and preserves ambiguity where the handwriting is genuinely unreadable, rather than guessing.
readingsactionsrecommendationsuncertain
05 Try it
Pick a sample · simulated extraction
Drop in a document, watch it come apart.
This demo uses pre-computed synthetic extractions so the page works offline and preserves privacy — no API keys, no backend, no uploads sent anywhere. The structured outputs below are what a real Claude-powered pipeline produces on similar documents; the formatting, field shape, confidence scores, and red-flag logic all mirror production behaviour.
Sample document
3 samples · pre-loaded · live extraction disabled
Or drop a PDF here
Live API disabled on the public demo · upload is simulated
—
06 Featured case
The contract that hid a 14-week lead time
"Page 6 of 23 said everything."
A standard supplier purchase agreement arrives in the office queue. Twenty-three pages, three signatories, English throughout, nothing visually alarming. The procurement lead runs it through extraction because that's the routine now — she's not expecting anything. The output comes back with a single amber flag: lead time, 14 weeks, confidence 71%, flagged for review.
00:02
Ingest
PDF arrives attached to an email from the supplier. Hash computed, duplicate check clean, file stored. The contract is 23 pages, roughly 8,400 words.
00:12
Segment
Five logical sections detected — cover, parties & pricing, commercial terms, technical annex, signatures. The lead-time clause lives in the commercial terms block on page 6, buried between an invoicing paragraph and a force-majeure clause.
00:26
Extract
The extractor returns the expected fields: parties, product, unit price, quantity, warranty, payment, governing law. And one more: lead time · 14 weeks · confidence 71%. Confidence is low because the phrasing is unusual — the clause doesn't use the phrase "lead time" anywhere; it says "delivery shall occur within 98 calendar days of supplier's acceptance of order".
00:34
Validate
The validator flags the extracted 14-week value against the procurement business rule: any lead time longer than 8 weeks on this equipment class requires procurement-director sign-off. The flag routes the extraction to human review before the contract can be acknowledged.
00:41
Route & resolve
Procurement lead opens the extraction, clicks the lead-time field, jumps directly to page 6 with the exact clause highlighted. Reads it, confirms the interpretation, sees that this particular supplier has quietly extended lead times by six weeks since the last framework agreement. The deal gets re-negotiated before signature.
Why this is the AI leverage. A procurement lead reading 23 pages at the end of a long day skims the commercial terms block — the important stuff is supposed to be on page 1, everyone knows that. Extraction reads every page with equal attention and every paragraph with equal scepticism. It doesn't skim. It doesn't get tired. And when the phrasing is unusual, it flags the uncertainty rather than guessing confidently — which is exactly what a careful human reader would do if they had eight hours for every contract instead of eight minutes.