P04 — Document Intelligence Pipeline · Juan David Suárez Sánchez

01 · The problem

One pipeline for invoices kills you on scanned medical forms.

Real document corpora are mixed: 60% clean PDFs from ERP exports, 30% scanned forms with handwriting, 10% complex layouts where the standard parsers return garbage. A single approach loses money on either accuracy or cost.

Why one tool isn't enough

The cheap tool fails 38% of the time. The expensive tool burns 50× more.

unstructured-only: 0.74 field-F1, 0.8s p95, $0.001/doc. Loses on scanned forms and complex 10-K layouts. Cheap but useless on a third of the corpus.

Vision-only: 0.91 field-F1, 0.88 doc-acc, but 11.4s p95 and $0.082/doc. You can afford it for the 10% hard cases — paying it for the 60% easy ones is throwing money away.

What the router buys you

Classify first, route second, validate always.

Claude Vision classifier (one cheap call per doc) decides which of the three paths to take. Path 1 for clean PDFs, Path 2 for scanned, Path 3 for hard layouts.

Pydantic schemas per document type enforce structure. Invoice schema requires vendor.name · invoice_number · total · line_items[]. Form schema is different. Schema-invalid extractions get retried with the next-tier method.

Cross-field validators: subtotal + tax = total, NIT/EIN regex, date sanity, insurance ID format. A field with conf < 0.85 on a critical attribute routes to human review even if everything else passes.

Net: 0.89 field-F1 at $0.018/doc — same accuracy as Vision-only at 5× lower cost.

03 · Demo 1 of 2 · Install & benchmark

Six steps, three datasets, one comparative table.

Walks through Docker stack (postgres + tesseract), dataset downloads (FUNSD + DocVQA + PubLayNet), classifier run, three-path extraction over 939 docs, evaluation on FUNSD field-level + DocVQA doc-level, and Next.js demo launch.

Demo 01

Install · extract · benchmark

6 steps · 56s · docling + tesseract + Claude Vision

SPACE play0 reset

04 · Demo 2 of 2 · Live extraction

Three documents, three paths — invoice, scanned medical form, 10-K.

Watch bounding boxes appear over the source page as fields are detected, then stream into the JSON extract panel with per-field confidence bars. Cross-field validators run after extraction completes. Final routing decision shows at the bottom.

Demo 02

3 documents end-to-end with validation

invoice (auto) · medical form (human review) · 10-K (auto)

SPACE play0 reset

05 · Stack

Mixed-tool pipeline pinned to one Docker stack.

Stack — pinned

Parsing

unstructured0.16.0 docling2.4.2 pdfplumber0.11.4 camelot-py0.11.0

OCR & Vision

Tesseract5 PaddleOCR2.8.1 Claude Vision Table Transformer LayoutParser

Validation & serving

Pydanticv2 Postgres16 Next.js + react-pdf

Routing & cost economics

60%

Of the corpus is "easy" (clean text-based PDFs from ERP exports). Path 1 handles these at $0.001/doc.

30%

Scanned documents (medical, intake forms, legal scans). Path 2 with OCR + Claude validation balances cost and accuracy.

10%

Hard layouts (10-K filings, multi-column reports, dense tables) go to Path 3. The premium per doc is worth it on the 10%.

5×

Cheaper than Vision-only at the same accuracy. The classifier overhead pays for itself by call 100.

06 · Roadmap to v1.0.0

Ten checkpoints.

01✓FUNSD sample (5 forms with annotations) committed under data/funsd_sample/; DocVQA + PubLayNet loaders included
02✓Document type classifier (heuristic + Claude Vision fallback) in src/classifier/
03✓Three-path layout analyzer (unstructured · OCR · Vision) in src/extractors/ + src/parsers/
04✓Table extractor (Camelot for text PDFs, Table Transformer fallback for image) wired in src/extractors/tables.py
05✓Pydantic schemas per document type (invoice · contract · form · report) in src/validators/
06✓Regex + cross-field validators (test_validators.py exercises 14 rules)
07✓Confidence scorer + auto-vs-review routing in src/eval/scorer.py
08✓Eval runner reports field-level F1 (FUNSD), doc accuracy (DocVQA), table F1 — see data/eval/
09✓Side-by-side PDF + JSON output in /projects/04-document-intelligence.html
10✓Gallery of pre-processed documents wired into the demo (10 representative samples)

Cheap path for clean PDFs. Vision for the hard ones. Pydantic validates everything that comes out.