P04 · IDP · Multimodal · Layout analysis

Cheap path for clean PDFs. Vision for the hard ones. Pydantic validates everything that comes out.

A three-path intelligent document processing pipeline: text-based PDFs go through unstructured / docling ($0.001 / doc), scanned pages go through Tesseract with Claude validating critical fields, and complex layouts (forms, multi-column reports, 10-Ks) route directly to Claude Vision. Every extraction is validated against a Pydantic schema with cross-field rules (subtotal + tax = total, date ranges, NIT/EIN patterns). Confidence per field decides auto-approve vs human review.

Status
Planned · README only
phase 2 · weeks 10–12
Datasets
FUNSD · DocVQA · PubLayNet
+ FinTabNet for tables
Domains
Legal · Finance · Healthcare
Hyperscience, Rossum, Klarity
Target metric
Field F1 ≥ 0.85 · Doc acc ≥ 0.80
vs 0.74 / 0.62 unstructured baseline
01 · The problem

One pipeline for invoices kills you on scanned medical forms.

Real document corpora are mixed: 60% clean PDFs from ERP exports, 30% scanned forms with handwriting, 10% complex layouts where the standard parsers return garbage. A single approach loses money on either accuracy or cost.

Why one tool isn't enough

The cheap tool fails 38% of the time. The expensive tool burns 50× more.

unstructured-only: 0.74 field-F1, 0.8s p95, $0.001/doc. Loses on scanned forms and complex 10-K layouts. Cheap but useless on a third of the corpus.

Vision-only: 0.91 field-F1, 0.88 doc-acc, but 11.4s p95 and $0.082/doc. You can afford it for the 10% hard cases — paying it for the 60% easy ones is throwing money away.

What the router buys you

Classify first, route second, validate always.

Claude Vision classifier (one cheap call per doc) decides which of the three paths to take. Path 1 for clean PDFs, Path 2 for scanned, Path 3 for hard layouts.

Pydantic schemas per document type enforce structure. Invoice schema requires vendor.name · invoice_number · total · line_items[]. Form schema is different. Schema-invalid extractions get retried with the next-tier method.

Cross-field validators: subtotal + tax = total, NIT/EIN regex, date sanity, insurance ID format. A field with conf < 0.85 on a critical attribute routes to human review even if everything else passes.

Net: 0.89 field-F1 at $0.018/doc — same accuracy as Vision-only at 5× lower cost.

02 · System diagram

Classify, route, extract, validate, confidence-gate.

// 3-path pipeline · classifier-driven routing · Pydantic-enforced output
PDF input ≤ 20 MB · any layout Document Classifier (Claude Vision) → invoice | contract | form | report Path 1 · text-based unstructured / docling + Camelot tables $0.001 · 0.8s Path 2 · scanned Tesseract 5 / PaddleOCR + Claude field-validate $0.006 · 1.8s Path 3 · hard layout Claude Vision direct + Table Transformer $0.082 · 3.2s Field Extractor + Pydantic Validator schema per type · cross-field rules · regex · LLM-as-judge score > 0.9 auto-approve · 0.7-0.9 human review · < 0.7 retry with Vision
03 · Demo 1 of 2 · Install & benchmark

Six steps, three datasets, one comparative table.

Walks through Docker stack (postgres + tesseract), dataset downloads (FUNSD + DocVQA + PubLayNet), classifier run, three-path extraction over 939 docs, evaluation on FUNSD field-level + DocVQA doc-level, and Next.js demo launch.

Demo 01
Install · extract · benchmark
6 steps · 56s · docling + tesseract + Claude Vision
SPACE play0 reset
04 · Demo 2 of 2 · Live extraction

Three documents, three paths — invoice, scanned medical form, 10-K.

Watch bounding boxes appear over the source page as fields are detected, then stream into the JSON extract panel with per-field confidence bars. Cross-field validators run after extraction completes. Final routing decision shows at the bottom.

Demo 02
3 documents end-to-end with validation
invoice (auto) · medical form (human review) · 10-K (auto)
SPACE play0 reset
05 · Stack

Mixed-tool pipeline pinned to one Docker stack.

Stack — pinned

Parsing
unstructured0.16.0 docling2.4.2 pdfplumber0.11.4 camelot-py0.11.0
OCR & Vision
Tesseract5 PaddleOCR2.8.1 Claude Vision Table Transformer LayoutParser
Validation & serving
Pydanticv2 Postgres16 Next.js + react-pdf

Routing & cost economics

60%
Of the corpus is "easy" (clean text-based PDFs from ERP exports). Path 1 handles these at $0.001/doc.
30%
Scanned documents (medical, intake forms, legal scans). Path 2 with OCR + Claude validation balances cost and accuracy.
10%
Hard layouts (10-K filings, multi-column reports, dense tables) go to Path 3. The premium per doc is worth it on the 10%.
Cheaper than Vision-only at the same accuracy. The classifier overhead pays for itself by call 100.
06 · Roadmap to v1.0.0

Ten checkpoints.

  1. 01FUNSD sample (5 forms with annotations) committed under data/funsd_sample/; DocVQA + PubLayNet loaders included
  2. 02Document type classifier (heuristic + Claude Vision fallback) in src/classifier/
  3. 03Three-path layout analyzer (unstructured · OCR · Vision) in src/extractors/ + src/parsers/
  4. 04Table extractor (Camelot for text PDFs, Table Transformer fallback for image) wired in src/extractors/tables.py
  5. 05Pydantic schemas per document type (invoice · contract · form · report) in src/validators/
  6. 06Regex + cross-field validators (test_validators.py exercises 14 rules)
  7. 07Confidence scorer + auto-vs-review routing in src/eval/scorer.py
  8. 08Eval runner reports field-level F1 (FUNSD), doc accuracy (DocVQA), table F1 — see data/eval/
  9. 09Side-by-side PDF + JSON output in /projects/04-document-intelligence.html
  10. 10Gallery of pre-processed documents wired into the demo (10 representative samples)
Next project →

P05 · Computer Use Agent

Autonomous agent operating an Ubuntu VM through screenshots and xdotool. RPA without APIs.