Extracting documents
Upload a PDF or image, pick a template, see structured data come back with per-field source provenance.
From the dashboard
Go to /dashboard/documents. Pick a template, drop in a file, click Upload & extract.
- Supported file types: PDF, PNG, JPEG, TIFF, WebP. Max 50 MB.
- Born-digital PDFs (created from word processors) extract in milliseconds — no OCR needed.
- Scans + images run through the OCR cascade (Tier 0 → 1 → 3). Typical latency: 2–15 seconds depending on quality + size.
The result, field by field
Each extracted field shows:
- Name — matches the field name in your template.
- Value — what the engine extracted. Empty when no value was found (the engine never invents).
- Source (verbatim)— the exact text from the source document this value came from. If a value can't be traced to a source, the engine drops it.
- Confidence — 0–100% score combining OCR confidence + extraction method confidence + grounding strength. Color-coded green / amber / red.
- Method — which path produced this value:
regex— pattern-based (dates, currencies, numbers). Cheapest.anchor— label-based lookup using the field'sanchorstring.llm— LLM extraction using the field'sdescriptionas a hint. Most expensive, most flexible.
- Needs review— flagged when confidence falls below the template's threshold. Goes to the Review queue.
The OCR cascade
Cheap OCR runs first; the engine escalates to heavier tiers only when the cheap ones fail their quality gates. You don't configure this — it's automatic per document.
| Tier | Engine | When it runs | Credits/page |
|---|---|---|---|
| 0 | Born-digital extractor | PDFs with embedded text, DOCX, XLSX | 1 |
| 1 | Tesseract | Clean printed Latin scripts (7 languages) | 2 |
| 2A | PaddleOCR | Printed CJK (Chinese / Japanese / Korean) | 5 |
| 2B | Surya OCR | Printed Indic / Arabic / Thai / Khmer / Vietnamese | 5 |
| 3 | Vision-LLM | Handwriting, complex layouts, low-quality scans | 5–25 |
| 4 | Dual-pass + reconciliation | Precision mode (Enterprise) | 40–80 |
Scan modes
Each template has a default scan mode, overridable per-extraction. Modes drive routing decisions + credit cost:
- Quick Scan — born-digital only (1–2 credits/page)
- Standard (Auto) — auto-detect + cascade (2–5 credits)
- Multilingual (Printed) — CJK / Indic / Arabic printed (5 credits)
- Handwritten — English — Latin/Cyrillic/Greek (5 credits)
- Handwritten — Other language — CJK/Indic/Arabic/Hebrew (15–25 credits)
- Handwritten — Mixed languages — dual-engine pipeline (30 credits)
- Precision — highest-accuracy AI, dual-pass on Enterprise (40–80 credits)
From the API
Programmatic extraction is documented in the API reference. The dashboard's upload flow is a wrapper around POST /v1/documents + POST /v1/extract — same endpoints, same payloads.