DocuExtract

Upload a PDF or image, pick a template, see structured data come back with per-field source provenance.

From the dashboard

Go to /dashboard/documents. Pick a template, drop in a file, click Upload & extract.

Supported file types: PDF, PNG, JPEG, TIFF, WebP. Max 50 MB.
Born-digital PDFs (created from word processors) extract in milliseconds — no OCR needed.
Scans + images run through the OCR cascade (Tier 0 → 1 → 3). Typical latency: 2–15 seconds depending on quality + size.

The result, field by field

Each extracted field shows:

Name — matches the field name in your template.
Value — what the engine extracted. Empty when no value was found (the engine never invents).
Source (verbatim)— the exact text from the source document this value came from. If a value can't be traced to a source, the engine drops it.
Confidence — 0–100% score combining OCR confidence + extraction method confidence + grounding strength. Color-coded green / amber / red.
Method — which path produced this value:
- regex — pattern-based (dates, currencies, numbers). Cheapest.
- anchor — label-based lookup using the field's anchor string.
- llm — LLM extraction using the field's description as a hint. Most expensive, most flexible.
Needs review— flagged when confidence falls below the template's threshold. Goes to the Review queue.

The OCR cascade

Cheap OCR runs first; the engine escalates to heavier tiers only when the cheap ones fail their quality gates. You don't configure this — it's automatic per document.

Tier	Engine	When it runs	Credits/page
0	Born-digital extractor	PDFs with embedded text, DOCX, XLSX	1
1	Tesseract	Clean printed Latin scripts (7 languages)	2
2A	PaddleOCR	Printed CJK (Chinese / Japanese / Korean)	5
2B	Surya OCR	Printed Indic / Arabic / Thai / Khmer / Vietnamese	5
3	Vision-LLM	Handwriting, complex layouts, low-quality scans	5–25
4	Dual-pass + reconciliation	Precision mode (Enterprise)	40–80

Scan modes

Each template has a default scan mode, overridable per-extraction. Modes drive routing decisions + credit cost:

Quick Scan — born-digital only (1–2 credits/page)
Standard (Auto) — auto-detect + cascade (2–5 credits)
Multilingual (Printed) — CJK / Indic / Arabic printed (5 credits)
Handwritten — English — Latin/Cyrillic/Greek (5 credits)
Handwritten — Other language — CJK/Indic/Arabic/Hebrew (15–25 credits)
Handwritten — Mixed languages — dual-engine pipeline (30 credits)
Precision — highest-accuracy AI, dual-pass on Enterprise (40–80 credits)

From the API

Programmatic extraction is documented in the API reference. The dashboard's upload flow is a wrapper around POST /v1/documents + POST /v1/extract — same endpoints, same payloads.

Extracting documents

From the dashboard

The result, field by field

The OCR cascade

Scan modes

From the API