Skip to content
DocuExtract

← All docs

Extracting documents

Upload a PDF or image, pick a template, see structured data come back with per-field source provenance.

From the dashboard

Go to /dashboard/documents. Pick a template, drop in a file, click Upload & extract.

  • Supported file types: PDF, PNG, JPEG, TIFF, WebP. Max 50 MB.
  • Born-digital PDFs (created from word processors) extract in milliseconds — no OCR needed.
  • Scans + images run through the OCR cascade (Tier 0 → 1 → 3). Typical latency: 2–15 seconds depending on quality + size.

The result, field by field

Each extracted field shows:

  • Name — matches the field name in your template.
  • Value — what the engine extracted. Empty when no value was found (the engine never invents).
  • Source (verbatim)— the exact text from the source document this value came from. If a value can't be traced to a source, the engine drops it.
  • Confidence — 0–100% score combining OCR confidence + extraction method confidence + grounding strength. Color-coded green / amber / red.
  • Method — which path produced this value:
    • regex — pattern-based (dates, currencies, numbers). Cheapest.
    • anchor — label-based lookup using the field's anchor string.
    • llm — LLM extraction using the field's description as a hint. Most expensive, most flexible.
  • Needs review— flagged when confidence falls below the template's threshold. Goes to the Review queue.

The OCR cascade

Cheap OCR runs first; the engine escalates to heavier tiers only when the cheap ones fail their quality gates. You don't configure this — it's automatic per document.

TierEngineWhen it runsCredits/page
0Born-digital extractorPDFs with embedded text, DOCX, XLSX1
1TesseractClean printed Latin scripts (7 languages)2
2APaddleOCRPrinted CJK (Chinese / Japanese / Korean)5
2BSurya OCRPrinted Indic / Arabic / Thai / Khmer / Vietnamese5
3Vision-LLMHandwriting, complex layouts, low-quality scans5–25
4Dual-pass + reconciliationPrecision mode (Enterprise)40–80

Scan modes

Each template has a default scan mode, overridable per-extraction. Modes drive routing decisions + credit cost:

  • Quick Scan — born-digital only (1–2 credits/page)
  • Standard (Auto) — auto-detect + cascade (2–5 credits)
  • Multilingual (Printed) — CJK / Indic / Arabic printed (5 credits)
  • Handwritten — English — Latin/Cyrillic/Greek (5 credits)
  • Handwritten — Other language — CJK/Indic/Arabic/Hebrew (15–25 credits)
  • Handwritten — Mixed languages — dual-engine pipeline (30 credits)
  • Precision — highest-accuracy AI, dual-pass on Enterprise (40–80 credits)

From the API

Programmatic extraction is documented in the API reference. The dashboard's upload flow is a wrapper around POST /v1/documents + POST /v1/extract — same endpoints, same payloads.