Features
Built for documents that matter.
Most extraction tools optimize for the demo. We optimize for the production requirement: every value traceable to source, uncertainty surfaced honestly, no third-party data exfiltration. Here's what that actually means.
Signature feature
Visual field-picker
Upload a sample of your document. Draw a bounding box around each field you want extracted. Name it, set its type (text, number, date, currency, enum, table), optionally anchor it to a nearby label so the engine finds it even when the value position drifts.
That set of definitions is a Template. Save it once and run thousands of similar documents through it — clean structured data out, consistent column names, consistent types, ready for downstream systems.
- Draw, don't code — no JSON schemas, no regex
- Anchor-based matching handles position drift between similar documents
- Six field types: text, number, date, currency, enum, table
- Templates are versioned — edits create v+1, existing batches stay reproducible
Template fields
- invoice_numbertext1.00
- invoice_datedate0.99
- due_datedate0.96
- vendor_nametext1.00
- bill_totext0.98
- subtotalcurrency0.99
- taxcurrency0.97
- totalcurrency0.97
Every value links back to a region in the source — and to the model + version that read it.
Anti-hallucination spine
Verbatim grounding.
Every value the engine extracts must point back to a source span in the original document. The LLM doesn't invent — it finds. Values that can't be grounded are dropped (default) or routed to the human-review queue (configurable per template).
This isn't a feature toggle. It's structural. The is_grounded column on every extraction defaults to false — only flipped to true after a real source span is located. A crashed extraction can never accidentally surface an ungrounded value.
- Direct verbatim match → highest confidence
- Fuzzy match (1–2 char edit distance) handles OCR noise like O↔0, 1↔l↔I
- Semantic match grounds typed values (date "Jan 5" → 2026-01-05 → source token)
- No match → drop or route to review. Never invent.
Extracted field
confidence 0.97 · grounded ✓
Source span
page 1 · bbox (412, 891, 88, 18) · text matched verbatim
Quality boundary
Human-in-the-loop, by default
Confidence below the template's threshold? That field surfaces in a review queue with the source region highlighted on the original document image. The reviewer sees the field name, the engine's best guess, and exactly where it came from. One click to approve, one to correct.
Corrections feed back as exemplars that improve future extractions for that template. This is what separates "OCR with confidence scores" from a usable production workflow.
- Configurable confidence threshold per template (default 0.75)
- Source-region highlight on the page image, with text snippet context
- Per-field review — only the uncertain values, not whole documents
- Corrections logged to the audit trail (who, when, before / after)
Review queue · 3 fields
- invoice_date0.68
06/15/2026
date ambiguity (US/EU)
ApproveCorrectView source - tax0.71
$2,967.30
OCR uncertainty: 7 vs 1
ApproveCorrectView source - vendor_id0.62
AC-2024-1109
no clear anchor label
ApproveCorrectView source
Languages we actually support
Honest multilingual tiering.
Every competitor markets "100+ languages." Most of those claims fall apart on real documents — bad accuracy, broken bounding boxes for right-to-left, no Indic support at all. We label by maturity instead.
Stable means production-ready, validated. Beta means clean documents work; bring yours. Experimental means active development, accuracy varies. You always know which is which.
- Stable: English, Spanish
- Beta: Chinese (Simplified + Traditional), French, Vietnamese, Korean, Tagalog, Portuguese, Arabic (RTL), Hindi
- Experimental: Punjabi, Tamil, Telugu, Bengali (Indic scripts most tools fail on)
- Right-to-left + complex-script handling is real engineering, not a language pack
- English
- Spanish
- Chinese (Simp)
- Chinese (Trad)
- French
- Vietnamese
- Korean
- Tagalog
- Portuguese
- Arabic (RTL)
- Hindi
- Punjabi
- Tamil
- Telugu
- Bengali
Beyond clean printed text
Handwriting + degraded scans.
Form fields filled in by hand, signatures, dates scrawled in margins, faxed receipts with stamped overlays, photos taken with phone cameras at bad angles. The OCR cascade escalates these automatically to a self-hosted vision-LLM (Qwen 2.5-VL 7B by default) that reads what traditional OCR engines miss.
We're honest about the limits. Clean printed-form handwriting and short field values (signatures, names, dates, single-line answers) extract at 70–85% accuracy. Cursive freeform handwriting and degraded multi-line text often drop below the confidence threshold and route to the human-review queue rather than confidently guessing wrong.
- Signatures, hand-printed form fields, marginal notes, stamps
- Available on Premium tier (and bundled with Team and up)
- Same model handles handwriting and multilingual scripts — no separate setup
- Low-confidence handwriting routes to review with the source region highlighted
- Self-host: runs on your own GPU; hosted: Modal scale-to-zero GPU
Handwritten signature field
Signature
Extracted: J. Müller · confidence 0.78 · grounded ✓
confidence 0.78 < threshold 0.80
→ routed to review queue for confirmation
Forensic-grade provenance
Audit trail.
Every field carries its full lineage: which OCR tier ran, which model + version, the confidence score, the source page and bounding region, any human correction (who, when, before, after). All written to an append-only event log.
This is what makes the product auditable, not just accurate. Regulated workflows (legal, medical, financial) need to reconstruct exactly why any field reached its final state. The audit log is that reconstruction.
- Append-only — events are never updated or deleted
- Indexed by extraction, field, template, time, and event type
- 15+ canonical event types covering OCR, extraction, grounding, review
- Export-ready for compliance, SOC 2, HIPAA workflows (managed-deployment tier)
Audit log · extraction d8f3...4a91
- 14:02:01.341document_uploadedinvoice_07.pdf · 3 pages
- 14:02:02.118language_detectedscript=Latin · lang=en · 0.99
- 14:02:03.005ocr_tier_passedtier=1 · confidence=0.94
- 14:02:08.622extraction_field_extractedtotal · method=llm · 0.97
- 14:02:08.847grounding_passedtotal · edit_distance=0
- 14:02:09.011extraction_completed7 fields · overall=0.95
What runs under the hood
Progressive OCR cascade.
Cheap, fast OCR runs first. Quality gates escalate to heavier engines only when needed. Born-digital PDFs never touch a model. Clean scans run Tesseract. Hard cases escalate to PaddleOCR. Handwriting and degraded scans go to a self-hosted vision-LLM. Failures route to human review — never to a paid third-party API by default.
On Premium tiers, two independent LLM passes vote on each field. Disagreements get a third pass or route to review. No silent guessing.
- Tier 0: embedded PDF text (born-digital, free, instant)
- Tier 1: Tesseract (Latin scripts, 8 supported languages)
- Tier 2: PaddleOCR (CJK, Indic, complex layouts)
- Tier 3: Vision-LLM (Qwen 2.5-VL 7B, self-hosted on scale-to-zero GPU)
- 2-pass agreement on Premium; 3rd-call tiebreaker on disagreement
Cascade flow
- Tier 0free
Embedded text
pypdfium2
- Tier 1~CPU sec
Tesseract
Latin scripts
- Tier 2~CPU sec
PaddleOCR
CJK / Indic / layout
- Tier 3~$0.003/page
Vision-LLM
Qwen 2.5-VL 7B
- Tier 4your labor
Human review
queue with source highlight
Quality gate at each tier decides whether to escalate or stop. Cheap tiers run first.
For developers
Public API.
Every feature of the product is available over a documented REST API. Generate API keys in the dashboard, pick a tier (Standard / Premium / Premium + multi-pass), integrate from any stack.
Webhooks fire on batch completion. Rate limits scale with your plan. Idempotency headers on every endpoint. OpenAPI spec at /docs.
- REST + JSON, OpenAPI 3.1 spec, Swagger UI at /docs
- Per-key API keys with scopes and rotation
- Webhooks for batch completion (HMAC-signed)
- Tier-aware rate limits (60 → 6,000 req/min by plan)
- Idempotency keys to make retries safe
Quick start
curl -X POST https://docuextract.ai/v1/extract \
-H "Authorization: Bearer $DOCUEXTRACT_API_KEY" \
-H "Content-Type: application/pdf" \
-H "X-Template: tpl_acme_invoices" \
-H "X-Tier: premium" \
--data-binary @invoice.pdf
# → { "fields": [...], "audit_id": "...", ... }Bring your own template, or use a public one from the gallery.
How we compare
The combination is the moat.
Almost every individual feature exists somewhere. No competitor combines them: OSS + self-host + visual picker + HITL + verbatim grounding + honest multilingual + no paid- API-by-default. That's the gap we're built into.
| Capability | DocuExtract | Typical competitor |
|---|---|---|
| Apache-2.0 open source | ||
| Full self-host (no contact-sales) | ||
| Polished visual bounding-box picker | ||
| Integrated HITL review queue | ||
| Verbatim grounding (no-hallucination as guarantee) | ||
| Honest multilingual tier labeling | ||
| Multi-pass LLM agreement | ||
| No paid third-party APIs by default | ||
| Custom-template visual editor | ||
| Public API + documented OpenAPI spec | ||
| Batch processing + webhooks |
"Typical competitor" abstracts across Nanonets, Sensible, Rossum, Docparser, Hyperscaler APIs. Individual competitors may match on a given row; none match on the full set.
Try it on your documents.
50 free documents per month covers a real evaluation. Or self-host for unlimited volume on your own hardware — same code, same features.