Open-source · Apache-2.0 · Self-hostable

Point. Pick. Extract.
No fabricated values.

Visual field-picker. Human-in-the-loop review. Every extracted value links back to its source — not hallucinated by an LLM. Self-host it on your own infrastructure or use the hosted version free.

Reads: printed text · handwriting · tables · multi-column layouts · 15 languages (incl. Arabic RTL + Indic scripts)

Start free Self-host on GitHub Talk to consulting →

50 documents/month on the free tier. No credit card. No data leaves your environment when you self-host.

invoice_template · v3

Acme CorporationInvoice

Invoice no.

INV-2026-0042

Date

2026-06-15

Due date

2026-07-21

Bill to

Beta Industries LLC

Consulting (40 hrs)$15,000.00

Materials$617.37

Subtotal$15,617.37

VAT (19%)$2,967.30

Total$18,584.67

Template fields

invoice_numbertext
1.00
invoice_datedate
0.99
due_datedate
0.96
vendor_nametext
1.00
bill_totext
0.98
subtotalcurrency
0.99
taxcurrency
0.97
totalcurrency
0.97

Every value links back to a region in the source — and to the model + version that read it.

How it works

Three steps. One promise: nothing fabricated.

01
Define a template
Upload a sample document. Draw bounding boxes around the fields you want. Name them, set types (text, number, date, currency, table). Save the template.
02
Run a batch
Upload many similar documents — invoices, intake forms, receipts. The engine runs each through the OCR cascade, finds the values, grounds them to source regions.
03
Review uncertainties
Low-confidence fields surface in a review queue with the highlighted source region. Approve or correct. Export the rest as CSV or JSON — every value with its provenance.

Why it's different

What the document actually says. Not what the model guessed.

Most extraction tools either hallucinate confident-but-wrong values, charge enterprise prices for what should be commodity work, or ship as a black box with no self-host option. DocuExtract is the opposite of all three.

Verbatim grounding
Every extracted value must trace back to a source span in the original document. Values that can’t be grounded are dropped or routed to review — never fabricated by the model.
Visual field-picker
Draw bounding boxes on a sample. No JSON schemas to write, no regex to maintain. Anchors handle position drift between similar documents.
Honest multilingual tiering
Stable: English, Spanish. Beta: 8 widely-used languages including Arabic RTL. Experimental: Indic scripts (Hindi, Punjabi, Tamil, Telugu, Bengali). We label by maturity, not marketing.
Human-in-the-loop, by default
Low-confidence fields surface for review with the source region highlighted. Reviewer corrections feed back to improve future extractions on that template.
Self-hostable
docker compose up. Postgres + Redis + MinIO + Ollama, all in one stack. No paid API required. Apache-2.0. Run it on your own hardware, your own models, your own data.
Auditable
Every field carries: which OCR tier ran, which model + version, the confidence score, any human correction. Append-only audit log. Forensic-grade provenance.

What we read

Handles the documents others can't.

Printed text and born-digital PDFs are table stakes. We also read handwriting, multi-column tables, degraded scans, and 15 languagesacross three honesty tiers. We label by maturity instead of claiming "100+ languages" like most vendors do.

Document types

Printed text & PDFs

Born-digital PDFs read directly from the embedded text layer. Scanned printed text runs through Tesseract / PaddleOCR depending on script.

Handwriting

Signatures, short field values, and printed-form handwriting handled via Tier 3 vision-LLM (Qwen 2.5-VL). 70–85% accuracy on clean handwriting; cursive freeform falls back to human review.

Tables & multi-column

Single-page tables, key/value forms, multi-column layouts. Complex multi-page tables are on the roadmap (tracked in KNOWN_ISSUES).

Degraded scans

Low-resolution photos, faxed documents, stained scans. The cascade escalates to vision-LLM automatically; uncertainty routes to review queue.

Languages (honestly tiered)

Stable

Production-ready. Validated against a comprehensive fixture set.

English
Spanish

Beta

Works on clean documents. Validate on yours.

Chinese (Simplified)
Chinese (Traditional)
French
Vietnamese
Korean
Tagalog
Portuguese
Arabic (RTL)
Hindi

Experimental

Active development. Accuracy varies.

Punjabi
Tamil
Telugu
Bengali

Full language-support matrix

Join the waitlist. Self-host anytime. Bring us in when it matters.

Hosted invites are rolling out in batches. Drop your email and we’ll let you know when you can sign up. Self-host the OSS version today for unlimited volume.

See pricing Self-host on GitHub →Talk to consulting →

Point. Pick. Extract.No fabricated values.

Three steps. One promise: nothing fabricated.

Define a template

Run a batch

Review uncertainties

What the document actually says. Not what the model guessed.

Verbatim grounding

Visual field-picker

Honest multilingual tiering

Human-in-the-loop, by default

Self-hostable

Auditable