Skip to content
DocuExtract

Open-source · Apache-2.0 · Self-hostable

Point. Pick. Extract.
No fabricated values.

Visual field-picker. Human-in-the-loop review. Every extracted value links back to its source — not hallucinated by an LLM. Self-host it on your own infrastructure or use the hosted version free.

Reads: printed text · handwriting · tables · multi-column layouts · 15 languages (incl. Arabic RTL + Indic scripts)

50 documents/month on the free tier. No credit card. No data leaves your environment when you self-host.

invoice_template · v3
Acme CorporationInvoice
Invoice no.
INV-2026-0042
Date
2026-06-15
Due date
2026-07-21
Bill to
Beta Industries LLC
Consulting (40 hrs)$15,000.00
Materials$617.37
Subtotal$15,617.37
VAT (19%)$2,967.30
Total$18,584.67

Template fields

  • invoice_numbertext
    1.00
  • invoice_datedate
    0.99
  • due_datedate
    0.96
  • vendor_nametext
    1.00
  • bill_totext
    0.98
  • subtotalcurrency
    0.99
  • taxcurrency
    0.97
  • totalcurrency
    0.97

Every value links back to a region in the source — and to the model + version that read it.

How it works

Three steps. One promise: nothing fabricated.

  1. 01

    Define a template

    Upload a sample document. Draw bounding boxes around the fields you want. Name them, set types (text, number, date, currency, table). Save the template.

  2. 02

    Run a batch

    Upload many similar documents — invoices, intake forms, receipts. The engine runs each through the OCR cascade, finds the values, grounds them to source regions.

  3. 03

    Review uncertainties

    Low-confidence fields surface in a review queue with the highlighted source region. Approve or correct. Export the rest as CSV or JSON — every value with its provenance.

Why it's different

What the document actually says. Not what the model guessed.

Most extraction tools either hallucinate confident-but-wrong values, charge enterprise prices for what should be commodity work, or ship as a black box with no self-host option. DocuExtract is the opposite of all three.

  • Verbatim grounding

    Every extracted value must trace back to a source span in the original document. Values that can’t be grounded are dropped or routed to review — never fabricated by the model.

  • Visual field-picker

    Draw bounding boxes on a sample. No JSON schemas to write, no regex to maintain. Anchors handle position drift between similar documents.

  • Honest multilingual tiering

    Stable: English, Spanish. Beta: 8 widely-used languages including Arabic RTL. Experimental: Indic scripts (Hindi, Punjabi, Tamil, Telugu, Bengali). We label by maturity, not marketing.

  • Human-in-the-loop, by default

    Low-confidence fields surface for review with the source region highlighted. Reviewer corrections feed back to improve future extractions on that template.

  • Self-hostable

    docker compose up. Postgres + Redis + MinIO + Ollama, all in one stack. No paid API required. Apache-2.0. Run it on your own hardware, your own models, your own data.

  • Auditable

    Every field carries: which OCR tier ran, which model + version, the confidence score, any human correction. Append-only audit log. Forensic-grade provenance.

What we read

Handles the documents others can't.

Printed text and born-digital PDFs are table stakes. We also read handwriting, multi-column tables, degraded scans, and 15 languagesacross three honesty tiers. We label by maturity instead of claiming "100+ languages" like most vendors do.

Document types

Printed text & PDFs

Born-digital PDFs read directly from the embedded text layer. Scanned printed text runs through Tesseract / PaddleOCR depending on script.

Handwriting

Signatures, short field values, and printed-form handwriting handled via Tier 3 vision-LLM (Qwen 2.5-VL). 70–85% accuracy on clean handwriting; cursive freeform falls back to human review.

Tables & multi-column

Single-page tables, key/value forms, multi-column layouts. Complex multi-page tables are on the roadmap (tracked in KNOWN_ISSUES).

Degraded scans

Low-resolution photos, faxed documents, stained scans. The cascade escalates to vision-LLM automatically; uncertainty routes to review queue.

Languages (honestly tiered)

Stable

Production-ready. Validated against a comprehensive fixture set.

  • English
  • Spanish
Beta

Works on clean documents. Validate on yours.

  • Chinese (Simplified)
  • Chinese (Traditional)
  • French
  • Vietnamese
  • Korean
  • Tagalog
  • Portuguese
  • Arabic (RTL)
  • Hindi
Experimental

Active development. Accuracy varies.

  • Punjabi
  • Tamil
  • Telugu
  • Bengali

Join the waitlist. Self-host anytime. Bring us in when it matters.

Hosted invites are rolling out in batches. Drop your email and we’ll let you know when you can sign up. Self-host the OSS version today for unlimited volume.

We’ll email you when invites open. No spam. Unsubscribe anytime.