About
Open-source document extraction, built by people who run AI on real infrastructure.
DocuExtract is built and maintained by Inspire AI Lab, a small AI consultancy. We help organizations run AI on their own hardware — private, self-hosted, audit-grade — instead of paying per-token API bills. Inspire Extract is the open-source companion to our Document Intelligence practice.
Why open source
You shouldn't need a sales call to read your own documents.
Document extraction is commodity infrastructure. The OCR engines are open. The model weights are open. The vector formats are open. Everything except the user-facing product has been openly available for years. Yet the tools that package it into a usable workflow are closed-source SaaS that charge enterprise prices and route your documents through their data centers.
We don't think that's the right shape for this market. Self-host should be a first-class option, not a contact-sales gate. Audit trails should be inspectable, not vendor-locked. Hallucination prevention should be a structural guarantee, not a marketing claim.
So we built DocuExtract under Apache-2.0. Anyone can run it, fork it, embed it in commercial products. The hosted version exists for users who don't want to run infrastructure. The consulting practice exists for users whose deployment requirements are deeper than a SaaS subscription can serve.
What we believe
Six principles. They're also commitments.
- 01
Never fabricate
Every value must trace back to a real source span. Ungrounded values are dropped or flagged for review — never returned as confident answers. This is structural, not optional.
- 02
Surface uncertainty
Below-threshold confidence routes to a human review queue with the source region highlighted. Customers know exactly what we know and don't know. Quality comes from honesty, not bravado.
- 03
Self-host first
The OSS codebase is the same code that runs the hosted tier. Self-host is not a stripped-down version — it's the full product, on your hardware, with your models, on your terms.
- 04
Honest labels
Multilingual support is tiered: stable, beta, experimental. We don't claim "100+ languages" because most of those would mangle real documents. We'd rather ship fifteen we trust.
- 05
Predictable cost
Flat infrastructure cost when idle. Per-second compute when active. No per-page surprise bills. Self-host removes the ceiling entirely. Paid plans pay for the GPU compute, not the software.
- 06
Auditable provenance
Every extracted field carries: which OCR tier, which model + version, the confidence score, the source region, any human correction. Append-only event log. Reconstruct any decision.
The team behind it
Inspire AI Lab.
A small AI consultancy based in New Jersey. We help organizations move from proof-of-concept to production AI on their own infrastructure — assessment, build, deploy, and ongoing operations.
Document Intelligence is one of our practice areas — multi-tier OCR for multilingual and degraded scans, audit-grade extraction, human-in-the-loop workflows. DocuExtract is the open-source distillation of that practice.
If your deployment requires more than a SaaS subscription — custom models on your hardware, regulated compliance, integration with existing systems, multilingual validation on a corpus of your own documents — that's the consulting conversation.
Try the product. Read the code. Talk to us if it matters.
50 free documents a month gets you a real evaluation. The repo is on GitHub. The consulting practice is one email away.