Printed text & born-digital PDFs
Embedded PDF text (born-digital) reads at 100% accuracy with no model involved. Printed scans run through Tesseract for Latin scripts and PaddleOCR for CJK/Indic.
Languages
Every competitor markets "100+ languages." Most of those claims fall apart on real documents — bad accuracy, broken bounding boxes on right-to-left, no Indic support at all. We label by maturity tier and tell you the real caveats per language. What follows is the full matrix.
Beyond language
Language coverage is only half the story. The other half is document shape and quality: handwriting, complex tables, degraded scans. Here's what the engine handles regardless of language.
Embedded PDF text (born-digital) reads at 100% accuracy with no model involved. Printed scans run through Tesseract for Latin scripts and PaddleOCR for CJK/Indic.
Signatures, hand-filled form fields, dates, short freeform answers. Routed to Tier 3 vision-LLM (Qwen 2.5-VL 7B). Expect 70–85% on clean handwriting; cursive multi-line text often drops to review queue. Available on Premium tier (bundled with Team plan and up).
Single-page tables extracted with bounding-polygon awareness. Multi-page tables (continuation rows across pages) are on the roadmap — tracked in KNOWN_ISSUES.
Low-resolution photos, faxed documents, stained pages. The cascade escalates to vision-LLM automatically when traditional OCR confidence drops below 0.65.
Production-ready. Validated against a comprehensive fixture set. Use these in business-critical workflows with confidence.
Caveats
Degraded scans escalate to vision-LLM (Tier 3).
Caveats
Diacritic-handling validated; bilingual EN/ES docs use dominant-script routing.
Works on clean documents. Real-world accuracy depends on your document quality. Validate on a sample before committing to volume.
Caveats
Handwritten or stylized fonts often need Tier 3 vision-LLM. Tables work; complex multi-column layouts can confuse the layout parser.
Caveats
Some traditional characters less frequent in training data — vision-LLM fallback often improves accuracy.
Caveats
Diacritic-heavy text handled; right-to-left tables (currency placement) validated.
Caveats
Stacked tone marks can confuse low-resolution OCR. Validate on your scan quality.
Caveats
Mixed Hangul/Hanja documents (older formal text) escalate to vision-LLM.
Caveats
Spanish loanwords and abbreviations handled correctly.
Caveats
BR and PT variants both supported; currency/date formats locale-aware.
Caveats
Field-picker overlay coordinates need polish on rotated scans (tracked in KNOWN_ISSUES). Diacritical marks in classical Arabic may need Tier 3 vision-LLM.
Caveats
Conjunct-heavy handwriting and degraded scans struggle. Vision-LLM fallback typically improves accuracy 10–15 percentage points.
Active development. Accuracy varies. The hardest scripts most tools fail on entirely — we ship them honestly labeled instead of overpromising.
Caveats
Active development. Bring real documents for evaluation. Government forms with stamped overlays remain difficult.
Caveats
Complex consonant clusters and Vatteluttu-influenced glyphs can confuse OCR. Vision-LLM helps but adds latency.
Caveats
Compound vowel signs above and below the base character — small OCR noise causes character-level errors.
Caveats
Conjuncts and ligatures common in formal text; vision-LLM significantly improves over traditional OCR.
How we set the tiers
Stablemeans we have run thousands of synthetic and anonymized real-world documents through it, validated extraction accuracy against ground truth, and shipped it as production-ready. We'd use it ourselves in a regulated workflow.
Betameans the OCR engine and routing work, but the maturity isn't backed by exhaustive validation. We've tested it on clean documents. Your scan quality, document layout, and font choice might surface failures we haven't seen. Validate on a sample first.
Experimentalmeans it works enough to be useful, but accuracy varies meaningfully across documents. Indic scripts are the hardest cases in OCR — most tools refuse to ship them at all, or ship them with misleading "supported" labels. We ship them with honest expectations instead.
As accuracy improves, languages get promoted. Promotions are documented in the CHANGELOG with the validation work that supported them.
Inspire AI Lab has run extraction at scale on a 230M-document multilingual legal corpus. If you have a custom language requirement — fine-tuning, new script support, dialect handling — we can scope a custom build against your real corpus.