Also known as: optical character recognition, text recognition, document AI
TL;DR
OCR converts image regions containing text into machine-readable strings. Classical pipelines (Tesseract, Google Cloud Vision, AWS Textract) detect text regions then recognize them via CNN+LSTM.
OCR (Optical Character Recognition) extracts machine-readable text from images — scanned documents, signs, screenshots, receipts, handwritten notes. One of the oldest computer-vision problems, with continuous research history back to the 1950s, and still active despite frontier VLMs increasingly subsuming it. The reason: at production scale, dedicated OCR is cheaper, faster, and more predictable.
OCR sits at the front of every document-AI pipeline. Its accuracy is a hard upper bound on everything downstream — the best RAG system cannot retrieve a fact that OCR mangled into gibberish.
Classical OCR architecture
A classical OCR pipeline has three stages.
Layout analysis — detect text regions (paragraphs, lines, tables) within the image. Computer-vision models (object detectors, segmentation networks) handle this.
Text-line detection — within each region, locate individual lines of text. Output is a set of axis-aligned or rotated bounding boxes.
Recognition — for each line, predict the character sequence. Historically a CNN feature extractor + LSTM decoder + CTC loss; modern systems often use transformer decoders.
Tesseract (open-source, since 1985) follows this shape. Google Cloud Vision OCR, AWS Textract, and Azure Document Intelligence are cloud variants with better accuracy and layout outputs (tables, key-value pairs, form fields). For high-volume English scans, these services hit 99%+ character accuracy at ~$1 per 1000 pages.
Transformer-based modern OCR
A second wave of architectures collapses detection and recognition into one end-to-end model.
Modern OCR systems
TrOCR (Microsoft, 2021) — ViT image encoder + standard transformer decoder, autoregressively generates the text. Strong on printed and handwritten English.
Donut (NAVER, 2021) — OCR-free document understanding. Skips explicit OCR entirely; encodes the image with Swin Transformer and decodes structured output (JSON) directly.
TrOCR-Math, Nougat — specialized variants for scientific documents with math and tables.
PaddleOCR — open-source, Chinese-centric, fast on CPU.
Surya (datalab.to, 2024) — modern open-source OCR with strong layout analysis, competitive with cloud services on many benchmarks.
These models trade compute for accuracy and unify the pipeline. Donut’s design is particularly interesting — it sidesteps OCR errors entirely by going image-to-JSON, treating documents as a structured-output problem rather than a text-extraction problem.
VLMs as implicit OCR
Modern VLMs — GPT-4V, Claude vision, Gemini, Qwen-VL — read text in images well enough to displace dedicated OCR for many use cases. They can ingest a printed invoice, a Slack screenshot, or a whiteboard photo and produce structured output without a separate OCR step.
Failure modes everyone hits
Skew and rotation. Photos taken at an angle confuse models trained on flat scans. Image preprocessing (deskew, perspective-correct) is load-bearing.
Low-resolution screenshots. Below ~100 DPI, character recognition degrades sharply.
Mixed scripts and languages. A document with English and Mandarin in the same line is much harder than either alone. Most systems require a language hint upfront.
Handwriting. Cursive, signatures, historical handwriting (pre-1900) are each their own subfield.
Tables. Detecting cell boundaries and reading row-major or column-major depends on layout heuristics that fail on complex tables. Specialized models (Microsoft’s Table Transformer, AWS Textract’s table mode) handle this; generic OCR doesn’t.
Math and code. Subscripts, special characters, and indentation break standard pipelines. LaTeX-aware models (Nougat, TrOCR-Math) exist but are research-grade for production volume.
Three standard metrics. Character error rate (CER) — Levenshtein distance between predicted and ground-truth strings, divided by ground-truth length. Word error rate (WER) — same at word boundaries. Field-level accuracy for structured documents — fraction of fields extracted exactly correct.
CER and WER are misleading averages — a model can have 5% CER and miss every dollar amount on every invoice if errors concentrate on numerals. For document AI, evaluate on the specific fields that matter for the downstream task. Hold-out test sets like FUNSD, SROIE, and IIIT-5K give comparable numbers for general-purpose evaluation.
Where OCR fits in production
The standard document-AI pipeline is OCR → layout-aware chunking → embed → retrieve → answer. ColPali and similar late-interaction approaches over image patches skip OCR entirely — promising for layout-heavy content but immature. For most production workloads, the right answer is a strong dedicated OCR in front of a normal text-pipeline RAG. The OCR is the boundary; quality there sets the ceiling for everything else.
Go further
When should I use a dedicated OCR system vs a VLM?
Dedicated OCR (Tesseract, AWS Textract, Azure Document Intelligence) wins on pure text extraction at scale: lower latency, lower cost per page, predictable failure modes. VLMs win when you also need layout reasoning, table structure, contextual understanding, or selective extraction ('only the line items, ignore headers'). For high-volume invoice processing, dedicated. For variable, semi-structured document QA, VLM.
Three persistent failure modes. Handwriting (signatures, doctors' notes, historical documents) has limited training data and high variance. Low-quality scans (skew, shadow, photo of a screen) compound errors. And mixed-script documents (Arabic + English in the same line, math notation, code) confuse most pipelines. Frontier VLMs handle the first two better than dedicated systems but still miss edge cases.
OCR is the front-end of any RAG system over scanned documents — invoices, contracts, research papers. The pipeline is OCR to text, then chunk, then embed, then retrieve. Quality of the OCR step caps the entire downstream retrieval. Modern document-aware approaches (ColPali) skip OCR entirely and embed image patches directly — promising for layout-heavy content but immature.