2026-05-13 · 9 min read
How to OCR a PDF: Extract Text from Scanned Files
You’ve been emailed a 60-page scanned contract. You try to search for “termination clause” — nothing. You try to copy a paragraph — you get a fuzzy image instead of text. The PDF is technically a stack of photographs of paper, not a digital document. What you need is OCR.
What OCR actually does
OCR (optical character recognition) is the process of reading pixels and guessing which characters they represent. A scanner produces a raster image — millions of black-and-white dots arranged into shapes you recognise as letters. OCR software walks across that image, segments it into glyph candidates, matches each glyph against a trained model, and emits a string of real Unicode characters. The output is then layered invisibly behind the original page image so the document still looks identical but is now searchable, selectable, and convertible.
Modern OCR is shockingly good. On clean, machine-printed English text, engines like Tesseract 5 and the cloud OCR APIs reach 99.5%+ character accuracy. On bad faxes, handwriting, or low-DPI scans of newsprint, that can drop below 90% — fine for searching, painful for a final document.
Do you actually need OCR?
Quick test: open the PDF, hit Ctrl-F (or Cmd-F), search for any word you can see on the page. If the search finds it, the PDF already has a text layer — no OCR needed; just convert or extract directly. If the search returns nothing, you have a scan or image-only PDF and OCR is the right next step.
Another giveaway: try to select text with your mouse. If you get a rectangular “marquee” selection (like cropping a photo) rather than a text caret, the page is an image.
Method 1 — Your operating system already has OCR
Most people don’t realise this:
- macOS (Ventura+): Preview has built-in Live Text. Open the scanned PDF, hit the Edit Markup toolbar, choose “Recognize Text.” Or just Cmd-A to select — the text is already extractable.
- Windows 11: the Snipping Tool added a “Text Actions” button in 2023 that OCRs anything on screen. For full PDFs, OneNote (free) imports a PDF and runs OCR automatically — right click the page and “Copy Text from Picture.”
- iOS / Android: the Files / Photos apps OCR images automatically. If your scan is a photo from your phone, the text is probably already extractable.
These built-in tools are fine for one-off pages. For bulk PDF processing, move on to method 2.
Method 2 — Tesseract (free, command line, very good)
Tesseract is the best free OCR engine, originally developed at HP, now maintained by Google. Pair it with ocrmypdf (a Python wrapper) and you have a one-line command that takes any image PDF and outputs a searchable PDF with the original visual layout preserved:
# Install once brew install ocrmypdf # or: pip install ocrmypdf # OCR a scanned PDF ocrmypdf input.pdf output.pdf # Multi-language (e.g. English + German) ocrmypdf -l eng+deu input.pdf output.pdf
Tesseract supports 100+ languages out of the box. For languages with a non-Latin script (Arabic, Chinese, Hindi), download the appropriate traineddata file. Performance scales with your CPU; a 50-page document typically takes 30–90 seconds on a modern laptop.
Method 3 — Browser-based OCR
For one-shot conversions without installing anything, in-browser OCR is the fastest path. Look for a tool that runs Tesseract or a similar engine compiled to WebAssembly so the document never leaves your machine. After OCR, you can extract plain text with a tool like our PDF to text extractor — also browser-side, also free.
Avoid services that require an upload for OCR if your document is sensitive. Insurance claims, medical records, and legal correspondence are all common OCR targets and all things you don’t want sitting on a third-party server.
Method 4 — Cloud OCR APIs (most accurate, paid)
For high-volume or high-stakes work — claims processing, archive digitisation, historical newspapers — the big cloud OCR APIs lead the accuracy charts:
| Provider | Approx. price | Notable strength |
|---|---|---|
| Google Document AI | $1.50 / 1,000 pages | Tables, forms, handwriting |
| AWS Textract | $1.50 / 1,000 pages | Form-field extraction |
| Azure Document Intelligence | $1.00 / 1,000 pages | Custom-trained models |
| ABBYY FineReader Cloud | $0.20 / page (premium) | Best for non-Latin scripts |
These also do layout reconstruction — preserving columns, tables, and reading order — which Tesseract is mediocre at. For 95% of consumer documents, free Tesseract is enough; for the other 5%, paying $1.50 per thousand pages buys real accuracy gains.
Tips for better OCR accuracy
- Scan at 300 DPI minimum. Below 200 DPI, accuracy falls off a cliff. 300 is the sweet spot; going to 600 rarely helps.
- Black-and-white, not colour. A bilevel scan removes colour noise that can confuse OCR engines. Most engines convert internally anyway.
- Deskew before OCR. Even 1° of rotation hurts accuracy.
ocrmypdf --rotate-pageshandles this automatically. - Tell the engine the language. Default English models will guess wildly on French or Spanish accents.
- Spell-check the output. A 1% character error rate still means roughly one wrong word every two sentences. Always proofread for anything important.
After OCR, what next?
Once your PDF has a real text layer, all the usual workflows are open again. Search works. Copy-paste works. You can convert it to an editable Word document with our guide on PDF to Word conversion, or extract the raw text directly using our PDF to text tool for use in a script, spreadsheet, or LLM prompt.
OCR is the bridge between paper and software. The free tools available in 2026 are easily good enough that nobody needs to retype a scanned document by hand again. If you’ve been doing it that way, today is a good day to stop. Start with our browser-based PDF tools and your scanned files become editable in under a minute.