PWPrivatePDFConvertPro

2026-05-13 · 9 min read

How to OCR a PDF: Extract Text from Scanned Files

You’ve been emailed a 60-page scanned contract. You try to search for “termination clause” — nothing. You try to copy a paragraph — you get a fuzzy image instead of text. The PDF is technically a stack of photographs of paper, not a digital document. What you need is OCR.

What OCR actually does

OCR (optical character recognition) is the process of reading pixels and guessing which characters they represent. A scanner produces a raster image — millions of black-and-white dots arranged into shapes you recognise as letters. OCR software walks across that image, segments it into glyph candidates, matches each glyph against a trained model, and emits a string of real Unicode characters. The output is then layered invisibly behind the original page image so the document still looks identical but is now searchable, selectable, and convertible.

Modern OCR is shockingly good. On clean, machine-printed English text, engines like Tesseract 5 and the cloud OCR APIs reach 99.5%+ character accuracy. On bad faxes, handwriting, or low-DPI scans of newsprint, that can drop below 90% — fine for searching, painful for a final document.

Do you actually need OCR?

Quick test: open the PDF, hit Ctrl-F (or Cmd-F), search for any word you can see on the page. If the search finds it, the PDF already has a text layer — no OCR needed; just convert or extract directly. If the search returns nothing, you have a scan or image-only PDF and OCR is the right next step.

Another giveaway: try to select text with your mouse. If you get a rectangular “marquee” selection (like cropping a photo) rather than a text caret, the page is an image.

Method 1 — Your operating system already has OCR

Most people don’t realise this:

These built-in tools are fine for one-off pages. For bulk PDF processing, move on to method 2.

Method 2 — Tesseract (free, command line, very good)

Tesseract is the best free OCR engine, originally developed at HP, now maintained by Google. Pair it with ocrmypdf (a Python wrapper) and you have a one-line command that takes any image PDF and outputs a searchable PDF with the original visual layout preserved:

# Install once
brew install ocrmypdf  # or: pip install ocrmypdf

# OCR a scanned PDF
ocrmypdf input.pdf output.pdf

# Multi-language (e.g. English + German)
ocrmypdf -l eng+deu input.pdf output.pdf

Tesseract supports 100+ languages out of the box. For languages with a non-Latin script (Arabic, Chinese, Hindi), download the appropriate traineddata file. Performance scales with your CPU; a 50-page document typically takes 30–90 seconds on a modern laptop.

Method 3 — Browser-based OCR

For one-shot conversions without installing anything, in-browser OCR is the fastest path. Look for a tool that runs Tesseract or a similar engine compiled to WebAssembly so the document never leaves your machine. After OCR, you can extract plain text with a tool like our PDF to text extractor — also browser-side, also free.

Avoid services that require an upload for OCR if your document is sensitive. Insurance claims, medical records, and legal correspondence are all common OCR targets and all things you don’t want sitting on a third-party server.

Method 4 — Cloud OCR APIs (most accurate, paid)

For high-volume or high-stakes work — claims processing, archive digitisation, historical newspapers — the big cloud OCR APIs lead the accuracy charts:

ProviderApprox. priceNotable strength
Google Document AI$1.50 / 1,000 pagesTables, forms, handwriting
AWS Textract$1.50 / 1,000 pagesForm-field extraction
Azure Document Intelligence$1.00 / 1,000 pagesCustom-trained models
ABBYY FineReader Cloud$0.20 / page (premium)Best for non-Latin scripts

These also do layout reconstruction — preserving columns, tables, and reading order — which Tesseract is mediocre at. For 95% of consumer documents, free Tesseract is enough; for the other 5%, paying $1.50 per thousand pages buys real accuracy gains.

Tips for better OCR accuracy

After OCR, what next?

Once your PDF has a real text layer, all the usual workflows are open again. Search works. Copy-paste works. You can convert it to an editable Word document with our guide on PDF to Word conversion, or extract the raw text directly using our PDF to text tool for use in a script, spreadsheet, or LLM prompt.

OCR is the bridge between paper and software. The free tools available in 2026 are easily good enough that nobody needs to retype a scanned document by hand again. If you’ve been doing it that way, today is a good day to stop. Start with our browser-based PDF tools and your scanned files become editable in under a minute.