How to OCR a scanned PDF to extract text (free, in-browser)

There are two kinds of PDFs that look identical on screen but behave completely differently under the hood. The first is a text PDF: real characters, selectable with the mouse, searchable in any PDF reader. Word, LaTeX, browsers, and modern authoring tools all produce text PDFs by default. The second is an image PDF: pages that are visually just pictures, with the "text" you see actually being pixels. Phone-scanned documents, old photocopier output, and PDFs assembled from screenshots all fall here. Image PDFs look fine but you can't select the text, you can't search them, you can't copy-paste sentences out. OCR is the operation that converts the second kind into the first kind.

Why this used to be hard

OCR (Optical Character Recognition) has been around for 60+ years, but until very recently the good engines were either expensive (Adobe's OCR is in Acrobat Pro at $20/month) or required installing tooling (Tesseract is open-source and excellent but the typical install path involves command line). Online OCR services are fast but they require uploading your document — wrong tradeoff for the scanned contracts, tax forms, and personal correspondence that make up the bulk of "I need to OCR a PDF" cases.

The change in the last few years: Tesseract was ported to WebAssembly as Tesseract.js, which means the same Tesseract engine that ships in production OCR systems now runs entirely in a browser tab. The accuracy is the same. The privacy is dramatically better. The friction is one page-load.

Step-by-step

Open justfiletools.com/tools/pdf-ocr.
Drop a scanned PDF. If you're not sure whether your PDF is image-based, try PDF to Text first — empty pages in the output mean the PDF is image-based and you need OCR.
Pick the language. The dropdown has 10 common options (English, Spanish, French, German, Italian, Portuguese, Russian, Chinese Simplified, Japanese, Korean). For mixed-language documents, pick the dominant language.
Click Run OCR.
Wait. The first run downloads the Tesseract wasm binary (~2 MB) and the language data file (~10 MB per language) from a CDN. After that they cache for the rest of the session.
Watch the progress bar. Pages are processed sequentially. You can read partial output as pages complete; you don't need to wait for the full document.
Copy or download the result. The output is plain text with optional "--- Page N ---" markers between pages.

Performance expectations

First run: 30–60 seconds for the wasm and language data to download, then 5–15 seconds per page for OCR itself. A 10-page document on first use takes about 2–3 minutes total.

Subsequent runs in the same session: 1–2 seconds startup (cached wasm), 5–15 seconds per page. A 10-page document takes 1–1.5 minutes.

OCR accuracy depends heavily on input quality. The best inputs are high-DPI (300+) scans of typed text in standard fonts (Times, Arial, Helvetica) on a high-contrast (dark on white) background. Phone photos work too but with lower accuracy — typically 200–400 effective DPI, with possible perspective distortion and uneven lighting.

What OCR gets right and wrong

Right: typeset documents (LaTeX, InDesign, Word) — 95–99% character accuracy. Standard print fonts. Reasonably-lit photos of clean text. Foreign-language text in the supported scripts. Standard punctuation and numerals.

Imperfect: low-resolution scans. Photographed documents with perspective distortion. Documents with unusual fonts or stylized lettering. Multi-column layouts (column-jumping is a known Tesseract limitation — the output text may switch columns mid-paragraph). Tables (Tesseract flattens table cells into text rows without preserving column structure).

Wrong: handwriting. Tesseract was not designed for handwriting and the accuracy is poor (60–70%). For handwritten notes, dedicated handwriting-OCR tools (Microsoft OneNote, Google Lens) work much better. Heavily-stylized fonts (decorative typography, script fonts) also produce many errors.

Common pitfalls

Rotated scans. If your scan is upside-down or sideways, OCR results are unusable. Fix the orientation first with PDF Rotate, then run OCR.

Very low-quality scans. Old fax-quality scans (200 DPI, heavily compressed) push Tesseract beyond its sweet spot. Re-scan at higher DPI if you can.

PDFs that are partially text and partially image. Some PDFs have both digital text and scanned image regions on the same page. OCR runs on the rendered page (which includes both), so the digital-text portions get re-OCR'd and may produce minor errors where the original text was perfect. For these mixed PDFs, dual-pass processing (text extraction for text pages, OCR for image pages) is more reliable, but our tool doesn't do this auto-detection.

Network failure during initial download. If your connection drops while the wasm or language data is being fetched on first use, the tool fails with a clear error. Retry; on next attempt, partial downloads in the browser cache may speed things up.

Alternative approaches and when to use them

Adobe Acrobat Pro's OCR. Built-in feature, runs locally. Best quality available among consumer tools. Worth the subscription if you OCR frequently.
macOS Preview. macOS Catalina+ has built-in OCR via the "Live Text" feature — open a PDF in Preview, select text, and you can copy it even if the PDF is image-based. Free, but only on Mac.
Tesseract at the command line. tesseract input.png output.txt for an image. pdftoppm + tesseract in a loop for PDFs. Free, excellent for batch workflows.
Google Drive's OCR-on-upload. Upload a PDF to Google Drive, right-click → Open with → Google Docs. Drive runs OCR and produces a Doc with the recognized text. Good quality, free, but your file goes to Google's servers.
Microsoft OneNote. Built-in OCR including handwriting. Right-click an inserted image to extract text. Great for handwriting; less convenient for multi-page PDFs.

Privacy considerations

Scanned documents are often exactly the documents you'd rather not upload: signed contracts, tax returns, medical records, ID copies, financial statements. The "free online OCR" services typically retain copies of uploaded documents for service improvement and may store recognized text for indexing. For sensitive material, the in-browser approach is the right answer.

Tesseract.js runs as WebAssembly entirely in your browser. The PDF pages are rendered to canvas via pdf.js (in-browser), the pixel data is fed to Tesseract (in-browser), the recognized text is generated in your tab. The only network requests are the one-time fetches of the Tesseract wasm binary and language data files from a CDN — these are public model files that don't reveal anything about your document. After the cache fills, subsequent OCR jobs produce zero network traffic. Verify in DevTools.

Related tools and guides

PDF OCR — the tool this guide covers.
PDF to Text — for PDFs that already have text (no OCR needed).
PDF Rotate — fix scan orientation before OCR.
PDF to Image — extract pages as PNG/JPG.