Just File Tools

PDF OCR

Run OCR on a scanned PDF to extract text — runs Tesseract entirely in your browser

Drop a scanned PDF to OCR (image-based PDFs where text-extract returns empty).

Max file size: 50MB

Files are processed in your browser. Nothing is uploaded.

How to PDF OCR Online

Run Optical Character Recognition (OCR) on a scanned PDF to extract text from image-based pages.

  1. Drop a PDF that's image-based (typically a scanned document). If you're not sure, try PDF to Text first — empty pages mean you need OCR.
  2. Pick the language. For documents in English, leave the default. For other languages, select from the dropdown; each language pack is a ~10 MB one-time download.
  3. Click Run OCR. The first run shows a 'loading Tesseract' status while wasm + language data download (30–60 seconds). Subsequent runs in the same session skip this step.
  4. Each page is rendered to an image then OCR'd; the progress bar shows page N of total. Copy or download the result when complete.

About PDF OCR

Optical Character Recognition is the process of looking at an image of text and producing the actual text characters. It's a 60-year-old technology that's gotten dramatically better in the last decade thanks to neural-network approaches. Modern OCR engines achieve 95–99% accuracy on clean printed text, which makes them practical for everyday use: turning a phone photo of a recipe into editable text, extracting quotes from a scanned book, digitizing a stack of receipts.

This tool runs **Tesseract.js**, the JavaScript port of Google's Tesseract OCR engine (the open-source standard since version 4 in 2018). It works entirely in your browser via WebAssembly — no server, no upload, no cloud OCR service. The PDF is rendered page-by-page to a high-resolution image in your tab, then each image is fed to Tesseract, which returns the recognized text.

**The classic use case**: a scanned PDF where the text isn't really text. A 1970s contract that was photocopied so many times the original is lost. A receipt photographed with your phone. A whiteboard snapshot from a meeting. A book chapter you scanned for a class. In every case, the PDF visually shows text but the text isn't selectable, copyable, or searchable. OCR converts the pixels back into characters so the document becomes usable.

**Why a browser-based OCR tool?** Most online OCR services upload your PDF to their servers, run OCR there, send back the text. For most users that's fine; for sensitive documents (private correspondence, medical records, legal documents, financial statements), it's a privacy hole. This tool runs the entire pipeline locally — your PDF and its contents stay in your tab.

**The performance picture.** Tesseract.js needs to load two things on first use:

- **The Tesseract wasm binary** (~2 MB) — the OCR engine itself. - **A language data file** (~10 MB per language) — trained models for character recognition in a specific script.

Both are loaded lazily on the first OCR run and cached by the browser for the rest of the session. So:

- **First run**: 30–60 seconds for downloads, then 5–15 seconds per page for OCR. - **Subsequent runs (same session)**: 1–2 seconds startup, 5–15 seconds per page. - **Subsequent runs (later sessions, after browser cache eviction)**: back to first-run cold-start times.

For a 10-page document on first use, expect 2–3 minutes total. For the same document later in the same session, expect 1–1.5 minutes.

**Accuracy depends heavily on input quality.** The best inputs are:

- **High DPI**: 300 dpi minimum, 600 dpi ideal. Phone photos are typically 1500–3000 px wide which translates to roughly 200–400 dpi for letter-size documents — usable but not ideal. - **Flat scan**: paper-against-glass scans beat handheld photos because there's no perspective distortion and consistent lighting. - **High contrast**: dark text on white background. Color text or text on textured background reduces accuracy. - **Standard fonts**: Times, Arial, Helvetica, Calibri, Cambria, Georgia — anything Tesseract has seen a lot of. Decorative fonts, handwriting fonts, stylized lettering all degrade accuracy. - **Single-column layout**: like the PDF-to-Text tool, OCR results can jump between columns or include sidebar text inline.

For documents that don't meet these criteria, the output is still usable — you just have more cleanup work to do.

**Language support.** Ten languages exposed in the UI: English, Spanish, French, German, Italian, Portuguese, Russian, Chinese Simplified, Japanese, Korean. Tesseract itself supports 100+ languages (Arabic, Hebrew, Thai, Vietnamese, Hindi, and many regional scripts) — these aren't in the UI but could be added if there's demand. For mixed-language documents, pick the dominant language; Tesseract handles occasional foreign words reasonably well as long as they're in a script similar to the selected language.

**Progress and partial results.** OCR is processed page-by-page sequentially. The tool shows a progress bar (page 3 of 12) and updates the result textarea as each page completes. You can read partial output while later pages are still being processed. If you have what you need from the first few pages, the rest are still useful but you don't need to wait.

**Output format.** Plain text in the textarea, with optional `--- Page N ---` separators between pages. Copy to clipboard or download as a `.txt` file. The output is UTF-8 — works for Cyrillic, CJK characters, accented Latin, anything Tesseract recognizes.

**Privacy.** Tesseract.js runs entirely in your browser as WebAssembly. The PDF is rendered locally via pdf.js, the image data is fed to Tesseract locally, the text is generated locally. The only network requests are the initial CDN fetches of the Tesseract wasm and language data — these are cached after first load. Subsequent OCR jobs produce zero network traffic. Verify in DevTools.

**Edge cases handled:**

- **Network failure during initial download**: a clear error message indicates the CDN fetch failed; user can retry. Already-cached language packs don't re-download. - **Pages that are partially scanned + partially digital text**: OCR runs on the rendered page image and recognizes whatever pixels are there — including digital text on a partially-scanned page, which works fine because rendered digital text is just an image too. - **Pages with mostly images and a small text caption**: OCR will pick up the caption text and ignore the image regions. - **Rotated pages**: rendered with rotation applied, so OCR sees the upright orientation. (If your scan is upside-down, fix it first with PDF Rotate.) - **Multi-column scans**: column-jumping is common in OCR output; this is a Tesseract limitation, not a tool bug.

**What it doesn't do**: form-field recognition, table-structure recognition, math-formula recognition, handwriting (Tesseract is poor at handwriting), confidence scoring per character. For all of those, more specialized tools exist; Tesseract is the practical right answer for general-purpose printed-text OCR.

Frequently Asked Questions

When do I need OCR vs the PDF to Text tool?

**PDF to Text** works on PDFs where the text is real text — selectable, copyable, native characters in the file. Most documents created in Word, LaTeX, Pages, or any modern authoring tool fall in this category. **PDF OCR** works on PDFs where the 'text' is actually pixels — scanned paper documents, photographed receipts, photographed whiteboards, any PDF that wraps an image. If you try PDF to Text and get empty pages, that's the signal — switch to OCR.

Why is the first run slow?

Because Tesseract.js (the OCR engine) needs to download two things on first use: the Tesseract wasm binary (~2 MB) and the language data file (~10 MB per language). After the first run, both are cached by the browser for the rest of the session — subsequent OCR jobs in the same tab skip the download. So the first PDF takes 30–60 seconds to start; subsequent PDFs start in 1–2 seconds.

How accurate is the OCR?

For clean, high-resolution scans of standard-language text, **95–99%** character accuracy. For low-resolution or noisy scans, accuracy drops to 80–90%. For handwriting, accuracy is poor — Tesseract isn't designed for handwriting. For old or low-contrast documents, expect significant cleanup work after OCR. The best quality input is a flat scan at 300 dpi or higher of typed text; phone photos are usable but less reliable due to perspective distortion, focus issues, and uneven lighting.

Which languages does it support?

Ten common ones in the UI: English, Spanish, French, German, Italian, Portuguese, Russian, Chinese (Simplified), Japanese, Korean. Tesseract itself supports 100+ languages (Arabic, Hebrew, Thai, Vietnamese, Hindi, etc.) — they just aren't exposed in this UI to keep the interface manageable. For mixed-language documents, pick the dominant language; the OCR handles occasional foreign words and proper nouns reasonably well.

Is OCR good enough to replace re-typing?

For most use cases, yes. The output is plain text you can clean up in a few minutes — fix the inevitable typos, add punctuation that got dropped, restore tables that flattened. The time savings vs typing a 10-page document from scratch are dramatic. For perfect fidelity (legal contracts, scientific papers with equations, anything where errors matter), expect significant post-OCR cleanup; for general-purpose 'I need the text out of this scan', OCR is excellent.

Why pages instead of the whole PDF at once?

Two reasons. **Memory** — OCR of a large page at high DPI uses meaningful memory; doing all pages simultaneously could exhaust the tab. **Progress** — you can see results as they complete rather than waiting in silence for everything. The tool processes pages sequentially, showing a progress bar and accumulated results. You can stop early if you have what you need from the first few pages.

Does Tesseract get sent to a server?

No. tesseract.js runs the Tesseract C++ engine compiled to WebAssembly, entirely in your browser. The PDF pages are rendered to canvas via pdf.js (also in-browser), the pixel data is fed to Tesseract (in-browser), and the recognized text is displayed in the tab. The only network requests are the one-time fetches of the Tesseract wasm binary and the language data file from a CDN — after caching, no further network traffic. Verify in DevTools.