PDF to Text
Extract plain text from any text-based PDF — clean output ready to paste or process
Drop a PDF to extract its text content.
Max file size: 50MB
How to PDF to Text Online
Pull all readable text out of a PDF in one click — gives you plain text you can paste, search, or feed to another tool.
- Drop a PDF. The tool extracts text from every page using pdf.js.
- Toggle 'Include page markers' to add '--- Page N ---' separators between pages, useful for downstream search-by-page or per-page processing.
- Copy to clipboard or download as a .txt file.
- If many pages come back empty (especially for old scans), the PDF is image-based — use PDF OCR instead.
About PDF to Text
Most PDFs have real text inside them — the kind you can select with your mouse, copy, paste. The catch is that PDF stores text in a format optimized for printing, not for reading by machines. Extracting the text means parsing the PDF's drawing operations, walking the text-rendering commands, and decoding font character maps. This tool does that with **pdf.js** (Mozilla's reference PDF parser), runs entirely in your browser, and gives you plain text back.
The use cases are practical:
- **Quoting from a paper.** You want a sentence from page 7 of a 30-page PDF research paper. The PDF viewer's copy-paste sometimes inserts weird line breaks or misses characters; this tool gives you clean text for the whole document, easy to grep through. - **Feeding a PDF to an LLM.** ChatGPT, Claude, Gemini all accept text input. Some accept PDF directly but with size limits and weird behavior. Extracting to text and pasting gives you exact control. - **Searching a PDF that isn't full-text-indexed.** Some PDF readers don't support search; some search is broken for technical reasons. Plain text is greppable and copy-pasteable. - **Reformatting.** You want the content of a marketing PDF as a blog post. Extract the text, edit in your editor of choice, ignore the PDF's layout entirely. - **Accessibility.** Screen readers work better with plain text than with structured PDF. Extracting first can be a workaround.
The implementation uses pdf.js's `getTextContent` API, which returns a list of text items per page with their positions. The tool concatenates the items with single-space separators and collapses runs of whitespace — the result is plain prose that approximates the original document's reading flow.
**What it does well.** Documents typeset with LaTeX, InDesign, Word, Pages, Google Docs, and most modern PDF generators — the text comes out clean. Sequential single-column layouts (most reports, papers, books, contracts) preserve reading order. Foreign-language text (Chinese, Japanese, Korean, Arabic, Hebrew) decodes correctly through the embedded font character maps. Special characters and ligatures (é, ñ, æ, fi, fl) come through as Unicode.
**Where it gets messy.** Multi-column layouts can jump between columns unexpectedly because PDF stores text in drawing order, not reading order. Tables flatten into rows of cell text without column structure. Sidebars, callout boxes, and footnotes interrupt the main reading flow at the position where the PDF chose to draw them. Math equations stored as embedded fonts may emit characters that look right but render in the wrong font in plain text. Scanned image-based PDFs (where the 'text' is actually pixels) return empty — you need OCR for those.
**The empty-page heuristic.** The tool counts pages that return zero extracted text and surfaces a warning when several pages are empty. This is almost always a signal that the PDF is scanned (image-based) rather than digitally typeset. Modern phones produce image-based PDFs when you "scan" a document with the camera app; old desktop scanners do the same. For image-based PDFs, the right tool is **PDF OCR**, which uses Tesseract.js to recognize the text from the pixels. Both tools are on this site; the empty-page warning links to the OCR tool.
**Page markers.** Optional `--- Page N ---` separators between pages, useful when you want to preserve the source-page information for downstream processing (e.g., quoting back 'from page 17'). Turn them off for documents you want as flat prose.
**Output options.** Copy to clipboard (instant) or download as a `.txt` file (with the original filename and `.txt` extension). Both produce the same text; the file option is useful for very large extractions that you'd rather save than hold in the clipboard.
**Performance.** Linear in PDF size. A 100-page document with ~500 words per page extracts in a couple of seconds. A 1000-page book takes 15–20 seconds in most browsers. The bottleneck is pdf.js's per-page parse; the text concatenation is free.
**Privacy.** pdf.js runs entirely in the browser. The PDF is read into memory and parsed locally — no server-side processing, no upload, no cloud. Verify in DevTools by watching the network panel — the only request is the one-time fetch of `pdf.worker.min.mjs` from your origin (cached after first load). Subsequent extractions produce zero traffic.
**Edge cases handled:** Unicode text in every script (Chinese, Arabic, etc.); PDFs with embedded fonts; PDFs with text spanning multiple unicode codepoints (combining diacriticals, complex emoji); pages with no extractable text (correctly reported as empty); very large PDFs (streaming-friendly extraction); PDFs with missing fonts (text is still extracted via the font CMap fallback); rotated text (extracted in drawing order, position-aware).
**What this tool deliberately doesn't do**: preserve layout. The output is plain text, not a Word document or a re-flowed PDF. If you need the layout, you want a different operation — PDF to Word conversion is a heavier process that requires structural recognition (and is generally lossy because PDF doesn't preserve true semantic structure). If you just want the words, this tool is right.
Related Tools
Frequently Asked Questions
Why is the extracted text in a different order than the PDF?
Because PDFs store text in **drawing order**, not reading order. A PDF page is technically a sequence of 'draw this text at coordinates (x, y)' operations — the renderer can put any text anywhere on the page in any order. For most PDFs (a single column of normal text), drawing order and reading order match. But for **multi-column layouts, sidebars, tables with text overflow, or scientific papers with figures**, the drawing order can be unpredictable. The extracted text might jump from column 1 to column 2 to the page footer back to column 1 in a way that doesn't match how a human reads. There's no perfect fix — the PDF format genuinely doesn't preserve reading order.
Why are some pages empty?
Most likely because those pages are scanned images (photos of paper, not real text). PDFs can contain image-only pages where the visible text is actually pixels, not characters. Text extraction returns nothing because there's no text to extract — just pictures. For scanned PDFs, you need **OCR** (Optical Character Recognition), which converts the image of text back into characters. The PDF OCR tool on this site does exactly that, running Tesseract.js entirely in your browser.
What about tables?
Tables in PDF are stored as drawing operations for individual cells, not as table structure. The extracted text concatenates cell contents in drawing order, which usually but not always matches row-by-row reading. The result is text that looks like rows of values separated by spaces — often readable but not structured. For real table extraction (preserving row/column structure as CSV or JSON), you need a dedicated PDF table tool. For most uses, the flat text output is enough.
Why are some spaces missing between words?
Because PDF doesn't always store explicit spaces between words. Sometimes 'Hello world' is two text-draw operations 'Hello' at x=10 and 'world' at x=60 with no space character — the visual space is implied by the position. The extractor uses heuristics to insert spaces where the positions imply them, but the heuristics aren't perfect. For typeset documents (LaTeX, InDesign output) the result is usually clean; for some legacy authoring tools the spacing is unreliable.
Does it handle Unicode (Chinese, Arabic, emoji)?
Yes. pdf.js decodes the text using the font's character map and emits Unicode strings. Chinese, Japanese, Korean, Arabic, Hebrew, and emoji all extract correctly as long as the PDF embeds the appropriate font information (which it almost always does). The output is UTF-8 in the textarea and the downloaded .txt file.
Can it extract from password-protected PDFs?
No. pdf.js supports password-protected PDFs but exposing the password prompt UI here would complicate the no-config-needed experience. Decrypt the PDF externally first (Adobe Reader can save an unencrypted copy), then run the result through this tool.
Is the PDF uploaded to a server?
No. pdf.js is a JavaScript library running in your browser. The PDF is read into memory via the File API, parsed locally, and text is extracted in-tab. Verify in DevTools — the network panel only shows the initial pdf.worker fetch (a one-time browser cache) and nothing else.