Just File Tools

Unicode Inspector

Inspect each codepoint in any text — name, escape forms, invisible-character detection

11 codepoints11 UTF-16 code units13 UTF-8 bytes1 invisible char
With invisible characters shown as badges:
WordU+200BJoiner
#CharCodepointNameCategoryBlockJS escapeHTMLUTF-8
0WU+0057letter (Latin)Basic Latin (ASCII)57
1oU+006Fletter (Latin)Basic Latin (ASCII)6F
2rU+0072letter (Latin)Basic Latin (ASCII)72
3dU+0064letter (Latin)Basic Latin (ASCII)64
4(invisible)ZERO WIDTH SPACEinvisibleGeneral PunctuationE2 80 8B
5JU+004Aletter (Latin)Basic Latin (ASCII)4A
6oU+006Fletter (Latin)Basic Latin (ASCII)6F
7iU+0069letter (Latin)Basic Latin (ASCII)69
8nU+006Eletter (Latin)Basic Latin (ASCII)6E
9eU+0065letter (Latin)Basic Latin (ASCII)65
10rU+0072letter (Latin)Basic Latin (ASCII)72

How to Unicode Inspector Online

Paste any text, see each Unicode codepoint with name/block/category/escape forms. Surfaces invisible characters that cause bugs.

  1. Paste text into the input field. The codepoint table updates live.
  2. Look at the stats row: codepoints, UTF-16 units, UTF-8 bytes, invisible character count, astral plane count.
  3. If invisible chars are present, the highlighted preview shows them as inline badges with codepoint hex.
  4. Click any codepoint, JS escape, or HTML entity to copy it. Click the example dropdown to load common gotcha-prone inputs.

About Unicode Inspector

Text is more complicated than it looks. The string `café` could be four codepoints (U+0063 U+0061 U+0066 U+00E9) or five (U+0063 U+0061 U+0066 U+0065 U+0301 — where the last is a combining acute accent over a separate 'e'). Both render identically on screen. They compare unequal under `===` in JavaScript. They produce different bytes when serialized. They fail database UNIQUE constraints differently. The user can't tell which they have until something breaks; the Unicode Inspector is what you use to find out.

The tool's primary job is to surface **invisible characters** — codepoints that take up zero or near-zero visual space but exist in the text and affect everything downstream of rendering. The greatest hits:

- **U+200B ZERO WIDTH SPACE** — a word boundary that's invisible. Sneaks into pasted content from rich-text editors. Breaks regex like `/^\w+$/` because the word boundary is real but the input visually looks like one word. - **U+200C ZERO WIDTH NON-JOINER** and **U+200D ZERO WIDTH JOINER** — used legitimately in scripts that need ligature control (Arabic, Persian, Devanagari) and in emoji ZWJ sequences (👨‍👩‍👧). Cause bugs when they leak into Latin text. - **U+FEFF BYTE ORDER MARK** — the BOM. Legitimately marks UTF-16 byte order; illegitimately appears at the start of UTF-8 files where it's just an invisible character that breaks things expecting bytes to start with the actual content. - **U+00AD SOFT HYPHEN** — invisible until line-wrap, then it might appear as a hyphen. Sneaks in from word processors. - **U+200E LEFT-TO-RIGHT MARK** and **U+200F RIGHT-TO-LEFT MARK** — directional control. Legitimate in mixed-script text; bugs in mono-script text. - **U+202E RIGHT-TO-LEFT OVERRIDE** — the "trojan source" character that's been weaponized to make malicious code look benign. Source files with U+202E render text after it in reverse, which can hide the true behavior of a function. - **U+00A0 NO-BREAK SPACE** — looks like a space but doesn't break lines. Comes from rich-text paste. Breaks `/\s+/` if the regex doesn't include it.

When any of these are in the input, the stats row shows a yellow warning with the count, and the highlighted preview pane shows them as inline badges with their codepoint hex. The codepoint table below the preview gives every codepoint a row with name, category, block, JS escape sequence, HTML numeric entity, and UTF-8 byte sequence.

**The secondary job** is teaching the Unicode size model. JavaScript strings count UTF-16 code units, which makes `'😀'.length === 2` even though one emoji is one codepoint. Database VARCHAR(N) usually counts bytes, not codepoints, so a Chinese-character column with VARCHAR(10) fits about 3 Chinese characters in UTF-8. URL paths use percent-encoded UTF-8 bytes. Email headers use a different encoding entirely. The inspector shows codepoints, UTF-16 code units, and UTF-8 bytes simultaneously so you can reason about size limits in whichever layer you're working in.

**Block and category info** comes from a compact inline table covering the ~70 most-common Unicode blocks (Basic Latin, Latin Extended, Cyrillic, CJK, Hangul, Devanagari, the various Symbol blocks, the Math blocks, the various Emoji blocks, etc.). This is not a full Unicode Character Database (UCD) replacement — codepoints outside the table show "Unknown / Unassigned" — but it covers everything you're likely to encounter in real text. For exhaustive Unicode lookup, the Unicode Character Database directly (or unicode.org's online tools) is the canonical source.

**Names follow Unicode conventions**: 'LINE FEED', 'ZERO WIDTH SPACE', 'CJK UNIFIED IDEOGRAPH-4E2D', etc. Control characters get full names ('CHARACTER TABULATION' instead of 'tab'). Most CJK ideographs don't have individual names in the Unicode spec — they get the form 'CJK UNIFIED IDEOGRAPH-XXXX' where XXXX is the hex codepoint.

**JS escape sequences** are output in the form your JavaScript code can paste directly. BMP codepoints (≤ U+FFFF) become `\uXXXX`. Astral codepoints become `\u{XXXXX}` (ES2015 form). HTML entities use the decimal numeric form `&#NNN;` which works in every HTML parser.

**The trojan-source example** demonstrates the U+202E attack pattern. The input `safe‮nimda‬code` looks like `safeadmincode` because U+202E reverses the direction of everything after it. The inspector shows you what's actually there: 's', 'a', 'f', 'e', RLO marker, 'n', 'i', 'm', 'd', 'a', PDF marker, 'c', 'o', 'd', 'e'. This kind of thing has appeared in real attacks on code review (a 2021 paper coined the term "trojan source"). Linting tools have since added detection for it, but the inspector is the manual fallback.

**Privacy.** All analysis runs in your browser. The input you paste, the codepoint table, the highlighted preview — none of it crosses the network. No upload, no third-party lookup. Verify in DevTools — paste anything sensitive (a password, a session token, a private API key with weird characters), and the network panel stays empty. The inspector itself is small (no heavy lib, no font assets); after the page loads it works offline.

**Edge cases handled correctly**: surrogate pairs are joined into single codepoints (not shown as two separate entries); combining marks appear as separate codepoints (so you see them); variation selectors appear with their VS-N name; control characters appear with their CTRL name; the empty string gracefully produces zero rows.

Frequently Asked Questions

When do I need this?

When something in your text isn't behaving as expected. Common cases: a regex that should match doesn't, a JSON.parse rejects what looks like valid JSON, a database UNIQUE constraint fires on what looks like a duplicate, a comparison `a === b` returns false for visually-identical strings. All of these are usually invisible characters — zero-width spaces, byte order marks, RTL/LTR marks, or normalization-form differences. The inspector surfaces them by name so you can see what's actually there.

What's an 'invisible character'?

A character that takes up zero or near-zero visual space but exists in the text. The dangerous ones: U+200B zero-width space (invisible word separator), U+200C zero-width non-joiner, U+200D zero-width joiner (used in emoji sequences), U+FEFF byte order mark, U+00AD soft hyphen, U+202E right-to-left override (the 'trojan source' character that reverses display direction). All of these can cause real bugs when present in code, configuration, or user input.

What's the 'trojan source' example demonstrating?

U+202E (right-to-left override) flips the visual direction of all characters after it. The example `safe‮nimda‬code` is actually `s a f e [U+202E] n i m d a [U+202C] c o d e` — but it renders left-to-right as `safeadmincode`. This was exploited in real-world attacks where malicious code looked like one thing in editors but executed as another. The inspector lets you see exactly what's there byte by byte.

Why does my text show more codepoints than characters?

Three possibilities. **Combining characters**: 'é' can be one codepoint (U+00E9, precomposed) or two (U+0065 'e' + U+0301 combining acute). They look identical but are different sequences. **Variation selectors**: many emoji are followed by U+FE0F (variation selector-16) to force emoji-style rendering. **ZWJ sequences**: complex emoji like 👨‍👩‍👧 are multiple codepoints joined by U+200D. The inspector shows each component separately.

What's the difference between UTF-16 code units, codepoints, and UTF-8 bytes?

**Codepoints** are the abstract Unicode characters (U+0041 = 'A'). **UTF-16 code units** are how JavaScript stores strings internally — each code unit is 16 bits. Characters above U+FFFF (emoji, ancient scripts) need two code units (surrogate pair), so `'😀'.length === 2` in JS even though it's one codepoint. **UTF-8 bytes** are how strings are typically serialized on disk and over the network — 1 to 4 bytes per codepoint depending on its value. The inspector shows all three so you can reason about size limits and indexing in any layer.

Is my input sent to a server?

No. The inspector is pure JavaScript running in your browser. Each codepoint is resolved against a local lookup table; UTF-8 encoding is computed in-tab. Verify in DevTools — paste anything, watch the network panel stay empty. This matters because text passed to this tool often contains sensitive content (config files, API responses, user input under debugging).