TOOL · INSPECTION

Character inspector

Paste any string and see exactly what codepoints are in it — including the invisible ones.

How it works

The inspector iterates the input string by codepoint using a for…of loop, which correctly handles surrogate pairs (unlike indexing with bracket notation, which gives you 16-bit code units and breaks on supplementary-plane characters like emoji). For each codepoint it shows the glyph itself, the canonical U+XXXX notation, the decimal value, an identifying name, the general category, and the UTF-8 byte length.

The summary above the table compares four counts that often disagree:

Graphemes — user-perceived characters, segmented by the Web Platform's Intl.Segmenter with granularity: 'grapheme'. A flag, a family emoji, and a base letter with combining marks each count as one grapheme.
Codepoints — the number of Unicode codepoints, what [...str].length returns.
UTF-16 code units — what JavaScript's plain .length returns; equal to codepoint count for BMP-only strings, larger when supplementary-plane characters are present.
UTF-8 bytes — the encoded byte length on the wire or on disk. This is the count databases enforce when columns are sized in bytes rather than characters.

Names are resolved from a small built-in lookup of roughly eighty common codepoints. For everything else, the tool falls back to the Unicode block name (e.g. CJK Unified Ideographs, Emoticons, Cyrillic). Blocks partition the codepoint space into named ranges and are useful even without a per-character name. The category column uses a block-based heuristic to suggest a general category — letter, mark, digit, symbol, control, format, separator. For precise category data you'd consult the Unicode Character Database; this tool deliberately ships a heuristic rather than a 200 KB JSON file.

What you'll see

Paste a string and you immediately notice things you couldn't see before. A "smart quote" you copied from a Word document turns out to be U+201D RIGHT DOUBLE QUOTATION MARK, not the ASCII ". A string that "should" match in your database turns out to contain a U+200B ZERO WIDTH SPACE somewhere in the middle, courtesy of a copy-paste from a tracking pixel. A name field that "should" sort correctly turns out to contain a U+00A0 NO-BREAK SPACE where the user intended a regular space. A family emoji like 👨‍👩‍👧 turns out to be five codepoints joined by zero-width joiners (U+200D), not a single atomic character.

Worked example

Paste café spelled in two different ways and you can see the difference instantly:

Precomposed: c, a, f, U+00E9 LATIN SMALL LETTER E WITH ACUTE — four codepoints, five UTF-8 bytes.
Decomposed: c, a, f, e, U+0301 COMBINING ACUTE ACCENT — five codepoints, six UTF-8 bytes.

Both render identically. They will not compare equal as JavaScript strings. They will compare equal after running both through NFC with the normalizer. This is the most common variant of the "why doesn't equality work" bug in any Unicode-aware system, and seeing the codepoints side by side makes it obvious.

Codepoint converter — single-character deep dive
Unicode normalizer — fix mismatches
UTF-8 encoder — focus on byte sequences
Codepoint, character, glyph, grapheme
Unicode normalization explained
How emoji work
All Unicode categories
All Unicode blocks

How it works

What you'll see

Worked example

Related