The single hardest thing about working with Unicode is that the question "how many characters is this?" has at least four answers. The string café, as displayed in your browser, contains four visible letters. It also contains either four or five codepoints depending on how it was typed. It might be rendered with three or four glyphs depending on the font. And to a user who is told to type a password of at least four characters, it should always count as four. None of those four numbers is wrong; they are answers to different questions.
The four terms
- Codepoint
- An integer between U+0000 and U+10FFFF assigned to a character (or reserved for some purpose) in the Unicode standard. The most fundamental unit and the only one with a precise number.
- Character
- The abstract idea of a writable thing — a letter, a digit, a punctuation mark. Unicode formally defines abstract characters, which map to one or more codepoints. The mapping is not always one-to-one.
- Glyph
- The drawn shape of a character on screen or page. A font defines the glyphs. The same character can be rendered as many glyphs (contextual forms in Arabic, ligatures in Latin), and the same glyph can stand for several characters (the fi ligature).
- Grapheme cluster
- The unit a human would call "a character" when counting on screen. Defined by Unicode Standard Annex #29. One grapheme cluster may span many codepoints.
Example one: the accented e
The letter é can be represented as a single codepoint or as two, and both are valid Unicode:
"é" as one codepoint: U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
"é" as two codepoints: U+0065 U+0301 (e + COMBINING ACUTE ACCENT)
The two representations look identical when displayed. They produce different byte sequences in every encoding. They are, however, treated as canonically equivalent by Unicode: a conforming process must consider them the same character for purposes like searching, sorting, and string comparison. The job of normalising one form to the other is the subject of normalization.
From the four-term perspective:
- Codepoints: 1 in the first form, 2 in the second.
- Characters: 1 — the same abstract character in both cases. (The standard calls this an abstract character with two valid coded representations.)
- Glyphs: 1 — the font draws the same shape.
- Grapheme clusters: 1 — a user counts one letter.
So the question "how many characters is é?" can correctly be answered 1 or 2 depending on whether you mean codepoints or anything else. The classic bug here is a length check: "café".length is 4 in JavaScript, but "café".length is 5. See the é detail page for the byte breakdowns.
Example two: the family emoji
👨👩👧👦 is one image. To a user it is one character. Underneath it is a ZWJ sequence of seven codepoints, joined by an invisible Zero Width Joiner at U+200D:
👨👩👧👦 = U+1F468 MAN
U+200D ZERO WIDTH JOINER
U+1F469 WOMAN
U+200D ZERO WIDTH JOINER
U+1F467 GIRL
U+200D ZERO WIDTH JOINER
U+1F466 BOY
In UTF-8: F0 9F 91 A8 E2 80 8D F0 9F 91 A9 E2 80 8D
F0 9F 91 A7 E2 80 8D F0 9F 91 A6
— 25 bytes for 7 codepoints, rendered as 1 glyph.
The font carries a ligature that recognises this exact ZWJ sequence and replaces the four people with a single composite picture. If the font lacks that ligature — common on older systems and some Linux desktops — the same sequence renders as four separate people with no joining. The codepoints are the same in either case; what changes is the glyph count.
- Codepoints: 7.
- Characters: 4 abstract characters (the four people), plus 3 ZWJ format characters that carry no semantic meaning of their own.
- Glyphs: 1 in a modern font, 4 in a basic one.
- Grapheme clusters: 1.
This is why "👨👩👧👦".length in JavaScript is 11 (it counts UTF-16 code units), and why [..."👨👩👧👦"].length is 7 (it counts codepoints), and why neither is the answer a user would give. To count grapheme clusters you need a library that implements UAX #29, or in modern JavaScript, Intl.Segmenter:
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...seg.segment("👨👩👧👦")].length // 1
Example three: the fi ligature
The Latin fi ligature works in the opposite direction from the family emoji. In high-quality typography the two letters f and i are joined into a single shape so that the dot of the i does not collide with the hook of the f.
"fi" as text: U+0066 U+0069 (f, i) → font draws 1 glyph
"fi" as text: U+FB01 (Latin small ligature fi) → 1 glyph
The string fi is two codepoints, two abstract characters, two grapheme clusters — but one glyph in any font that has the ligature in its liga OpenType feature. The other string, the precomposed ligature U+FB01, is one codepoint, one glyph, and is meant for compatibility with older typesetting systems. Searching for fi inside the precomposed ligature returns no match unless you first run NFKC normalization, which decomposes U+FB01 back into two letters.
Example four: Devanagari
The Hindi word for India, भारत, is rendered in four glyphs in most fonts. It contains four codepoints, but the boundaries do not line up with the visible letters in the way an alphabetic-script reader expects.
भारत = U+092D DEVANAGARI LETTER BHA (भ)
U+093E DEVANAGARI VOWEL SIGN AA (ा)
U+0930 DEVANAGARI LETTER RA (र)
U+0924 DEVANAGARI LETTER TA (त)
The vowel sign U+093E attaches to the preceding consonant; together they form a single grapheme cluster भा. So the codepoint count is 4, the grapheme cluster count is 3 (भा, र, त), and the glyph count is also 3.
Brahmic scripts (Devanagari, Tamil, Bengali, Thai, and many others) make this point routinely. The unit you care about for cursor positioning, selection, and most user-facing string lengths is the grapheme cluster, not the codepoint.
When does the distinction bite
| Task | Unit to use | Why |
|---|---|---|
| Password minimum length | Grapheme clusters | "abc😀" should count as 4, not 7 (UTF-16 length). |
| Database storage allocation | Bytes (UTF-8) | What disk and network consume. |
| Cursor movement / selection | Grapheme clusters | Pressing right-arrow should never split a family emoji. |
| Regex character class | Codepoints | Most regex engines work at codepoint granularity. |
| Sorting / searching | Normalized codepoints | "café" must match "café" regardless of composition. |
| Font rendering | Glyphs | OpenType liga, calt, locl features decide what is drawn. |
JavaScript's String.length is the count of UTF-16 code units, which is none of the four units above. It happens to equal the codepoint count for BMP text, but not for emoji, mathematical alphanumerics, or any other codepoint above U+FFFF. Use [...str].length for codepoints and Intl.Segmenter for grapheme clusters.
What to remember
A codepoint is a number. A character is an idea. A glyph is a shape. A grapheme cluster is what a user counts. Whenever code talks about "a character", ask which of the four it means — and which the underlying API actually delivers. Most surprises in Unicode begin with that mismatch.
Further reading
- Unicode normalization explained — when the two forms of é become indistinguishable.
- How emoji work — the full mechanics of the family-emoji example.
- Character inspector — paste any string and see each codepoint listed out.
- é U+00E9 — the single-codepoint precomposed form.
- UTF-8, UTF-16, UTF-32 compared — what indexes are even counting.
- What is Unicode? — the standard underneath all four terms.