An accent in Unicode is not always part of its letter. It can be — é exists as the single codepoint U+00E9 — but it can also stand alone, attached to whatever base character precedes it in the stream. That second arrangement is a combining mark, and it is the mechanism by which Unicode covers every accent on every letter in every script without enumerating each combination as its own codepoint. The cost is paid back in normalization, indexing, and a handful of subtle bugs that surface when code assumes one character is one codepoint.
What a combining mark is
A combining mark is a codepoint whose General Category begins with M. There are three subcategories:
- Mn — Mark, Nonspacing
- Zero advance width. The mark renders on top of, below, or through the preceding base character without moving the cursor forward. Most accents, vowel signs, and tone marks live here. U+0301 COMBINING ACUTE ACCENT is Mn.
- Mc — Mark, Spacing Combining
- Non-zero advance width. The mark occupies horizontal space of its own while still binding semantically to the preceding base. Devanagari vowel signs like U+093E DEVANAGARI VOWEL SIGN AA are Mc.
- Me — Mark, Enclosing
- The mark surrounds the preceding base character. U+20DD COMBINING ENCLOSING CIRCLE turns any base into a circled glyph; U+20E3 COMBINING ENCLOSING KEYCAP is what gives 1️⃣ its boxed shape.
The defining property of every mark is that it modifies the preceding base character — it has no glyph of its own that stands alone. Renderers that meet a combining mark with no base in front of it are expected to display it over a dotted circle placeholder (◌), and most do.
The four blocks
The marks are distributed across four blocks, plus scattered marks inside individual script blocks for Indic, Arabic, Hebrew, and a dozen others:
- U+0300–U+036F
- Combining Diacritical Marks. 112 codepoints. The main block, covering Latin, Greek, Cyrillic, and IPA. Includes U+0301 acute, U+0300 grave, U+0302 circumflex, U+0303 tilde, U+0307 dot above, U+0308 diaeresis, U+030A ring above, U+0327 cedilla.
- U+1AB0–U+1AFF
- Combining Diacritical Marks Extended. Added in Unicode 7.0 for Germanic dialectology and other linguistic notation.
- U+1DC0–U+1DFF
- Combining Diacritical Marks Supplement. Medievalist and phonetic extensions — combining double letters, archaic marks.
- U+20D0–U+20FF
- Combining Diacritical Marks for Symbols. Marks that combine with mathematical and currency symbols — arrows above, enclosing circles, enclosing squares, the keycap.
- U+FE20–U+FE2F
- Combining Half Marks. Halved marks used to draw a single accent across two adjacent base letters.
The same letter, two spellings
The most consequential property of combining marks is that almost every accented letter has two valid representations: the precomposed single codepoint, and the base-plus-mark sequence.
é precomposed U+00E9
é base + combining U+0065 U+0301
These two strings are canonically equivalent. Section 3.7 of the Unicode standard requires conforming software to treat them as the same character. They render identically in any well-built font, they compare equal under the Unicode collation algorithm, and they hash to the same value after normalization. They do not, however, have the same byte length, the same codepoint length, or — without normalization — the same equality under simple string comparison.
That equivalence is symmetric. NFC pulls the second form to the first; NFD pushes the first to the second. See the normalization guide for the four forms and when each one is required.
Worked examples across scripts
Combining marks are not a Latin-only mechanism. Every major script with diacritics uses them.
| Glyph | Codepoints | Description |
|---|---|---|
| ñ | U+006E U+0303 | Latin n + combining tilde. |
| n͠ | U+006E U+0360 | Latin n + combining double tilde, drawn across two letters when followed by a second base. |
| ǘ | U+0075 U+0308 U+0301 | Latin u with diaeresis and acute. Vietnamese and pinyin use stacks like this. |
| ё | U+0435 U+0308 | Cyrillic e + combining diaeresis. Also has the precomposed U+0451. |
| с҃ | U+0441 U+0483 | Cyrillic es + combining titlo, used in Church Slavonic to mark abbreviations. |
| اَ | U+0627 U+064E | Arabic alef + fatha (a harakat — short vowel mark). |
| אֱ | U+05D0 U+05B1 | Hebrew alef + hataf segol (a niqqud — vowel point). |
| nǎ | U+006E U+0061 U+030C | Pinyin n + a + combining caron, the third tone for nǎ. |
| ế | U+0065 U+0302 U+0301 | Vietnamese e with circumflex and acute — two marks on one base. |
The Vietnamese, Arabic, and Hebrew cases are the ones to look at carefully. Vietnamese in particular routinely stacks two marks per base — a vowel quality mark plus a tone mark — and the precomposed forms exist in Latin Extended-Additional (U+1E00–U+1EFF), but normalization to NFD breaks them apart and exposes the underlying sequence.
Canonical combining class
When more than one mark follows the same base, the order matters for byte equality but should not matter for visual result. To make the two reconcilable, every combining mark carries a number called its canonical combining class (ccc), defined in the Unicode Character Database. It is an integer from 0 to 240. A few of the values:
- ccc = 0
- Not reordered. Base characters and starter marks. This is the value for any non-combining codepoint and for combining marks that interact with the base in a way that order-dependence matters.
- ccc = 1
- Overlay.
- ccc = 202
- Attached below left.
- ccc = 218
- Below left.
- ccc = 220
- Below.
- ccc = 230
- Above. The class of the acute, grave, circumflex, tilde, diaeresis, and most familiar Latin diacritics.
- ccc = 232
- Above right.
- ccc = 240
- Iota subscript.
The canonical ordering algorithm (Section 3.11 of the Unicode standard) sorts runs of marks that follow a starter into ascending order by ccc, but is forbidden from swapping two marks that share the same non-zero class — they retain their original relative order. The result is that strings differing only in the order of independent marks have a single canonical form.
A worked ordering
Take the letter a with a dot above and a dot below, written in the order the user types:
Input: U+0061 U+0307 U+0323
a dot above (ccc 230) dot below (ccc 220)
Reorder by ccc (ascending, stable):
U+0061 U+0323 U+0307
a dot below (ccc 220) dot above (ccc 230)
The two marks have different non-zero combining classes, so they are reordered. After canonical ordering the byte sequence is fixed regardless of which mark the user typed first. Visually nothing changes; storage now has a single canonical form.
Contrast with two marks of the same class. Two marks above (both ccc 230) keep their input order, because reordering them might change the rendering — the standard reserves that order as a meaningful authoring choice.
Stacking and the limits of fonts
A well-built font for Latin, Greek, or Cyrillic includes mark anchor points and a GPOS table that positions marks correctly above and below the base. Less well-built fonts collide marks into the base or each other. The behaviour is purely a font-level concern; the codepoint sequence is the same either way.
The pathological case is so-called Zalgo text, a real Unicode phenomenon in which dozens of combining marks are stacked on a single base. Each codepoint is legitimate; the sequence is legal Unicode; rendering simply runs out of vertical space. Stripping marks (NFD followed by removing the Mn category) is the standard mitigation.
A combining mark is always optional in the sense that its semantics survive without it, and always necessary in the sense that without it the base is a different letter.
The length question
Indexing a string by codepoint and by grapheme cluster gives different answers when combining marks are present. The four common languages disagree on what length means:
| Language | What length returns | "é" precomposed | "é" decomposed |
|---|---|---|---|
JavaScript .length | UTF-16 code units | 1 | 2 |
Python len() | Codepoints | 1 | 2 |
Java .length() | UTF-16 code units | 1 | 2 |
Swift .count | Grapheme clusters | 1 | 1 |
Swift is the outlier — it counts grapheme clusters by default, so the user-perceived character count is what you get. Everywhere else, code that assumes one character is one unit will produce different results for the same visible string depending on how it was typed. See codepoint, character, glyph, grapheme for the broader picture.
Stripping marks for fuzzy search
A common search requirement is accent-insensitive: café should match cafe. The standard recipe is two steps. First, normalize to NFD to separate every mark from its base. Second, drop everything in the Mark category.
// JavaScript
function stripDiacritics(s) {
return s.normalize('NFD').replace(/\p{M}/gu, '');
}
stripDiacritics('café'); // 'cafe'
stripDiacritics('Çığlık'); // 'Ciglik'
stripDiacritics('Việt Nam'); // 'Viet Nam'
The Unicode property escape \p{M} matches all three mark subcategories. The flag u turns on full Unicode matching. The same idea applies in Python with unicodedata.normalize('NFD', s) followed by filtering unicodedata.category(c).startswith('M').
This is a comparison-only transform. Do not store stripped text — you have thrown away information that may matter (a Turkish I with no dot is a different letter from I, not a typo). Compare a stripped copy against a stripped search query; render the original.
Invisible marks and spoofing
Because Mn marks have zero advance width, they are invisible when there is no preceding base or when their attachment to the base is unobtrusive. An attacker can hide a combining mark inside an identifier (a variable name, an HTTP header, a domain label) and produce a string that compares unequal to its naked form while looking identical. The Trojan Source family of attacks (Boucher and Anderson, 2021) abuses bidirectional formatting characters; the same shape of attack with combining marks is well known to the security model in UTS #39, and is one reason IDNA 2008 layers a script-mixing check on top of NFKC.
If an identifier system accepts arbitrary Unicode, it should at minimum apply NFKC and the IdentifierStatus filter from UTS #39. Stripping marks alone is not sufficient; some attacks rely on marks that survive NFC but render invisibly.
What to remember
A combining mark is a codepoint that modifies the preceding base. The same accented letter can be one codepoint or two; both are canonically equivalent; normalization picks one shape. Multiple marks on the same base are sorted by canonical combining class — same-class marks keep their order, different-class marks reorder. Code that counts UTF-16 units or codepoints will not agree with users about how long a string is. When you need accent-insensitive matching, NFD-decompose and drop the Mark category; when you need to compare canonically, NFC and you are done.
Further reading
- Unicode normalization explained — the four forms that decide when two mark sequences are equal.
- Codepoint, character, glyph, grapheme — why
"é".lengthcan be 1 or 2. - é U+00E9 — the precomposed letter from the running example.
- Character inspector — paste a marked string and see every codepoint and its combining class.
- Unicode normalizer — see NFC and NFD side by side for any combining-mark string.
- Marks category — the full inventory of Mn, Mc, and Me codepoints.
- Arabic block — where the harakat live.