Combining diacritical marks in Unicode

An accent in Unicode is not always part of its letter. It can be — é exists as the single codepoint U+00E9 — but it can also stand alone, attached to whatever base character precedes it in the stream. That second arrangement is a combining mark, and it is the mechanism by which Unicode covers every accent on every letter in every script without enumerating each combination as its own codepoint. The cost is paid back in normalization, indexing, and a handful of subtle bugs that surface when code assumes one character is one codepoint.

What a combining mark is

A combining mark is a codepoint whose General Category begins with M. There are three subcategories:

Mn — Mark, Nonspacing: Zero advance width. The mark renders on top of, below, or through the preceding base character without moving the cursor forward. Most accents, vowel signs, and tone marks live here. U+0301 COMBINING ACUTE ACCENT is Mn.
Mc — Mark, Spacing Combining: Non-zero advance width. The mark occupies horizontal space of its own while still binding semantically to the preceding base. Devanagari vowel signs like U+093E DEVANAGARI VOWEL SIGN AA are Mc.
Me — Mark, Enclosing: The mark surrounds the preceding base character. U+20DD COMBINING ENCLOSING CIRCLE turns any base into a circled glyph; U+20E3 COMBINING ENCLOSING KEYCAP is what gives 1️⃣ its boxed shape.

The defining property of every mark is that it modifies the preceding base character — it has no glyph of its own that stands alone. Renderers that meet a combining mark with no base in front of it are expected to display it over a dotted circle placeholder (◌), and most do.

The four blocks

The marks are distributed across four blocks, plus scattered marks inside individual script blocks for Indic, Arabic, Hebrew, and a dozen others:

U+0300–U+036F: Combining Diacritical Marks. 112 codepoints. The main block, covering Latin, Greek, Cyrillic, and IPA. Includes U+0301 acute, U+0300 grave, U+0302 circumflex, U+0303 tilde, U+0307 dot above, U+0308 diaeresis, U+030A ring above, U+0327 cedilla.
U+1AB0–U+1AFF: Combining Diacritical Marks Extended. Added in Unicode 7.0 for Germanic dialectology and other linguistic notation.
U+1DC0–U+1DFF: Combining Diacritical Marks Supplement. Medievalist and phonetic extensions — combining double letters, archaic marks.
U+20D0–U+20FF: Combining Diacritical Marks for Symbols. Marks that combine with mathematical and currency symbols — arrows above, enclosing circles, enclosing squares, the keycap.
U+FE20–U+FE2F: Combining Half Marks. Halved marks used to draw a single accent across two adjacent base letters.

The same letter, two spellings

The most consequential property of combining marks is that almost every accented letter has two valid representations: the precomposed single codepoint, and the base-plus-mark sequence.

é   precomposed       U+00E9
é   base + combining  U+0065 U+0301

These two strings are canonically equivalent. Section 3.7 of the Unicode standard requires conforming software to treat them as the same character. They render identically in any well-built font, they compare equal under the Unicode collation algorithm, and they hash to the same value after normalization. They do not, however, have the same byte length, the same codepoint length, or — without normalization — the same equality under simple string comparison.

That equivalence is symmetric. NFC pulls the second form to the first; NFD pushes the first to the second. See the normalization guide for the four forms and when each one is required.

Worked examples across scripts

Combining marks are not a Latin-only mechanism. Every major script with diacritics uses them.

Glyph	Codepoints	Description
ñ	U+006E U+0303	Latin n + combining tilde.
n͠	U+006E U+0360	Latin n + combining double tilde, drawn across two letters when followed by a second base.
ǘ	U+0075 U+0308 U+0301	Latin u with diaeresis and acute. Vietnamese and pinyin use stacks like this.
ё	U+0435 U+0308	Cyrillic e + combining diaeresis. Also has the precomposed U+0451.
с҃	U+0441 U+0483	Cyrillic es + combining titlo, used in Church Slavonic to mark abbreviations.
اَ	U+0627 U+064E	Arabic alef + fatha (a harakat — short vowel mark).
אֱ	U+05D0 U+05B1	Hebrew alef + hataf segol (a niqqud — vowel point).
nǎ	U+006E U+0061 U+030C	Pinyin n + a + combining caron, the third tone for nǎ.
ế	U+0065 U+0302 U+0301	Vietnamese e with circumflex and acute — two marks on one base.

The Vietnamese, Arabic, and Hebrew cases are the ones to look at carefully. Vietnamese in particular routinely stacks two marks per base — a vowel quality mark plus a tone mark — and the precomposed forms exist in Latin Extended-Additional (U+1E00–U+1EFF), but normalization to NFD breaks them apart and exposes the underlying sequence.

Canonical combining class

When more than one mark follows the same base, the order matters for byte equality but should not matter for visual result. To make the two reconcilable, every combining mark carries a number called its canonical combining class (ccc), defined in the Unicode Character Database. It is an integer from 0 to 240. A few of the values:

ccc = 0: Not reordered. Base characters and starter marks. This is the value for any non-combining codepoint and for combining marks that interact with the base in a way that order-dependence matters.
ccc = 1: Overlay.
ccc = 202: Attached below left.
ccc = 218: Below left.
ccc = 220: Below.
ccc = 230: Above. The class of the acute, grave, circumflex, tilde, diaeresis, and most familiar Latin diacritics.
ccc = 232: Above right.
ccc = 240: Iota subscript.

The canonical ordering algorithm (Section 3.11 of the Unicode standard) sorts runs of marks that follow a starter into ascending order by ccc, but is forbidden from swapping two marks that share the same non-zero class — they retain their original relative order. The result is that strings differing only in the order of independent marks have a single canonical form.

A worked ordering

Take the letter a with a dot above and a dot below, written in the order the user types:

Input:        U+0061 U+0307 U+0323
              a      dot above (ccc 230)   dot below (ccc 220)

Reorder by ccc (ascending, stable):
              U+0061 U+0323 U+0307
              a      dot below (ccc 220)   dot above (ccc 230)

The two marks have different non-zero combining classes, so they are reordered. After canonical ordering the byte sequence is fixed regardless of which mark the user typed first. Visually nothing changes; storage now has a single canonical form.

Contrast with two marks of the same class. Two marks above (both ccc 230) keep their input order, because reordering them might change the rendering — the standard reserves that order as a meaningful authoring choice.

Stacking and the limits of fonts

A well-built font for Latin, Greek, or Cyrillic includes mark anchor points and a GPOS table that positions marks correctly above and below the base. Less well-built fonts collide marks into the base or each other. The behaviour is purely a font-level concern; the codepoint sequence is the same either way.

The pathological case is so-called Zalgo text, a real Unicode phenomenon in which dozens of combining marks are stacked on a single base. Each codepoint is legitimate; the sequence is legal Unicode; rendering simply runs out of vertical space. Stripping marks (NFD followed by removing the Mn category) is the standard mitigation.

A combining mark is always optional in the sense that its semantics survive without it, and always necessary in the sense that without it the base is a different letter.

The length question

Indexing a string by codepoint and by grapheme cluster gives different answers when combining marks are present. The four common languages disagree on what length means:

Language	What length returns	"é" precomposed	"é" decomposed
JavaScript `.length`	UTF-16 code units	1	2
Python `len()`	Codepoints	1	2
Java `.length()`	UTF-16 code units	1	2
Swift `.count`	Grapheme clusters	1	1

Swift is the outlier — it counts grapheme clusters by default, so the user-perceived character count is what you get. Everywhere else, code that assumes one character is one unit will produce different results for the same visible string depending on how it was typed. See codepoint, character, glyph, grapheme for the broader picture.

Stripping marks for fuzzy search

A common search requirement is accent-insensitive: café should match cafe. The standard recipe is two steps. First, normalize to NFD to separate every mark from its base. Second, drop everything in the Mark category.

// JavaScript
function stripDiacritics(s) {
  return s.normalize('NFD').replace(/\p{M}/gu, '');
}

stripDiacritics('café');     // 'cafe'
stripDiacritics('Çığlık');   // 'Ciglik'
stripDiacritics('Việt Nam'); // 'Viet Nam'

The Unicode property escape \p{M} matches all three mark subcategories. The flag u turns on full Unicode matching. The same idea applies in Python with unicodedata.normalize('NFD', s) followed by filtering unicodedata.category(c).startswith('M').

This is a comparison-only transform. Do not store stripped text — you have thrown away information that may matter (a Turkish I with no dot is a different letter from I, not a typo). Compare a stripped copy against a stripped search query; render the original.

Invisible marks and spoofing

Because Mn marks have zero advance width, they are invisible when there is no preceding base or when their attachment to the base is unobtrusive. An attacker can hide a combining mark inside an identifier (a variable name, an HTTP header, a domain label) and produce a string that compares unequal to its naked form while looking identical. The Trojan Source family of attacks (Boucher and Anderson, 2021) abuses bidirectional formatting characters; the same shape of attack with combining marks is well known to the security model in UTS #39, and is one reason IDNA 2008 layers a script-mixing check on top of NFKC.

If an identifier system accepts arbitrary Unicode, it should at minimum apply NFKC and the IdentifierStatus filter from UTS #39. Stripping marks alone is not sufficient; some attacks rely on marks that survive NFC but render invisibly.

What to remember

A combining mark is a codepoint that modifies the preceding base. The same accented letter can be one codepoint or two; both are canonically equivalent; normalization picks one shape. Multiple marks on the same base are sorted by canonical combining class — same-class marks keep their order, different-class marks reorder. Code that counts UTF-16 units or codepoints will not agree with users about how long a string is. When you need accent-insensitive matching, NFD-decompose and drop the Mark category; when you need to compare canonically, NFC and you are done.

What a combining mark is

The four blocks

The same letter, two spellings

Worked examples across scripts

Canonical combining class

A worked ordering

Stacking and the limits of fonts

The length question

Stripping marks for fuzzy search

Invisible marks and spoofing

What to remember

Further reading