The same word can be spelled in Unicode in more than one way, and the user typing it has no idea which spelling their keyboard chose. Normalization is the standardised process of putting a string into a single canonical spelling, so that two strings the user thinks are the same actually compare as equal. The full specification is Unicode Standard Annex #15. The version that matters in code is the one-line decision: which form, NFC, NFD, NFKC, or NFKD?

Two kinds of equivalence

Unicode defines two relations on strings:

Canonical equivalence
Two strings represent the same abstract character. é as one codepoint (U+00E9) and é as two (U+0065 U+0301) are canonically equivalent — the standard requires that conforming software treat them as the same character.
Compatibility equivalence
Two strings represent the same character in a looser sense that may lose formatting distinctions. The superscript digit ² (U+00B2) is compatibility-equivalent to the digit 2 (U+0032). The full-width Latin letter (U+FF21) is compatibility-equivalent to A (U+0041). The decomposition discards visual distinctions deliberately.

The four normalization forms are the cross product of which equivalence with composed or decomposed result:

FormEquivalenceResult shapeUse case
NFDCanonicalDecomposedPer-character analysis, accent stripping.
NFCCanonicalComposedStorage, interchange, transmission. The W3C default.
NFKDCompatibilityDecomposedSearch indexes, fuzzy match.
NFKCCompatibilityComposedIdentifier comparison, login systems.

The four forms on café

Start with the string café typed in the worst possible way: the e followed by a combining acute accent.

InputCodepoints
café (as typed)U+0063 U+0061 U+0066 U+0065 U+0301
FormCodepoints outLength
NFDU+0063 U+0061 U+0066 U+0065 U+03015
NFCU+0063 U+0061 U+0066 U+00E94
NFKDU+0063 U+0061 U+0066 U+0065 U+03015
NFKCU+0063 U+0061 U+0066 U+00E94

For pure-Latin text, NFC and NFKC produce identical results, as do NFD and NFKD. The compatibility forms only differ from the canonical forms when a character has a compatibility decomposition. The classic examples follow.

The four forms on ffi

U+FB03 is the Latin small ligature ffi, a single codepoint for the historic typographic ligature. Its canonical decomposition is empty — there is no canonically equivalent multi-codepoint form. Its compatibility decomposition is three separate letters.

FormResult
Inputffi (U+FB03)
NFDffi (U+FB03) — unchanged
NFCffi (U+FB03) — unchanged
NFKDffi (U+0066 U+0066 U+0069)
NFKCffi (U+0066 U+0066 U+0069)

The compatibility forms restore the ligature to three separate letters, which is what you want if you are searching for office inside a document where someone has typed office. NFKC is the form used by IDNA 2008 for internationalised domain names, partly to prevent ligatures from being used as visual disguises for ASCII.

The four forms on ½

U+00BD VULGAR FRACTION ONE HALF behaves similarly. The canonical forms preserve it; the compatibility forms decompose it into digit, fraction slash, digit:

FormResult
Input½ (U+00BD)
NFD / NFC½ (U+00BD) — unchanged
NFKD / NFKC1⁄2 (U+0031 U+2044 U+0032)

Note that NFKC does not produce the ASCII string 1/2 — the slash U+2044 is FRACTION SLASH, not ASCII solidus. NFKC is a compatibility decomposition; it removes visual distinctions but does not promote characters across the digit/symbol boundary in ways that would lose semantics. (The compatibility decomposition is defined per character in the Unicode Character Database and is not user-tailorable.)

Other revealing decompositions

A small gallery of cases where NFKC differs from NFC:

InputCodepointNFKCCodepoints out
A (full-width A)U+FF21AU+0041
² (superscript 2)U+00B22U+0032
カ (half-width katakana KA)U+FF76U+30AB
℡ (telephone sign)U+2121TELU+0054 U+0045 U+004C
㎏ (square kg)U+338FkgU+006B U+0067
𝐀 (mathematical bold A)U+1D400AU+0041

The math alphanumerics in particular — every styled letter from U+1D400 to U+1D7FF — decomposes to the plain ASCII letter under NFKC. This is why a user-name field that runs NFKC will see 𝐀𝐝𝐦𝐢𝐧 and Admin as identical.

When to normalize

Store and transmit
Use NFC. It is the form the W3C and the IETF specify for HTML and protocol identifiers. Browsers do not normalize HTML automatically; the W3C Character Model document recommends that authoring tools save content as NFC. macOS notoriously stores filenames in NFD, which is the source of many cross-platform bugs — a file named café.txt on macOS may not match the same name on Linux when the latter expects NFC.
Compare and search
Use NFC at minimum on both sides. Use NFKC if you want compatibility-equivalent strings to match (full-width vs half-width, ligatures vs letters, styled vs unstyled).
Login systems
Apply NFKC plus case-folding (Unicode's case-insensitive comparison, not ASCII tolower). This is what IDNA 2008 and the PRECIS framework (RFC 8264) specify for identifiers.
Password fields
RFC 8265 (PRECIS OpaqueString) prescribes NFC and a specific allowed-character profile. Do not case-fold passwords.
Accent-insensitive search
Apply NFD, then strip combining marks (codepoints in the Mn category). "café" NFD-decomposed becomes "cafe" + combining acute, after which removing combining marks leaves "cafe".

Most languages provide normalization in the standard library: JavaScript's String.prototype.normalize(form), Python's unicodedata.normalize(form, s), Java's java.text.Normalizer, Swift's String.unicodeScalars + ICU. The form argument is the literal string "NFC", "NFD", "NFKC", or "NFKD".

The IDN homograph attack

The most cited security consequence of skipping normalization is the internationalised domain name homograph attack. Consider these two strings:

"apple.com"   ASCII Latin letters    U+0061 U+0070 ...
"аpple.com"   first letter is Cyrillic а  (U+0430 U+0070 ...)

The Cyrillic small letter A (U+0430) is visually indistinguishable from the Latin small letter A (U+0061) in most fonts, but they are distinct codepoints. Neither NFC nor NFKC turns one into the other — they have no shared decomposition. To prevent this, IDNA 2008 layers an additional mixed-script check on top of NFKC: a label containing characters from multiple scripts is rejected, and a label that looks confusable is also rejected (UTS #39, the Unicode Security Mechanisms document, provides the confusables data file).

Modern browsers display IDN names in Punycode (the all-ASCII form starting with xn--) whenever the registered name contains scripts the user's browser is not configured to expect. The first letter of the malicious example above renders as xn--pple-43d.com in Chrome and Firefox, not as a clickable lookalike.

Normalization is not a security feature by itself. It is the precondition for comparing strings safely. The actual rules for what is allowed in identifiers — scripts, confusables, mixed-script labels — live in UTS #39 and IDNA 2008, layered on top.

What to remember

If you take only one rule from this page: store text as NFC and compare it as NFC. If your system has identifiers (usernames, filenames, hostnames), apply NFKC plus the appropriate identifier profile from PRECIS or IDNA 2008. Run the normalizer on any string you are not sure about before storing it.

Further reading