Surrogate pairs in UTF-16

For ten years between 1991 and 2001, Unicode lived in a single 16-bit world. Every character had a number between 0 and 65,535, and every encoding stored that number as a single 16-bit unit. When the standard outgrew that range in 1996, UTF-16 was retrofitted with a way to address codepoints up to U+10FFFF without breaking any existing software that read 16-bit units. The fix was the surrogate pair. Thirty years later it still defines what "😀".length returns in JavaScript, why Java's charAt can deliver half an emoji, and why some valid in-memory strings cannot be saved as UTF-8.

Why the pair exists

Unicode 1.0 in 1991 fit comfortably inside 16 bits. The original encoding, UCS-2, simply stored each codepoint as a 16-bit unsigned integer. Windows NT, Java, JavaScript, and macOS were all designed around that assumption — strings were arrays of 16-bit code units, indexable by integer position, and every character was exactly one unit. Unicode 2.0 in 1996 broke the assumption. The standard expanded to address 17 planes of 65,536 codepoints each, a total of 1,114,112 possible codepoints. UCS-2 could not represent anything above U+FFFF.

The solution was to reserve a slice of the 16-bit range as an escape mechanism. The codepoints U+D800–U+DFFF — 2,048 in total — were withdrawn from the assignable pool and dedicated to encoding the new supplementary planes inside the old 16-bit container. UCS-2 plus this escape mechanism became UTF-16. Old software that did not understand the escape would see two unfamiliar 16-bit units; new software would recognise them as a single supplementary codepoint.

The 2,048 surrogate codepoints are permanently unassigned in Unicode. They are not characters. They exist only inside UTF-16 streams as encoding scaffolding. UTF-8 and UTF-32 do not use them and explicitly forbid them as content.

High and low surrogates

The 2,048 codepoints split into two equal ranges:

U+D800–U+DBFF: High surrogates (also called leading surrogates). 1,024 codepoints. The first unit of a pair.
U+DC00–U+DFFF: Low surrogates (also called trailing surrogates). 1,024 codepoints. The second unit of a pair.

The two ranges are deliberately disjoint, so a UTF-16 decoder can tell from any single unit which role it plays. If you see a unit in the U+D800–U+DBFF range, the next unit must be a low surrogate; if you see a low surrogate without a preceding high surrogate, the data is malformed.

The split is what gives UTF-16 its self-synchronization property. A decoder that drops into the middle of a stream needs to read at most one unit forward or one unit back to find the next codepoint boundary. The 16-bit fixed unit size is what UTF-16 keeps; the variable codepoint length is what UTF-8 and UTF-16 share.

The algorithm

To encode a supplementary codepoint cp (a value in the range U+10000 to U+10FFFF) as a surrogate pair:

1.  Subtract 0x10000 from cp, leaving a 20-bit value v.
2.  Split v into a high 10 bits (vh) and a low 10 bits (vl).
3.  high surrogate = 0xD800 | vh
4.  low  surrogate = 0xDC00 | vl

Subtracting 0x10000 lands the value in the range 0 to 0xFFFFF, which fits in 20 bits. The 10 high bits go into the lower 10 bits of the high surrogate (its top 6 bits are 110110, mask 0xD800). The 10 low bits go into the lower 10 bits of the low surrogate (its top 6 bits are 110111, mask 0xDC00). The two units together carry exactly 20 bits of payload, and 0x10000 plus a 20-bit value covers the supplementary range exactly.

A worked example: 😀

U+1F600 GRINNING FACE is in the Supplementary Multilingual Plane. The conversion:

codepoint        = 0x1F600
subtract 0x10000 = 0x0F600
                 = 0000 1111 0110 0000 0000  (20 bits)

high 10 bits (vh) = 0000 1111 01           = 0x3D  (61)
low  10 bits (vl) = 10 0000 0000           = 0x200 (512)

high surrogate    = 0xD800 | 0x3D          = 0xD83D
low  surrogate    = 0xDC00 | 0x200         = 0xDE00

UTF-16:             D83D DE00

The same calculation for four more characters that span the supplementary planes:

Character	Codepoint	v = cp - 0x10000	High	Low
𝓐	U+1D4D0	0x0D4D0	D835	DED0
𓀀	U+13000	0x03000	D80C	DC00
🦄	U+1F984	0x0F984	D83E	DD84
𠮷	U+20BB7	0x10BB7	D842	DFB7
󠄀	U+E0100	0xD0100	DB40	DD00

The Egyptian hieroglyph U+13000, the unicorn-face emoji, the CJK Extension B character U+20BB7 (a variant of the Japanese surname Yoshida), and the Variation Selector Supplement U+E0100 are all reached by the same arithmetic. Anything above U+FFFF requires the pair; anything below is a single 16-bit unit.

To decode, the inverse:

cp = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)

JavaScript and the divided strings

JavaScript strings are sequences of UTF-16 code units, not codepoints. Everything in the standard library that exposes a length or an index works in 16-bit units. For BMP-only strings this is invisible; for strings containing supplementary characters it leaks the surrogate machinery into the API.

const s = '😀';
s.length;                  // 2  — two UTF-16 code units
[...s].length;             // 1  — iterator yields one codepoint
s.charCodeAt(0);           // 55357 (0xD83D) — high surrogate
s.charCodeAt(1);           // 56832 (0xDE00) — low surrogate
s.codePointAt(0);          // 128512 (0x1F600) — full codepoint
s[0];                      // a lone half — the high surrogate alone
s.charAt(0);               // same lone half

The two ways to iterate a string by codepoint rather than code unit are the iterator (for...of, the spread operator, Array.from) and codePointAt. The methods inherited from ECMAScript 3 — charCodeAt, charAt, indexing by number — work on code units. Anything that splits or slices by index can cut a surrogate pair in half.

Java has the same architecture and the same pitfalls. String.length() returns the UTF-16 code unit count; charAt(int) returns a single char which is a 16-bit unit; codePointAt(int), codePointCount, and codePoints() are the codepoint-aware counterparts. String.chars() is a stream of UTF-16 units, not codepoints — its name predates the language acquiring codepoint methods, and it has stayed there.

Lone surrogates

A surrogate unit on its own — without its required partner — is invalid UTF-16. The high surrogate U+D83D not followed by a low surrogate is incomplete; the low surrogate U+DE00 not preceded by a high surrogate is unpaired. Either case is malformed data that no encoder may emit.

JavaScript permits lone surrogates in its in-memory strings. ECMAScript calls strings of this kind USV-incomplete, where USV (Unicode Scalar Value) is the official term for a codepoint that is not a surrogate. The language is happy to store a string containing only "\uD83D" as a one-character value. The trouble starts at the boundary: any attempt to encode that string as UTF-8 either rejects it or quietly replaces the lone unit with U+FFFD REPLACEMENT CHARACTER, because UTF-8 has no representation for surrogates.

const lonely = '\uD83D';      // valid JavaScript string, one code unit
new TextEncoder().encode(lonely);
// Uint8Array(3) [ 0xEF, 0xBF, 0xBD ]  // U+FFFD, the replacement character

JSON.stringify(lonely);
// '"\\ud83d"'  // serialised as an escape, not as bytes

new Blob([lonely]);
// Writes EF BF BD to the file — bytes go through the same lossy step

This becomes a real problem when a string-processing pipeline cuts a string at an arbitrary index. Splitting '😀' at position 1 yields two single-unit strings, each a lone surrogate. Saving them via fetch, Blob, fs.writeFile, or any other UTF-8-on-the-wire path will lose the emoji and replace it with two U+FFFDs.

WTF-8

The pragmatic patch for environments that need to round-trip JavaScript-style strings through UTF-8 is WTF-8 (Wobbly Transformation Format-8), defined in a 2014 specification by Simon Sapin. WTF-8 is identical to UTF-8 for any valid string, but it additionally permits the encoding of unpaired surrogate codepoints as three-byte sequences using the same UTF-8 rules. This is technically illegal UTF-8, but it lets Rust's OsString on Windows, the Filesystem API in Web platform specs, and a few other interfaces preserve filenames that the underlying OS exposes as ill-formed UTF-16.

WTF-8 is for internal interfaces only. Do not write it to anything that calls itself UTF-8 — RFC 3629 explicitly forbids the encoding of surrogate codepoints, and conforming UTF-8 decoders are required to reject them.

Where UTF-8 and UTF-32 sidestep the problem

UTF-8 reaches the full Unicode range directly. A supplementary codepoint encodes as four bytes in UTF-8 via the standard variable-length pattern; there is no surrogate-pair step. The codepoint U+1F600 is the byte sequence F0 9F 98 80, computed by spreading the 20 payload bits across four bytes with the appropriate prefix bits — no detour through D800–DFFF. Because UTF-8 builds straight from the codepoint, it cannot represent surrogates, and that is exactly what its specification requires.

UTF-32 sidesteps the problem the other way: every codepoint, supplementary or not, gets a full 32 bits. There is no variable-length anywhere. The codepoint U+1F600 is the four bytes 00 01 F6 00 in big-endian. The surrogate range is simply unused. For O(1) indexing by codepoint, UTF-32 remains the only encoding that delivers it without ambiguity. The cost is four bytes per character even for ASCII, which is why UTF-32 lives in memory and not on the wire. See the comparison guide for the full trade-off.

UTF-16 carries an artefact of its history forever. The surrogate range is the place Unicode keeps the cost of having once been a 16-bit standard.

Why this still matters

Three of the most widely deployed runtimes — JavaScript engines, the JVM, and the .NET CLR — store strings as UTF-16 internally. So do Windows APIs and large parts of macOS. Every string operation in these environments is a UTF-16 operation, and every supplementary character that passes through them is a surrogate pair. The pair leaks into every API that exposes length, position, slice, or comparison. Code written by someone unaware of the mechanism produces strings that work on Latin text and corrupt anywhere a user types an emoji.

The mitigations are small but specific. Iterate by codepoint (for...of, codePoints()) when you need codepoint semantics. Iterate by grapheme cluster (Intl.Segmenter, ICU's BreakIterator) when you need user-perceived characters. Never substring a string at an arbitrary index without checking whether you are about to split a pair. Treat lone surrogates as data corruption when you see them at I/O boundaries.

An easy way to detect a lone surrogate: a code unit u satisfies u >= 0xD800 && u <= 0xDFFF. Validating that high surrogates are followed by low surrogates and that the inverse pairing is also present catches every UTF-16 well-formedness error in a single linear pass.

What to remember

UTF-16 represents supplementary codepoints as two 16-bit units drawn from a reserved range. The arithmetic is fixed: subtract 0x10000, split into 10+10 bits, OR with 0xD800 and 0xDC00. The surrogate codepoints are not characters and never will be. Languages that expose UTF-16 directly — JavaScript, Java, C# — leak the pair into their string APIs, and code that indexes by unit can split a character in half. UTF-8 and UTF-32 sidestep the whole apparatus by reaching the supplementary planes through their own native mechanisms.

Why the pair exists

High and low surrogates

The algorithm

A worked example: 😀

JavaScript and the divided strings

Lone surrogates

WTF-8

Where UTF-8 and UTF-32 sidestep the problem

Why this still matters

What to remember

Further reading