UTF-8, UTF-16, UTF-32 compared

A codepoint is an integer between zero and 10FFFF hex. There are three official ways to write that integer as bytes, and the choice between them quietly determines whether your strings index in constant time, whether they survive a careless copy through an ASCII-only system, and whether they need a byte-order mark at the start. This guide lays out exactly what each encoding does, with the same five characters traced through each.

The five characters

To keep the comparison honest the same five characters are used throughout — one from each region of the codepoint space:

U+0041: LATIN CAPITAL LETTER A. ASCII range, BMP.
U+00A9: COPYRIGHT SIGN — ©. Latin-1 range, BMP.
U+20AC: EURO SIGN — €. Mid-BMP, three bytes in UTF-8.
U+1D4D0: MATHEMATICAL BOLD SCRIPT CAPITAL A — 𝓐. SMP, requires a surrogate pair.
U+1F30D: EARTH GLOBE EUROPE-AFRICA — 🌍. SMP, four bytes in UTF-8.

UTF-8

UTF-8, designed by Ken Thompson and Rob Pike in September 1992 on a placemat at a New Jersey diner, is a variable-length encoding that uses one to four bytes per codepoint. The first 128 codepoints (U+0000–U+007F) are encoded as a single byte identical to ASCII. Higher codepoints use a continuation pattern: the leading byte carries length information in its top bits, and each continuation byte begins with 10.

Bit pattern (UTF-8):
  1 byte:    0xxxxxxx
  2 bytes:   110xxxxx 10xxxxxx
  3 bytes:   1110xxxx 10xxxxxx 10xxxxxx
  4 bytes:   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The high bit of every byte tells the decoder where it stands. A byte starting with 0 is a complete ASCII character. A byte starting with 10 is a continuation byte. A byte starting with 11 is the start of a multi-byte sequence, and the number of leading 1s tells you how many bytes it spans. This property is called self-synchronization: drop a UTF-8 reader into the middle of a stream and it can find the next character boundary by reading at most three bytes forward.

Character     Codepoint    UTF-8 bytes (hex)
A             U+0041       41
©             U+00A9       C2 A9
€             U+20AC       E2 82 AC
𝓐             U+1D4D0      F0 9D 93 90
🌍            U+1F30D      F0 9F 8C 8D

UTF-16

UTF-16 uses 16-bit code units. Codepoints in the BMP fit into a single unit. Codepoints above U+FFFF require two units called a surrogate pair: a high surrogate from the range U+D800–U+DBFF followed by a low surrogate from U+DC00–U+DFFF. Those 2,048 codepoints are permanently unassigned in Unicode for exactly this purpose — they exist only inside UTF-16 streams.

The surrogate pair encoding is mechanical. Subtract 0x10000 from the codepoint, leaving a 20-bit number. The high 10 bits go into the high surrogate, OR'd with 0xD800. The low 10 bits go into the low surrogate, OR'd with 0xDC00. For U+1D4D0:

U+1D4D0 - 0x10000  = 0xD4D0  = 0000 1101 0100 1101 0000
high surrogate  = 0xD800 | (0xD4D0 >> 10)        = 0xD835
low surrogate   = 0xDC00 | (0xD4D0 & 0x3FF)      = 0xDED0
UTF-16:           D835 DED0

UTF-16 is also sensitive to byte order. The 16-bit code unit 0xD835 can be stored as D8 35 (big-endian, UTF-16BE) or 35 D8 (little-endian, UTF-16LE). To disambiguate, a stream may begin with the codepoint U+FEFF (ZERO WIDTH NO-BREAK SPACE), which serves as a byte order mark (BOM): FE FF means big-endian, FF FE means little-endian. The same codepoint is used as a BOM in UTF-32.

Character     UTF-16 (BE)        UTF-16 (LE)
A             00 41              41 00
©             00 A9              A9 00
€             20 AC              AC 20
𝓐             D8 35 DE D0        35 D8 D0 DE
🌍            D8 3C DF 0D        3C D8 0D DF

UTF-32

UTF-32 uses 32-bit code units. Every codepoint is exactly four bytes. There are no surrogates, no continuation bytes, no length signalling. The encoding is the codepoint, zero-padded to 32 bits, with an optional BOM and a byte-order question for stored streams.

Character     UTF-32 (BE)
A             00 00 00 41
©             00 00 00 A9
€             00 00 20 AC
𝓐             00 01 D4 D0
🌍            00 01 F3 0D

UTF-32 is rare on disk and rarer on the wire because the storage cost is so high — every ASCII character pays for four bytes when it would be content with one. Where it does appear is inside text-processing libraries that want O(1) indexing by codepoint. Python's internal string representation, for example, picks between Latin-1, UCS-2 (BMP-only UTF-16), and UTF-32 per string depending on the maximum codepoint, precisely to keep indexing constant-time without paying the UTF-32 storage cost everywhere.

Size in practice

The choice of encoding matters less for compactness than people sometimes assume. The cost depends heavily on the script:

Text sample	UTF-8	UTF-16	UTF-32
"Hello, world" (English, 12 chars)	12 bytes	24 bytes	48 bytes
"Καλημέρα" (Greek, 8 chars)	16 bytes	16 bytes	32 bytes
"你好世界" (Chinese, 4 chars)	12 bytes	8 bytes	16 bytes
"🌍🌎🌏" (emoji, 3 grapheme clusters)	12 bytes	12 bytes	12 bytes

For ASCII text UTF-8 is half the size of UTF-16 and a quarter of UTF-32. For Greek, Cyrillic, Armenian, Hebrew, and Arabic — scripts in the U+0080–U+07FF range where UTF-8 uses two bytes per codepoint — UTF-8 and UTF-16 are tied. For East Asian scripts in the BMP, UTF-16 is more compact than UTF-8. Above the BMP all three encodings converge: four bytes per codepoint.

The BOM

A byte-order mark is the codepoint U+FEFF written at the start of a stream. It has three uses, only one of which is strictly justified:

UTF-16 / UTF-32: Disambiguates big-endian from little-endian. This is the legitimate purpose.
UTF-8: Identifies the stream as UTF-8 (encoded as the three bytes EF BB BF). Largely a Microsoft convention; tolerated but discouraged by most Unix tooling, which treats it as a non-empty first character.
Mid-string: U+FEFF embedded anywhere except the start should be read as ZERO WIDTH NO-BREAK SPACE. Modern Unicode reserves a separate codepoint, U+2060 WORD JOINER, for that semantic use, leaving FEFF to its byte-order job.

A UTF-8 BOM at the start of a PHP, shell, or HTML file will be served as part of the body, breaking header() calls and pushing leading whitespace into the document. If you see mysterious extra characters before <!DOCTYPE>, look for a BOM in the source file.

Why UTF-8 dominates

UTF-8 became the default encoding of the web for a small number of decisive reasons:

ASCII compatibility. Every existing ASCII file is already valid UTF-8. The migration cost for English-language content was zero.
No endianness. A UTF-8 stream has the same bytes everywhere. The BOM is optional and discouraged.
Self-synchronization. A decoder can find character boundaries from any byte position. This makes UTF-8 robust to truncation, partial reads, and search-in-bytes patterns.
Byte-oriented infrastructure. The vast existing apparatus of strcmp, grep, sort, and every file format that thinks in bytes works on UTF-8 without modification, because byte-wise ordering of UTF-8 matches codepoint-wise ordering.
No null bytes. UTF-8 never contains the byte 00 inside a non-null character, so C strings still work.

By contrast, UTF-16 carries the surrogate-pair complication forever. JavaScript's "𝓐".length returns 2, not 1, because the language was specified before the SMP existed and exposes UTF-16 code units directly. Java's String.charAt has the same problem. UTF-32 avoids both issues but spends too much memory to be practical at scale.

For new code, on disk and on the wire, the answer is almost always UTF-8. Internal-memory representations vary, but interchange formats have converged. The W3C has required UTF-8 as the default for HTML since HTML5; modern programming languages default to UTF-8 source files; the IETF has required UTF-8 in new protocols since RFC 2277 (1998).

What to remember

The encodings differ in length and alignment but not in what they can represent — any of them can carry any codepoint. The cost is in indexing, in interoperability, and in the small infelicities at the edges: surrogate pairs in UTF-16, the BOM question in UTF-8, the storage waste in UTF-32. If you control the interface, pick UTF-8 and move on. If you inherit it, know which one you have and where the seams are.

The five characters

UTF-8

UTF-16

UTF-32

Size in practice

The BOM

Why UTF-8 dominates

What to remember

Further reading