What is Unicode? A reference introduction

Before Unicode, text was a regional affair. A document written in Athens used one encoding, a document written in Tokyo used another, and neither could be opened reliably on a machine that did not already know which encoding to expect. Files arrived as columns of seemingly random punctuation. The condition had a name — mojibake, from the Japanese moji (character) and bake (transform) — and it was so common that entire industries grew around guessing the right encoding from byte statistics. Unicode is the technical and political agreement that made that guessing unnecessary.

The 1988 proposal

The clearest origin point is a paper written in August 1988 by Joseph Becker, then at Xerox, titled Unicode 88. Becker proposed a single, wide character set that could hold every character used in every modern script, plus a generous reserve for historic and technical use. He estimated that 16 bits per character — 65,536 possible values — would be sufficient. That estimate turned out to be too small, but the architecture survived: separate the abstract idea of a numeric identifier (a codepoint) from the question of how to store it as bytes (an encoding).

Xerox engineers had been wrestling with the problem since the early 1980s as part of the Star workstation. They were joined by Apple, where Mark Davis and others were building international support for the Macintosh. In January 1991 the Unicode Consortium was incorporated as a California nonprofit, and Unicode 1.0 was published in October 1991, containing roughly 7,161 characters. Sun, Microsoft, IBM, Borland, and others joined within months. The first standard was a single book; today the Unicode Character Database is over a hundred plain-text files in a versioned repository, last released as Unicode 16.0 in September 2024.

The problem before Unicode

The reason the standard was necessary at all is best demonstrated. Consider the French word café. In Windows code page 1252 — the de facto Western European encoding through the 1990s — it is stored as four bytes:

café  (CP-1252)   →   63 61 66 E9
café  (Latin-1)   →   63 61 66 E9
café  (UTF-8)     →   63 61 66 C3 A9

The first three bytes (63 61 66) are identical in all three, because they are ASCII letters and every modern encoding agrees about ASCII. The disagreement begins at the é. CP-1252 and Latin-1 happen to agree here (both use the single byte E9), but they disagree on dozens of other bytes — CP-1252 occupies positions 80–9F with extra punctuation that Latin-1 leaves undefined. Open a CP-1252 file as Latin-1 and the smart quotes turn into control characters. Open it as Shift-JIS and the byte E9 becomes part of a half-finished Japanese kanji.

Multiply this problem by the dozens of code pages that existed for Eastern European languages, Cyrillic, Greek, Hebrew, Arabic, Thai, and the various incompatible standards for Chinese, Japanese, and Korean, and you have what the standards literature politely calls code page hell. Email systems routinely added a charset= parameter that lied; web browsers shipped with elaborate auto-detection heuristics; software vendors maintained internal conversion tables that disagreed at the edges. The cost was measured in unreadable messages, lost data, and a permanent ceiling on how multilingual any single document could be.

Codepoints

Unicode replaces the code page model with a single integer space. Every character — every letter, digit, punctuation mark, symbol, control code, format character, and emoji — is assigned a unique non-negative integer called a codepoint. The space runs from U+0000 to U+10FFFF, which is 1,114,112 possible values. The U+ prefix is a Unicode-specific notation; the digits after it are hexadecimal.

U+0041: LATIN CAPITAL LETTER A. Decimal 65. Identical to ASCII.
U+00E9: LATIN SMALL LETTER E WITH ACUTE — é. The first three bytes of code page disagreement.
U+20AC: EURO SIGN — €. Added in Unicode 2.1 (1998) when the currency itself was new.
U+1F30D: EARTH GLOBE EUROPE-AFRICA — 🌍. Beyond the original 16-bit ceiling, in the Supplementary Multilingual Plane.

Of the 1,114,112 possible codepoints, Unicode 16.0 assigns names to 154,998. The rest are either reserved for future use, set aside as private use, or permanently unassigned for technical reasons (the surrogate range, which exists only as a workaround for UTF-16).

Planes

For organisational purposes the codepoint space is divided into 17 planes of 65,536 codepoints each. Plane 0 is the Basic Multilingual Plane (BMP), and it contains nearly everything in daily use: the Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Thai, and Han scripts; all of the punctuation, math, and currency blocks; the Hangul Syllables block for Korean. The remaining sixteen planes are supplementary.

Plane	Range	Name	Contains
0	U+0000 – U+FFFF	BMP	Modern scripts, punctuation, common symbols
1	U+10000 – U+1FFFF	SMP	Historic scripts, musical notation, most emoji
2	U+20000 – U+2FFFF	SIP	Rare CJK ideographs
3	U+30000 – U+3FFFF	TIP	More rare CJK ideographs, Oracle Bone script
4–13	U+40000 – U+DFFFF	—	Unassigned
14	U+E0000 – U+EFFFF	SSP	Tag characters, variation selectors supplement
15–16	U+F0000 – U+10FFFF	PUA-A / PUA-B	Supplementary private use

The 16-bit ceiling that Becker proposed in 1988 corresponds exactly to the BMP. When the standard moved beyond it — first with Unicode 2.0 in 1996, which introduced surrogate pairs to let UTF-16 reach the higher planes — every encoding had to be revisited. UTF-8 absorbed the change easily; UTF-16 acquired a complication it still has not shed.

Encodings

A codepoint is a number. Storing or transmitting it requires turning that number into bytes. Unicode defines three official encodings for this — UTF-8, UTF-16, and UTF-32 — and the choice between them has shaped how text behaves on every platform you use.

UTF-8: Variable length, 1–4 bytes per codepoint, ASCII-compatible. The dominant encoding on the web and on Unix-like systems. A is one byte; é is two; € is three; 🌍 is four.
UTF-16: Variable length, 2 or 4 bytes per codepoint. Native on Windows, Java, JavaScript, .NET, and Objective-C. Characters above the BMP require a surrogate pair — two 16-bit code units in the reserved range U+D800–U+DFFF.
UTF-32: Fixed 4 bytes per codepoint. Conceptually the simplest; used in memory for some text-processing libraries because indexing is O(1). Rare on disk and rarer on the wire.

The full comparison — including byte tables and the reasons UTF-8 took over — lives in UTF-8, UTF-16, UTF-32 compared.

A codepoint is not the same thing as a character, and a character is not the same thing as a glyph. The distinction matters as soon as you start counting, searching, or comparing text. See codepoint, character, glyph, grapheme.

What Unicode does not solve

Unicode assigns numbers to characters. It does not, by itself, draw them — that is the job of a font, and a font may render the same codepoint as a thousand different glyphs depending on context. Unicode does not specify how to sort text (the Unicode Collation Algorithm, UAX #10, does, but locale-specific tailoring is still required). It does not specify how to break lines (UAX #14 does that). It does not specify how to display bidirectional text (UAX #9, the Bidirectional Algorithm, does that). And it does not specify which combinations of codepoints are equivalent — that is the job of normalization.

The standard is split deliberately into a thin core and a sprawling set of annexes precisely because reasonable people disagree about details. The core is the codepoint assignments, the names, the categories, and the canonical decompositions. The rest is layered on top.

ASCII, Latin-1, Unicode

It is worth being explicit about how the older standards relate to the new one. ASCII, formalised in 1963 by ANSI as X3.4, is a 7-bit code with 128 positions: 95 printable characters and 33 control codes. Latin-1 (ISO/IEC 8859-1, 1987) extended ASCII to 8 bits and filled the upper half with Western European letters. Windows code page 1252 is a slight variant of Latin-1 that uses positions 80–9F for smart quotes and other typography rather than leaving them undefined.

Unicode preserves all of ASCII at its original positions — U+0041 is A, the same byte you would see in 1963 — and most of Latin-1 in the U+0080–U+00FF range. This compatibility is the reason adoption was as smooth as it was: a UTF-8 file containing only English text is byte-identical to an ASCII file containing the same text. See Basic Latin and Latin-1 Supplement for the full character lists.

How big is it now

Unicode 16.0, released in September 2024, contains 154,998 assigned characters across 168 scripts. CJK Unified Ideographs alone account for over 97,000 codepoints. Emoji, despite their cultural visibility, are a small portion — roughly 3,790 base emoji, supplemented by an enormous combinatorial space of ZWJ sequences and modifier combinations. New versions add one or two thousand characters a year, mostly historic scripts, technical symbols, and emoji.

The Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world.

That sentence, from the introduction to the standard itself, has stayed the same for thirty years. What it means in practice — and what kind of work is required to honour it — is the subject of the rest of these guides.

The 1988 proposal

The problem before Unicode

Codepoints

Planes

Encodings

What Unicode does not solve

ASCII, Latin-1, Unicode

How big is it now

Further reading