CJK Unified Ideographs (U+4E00–U+9FFF)

The CJK Unified Ideographs block — U+4E00 through U+9FFF — is, by a wide margin, the largest block in the Unicode standard. It contains 20,992 Han ideographs, the characters used in written Chinese (where they are hànzì), in Japanese (kanji), in Korean (hanja, now used mainly in scholarly and legal contexts), and historically in Vietnamese (chữ Hán, before the Latin-based chữ Quốc ngữ replaced it in the twentieth century). Counting all extension blocks across the Supplementary Ideographic Plane and the Tertiary Ideographic Plane, the total Han repertoire in Unicode 16.0 exceeds 97,000 codepoints — more characters than the rest of Unicode combined.

About this block

Unicode 1.0.1 (1992) introduced the first set of unified ideographs; Unicode 2.0 (1996) extended the range to its current size of 20,992; Unicode 3.0 (1999) added Extension A; Unicode 3.1 (2001) added Extension B on Plane 2; and subsequent versions through Unicode 15.1 have added Extensions C, D, E, F, G, H, and I. Within this core block, the first ideograph U+4E00 一 — the character meaning "one" — was deliberately chosen to occupy the opening position, a small piece of typographic poetry inside an encoding spec. The originally-final ideograph in Unicode 3.0 was U+9FA5 龥, a rare character whose name in Japanese is yaku; the block has since been extended toward U+9FFF as new characters have been encoded.

The defining technical decision behind this block is Han unification. The Ideographic Rapporteur Group (IRG), a body operating under ISO/IEC JTC1/SC2 with representatives from China, Hong Kong, Taiwan, Japan, North Korea, South Korea, Vietnam, Singapore, Macao, and the Unicode Consortium, decided to assign a single codepoint to characters that are "abstractly the same" even when their glyph shapes differ across CJKV traditions. The character 直 ("straight, honest") at U+76F4 is the canonical illustration: in PRC Simplified Chinese fonts it is drawn one way, in Taiwan Traditional Chinese fonts another, in Japanese Mincho fonts subtly differently again, and in Korean fonts differently still — but Unicode encodes only one codepoint, leaving the visual variation to be supplied by language-tagged fonts or by Ideographic Variation Selectors (U+E0100–U+E01EF). Critics of the unification — particularly Japanese designers and typographers — argued that semantically meaningful glyph distinctions had been collapsed, especially in proper names where a particular shape carries identity. Defenders pointed out that the Latin letter a looks very different in Times Roman, Helvetica, and Comic Sans, and yet we accept that they encode the same character. The IVD (Ideographic Variation Database) was added later to record specific named variation sequences for cases where a single character must be visually disambiguated.

The semantic density of this block is what makes it culturally indispensable. A literate Chinese reader actively uses perhaps 3,000–4,000 of these characters; a Japanese reader, the ~2,136 jōyō kanji set plus a few thousand more; a Korean reader needs only a small set for hanja in modern use. But the long tail of rare characters — names, places, classical texts, technical terminology — is what justifies a 20,992-character block. Some of the most-used characters: U+4E00 一 (one), U+4E8C 二 (two), U+4E09 三 (three), U+4EBA 人 (person), U+5927 大 (big), U+5C0F 小 (small), U+4E0A 上 (up, above), U+4E0B 下 (down, below), U+4E2D 中 (middle, China), U+65E5 日 (sun, day), U+6708 月 (moon, month), U+5E74 年 (year), U+56FD 国 (the simplified-Chinese "country"), and U+570B 國 (the traditional-Chinese "country"). The pair 国/國 demonstrates that simplification was not unified — simplified and traditional forms are separately encoded when their glyph shapes diverge enough to be considered different characters.

For implementers, the takeaways are practical. UTF-8 encodes every character in this block as three bytes (the range U+4E00–U+9FFF falls entirely within the three-byte UTF-8 range). UTF-16 represents each one as a single 16-bit code unit, with no surrogates required. Font fallback is the most common failure mode: a missing or incomplete CJK font shows tofu (□) for every ideograph, and rendering a Japanese text in a PRC-tuned font (or vice versa) produces visually "wrong" glyphs even though every codepoint is correct. The deeper conceptual point — what unification means for what a "character" really is — is treated in Codepoint, character, glyph, grapheme.

CJK Unified Ideographs

About this block

Notable characters

Common ideographs in the block

About this block

Notable characters

Common ideographs in the block

Related blocks