REFERENCE · GENERAL CATEGORY

Unicode general categories

Every codepoint in Unicode has a single General Category code from the Lu/Ll/Lo/Nd/Pc/etc. system.

The General Category is the most-used classification in the Unicode Character Database. It assigns each of the 1,114,112 codepoints to exactly one of thirty two-letter codes. The first letter names a major group — L for letter, M for mark, N for number, P for punctuation, S for symbol, Z for separator, C for other — and the second letter narrows the bucket: u for uppercase, l for lowercase, d for decimal digit, and so on. The full table was largely settled in Unicode 1.0 (1991) and has barely changed since.

The category drives almost every piece of higher-level text processing you encounter. A regex engine implements \p{L} as "any codepoint whose General Category starts with L." Word-break algorithms classify spaces using Z, identifier syntax in C, Java and Python uses L + Nd + Pc, and bidi neutrals are largely the P and S groups. Even Unicode Security Mechanisms (UAX #39) lean on the General Category to decide which codepoints are safe in identifiers.

All 30 categories

CodeMajorNameDescriptionExamples
LuL · LetterLetter, uppercaseUppercase letters in cased scripts.A B C Ω Ж
LlL · LetterLetter, lowercaseLowercase letters in cased scripts.a b c α ж
LtL · LetterLetter, titlecaseThe 31 Croatian/Serbian Latin digraphs and a handful of archaic forms used for titlecasing.Dž Lj Nj Dz
LmL · LetterLetter, modifierSmall letters that modify the preceding base character.ʰ ʱ ʲ ˇ ˆ
LoL · LetterLetter, otherLetters in scripts with no case distinction — CJK, Arabic, Hebrew, Indic.中 あ א ا
MnM · MarkMark, nonspacingCombining marks that occupy no horizontal advance (most diacritics).◌̀ ◌́ ◌̃ ◌̈
McM · MarkMark, spacing combiningCombining marks that do consume horizontal width — chiefly Indic vowel signs.ा ि ी
MeM · MarkMark, enclosingCombining marks that enclose the preceding character.◌⃝ ◌⃞
NdN · NumberNumber, decimal digitCodepoints in the 0–9 cycle of any decimal-digit script. There are 700+ of these.0 1 2 ٠ ١ २
NlN · NumberNumber, letterNumerals that are also letters — Roman numerals, Greek acrophonic, Counting Rod.Ⅰ Ⅱ Ⅲ Ⅳ
NoN · NumberNumber, otherFractions, superscript and subscript digits, circled and parenthesised digits.½ ¾ ① ② ³
PcP · PunctuationPunctuation, connectorConnecting punctuation, like the underscore, used to join words inside identifiers._ ‿ ⁀
PdP · PunctuationPunctuation, dashThe full dash family — hyphen, en dash, em dash, horizontal bar and minus-like dashes.- – — ⸺
PsP · PunctuationPunctuation, openOpening brackets, parentheses and CJK corner brackets.( [ { 「
PeP · PunctuationPunctuation, closeClosing brackets — the mirror partners of the Ps group.) ] } 」
PiP · PunctuationPunctuation, initial quoteOpening quotation marks, including French and German guillemets and curly quotes.“ ‘ ‹ «
PfP · PunctuationPunctuation, final quoteClosing quotation marks. Note that ASCII " and ' are Po, not Pf.” ’ › »
PoP · PunctuationPunctuation, otherEverything else: periods, commas, semicolons, the ampersand, the ASCII straight quotes.. , ; : ! ? @ &
SmS · SymbolSymbol, mathMathematical operators, relational symbols, set-theory glyphs.+ < = ± ∞ ∑
ScS · SymbolSymbol, currencyAll 60+ currency signs, from dollar to Indian rupee to the new bitcoin.$ € £ ¥ ₹
SkS · SymbolSymbol, modifierSpacing letter-modifier symbols — the standalone forms of accent marks.` ^ ¨ ¯ ´
SoS · SymbolSymbol, otherEverything pictographic — copyright, trademark, dingbats, most emoji.© ® ™ ☃ ♥ 🌍
ZsZ · SeparatorSeparator, spaceHorizontal whitespace — ASCII space, NBSP, en quad, em quad, hair space.U+0020 U+00A0 U+2000
ZlZ · SeparatorSeparator, lineA single codepoint that marks a line separator.U+2028
ZpZ · SeparatorSeparator, paragraphA single codepoint that marks a paragraph separator.U+2029
CcC · OtherOther, controlThe 65 ASCII and C1 control codes inherited from ISO/IEC 6429.U+0000–001F U+007F–009F
CfC · OtherOther, formatInvisible formatting codepoints — ZWJ, ZWNJ, BOM, bidi controls, soft hyphen.ZWJ ZWNJ BOM U+200B
CsC · OtherOther, surrogateThe 2,048 surrogate codepoints. Used by UTF-16; never assigned to characters.U+D800–U+DFFF
CoC · OtherOther, private useThe three Private Use Areas, totalling 137,468 codepoints reserved for unofficial use.E000–F8FF PUA-A PUA-B
CnC · OtherOther, unassignedEvery codepoint not yet assigned to a character. Includes all 66 noncharacters.— (~819,000 codepoints)

The seven major groups

Each major group has its own page with longer-form notes and example characters. Emoji is shown alongside because the Emoji property is independent of General Category — most emoji are So, but a handful are Po or even Nd.

Related