CATEGORY · P · PUNCTUATION

Punctuation

Seven subcategories cover every connector, dash, bracket, quotation mark and miscellaneous mark in the standard.

Unicode keeps a more granular taxonomy of punctuation than most writers realise. There are seven subcategories under P — connector, dash, open bracket, close bracket, initial quote, final quote, and the catch-all "other". Together they cover around 850 codepoints, almost all of which live in the General Punctuation, Basic Latin, CJK Symbols and Punctuation, and Supplemental Punctuation blocks.

The subcategories

Pc
Punctuation, connector — punctuation that visually joins. Just ten codepoints in total: LOW LINE _ (U+005F), UNDERTIE ‿ (U+203F), CHARACTER TIE ⁀ (U+2040), and the half-width and fullwidth variants. Used inside identifiers (Python, Java, JavaScript all treat U+005F as a valid identifier character).
Pd
Punctuation, dash — the entire dash family. HYPHEN-MINUS - (U+002D), HYPHEN ‐ (U+2010), NON-BREAKING HYPHEN ‑ (U+2011), FIGURE DASH ‒ (U+2012), EN DASH – (U+2013), EM DASH — (U+2014), HORIZONTAL BAR ― (U+2015), TWO-EM DASH ⸺ (U+2E3A), THREE-EM DASH ⸻ (U+2E3B), and the dashes used by Mongolian, Armenian and Arabic. The ASCII minus is included even though MINUS SIGN U+2212 itself is in Sm.
Ps
Punctuation, open — opening brackets. ( [ { 「 〈 ⟨ ⦃ plus dozens more from CJK, math and ornament ranges. Unicode pairs each Ps with a matching Pe through the Bidi_Mirroring_Glyph property, so a bidi engine knows that ( in RTL context should render as the mirror of (.
Pe
Punctuation, close — closing brackets. ) ] } 」 〉 ⟩ ⦄. Always the mirror partner of a Ps. About 80 pairs are registered.
Pi
Punctuation, initial quote — opening quotation marks. LEFT DOUBLE QUOTATION MARK " (U+201C), LEFT SINGLE QUOTATION MARK ' (U+2018), SINGLE LEFT-POINTING ANGLE QUOTATION MARK ‹ (U+2039), LEFT-POINTING DOUBLE ANGLE QUOTATION MARK « (U+00AB), plus the German low-quote variants.
Pf
Punctuation, final quote — closing quotation marks. " ' › » and their relatives. Note: the ASCII straight-quote characters U+0022 " and U+0027 ' are not Pi or Pf — they are Po, because the same codepoint can serve as either opening or closing in plain ASCII.
Po
Punctuation, other — everything that doesn't fit above. PERIOD . (U+002E), COMMA , (U+002C), SEMICOLON ; (U+003B), COLON : (U+003A), EXCLAMATION MARK ! (U+0021), QUESTION MARK ? (U+003F), AT SIGN @ (U+0040), AMPERSAND & (U+0026), BULLET • (U+2022), inverted ¿ ¡, full-stop variants from CJK and Indic scripts. The largest of the seven subcategories.

The dash family

Confusion over dashes is the source of half the typographic bugs in publishing software. In Unicode there are at least nine commonly used Pd codepoints. The ASCII hyphen-minus (U+002D) is overloaded — it serves as hyphen, en-dash and minus depending on context — which is exactly why dedicated codepoints exist. Editorial style guides give specific rules: the AP Stylebook uses an em dash for parenthetical asides; the Chicago Manual of Style uses an em dash without surrounding spaces; British and many European typographic traditions use an en dash with surrounding spaces. The figure dash (U+2012) is meant for digits inside phone numbers and ISBNs where the dash should be the width of a digit; the horizontal bar (U+2015) introduces quoted dialogue in some Romance-language typographic styles.

The quote families

The Pi/Pf split is locale-dependent in a way that the bracket split isn't. In English the curly opening and closing quotes are visually distinguishable. In German the convention is to use „low opening" / "high closing" — both encoded as Pi/Pf, but the low opening is U+201E and is the inverse of what English uses. French uses guillemets « », which are Pi/Pf but often surrounded by non-breaking spaces. Unicode keeps the codepoints in the right semantic bucket so that bidi mirroring and locale-aware "smart quote" software can do the correct substitution. The single straight ASCII ' (U+0027) is Po precisely because plain ASCII can't tell whether you mean apostrophe, opening or closing quote.

Brackets and bidi mirroring

Every Ps has a matching Pe and a Bidi_Mirroring_Glyph property. When a parenthesis appears in a right-to-left paragraph, the renderer substitutes its mirrored partner so that the visual nesting still makes sense. This is purely a display transformation — the codepoints in memory don't change. The full mirroring table is in BidiMirroring.txt; it's the same data that drives ICU's bidi engine and the CSS direction property.

Example characters

U+002E · Po.Full Stop U+002C · Po,Comma U+003A · Po:Colon U+003B · Po;Semicolon U+0021 · Po!Exclamation Mark U+003F · Po?Question Mark U+2022 · PoBullet U+005F · Pc_Low Line U+002D · Pd-Hyphen-Minus U+2013 · PdEn Dash U+2014 · PdEm Dash U+0028 · Ps(Left Parenthesis U+0029 · Pe)Right Parenthesis U+201C · PiLeft Double Quote U+201D · PfRight Double Quote U+00AB · Pi«Left Guillemet

Related