The Separator group is the smallest of the seven major categories with only nineteen total codepoints, but each of them does real work. Zs covers horizontal whitespace from the ASCII space to the typographic em quad. Zl is one codepoint, U+2028, used to mark a line separator. Zp is one codepoint, U+2029, used to mark a paragraph separator. None of these are control characters — that role belongs to Cc — and none are combining marks. They simply separate.
The subcategories
- Zs
- Separator, space — seventeen codepoints. SPACE U+0020, NO-BREAK SPACE U+00A0, OGHAM SPACE MARK U+1680, the typographic spaces EN QUAD U+2000, EM QUAD U+2001, EN SPACE U+2002, EM SPACE U+2003, THREE-PER-EM SPACE U+2004, FOUR-PER-EM SPACE U+2005, SIX-PER-EM SPACE U+2006, FIGURE SPACE U+2007, PUNCTUATION SPACE U+2008, THIN SPACE U+2009, HAIR SPACE U+200A, NARROW NO-BREAK SPACE U+202F, MEDIUM MATHEMATICAL SPACE U+205F, and IDEOGRAPHIC SPACE U+3000.
- Zl
- Separator, line — exactly one codepoint, LINE SEPARATOR U+2028. Intended to mark the end of a line without ending a paragraph. Rarely seen in practice because most software uses LF (U+000A) or CRLF instead, but legitimately used in some XML, plist and InDesign workflows. Notoriously broken JavaScript before ES2019, where U+2028 inside a JSON string was a syntax error.
- Zp
- Separator, paragraph — exactly one codepoint, PARAGRAPH SEPARATOR U+2029. Intended to mark a hard paragraph boundary. Same provenance and same software-compatibility caveats as Zl.
Why ASCII space is Zs but tab is Cc
The ASCII space U+0020 is Zs. The horizontal tab U+0009 is Cc, not Zs, because Unicode treats tab as a control character inherited from ASCII. The same applies to U+000A LINE FEED, U+000D CARRIAGE RETURN, U+000B VERTICAL TAB and U+000C FORM FEED — all Cc. This trips up regex authors who expect \p{Z} to match all whitespace; it doesn't. To match "whitespace" in the broader sense you usually want the explicit White_Space property (UCD column), which combines Zs + tab + line feeds + a few others. Modern regex engines expose this as \p{White_Space} or \s in Unicode mode.
U+2028 vs U+2029 vs U+000A vs CRLF
Unicode defines four ways to mark a line boundary in plain text:
- LF (U+000A) — Unix line ending. The default
\nin C, Python and most programming languages. - CRLF (U+000D + U+000A) — Windows line ending and the line terminator for HTTP, SMTP, FTP and most internet protocols.
- CR (U+000D) — classic Mac OS line ending. Largely extinct.
- U+2028 LINE SEPARATOR — explicit, format-agnostic line break that does not end a paragraph.
- U+2029 PARAGRAPH SEPARATOR — explicit paragraph end. The bidi algorithm treats it as a "P" type boundary that resets the embedding stack.
The two Unicode-specific separators were introduced because Unicode is meant to be transport-format-independent: a text file should be parseable into lines and paragraphs without consulting a separate "platform" property. In practice almost nobody writes U+2028 or U+2029 by hand, but they appear in machine-generated text from PDF extractors, OCR tools, and some word-processor exports. JavaScript engines mishandled them inside string literals before ES2019 — pasting a U+2028 inside "foo" would terminate the literal and throw a syntax error. The fix came in ES2019; modern bundlers (esbuild, swc, babel) escape them defensively when emitting code.
The typographic spaces in Zs
The U+2000–U+200A range encodes a small library of named horizontal spaces drawn from traditional typesetting: an em space is 1 em wide, an en space is half that, a thin space is 1/5 em, a hair space is the thinnest. These are not interchangeable with U+0020 in justified text because the renderer treats them as fixed-width — they do not stretch during justification. Useful in fixed-width data files, technical typography, and the kerning of abbreviations like "U. S. A." where some publishers prefer a thin space between the letters.
U+00A0 NO-BREAK SPACE is the most commonly used non-ASCII space. It looks like a regular space but prevents line breaks from occurring at that position. Common uses include between a number and its unit ("100 km"), between a person's title and surname ("Mr. Smith"), and between a measurement and a percent sign. It is also the source of countless mojibake bugs when text encoded in Latin-1 (where U+00A0 is byte 0xA0) is misinterpreted as UTF-8 — 0xA0 alone is an invalid UTF-8 byte.