HTML entities and escapes

An HTML character reference is a piece of source text — written with ampersands and semicolons — that the browser parses into a single character before rendering. They were essential in 1995 when documents arrived as ASCII or Latin-1 and there was no way to talk about U+2014 EM DASH except through an escape. Today, with UTF-8 the default and <meta charset="utf-8"> conventional, character references are mostly a fallback. There are exactly four characters you still must escape, and a small number of contexts where escapes are convenient. The rest is history.

The three forms

HTML defines three syntactic forms for a character reference:

Named: An ampersand, a name, and a semicolon. Around 2,231 named references in HTML5. Example: € for €.
Decimal numeric: An ampersand, a hash, decimal digits of the codepoint, and a semicolon. Example: € for €.
Hexadecimal numeric: An ampersand, hash-x, hex digits of the codepoint, and a semicolon. Example: € for €.

All three produce the same parsed character in any modern browser. The leading & begins a reference; the ; ends it. Inside the numeric forms, leading zeros are permitted (€ works); inside the named form, capitalisation matters for most names.

The four you actually still need

If your document is served as UTF-8 with the correct content-type or meta charset, you can write the character directly for almost everything. The exceptions are the four characters that have a syntactic role in HTML and would confuse the parser if left literal:

Character	Why escape	Named	Decimal	Hex
`&`	Starts a character reference	`&`	`&`	`&`
`<`	Starts a tag	`<`	`<`	`<`
`>`	Ambiguous in some legacy contexts	`>`	`>`	`>`
`"`	Closes attribute values	`"`	`"`	`"`

The single quote ' is also worth escaping (as ') inside attributes delimited with single quotes. The named form ' exists in HTML5 but did not exist in HTML 4 and was unsafe in older browsers; the numeric form is universally supported.

Everything else — accented letters, currency signs, em dashes, emoji — can be written directly in UTF-8 source. The browser parses your literal € exactly as it would parse €. The choice between them is editorial, not technical.

The 30 most useful named entities

Glyph	Named	Codepoint	What it is
&	`&`	U+0026	Ampersand
<	`<`	U+003C	Less-than
>	`>`	U+003E	Greater-than
"	`"`	U+0022	Quotation mark
'	`'`	U+0027	Apostrophe
	` `	U+00A0	Non-breaking space
©	`©`	U+00A9	Copyright sign
®	`®`	U+00AE	Registered sign
™	`™`	U+2122	Trade mark sign
€	`€`	U+20AC	Euro sign
£	`£`	U+00A3	Pound sign
¥	`¥`	U+00A5	Yen sign
°	`°`	U+00B0	Degree sign
±	`±`	U+00B1	Plus-minus sign
×	`×`	U+00D7	Multiplication sign
÷	`÷`	U+00F7	Division sign
—	`—`	U+2014	Em dash
–	`–`	U+2013	En dash
…	`…`	U+2026	Horizontal ellipsis
"	`“`	U+201C	Left double quotation mark
"	`”`	U+201D	Right double quotation mark
'	`‘`	U+2018	Left single quotation mark
'	`’`	U+2019	Right single quotation mark
«	`«`	U+00AB	Left guillemet
»	`»`	U+00BB	Right guillemet
§	`§`	U+00A7	Section sign
¶	`¶`	U+00B6	Pilcrow
•	`•`	U+2022	Bullet
←	`←`	U+2190	Leftwards arrow
→	`→`	U+2192	Rightwards arrow

The full HTML5 named-entity list is maintained at html.spec.whatwg.org/entities.json and contains exactly 2,231 entries. Many are obscure mathematical or technical symbols; in practice almost everyone uses fewer than thirty.

HTML, XML, XHTML — the differences

The three syntaxes treat named entities differently and the differences matter when files cross between contexts.

HTML5: ~2,231 named entities are recognised. The DOCTYPE is informational, not a DTD reference. Named entities are part of the parser's hard-coded table.
XML 1.0: Only five named entities are predefined: &, <, >, ", '. Any additional names must be declared in a DTD using <!ENTITY> declarations, otherwise the XML parser raises a well-formedness error.
XHTML 1.0: An XML application that imports the HTML 4 entity sets by referencing the public DTD (-//W3C//DTD XHTML 1.0 Strict//EN). An XHTML file without the DOCTYPE declaration cannot use   or any other HTML name without a parse error.
SVG (in HTML): Parsed by the HTML parser; the full HTML named-entity set is available.
SVG (standalone, served as image/svg+xml): Parsed by the XML parser; only the five XML names are available.   in a standalone SVG will break the file.

Inside an HTML document:
  <p>Hello &nbsp; world</p>          ← fine

Inside a standalone SVG served as image/svg+xml:
  <text>Hello &nbsp; world</text>    ← XML parse error
  <text>Hello &#xA0; world</text>    ← works (numeric reference)

When portability across HTML and XML matters — typically for SVG, RSS, Atom, and JSON-embedded XML — prefer the numeric forms. They are valid in every XML application without DTD declarations.

Attribute contexts and the encoding rules

The OWASP cross-site scripting cheat sheet treats HTML escaping as a context-sensitive operation. The character to escape depends on where the value will appear:

HTML element content: Escape & < >. (" and ' are not required here but harmless.)
HTML double-quoted attribute: Escape & and ".
HTML single-quoted attribute: Escape & and '.
HTML unquoted attribute: Escape &, ", ', space, tab, newline, =, <, >, backtick. Or, more practically, always quote your attributes.
URL: Use percent-encoding (RFC 3986), not HTML entities. See the URL encoder.
JavaScript string: Use JavaScript string escapes (\xHH, \uHHHH, \u{HHHHHH}), not HTML entities. The HTML parser does not run inside <script>.
CSS value: Use CSS escapes (\HHHHHH with hex digits and an optional trailing space).

The general rule: HTML entities are for HTML. URLs need percent-encoding, JavaScript needs \u escapes, CSS needs \ escapes. Mixing them is the source of an embarrassing share of XSS bugs.

Numeric vs named — which to use

Named entities are easier to read and harder to mistype: — reads better in source than —. They are also slightly longer to transmit, but every reasonable compression algorithm collapses that difference. The arguments against named entities are:

They are not portable to XML without DTD declarations.
The named-entity table grew gradually and some names predate Unicode (the &OElig; name for Œ, for instance, is older than the official Unicode name).
A few have unexpected meanings — Θ is the Greek capital theta (U+0398), not the math symbol.

For ordinary editorial use inside an HTML5 document, named entities are fine. For machine-generated content, library APIs that produce escaped output, and any context where XML compatibility matters, prefer numeric. For the four syntactic characters (& < > "), either form is acceptable; named is conventional.

The number of HTML named entities is large and grows in odd places. Some pairs share a glyph but differ in semantics: ∅ and &varnothing; both render as ∅ (U+2205). When choosing between two names for the same glyph, use the named entity whose Unicode codepoint matches the meaning you want.

What to remember

With UTF-8 source and <meta charset="utf-8">, you need to escape four characters in HTML: &, <, >, and ". Everything else is editorial preference. For XML and standalone SVG, prefer numeric references. For URLs and scripts, use the encoding mechanisms appropriate to those contexts. The HTML entity encoder converts strings between literal, named, decimal, and hexadecimal forms in either direction.

HTML entities & escapes

The three forms

The four you actually still need

The 30 most useful named entities

HTML, XML, XHTML — the differences

Attribute contexts and the encoding rules

Numeric vs named — which to use

What to remember

Further reading