An HTML character reference is a piece of source text — written with ampersands and semicolons — that the browser parses into a single character before rendering. They were essential in 1995 when documents arrived as ASCII or Latin-1 and there was no way to talk about U+2014 EM DASH except through an escape. Today, with UTF-8 the default and <meta charset="utf-8"> conventional, character references are mostly a fallback. There are exactly four characters you still must escape, and a small number of contexts where escapes are convenient. The rest is history.
The three forms
HTML defines three syntactic forms for a character reference:
- Named
- An ampersand, a name, and a semicolon. Around 2,231 named references in HTML5. Example:
€for €. - Decimal numeric
- An ampersand, a hash, decimal digits of the codepoint, and a semicolon. Example:
€for €. - Hexadecimal numeric
- An ampersand, hash-x, hex digits of the codepoint, and a semicolon. Example:
€for €.
All three produce the same parsed character in any modern browser. The leading & begins a reference; the ; ends it. Inside the numeric forms, leading zeros are permitted (€ works); inside the named form, capitalisation matters for most names.
The four you actually still need
If your document is served as UTF-8 with the correct content-type or meta charset, you can write the character directly for almost everything. The exceptions are the four characters that have a syntactic role in HTML and would confuse the parser if left literal:
| Character | Why escape | Named | Decimal | Hex |
|---|---|---|---|---|
& | Starts a character reference | & | & | & |
< | Starts a tag | < | < | < |
> | Ambiguous in some legacy contexts | > | > | > |
" | Closes attribute values | " | " | " |
The single quote ' is also worth escaping (as ') inside attributes delimited with single quotes. The named form ' exists in HTML5 but did not exist in HTML 4 and was unsafe in older browsers; the numeric form is universally supported.
Everything else — accented letters, currency signs, em dashes, emoji — can be written directly in UTF-8 source. The browser parses your literal € exactly as it would parse €. The choice between them is editorial, not technical.
The 30 most useful named entities
| Glyph | Named | Codepoint | What it is |
|---|---|---|---|
| & | & | U+0026 | Ampersand |
| < | < | U+003C | Less-than |
| > | > | U+003E | Greater-than |
| " | " | U+0022 | Quotation mark |
| ' | ' | U+0027 | Apostrophe |
| U+00A0 | Non-breaking space | |
| © | © | U+00A9 | Copyright sign |
| ® | ® | U+00AE | Registered sign |
| ™ | ™ | U+2122 | Trade mark sign |
| € | € | U+20AC | Euro sign |
| £ | £ | U+00A3 | Pound sign |
| ¥ | ¥ | U+00A5 | Yen sign |
| ° | ° | U+00B0 | Degree sign |
| ± | ± | U+00B1 | Plus-minus sign |
| × | × | U+00D7 | Multiplication sign |
| ÷ | ÷ | U+00F7 | Division sign |
| — | — | U+2014 | Em dash |
| – | – | U+2013 | En dash |
| … | … | U+2026 | Horizontal ellipsis |
| " | “ | U+201C | Left double quotation mark |
| " | ” | U+201D | Right double quotation mark |
| ' | ‘ | U+2018 | Left single quotation mark |
| ' | ’ | U+2019 | Right single quotation mark |
| « | « | U+00AB | Left guillemet |
| » | » | U+00BB | Right guillemet |
| § | § | U+00A7 | Section sign |
| ¶ | ¶ | U+00B6 | Pilcrow |
| • | • | U+2022 | Bullet |
| ← | ← | U+2190 | Leftwards arrow |
| → | → | U+2192 | Rightwards arrow |
The full HTML5 named-entity list is maintained at html.spec.whatwg.org/entities.json and contains exactly 2,231 entries. Many are obscure mathematical or technical symbols; in practice almost everyone uses fewer than thirty.
HTML, XML, XHTML — the differences
The three syntaxes treat named entities differently and the differences matter when files cross between contexts.
- HTML5
- ~2,231 named entities are recognised. The DOCTYPE is informational, not a DTD reference. Named entities are part of the parser's hard-coded table.
- XML 1.0
- Only five named entities are predefined:
&,<,>,",'. Any additional names must be declared in a DTD using<!ENTITY>declarations, otherwise the XML parser raises a well-formedness error. - XHTML 1.0
- An XML application that imports the HTML 4 entity sets by referencing the public DTD (-//W3C//DTD XHTML 1.0 Strict//EN). An XHTML file without the DOCTYPE declaration cannot use
or any other HTML name without a parse error. - SVG (in HTML)
- Parsed by the HTML parser; the full HTML named-entity set is available.
- SVG (standalone, served as
image/svg+xml) - Parsed by the XML parser; only the five XML names are available.
in a standalone SVG will break the file.
Inside an HTML document:
<p>Hello world</p> ← fine
Inside a standalone SVG served as image/svg+xml:
<text>Hello world</text> ← XML parse error
<text>Hello   world</text> ← works (numeric reference)
When portability across HTML and XML matters — typically for SVG, RSS, Atom, and JSON-embedded XML — prefer the numeric forms. They are valid in every XML application without DTD declarations.
Attribute contexts and the encoding rules
The OWASP cross-site scripting cheat sheet treats HTML escaping as a context-sensitive operation. The character to escape depends on where the value will appear:
- HTML element content
- Escape
&<>. ("and'are not required here but harmless.) - HTML double-quoted attribute
- Escape
&and". - HTML single-quoted attribute
- Escape
&and'. - HTML unquoted attribute
- Escape
&,",', space, tab, newline,=,<,>, backtick. Or, more practically, always quote your attributes. - URL
- Use percent-encoding (RFC 3986), not HTML entities. See the URL encoder.
- JavaScript string
- Use JavaScript string escapes (
\xHH,\uHHHH,\u{HHHHHH}), not HTML entities. The HTML parser does not run inside<script>. - CSS value
- Use CSS escapes (
\HHHHHHwith hex digits and an optional trailing space).
The general rule: HTML entities are for HTML. URLs need percent-encoding, JavaScript needs \u escapes, CSS needs \ escapes. Mixing them is the source of an embarrassing share of XSS bugs.
Numeric vs named — which to use
Named entities are easier to read and harder to mistype: — reads better in source than —. They are also slightly longer to transmit, but every reasonable compression algorithm collapses that difference. The arguments against named entities are:
- They are not portable to XML without DTD declarations.
- The named-entity table grew gradually and some names predate Unicode (the
Œname for Œ, for instance, is older than the official Unicode name). - A few have unexpected meanings —
Θis the Greek capital theta (U+0398), not the math symbol.
For ordinary editorial use inside an HTML5 document, named entities are fine. For machine-generated content, library APIs that produce escaped output, and any context where XML compatibility matters, prefer numeric. For the four syntactic characters (& < > "), either form is acceptable; named is conventional.
The number of HTML named entities is large and grows in odd places. Some pairs share a glyph but differ in semantics: ∅ and ∅ both render as ∅ (U+2205). When choosing between two names for the same glyph, use the named entity whose Unicode codepoint matches the meaning you want.
What to remember
With UTF-8 source and <meta charset="utf-8">, you need to escape four characters in HTML: &, <, >, and ". Everything else is editorial preference. For XML and standalone SVG, prefer numeric references. For URLs and scripts, use the encoding mechanisms appropriate to those contexts. The HTML entity encoder converts strings between literal, named, decimal, and hexadecimal forms in either direction.
Further reading
- HTML entity encoder — convert between literal text, named, decimal, and hex character references.
- © U+00A9 Copyright sign —
©in the table above. - € U+20AC Euro sign —
€in the table above. - — U+2014 Em dash —
—in the table above. - UTF-8, UTF-16, UTF-32 compared — the encoding that made most entities unnecessary.
- Character inspector — see exactly which codepoint a character reference produces.