Most scripts have a built-in direction. Latin, Greek, Cyrillic, Devanagari, Thai, Han — left-to-right. Arabic, Hebrew, Syriac, Thaana — right-to-left. The Unicode standard stores both kinds of text in the same order: logical order, meaning the order a reader would speak them. Turning a logical-order byte sequence into a left-to-right or right-to-left display is the job of the Unicode Bidirectional Algorithm, specified in Unicode Standard Annex #9. Every modern browser, word processor, and operating system runs this algorithm — often without you noticing — every time it paints a line that mixes scripts.
The base direction
Every paragraph has a base direction, either LTR or RTL. In HTML it is set with the dir attribute (dir="ltr", dir="rtl", or dir="auto", which lets the browser guess from the first strong character). The base direction determines two things: the side of the page where lines begin, and the default direction for any neutral characters that have no direction of their own.
The directional types
The Bidi Algorithm starts by assigning every codepoint one of around twenty directional types, listed in the BidiBrackets.txt and UnicodeData.txt files. The three big groups are strong, weak, and neutral.
| Group | Type | Meaning |
|---|---|---|
| Strong | L | Left-to-right (Latin, Greek, CJK, etc.) |
| R | Right-to-left (Hebrew, Thaana) | |
| AL | Right-to-left Arabic (Arabic, Syriac) | |
| Weak | EN | European number (0–9) |
| AN | Arabic-Indic number (٠١٢٣٤٥٦٧٨٩) | |
| ES | European number separator (+ -) | |
| ET | European number terminator ($ %) | |
| CS | Common number separator (, . :) | |
| NSM | Non-spacing combining mark | |
| BN | Boundary neutral (control characters) | |
| Neutral | B | Paragraph separator |
| S | Segment separator (tab) | |
| WS | Whitespace | |
| ON | Other neutrals (most punctuation) | |
| Explicit | LRE/RLE | Left-/right-to-left embedding (deprecated) |
| LRO/RLO | Left-/right-to-left override | |
| Pop directional formatting | ||
| LRI/RLI/FSI | Isolate controls (introduced Unicode 6.3) | |
| PDI | Pop directional isolate |
The algorithm's job is to take a logical-order string of these typed characters and decide the visual order. The strong characters anchor the runs. The weak and neutral characters take their direction from the surrounding context: a comma between two Hebrew words inherits right-to-left, a comma between two English words inherits left-to-right.
A worked example
Consider the string "abc אבג 123" — three Latin letters, a space, three Hebrew letters (aleph, bet, gimel), a space, and the three digits 1, 2, 3. Stored in logical order, it is twelve characters: a b c SP א ב ג SP 1 2 3. The base direction is LTR (English paragraph).
Logical order: a b c SP א ב ג SP 1 2 3
Type: L L L WS R R R WS EN EN EN
Step 1 — Resolve weak types:
The digits are EN. Their direction in this context is L.
Step 2 — Resolve neutrals:
The spaces between L and R, and between R and EN, take the base direction (L).
Step 3 — Resolve implicit levels:
L characters get even embedding level 0.
R characters get odd embedding level 1.
EN characters get even embedding level 2 inside the L paragraph.
Step 4 — Reorder by level (highest first, reverse within each run):
Visual order (left to right):
a b c SP ג ב א SP 1 2 3
→ → → ← → (arrows show reading direction within run)
The Hebrew run reverses — aleph-bet-gimel in logical order becomes gimel-bet-aleph in visual order — because Hebrew is read right-to-left. The digits, despite being weak and inheriting context, stay in their original order. Read left-to-right on the screen they spell 123, which is correct: numerals are read the same in any language.
Now flip the paragraph direction. With the same logical string but a base direction of RTL (an Arabic or Hebrew paragraph), the Latin run becomes the embedded one:
Visual order (right to left base, displayed right to left):
3 2 1 SP א ב ג SP c b a
Read right to left, this is: 123 אבג abc
Same logical string, same byte sequence; different paragraph direction, different visual result. This is why dir="auto" exists — if you do not know the language of a user-supplied string in advance, let the algorithm pick the direction from the first strong character.
Numbers in RTL contexts
European numbers in an Arabic paragraph behave like a small left-to-right island inside a right-to-left flow. The number "123" inside "السنة 2024" displays in the natural order 2-0-2-4 even though the surrounding text reads right-to-left. The Bidi Algorithm assigns the digits embedding level 2 (one level deeper than the surrounding Arabic level 1) so that they form an LTR run inside an RTL run. Arabic-Indic digits (U+0660 through U+0669) are a separate Bidi type, AN, with different rules — they pick up the script's direction.
Embedding, override, and isolation
Beyond the implicit algorithm, Unicode provides format control characters that influence direction explicitly. There are three families:
- Embedding (deprecated since Unicode 6.3)
- U+202A LRE, U+202B RLE, U+202C PDF. Open a run with the chosen direction; close it with PDF. Replaced by isolates because embedding allows neutrals from the surrounding text to bleed into the embedded run.
- Override (use with care)
- U+202D LRO, U+202E RLO, U+202C PDF. Force every character in the run to be treated as L or R regardless of its actual type. The override ignores the strong type of strong characters.
- Isolate (preferred)
- U+2066 LRI, U+2067 RLI, U+2068 FSI (first-strong isolate), U+2069 PDI. Treat the contents as a single neutral atom from the perspective of the surrounding text, and pick a base direction for the isolated content. The HTML
<bdi>element wraps text in these controls.
For interactive HTML — username displays, search results, message lists — the right tool is <bdi> or CSS unicode-bidi: isolate. They prevent a malicious or unexpected piece of RTL content from reordering the user-interface chrome around it.
The U+202E filename attack
The override controls have been used since at least 2009 to disguise filenames. The RIGHT-TO-LEFT OVERRIDE at U+202E forces everything after it to render right-to-left, regardless of script. Consider a file whose actual name in logical order is:
evil [U+202E] gpj.exe
In a file manager that honours Bidi but does not strip U+202E,
the displayed name is:
evilexe.jpg
The user sees what looks like a JPEG named evilexe.jpg. The operating system, which uses the logical-order bytes, treats it as an executable named evilegpj.exe. (The actual on-disk byte sequence, ignoring the U+202E for the purposes of OS file dispatch, ends with .exe.) Microsoft Windows began stripping U+202E from filenames in Explorer in 2018; major email clients and chat applications strip it from attachment names; modern source code editors highlight it when present.
The related "Trojan Source" technique (Boucher and Anderson, 2021, CVE-2021-42574) exploits the same override family inside source code comments and string literals, hiding malicious instructions that look benign on screen but compile differently. Most C, Rust, Python, and JavaScript compilers and linters now warn on isolated bidirectional control characters in source.
The Bidi controls themselves are useful. The vulnerability is in the renderer's failure to strip or escape them when the content is supposed to be untrusted (filenames, usernames, source code). When rendering user input, normalise away the overrides (U+202A–U+202E) or wrap in isolation (<bdi>).
What to remember
Logical order is what is stored; visual order is what is rendered. The Bidi Algorithm bridges them. For ordinary documents the algorithm runs automatically and produces the expected results. For interfaces that display untrusted text, use dir="auto", isolate user content in <bdi>, and treat U+202A–U+202E as control characters to be stripped or escaped. The character inspector will reveal any Bidi formatting characters hidden in a string.
Further reading
- Character inspector — paste a string and find any hidden Bidi controls.
- The Private Use Areas — another category of codepoints that look benign and behave specifically.
- General Punctuation block — where the format controls U+200C through U+206F live.
- Unicode normalization explained — another place where visible identity does not equal byte identity.
- Codepoint, character, glyph, grapheme — direction is a property of codepoints, but legibility is a property of glyphs.
- What is Unicode? — the standard inside which UAX #9 lives.