The Arabic block — U+0600 through U+06FF, 256 codepoints — encodes the second-most-used writing system in the world. Arabic is a right-to-left (RTL) script in which most letters connect to their neighbours, taking up to four contextual forms — isolated, initial, medial, and final — depending on what sits beside them. Unicode chose to encode only the abstract letter (one codepoint per letter, regardless of position) and let the renderer or font select the correct shape. That single design decision is what makes modern OpenType-rendered Arabic possible.

About this block

The block was added in Unicode 1.0 (1991) and covers not only Arabic itself but also the additional letters needed by every major Arabic-script language: Persian (Farsi), Urdu, Pashto, Sindhi, Kurdish (Sorani), Uyghur, Kazakh (in its pre-Cyrillic form, and again in its post-2017 Latin transition for some uses), and Malay written as Jawi. The 28 standard Arabic letters occupy U+0621U+064A: ا (alif), ب (ba), ت (ta), ث (tha), ج (jim), ح (ha), خ (kha), د (dal), ذ (dhal), ر (ra), ز (zay), س (sin), ش (shin), ص (sad), ض (dad), ط (ta), ظ (za), ع (ayn), غ (ghayn), ف (fa), ق (qaf), ك (kaf), ل (lam), م (mim), ن (nun), ه (ha), و (waw), ي (ya). Hamza ء U+0621 is the glottal-stop sign, often carried on a "seat" letter such as أ, إ, ؤ, or ئ — and Unicode provides both the bare hamza and the precomposed combinations. Ta marbuta ة U+0629 is the feminine ending. Variations on alif at U+0622U+0625 mark hamza on or below the letter, plus the alef madda أ̄ for long /aː/.

Vowel marks — called tashkeel or harakat — sit at U+064B through U+0652 and are combining marks that decorate the consonants: fatha (a) U+064E, damma (u) U+064F, kasra (i) U+0650, shadda (consonant gemination) U+0651, sukun (absence of vowel) U+0652, plus the three tanwin marks for grammatical case endings. Day-to-day Arabic writing omits these marks; they appear in the Qur'an, in children's textbooks, in poetry, and wherever pronunciation must be unambiguous. The combining nature of tashkeel means a single visible glyph may correspond to two or three codepoints — a classic example of the codepoint-versus-grapheme distinction discussed in Codepoint, character, glyph, grapheme.

Languages other than Arabic add their own letters within this same block. Persian, Urdu, and Pashto introduce پ U+067E (pe), چ U+0686 (che), ژ U+0698 (zhe), گ U+06AF (gaf), and use distinctive forms of kaf and ya: ک U+06A9 (the Persian keheh, with a different tail than Arabic ك) and ی U+06CC (Farsi ya, written without dots in final position). Urdu adds ٹ U+0679 (ttah, the retroflex t), ڈ U+0688 (ddal, retroflex d), ڑ U+0691 (rreh, retroflex r), ں U+06BA (noon ghunna, the nasalised n), and ھ U+06BE (doachashmee h, the second-h used for digraphs like بھ). Pashto adds ټ, ډ, ړ, ږ, ښ; Sindhi adds ڄ, ٺ, ٿ, ڀ; Uyghur uses additional vowel letters. Two parallel sets of digits live in the block: the Arabic-Indic digits ٠١٢٣٤٥٦٧٨٩ at U+0660U+0669, used in the Mashriq, and the Extended Arabic-Indic digits ۰۱۲۳۴۵۶۷۸۹ at U+06F0U+06F9, used in Persian and Urdu.

Right-to-left rendering is the other half of the story. Punctuation must be mirrored: the Arabic comma ، U+060C, semicolon ؛ U+061B, question mark ؟ U+061F, and full stop are visually flipped from their Latin counterparts. The Unicode Bidirectional Algorithm (UAX #9) determines how runs of LTR and RTL text are reordered for display when a paragraph mixes them — the kind of typographic complexity that makes a sentence like "Open the page at /path/to/file" surprisingly nontrivial when "Open" is Arabic. The dedicated guide on bidirectional text and RTL walks through the algorithm in detail. Cursive joining itself is described by the Joining_Type property: most Arabic letters are Dual_Joining (they connect on both sides), while six letters — د (dal), ذ (dhal), ر (ra), ز (zay), و (waw), ا (alif) — are Right_Joining (they connect only on their right side, so the next letter cannot connect to them and starts a new visual run). Hamza is Non_Joining; the harakat are Transparent (they sit on top of joined letters without breaking the join).

Two large supplementary blocks accompany this one. Arabic Presentation Forms-A (U+FB50U+FDFF) and Arabic Presentation Forms-B (U+FE70U+FEFF) hold thousands of pre-shaped contextual variants — every letter's initial, medial, final, and isolated form encoded as a separate codepoint. These exist purely for compatibility with legacy systems that did not implement OpenType contextual substitution. Modern rendering pipelines on every operating system shape Arabic correctly from the base block alone, and the presentation-form blocks should not be used for new text.