UTF-8 encoder
Paste any string. See each codepoint with its UTF-8, UTF-16, and UTF-32 byte sequences in hex.
Paste any string. See each codepoint with its UTF-8, UTF-16, and UTF-32 byte sequences in hex.
| # | Glyph | Codepoint | UTF-8 (hex) | UTF-16 (hex) | UTF-32 (hex) |
|---|
The tool walks the input string using a for…of loop, which iterates by codepoint rather than by UTF-16 code unit. For each codepoint it produces three encodings.
The UTF-8 byte sequence is produced by new TextEncoder().encode(...). The Web Platform's TextEncoder only emits UTF-8, by design — that decision was made when WHATWG specified the API, because UTF-8 is the only encoding the web needs to write. The number of bytes depends on the codepoint: one byte for U+0000 through U+007F (ASCII), two bytes for U+0080 through U+07FF, three bytes for U+0800 through U+FFFF (excluding the surrogate range, which UTF-8 cannot encode), and four bytes for U+10000 through U+10FFFF (the supplementary planes).
The UTF-16 code units are computed by hand because JavaScript doesn't expose them directly. For codepoints in the BMP, there is a single 16-bit code unit equal to the codepoint. For codepoints at or above U+10000, the codepoint is split into a surrogate pair using the algorithm: subtract 0x10000, take the top ten bits and add 0xD800 for the high surrogate, take the bottom ten bits and add 0xDC00 for the low surrogate. The result is two 16-bit code units that, decoded by any conformant UTF-16 reader, recover the original codepoint.
UTF-32 is the simplest: it's just the codepoint as a 32-bit value. The display pads to eight hex digits to make the fixed-width nature obvious. UTF-32 is almost never used for storage or transmission — it wastes three bytes for every ASCII character — but it appears in some internal string representations (Python 3.3+, for example, uses a flexible representation that includes UTF-32 when needed).
Consider the string "Aé😀". It contains three codepoints:
41 (one byte), UTF-16 0041 (one unit), UTF-32 00000041.C3 A9 (two bytes), UTF-16 00E9 (one unit), UTF-32 000000E9.F0 9F 98 80 (four bytes), UTF-16 D83D DE00 (two units, a surrogate pair), UTF-32 0001F600.Total UTF-8 length: seven bytes. Total UTF-16 length: four code units (eight bytes). Total UTF-32 length: three units (twelve bytes). The string contains three Unicode characters but "Aé😀".length in JavaScript returns four, because length counts UTF-16 code units. This mismatch is the most common source of bugs when handling user input that may contain emoji.
If the input contains a lone surrogate (an unpaired 0xD800–0xDFFF code unit, possible in JavaScript strings because they're sequences of 16-bit values without enforced well-formedness), TextEncoder replaces it with U+FFFD REPLACEMENT CHARACTER (UTF-8 EF BF BD). Combining marks like U+0301 COMBINING ACUTE ACCENT are listed as separate codepoints — they don't merge with the preceding base character in this view. Use the character inspector if you want grapheme-cluster counts. The for…of loop only addresses codepoint iteration, not grapheme segmentation.