Table of contents
Introduction
In the following tables, Unicode code point is represented in 24 bits (3 bytes) notation. For example U+0041 character corresponds to code point 41 in hexadecimal, which corresponds to bits 00000000 00000000 01000001.
Each UTF encoding may be provided with leading identification bytes, called Byte Order Mark (BOM).
UTF-16 and UTF-32 have only a meaning when endianness is provided, which can be Big Endian (BE) or Little Endian (LE).
The endianness code LE or BE has to be suffixed to the UTF-** name.
So we have UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.
For more information, see wikipedia UTF article.
UTF-8
Unicode code point (000zzzzz YYYYYyyy Xxxxxxxx) |
Byte1 |
Byte2 |
Byte3 |
Byte4 |
---|---|---|---|---|
U+0000-U+007F (zzzzz YYYYYyyy X = 0) |
0xxxxxxx |
|
|
|
U+0080-U+07FF (zzzzz YYYYY = 0) |
110yyyXx |
10xxxxxx |
|
|
U+0800-U+FFFF (zzzzz = 0) |
1110YYYY |
10YyyyXx |
10xxxxxx |
|
U+010000-U+10FFFF(zzzzz <> 0) |
11110zzz |
10zzYYYY |
10YyyyXx |
10xxxxxx |
Bytes of UTF-8 byte order mark are: EF BB BF
In UTF-8 (unlike UTF-16 and UTF-32), order of bytes is always the same (endianness do not apply)
Examples:
- U+0041 (00000000 00000000 01000001) is encoded: 41
- U+00E9 (00000000 00000000 11101001) is encoded: C3 A9
UTF-16 (with endianness)
Table below shows result in Big Endian. In little endian, final result would be: xxxxxxxx YYYYyyyy, or aawwwwyy 110110aa xxxxxxxx 110111yy.
Unicode code point (000zzzzz YYYYyyyy xxxxxxxx) |
Byte1 |
Byte2 |
Byte3 |
Byte4 |
---|---|---|---|---|
U+0000-U+D7FF and U+E000-U+FFFF (zzzzz = 00000 and YYYY <> 1101) |
YYYYyyyy |
xxxxxxxx |
|
|
U+010000-U+10FFFF (zzzzz <> 00000) |
110110aa |
aaYYYYyy |
110111yy |
xxxxxxxx |
Bytes of UTF-16 byte order mark are:
- Big endian:
FE FF
- Little endian:
FF FE
Examples:
- U+0041 (00000000 00000000 01000001) is encoded in little endian: 41 00
- U+0041 (00000000 00000000 01000001) is encoded in big endian: 00 41
- U+00E9 (00000000 00000000 11101001) is encoded in little endian: 00 E9
UTF-32 (with endianness)
UTF-32 stores code points "as-is" and 4th byte is always 0.
Endianness only applies on each group of 2 bytes.
Examples:
- U+10FF88 will be stored 00 10 FF 88 in big endian
- U+10FF88 will be stored 10 00 88 FF in little endian.
Bytes of UTF-32 byte order mark are:
- Big endian:
00 00 FE FF
- Little endian:
FF FE 00 00
Fast conversion between UTF encodings
See algorithms in SDN article - Unicode: Technical FAQs - Question "If I still have to convert between encodings, what is the advantage of Unicode?"
(and especially UTF-8 to UTF-16BE visual example)