Page tree
Skip to end of metadata
Go to start of metadata

Table of contents

Introduction

In the following tables, Unicode code point is represented in 24 bits (3 bytes) notation. For example U+0041 character corresponds to code point 41 in hexadecimal, which corresponds to bits 00000000 00000000 01000001.

Each UTF encoding may be provided with leading identification bytes, called Byte Order Mark (BOM).

UTF-16 and UTF-32 have only a meaning when endianness is provided, which can be Big Endian (BE) or Little Endian (LE).
The endianness code LE or BE has to be suffixed to the UTF-** name.
So we have UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

For more information, see wikipedia UTF article.

UTF-8

Unicode code point (000zzzzz YYYYYyyy Xxxxxxxx)

Byte1

Byte2

Byte3

Byte4

U+0000-U+007F (zzzzz YYYYYyyy X = 0)

0xxxxxxx

 

 

 

U+0080-U+07FF (zzzzz YYYYY = 0)

110yyyXx

10xxxxxx

 

 

U+0800-U+FFFF (zzzzz = 0)

1110YYYY

10YyyyXx

10xxxxxx

 

U+010000-U+10FFFF(zzzzz <> 0)

11110zzz

10zzYYYY

10YyyyXx

10xxxxxx

Bytes of UTF-8 byte order mark are: EF BB BF

In UTF-8 (unlike UTF-16 and UTF-32), order of bytes is always the same (endianness do not apply)

Examples:

  • U+0041 (00000000 00000000 01000001) is encoded: 41
  • U+00E9 (00000000 00000000 11101001) is encoded: C3 A9

UTF-16 (with endianness)

Table below shows result in Big Endian. In little endian, final result would be: xxxxxxxx YYYYyyyy, or aawwwwyy 110110aa xxxxxxxx 110111yy.

Unicode code point (000zzzzz YYYYyyyy xxxxxxxx)

Byte1

Byte2

Byte3

Byte4

U+0000-U+D7FF and U+E000-U+FFFF (zzzzz = 00000 and YYYY <> 1101)

YYYYyyyy

xxxxxxxx

 

 

U+010000-U+10FFFF (zzzzz <> 00000)
With aaaa = zzzzz - 1 (note: zzzzz has binary value 00001-10000,
so aaaa (= zzzzz-1) has value 0000-1111)

110110aa

aaYYYYyy

110111yy

xxxxxxxx

Bytes of UTF-16 byte order mark are:

  • Big endian:

    FE FF

  • Little endian:

    FF FE

Examples:

  • U+0041 (00000000 00000000 01000001) is encoded in little endian: 41 00
  • U+0041 (00000000 00000000 01000001) is encoded in big endian: 00 41
  • U+00E9 (00000000 00000000 11101001) is encoded in little endian: 00 E9

UTF-32 (with endianness)

UTF-32 stores code points "as-is" and 4th byte is always 0.

Endianness only applies on each group of 2 bytes.

Examples:

  • U+10FF88 will be stored 00 10 FF 88 in big endian
  • U+10FF88 will be stored 10 00 88 FF in little endian.

Bytes of UTF-32 byte order mark are:

  • Big endian:

    00 00 FE FF

  • Little endian:

    FF FE 00 00

Fast conversion between UTF encodings

See algorithms in SDN article - Unicode: Technical FAQs - Question "If I still have to convert between encodings, what is the advantage of Unicode?"
(and especially UTF-8 to UTF-16BE visual example)

  • No labels