Search



Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Character encoding

  • ASCII characters (in the range 0-127) are encoded as a single byte.
  • Greek, Hebrew, Arabic and most accented European characters are encoded as two bytes;
  • Other characters are encoded as three bytes;
  • The individual characters are encoded according to the following rules.

Single byte encoding

Characters in the range 'u+0000' to 'u+007f' are encoded as a single byte.

Caption label
CapIdutf-8-single-byte-encoding
CapTypeTable
UTF-8 Single Byte Encoding


byte 0

0

bits 0-6

Two byte encoding

Characters in the range 'u+0080' to 'u+07ff' are encoded as two bytes.

Caption label
CapIdtwo-byte-encoding
CapTypeTable
Two byte encoding


byte 0

byte 1

1

1

0

bits 6-10

1

0

bits 0-5

Three byte encoding

Characters in the range 'u+0800' to 'u+ffff' are encoded as three bytes:

Caption label
CapIdutf-8-three-byte-encoding
CapTypeTable
UTF-8 Three Byte Encoding


byte 0

byte 1

byte 2

1

1

1

0

bits 12-15

1

0

bits 6-11

1

0

bits 0-5

Notes on encoding rules

The first bits of each byte indicate the role of the byte. A zero bit terminates this role information. Thus possible byte values are:

Caption label
CapIdutf-8-encoding-rules
CapTypeTable
UTF-8 Encoding Rules


Bits

Byte value

Gloss
tRole

0???????

000-127

Single byte encoding of a character

10??????

128-191

Continuation of a multi-byte encoding

110?????

192-223

First byte of a two byte character encoding

1110????

224-239

First byte of a three byte character encoding

1111???

240-255

Invalid

Example encoding

Caption label
CapIdutf-8-encoding-example
CapTypeTable
UTF-8 Encoding Example


Character

S

C

T

 

 

Unicode

0053

0043

0054

00AE

2462

Bytes

01010011

01000011

01010100

11000010

10101110

11101111

10111111

10111111