Search



Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Unicode is preferred to ASCII because it permits the inclusion of accents, scientific symbols and characters used in

Gloss
tlanguages
other than English. The UTF-8 format is a standard encoding that provides the most efficient means of encoding 16-bit Unicode characters in cases where the majority of characters are in the ASCII range. Both UTF-8 and the alternative UTF-16 encoding issupported by all widely used operating systems and major applications (and has been for more than 15 years).

Gloss
tSNOMED CT
uses the UTF-8 representation of characters in
Gloss
tterms
and other text fields.

Anchor
_7dc823d4-c674-42dc-82dc-1344066c60a3
_7dc823d4-c674-42dc-82dc-1344066c60a3
Character encoding

...

Anchor
_e5b13006-97f1-49ae-8c11-31c7874255eb
_e5b13006-97f1-49ae-8c11-31c7874255eb
Single byte encoding

Characters in the

Gloss
trange
'u+0000' to 'u+007f' are encoded as a single byte.

...

.

...

 

byte 0

 

0

bits 0-6

Anchor
_5d9b51fc-9e4a-4c61-8b98-150719748042
_5d9b51fc-9e4a-4c61-8b98-150719748042
Two byte encoding

Characters in the

Gloss
trange
'u+0080' to 'u+07ff' are encoded as two bytes.

...

Table 20. Two byte encoding

 

 

 

 

 

 

 

byte 0

 

 

 

byte 1

 

 

 

1

1

0

bits 6-10

1

0

bits 0-5

Anchor
_b39848ac-e54c-4280-ab16-3288ef1a37d6
_b39848ac-e54c-4280-ab16-3288ef1a37d6
Three byte encoding

Characters in the

Gloss
trange
'u+0800' to 'u+ffff' are encoded as three bytes:

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

byte 0

 

 

 

 

byte 1

 

 

byte 2

 

 

1

1

1

0

bits 12-15

1

0

bits 6-11

1

0

bits 0-5

...

The first bits of each byte indicate the role of the byte. A zero bit terminates this role information. Thus possible byte values are:

Table 22. UTF-8 Encoding Rules

Bits

Byte value

Gloss
tRole

Anchor
_df58b4f1-b439-4d5c-9783-fe7e397d04b0__a_df58b4f1-b439-4d5c-9783-fe7e397d04b0__a

 

 

Bits

Byte value

0???? ?? ?

000-127

Single byte encoding of a character

10??? ?? ?

128-191

Continuation of a multi-byte encoding

110?? ?? ?

192-223

First byte of a two byte character encoding

1110? ?? ?

224-239

First byte of a three byte character encoding

1111? ?? ?

240-255

Invalid in UTF-8

Anchor
_0edf1a87-881e-460d-8c56-c6d483fd74dd
_0edf1a87-881e-460d-8c56-c6d483fd74dd
Example encoding

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

Character

S

C

T

 

 

 

 

 

 

Unicode

0053

0043

0054

00AE

 

2462

 

 

 

Bytes

01010011

01000011

01010100

11000010

10101110

11101111

10111111

10111111