Page History

Gloss

t

Anchor

_3332ca9b-c878-4bea-9cb2-9f1de32dbe59

Anchor

_GoBack

UTF-8

is

is an efficient encoding of Unicode character - String

Specref

RefType	(data type)
t	String

that recognizes the fact that the majority of text-based communications are in ASCII. It therefore optimizes the encoding of these characters.

Unicode is preferred to ASCII because it permits the inclusion of accents, scientific symbols and characters used in

Gloss

t	languages

other than English. The

Gloss

t	UTF-8

format

format is a standard encoding that provides the most efficient means of encoding 16-bit Unicode characters in cases where the majority of characters are in the ASCII range. Both Both

Gloss

t	UTF-8

and the alternative

Gloss

t	UTF

16

-16

encoding are supported encoding issupported by all widely used operating systems and major applications (and has been for more than 15 years). uses the UTF-8 representation of characters in and other text fields.

...

ASCII characters are encoded as a single byte.

Greek, Hebrew, Arabic and most accented European characters are encoded as two bytes;
All other characters are encoded as three bytes;
The individual characters are encoded according to the following rules.

...

Characters in the 'u+0000' to 'u+007f' are encoded as a single byte.

...

byte 0

...

0

...

bits 0-6

...

Characters in the 'u+0080' to 'u+07ff' are encoded as two bytes.

...

byte 0

...

byte 1

...

1

...

1

...

0

...

bits 6-10

...

1

...

0

...

bits 0-5

...

Characters in the 'u+0800' to 'u+ffff' are encoded as three bytes:

...

byte 0

...

byte 1

...

byte 2

...

1

...

1

...

1

...

0

...

bits 12-15

...

1

...

0

...

bits 6-11

...

1

...

0

...

bits 0-5

...

The first bits of each byte indicate the role of the byte. A zero bit terminates this role information. Thus possible byte values are:

...

Bits

...

Byte value

...

0???? ?? ?

...

000-127

...

Single byte encoding of a character

...

10??? ?? ?

...

128-191

...

Continuation of a multi-byte encoding

...

110?? ?? ?

...

192-223

...

First byte of a two byte character encoding

...

1110? ?? ?

...

224-239

...

First byte of a three byte character encoding

...

1111? ?? ?

...

240-255

...

Invalid in UTF-8

was adopted is an IETF Internet Standard (it was initially adopted by IETF in 1996 to restrict some code values in 1998 and 2003). In 2008 UTF-8 became the most widely used for of encoding in web pages.

Gloss

t	SNOMED CT

uses the

Gloss

t	UTF-8

representation

Footnote Macro
Note that SNOMED CT does not use, or require use of, the Byte Order Mark (BOM) specified by the Unicode standard because all SNOMED CT release files use UTF-8.

of characters in

Gloss

t	terms

and other text fields.

...

Display Footnotes Macro

...

Character

...

S

...

C

...

T

...

Unicode

...

0053

...

0043

...

0054

...

00AE

...

2462

...

Bytes

...

01010011

...

01000011

...

01010100

...

11000010

...

10101110

...

11101111

...

10111111

...

Search

Versions Compared

Old Version 1

New Version Current

Key