Search



Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Gloss
t
Anchor
_3332ca9b-c878-4bea-9cb2-9f1de32dbe59_3332ca9b-c878-4bea-9cb2-9f1de32dbe59
Anchor
_GoBack_GoBackUTF-8
is
 is an efficient encoding of Unicode character - String
Specref
RefType(data type)
tString
that recognizes the fact that the majority of text-based communications are in ASCII. It therefore optimizes the encoding of these characters.

Unicode is preferred to ASCII because it permits the inclusion of accents, scientific symbols and characters used in

Gloss
tlanguages
other than English. The
Gloss
tUTF-8
format
 format is a standard encoding that provides the most efficient means of encoding 16-bit Unicode characters in cases where the majority of characters are in the ASCII range. Both Both 
Gloss
tUTF-8
and the alternative
Gloss
tUTF
16
-16
 encoding are supported encoding issupported by all widely used operating systems and major applications (and has been for more than 15 years). uses the UTF-8 representation of characters in and other text fields.

...

ASCII characters are encoded as a single byte.

  • Greek, Hebrew, Arabic and most accented European characters are encoded as two bytes;
  • All other characters are encoded as three bytes;
  • The individual characters are encoded according to the following rules.

...

Characters in the 'u+0000' to 'u+007f' are encoded as a single byte.

...

 

...

byte 0

...

 

...

0

...

bits 0-6

...

Characters in the 'u+0080' to 'u+07ff' are encoded as two bytes.

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

byte 0

...

 

...

 

...

 

...

byte 1

...

 

...

 

...

 

...

1

...

1

...

0

...

bits 6-10

...

1

...

0

...

bits 0-5

...

Characters in the 'u+0800' to 'u+ffff' are encoded as three bytes:

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

byte 0

...

 

...

 

...

 

...

 

...

byte 1

...

 

...

 

...

byte 2

...

 

...

 

...

1

...

1

...

1

...

0

...

bits 12-15

...

1

...

0

...

bits 6-11

...

1

...

0

...

bits 0-5

...

The first bits of each byte indicate the role of the byte. A zero bit terminates this role information. Thus possible byte values are:

...

 

...

 

...

Bits

...

Byte value

...

0???? ?? ?

...

000-127

...

Single byte encoding of a character

...

10??? ?? ?

...

128-191

...

Continuation of a multi-byte encoding

...

110?? ?? ?

...

192-223

...

First byte of a two byte character encoding

...

1110? ?? ?

...

224-239

...

First byte of a three byte character encoding

...

1111? ?? ?

...

240-255

...

Invalid in UTF-8

was adopted is an IETF Internet Standard (it was initially adopted by IETF in 1996 to restrict some code values in 1998 and 2003). In 2008 UTF-8 became the most widely used for of encoding in web pages.

Gloss
tSNOMED CT
uses the
Gloss
tUTF-8
 representation

Footnote Macro

Note that SNOMED CT does not use, or require use of, the Byte Order Mark (BOM) specified by the Unicode standard because all SNOMED CT release files use UTF-8.

 of characters in

Gloss
tterms
and other text fields.

...

Display Footnotes Macro

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

 

...

Character

...

S

...

C

...

T

...

 

...

 

...

 

...

 

...

 

...

 

...

Unicode

...

0053

...

0043

...

0054

...

00AE

...

 

...

2462

...

 

...

 

...

 

...

Bytes

...

01010011

...

01000011

...

01010100

...

11000010

...

10101110

...

11101111

...

10111111

...