UTF-8
<character> (UCS transformation format 8) An ASCII-compatible multibyte
Unicode and UCS encoding, used by Java and Plan 9.
The Unicode character set occupies a 16-bit code space. The most obvious Unicode
encoding (known as UCS-2) consists of a sequence of 16-bit words. Such strings
can contain bytes like '\0' or '/' which have a special meaning in filenames and
other C library function parameters. In addition, the majority of Unix tools
expects ASCII files and can't read 16-bit words as characters without major
modifications. For these reasons, UCS-2 is not a suitable external encoding of
Unicode in filenames, text files, environment variables, etc.
The ISO 10646 Universal Character Set (UCS), a superset of Unicode, occupies a
31-bit code space and the obvious UCS-4 encoding for it (a sequence of 32-bit
words) has the same problems.
The UTF-8 encoding of Unicode and UCS avoids the problems of fixed-length
Unicode encodings because an ASCII file encoded in UTF is exactly same as the
original ASCII file and all non-ASCII characters are guaranteed to have the most
significant bit set (bit 0x80). This means that normal tools for text searching
etc. work as expected.
UTF-8 is defined in RFC 2279.
["File System Safe UCS Transformation Format (FSS_UTF)", X/Open Preliminary
Specification, X/Open Company Ltd., Document Number: P316. This information also
appears in ISO/IEC 10646, Annex P].
Plan 9 UTF manual entry.
(1998-07-29)
Nearby terms:
USSA « UTC « UTF « UTF-8 » utility-coder »
UTOPIST » UTP
|