This is the new international standard used in applications like Microsoft Office 97 or later and in operating systems like Windows NT 4.0, 2000 or XP. The Unicode Standard, Version 3.2.0, contains 95,156 printable characters.
Contents | Up |
For historical reasons, Unicode encoding forms are referred to as Unicode Transformation Formats.
Contents | Up |
A 16-bit encoding form, the default transformation format of Unicode. Every character of the Basic Multilingual Plane is encoded in two bytes. What looks like hello in UTF-16-compliant applications (such as Word 97 or later) will be displayed as
in applications that interpret each byte as one character (e.g. Notepad for Windows 9x or SuperPad). The white spaces between the characters are null bytes that extend single-byte ASCII to 16-bit Unicode.
Characters outside the Basic Multilingual Plane are encoded in four-byte sequences, called surrogate pairs.
UTF-16 has two variants (called encoding schemes) with respect to byte serialization:
UTF-16 is very rarely used in HTML or e-mail.
Test your word processor or text editor if it can read UTF-16 encoded plain text files. (Click here for a list of word processors and text editors supporting UTF-16 encoded plain text files.) Download utf16txt.zip and extract big-endian utf16be.txt and little-endian utf16le.txt. This is what you will see when you open them in English Word 97:
Highlight the squares and select a Chinese, Japanese or Korean font:
If you still see squares, select the same font again:
Now open the files in Notepad for Windows 9x or SuperPad and see what pairs of bytes encode the characters.
Contents | Up |
This is the preferred UTF for the Web. ASCII characters are encoded in single bytes, European and Near-Eastern characters in 2-byte sequences, South and East Asian characters in 3-byte sequences. Click here for details.
Test your browser by selecting UTF-8 in View : Character Coding, or View : Encoding. The text in the right column should match the GIF in the left column. Click here if it does not.
GIF | Text |
中日韓 |
Contents | Up |
A mail-safe 7-bit transformation format. Click here to find out how it works.
If you are using Mozilla or Netscape then you can decode the text in the right column by selecting UTF-7 in View : Character Coding. Click here if you cannot.
GIF | Text |
+Ti1l5ZfT |
Internet Explorer can only decode UTF-7 encoded HTML files if their encoding is specified in the META tag. Try this page.
Contents | Up |
A 2-byte encoding form of the Basic Multilingual Plane. It is upward compatible with UTF-16.
Contents | Up |
© Gyula Zsigri, 2000-2002 | [CJK] [Home] | Last updated: December 22, 2002 |