Unicode

[CJK]  [Home]

Contents


Overview

This is the new international standard used in applications like Microsoft Office 97 or later and in operating systems like Windows NT 4.0, 2000 or XP.  The Unicode Standard, Version 3.2.0, contains 95,156 printable characters.

Contents Up

Unicode Transformation Formats

For historical reasons, Unicode encoding forms are referred to as Unicode Transformation Formats.

Contents Up

UTF-16

A 16-bit encoding form, the default transformation format of  Unicode.  Every character of the Basic Multilingual Plane is encoded in two bytes.  What looks like hello in UTF-16-compliant applications (such as Word 97 or later) will be displayed as

h e l l o

in applications that interpret each byte as one character (e.g. Notepad for Windows 9x or SuperPad).  The white spaces between the characters are null bytes that extend single-byte ASCII to 16-bit Unicode.

Characters outside the Basic Multilingual Plane are encoded in four-byte sequences, called surrogate pairs.

UTF-16 has two variants (called encoding schemes) with respect to byte serialization:

  1. Big-Endian
    The high byte precedes the low byte.  Cha 'tea' is encoded with bytes 0x83 and 0x36.  UTF-16BE encoded plain text files should start with a 0xFEFF Byte Order Mark.  UTF-16BE is sometimes referred to as the byte-reversed version of  UTF-16, contrary to the fact that it uses the original byte order.
     
  2. Little-Endian
    The low byte precedes the high byte.  Cha 'tea' is encoded with bytes 0x36 and 0x83.  UTF-16LE encoded plain text files should start with a 0xFFFE Byte Order Mark.  UTF-16LE has become more popular than UTF-16BE.

UTF-16 is very rarely used in HTML or e-mail.

Test your word processor or text editor if it can read UTF-16 encoded plain text files.  (Click here for a list of word processors and text editors supporting UTF-16 encoded plain text files.)  Download utf16txt.zip and extract big-endian utf16be.txt and little-endian utf16le.txt.  This is what you will see when you open them in English Word 97:

word1.gif

Highlight the squares and select a Chinese, Japanese or Korean font:

word2.gif

If you still see squares, select the same font again:

egrcjk.gif

Now open the files in Notepad for Windows 9x or SuperPad and see what pairs of  bytes encode the characters.

Contents Up


UTF-8

This is the preferred UTF for the Web.  ASCII characters are encoded in single bytes, European and Near-Eastern characters in 2-byte sequences, South and East Asian characters in 3-byte sequences.  Click here for details.

Test your browser by selecting UTF-8 in View : Character Coding, or View : Encoding.  The text in the right column should match the GIF in the left column.  Click here if it does not.

GIF Text
CJK  中日韓 
Contents Up

UTF-7

A mail-safe 7-bit transformation format.  Click here to find out how it works.

If you are using Mozilla or Netscape then you can decode the text in the right column by selecting UTF-7 in View : Character Coding.  Click here if you cannot.

GIF Text
CJK +Ti1l5ZfT

Internet Explorer can only decode UTF-7 encoded HTML files if their encoding is specified in the META tag.  Try this page.

Contents Up


What is UCS-2?

A 2-byte encoding form of the Basic Multilingual Plane.  It is upward compatible with UTF-16.

Contents Up

Links


© Gyula Zsigri, 2000-2002 [CJK]  [Home] Last updated:  December 22, 2002