Make your own free website on Tripod.com

UTF-8

[Back]  [Home]  [CJK]
 

This page is only an illustration of how you can encode a Unicode character in UTF-8.  Read RFC 2279 for first-hand information.

  1. Take the Unicode value of the character to find out how many bytes you need.  Unicode values are given in hexadecimal numbers:

     0000-007F 1 byte
     0080-07FF 2 bytes
     0800-FFFF 3 bytes
    10000-FFFFF 4 bytes

  2. Convert the hex code to binary form and fill in the empty bits:

    1 byte 0xxxxxxx
    2 bytes 110xxxxx 10xxxxxx
    3 bytes 1110xxxx 10xxxxxx 10xxxxxx
    4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example

The Unicode value of Cha 'tea' is 8336, so you need 3 bytes.  The binary form of hexadecimal 8336 is

10000011 00110110

Fill the empty slots of the three-byte template with the binary value of Cha and you will get:

11101000 10001100 10110110

Thus you have converted 0x8336 to 0xE8 0x8C 0xB6.  Set your browser to UTF-8 to see these three bytes

as one Chinese character: cha.  In Western mode, you will see three characters: e grave, OE and paragraph.


Source of Information

François Yergeau. 1998. UTF-8, a transformation format of  ISO 10646. RFC 2279.


© 2000-2002 Gyula Zsigri [Back]  [Home]  [CJK] Last updated:  December 22, 2002