This page is only an illustration of how you can encode a Unicode character in UTF-8. Read RFC 2279 for first-hand information.
Take the Unicode value of the character to find out how many bytes you need. Unicode values are given in hexadecimal numbers:
0000-007F | 1 byte |
0080-07FF | 2 bytes |
0800-FFFF | 3 bytes |
10000-FFFFF | 4 bytes |
Convert the hex code to binary form and fill in the empty bits:
1 byte | 0xxxxxxx |
2 bytes | 110xxxxx 10xxxxxx |
3 bytes | 1110xxxx 10xxxxxx 10xxxxxx |
4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The Unicode value of 'tea' is 8336, so you need 3 bytes. The binary form of hexadecimal 8336 is
10000011 00110110
Fill the empty slots of the three-byte template with the binary value of and you will get:
11101000 10001100 10110110
Thus you have converted 0x8336 to 0xE8 0x8C 0xB6. Set your browser to UTF-8 to see these three bytes
茶
as one Chinese character: . In Western mode, you will see three characters: .
François Yergeau. 1998. UTF-8, a transformation format of ISO 10646. RFC 2279.
© 2000-2002 Gyula Zsigri | [Back] [Home] [CJK] | Last updated: December 22, 2002 |