UTF-7

[Back]  [Home]  [CJK]
 

This page is only an illustration of how you can convert UTF-16BE into UTF-7.  Read RFC 2152 to get first-hand information.


  1. Convert the encoding of these characters from 16-bit to 7-bit by simply dropping the initial null bytes:
    ABCDEFGHIJKLMNOPQRSTUVWXYZ
    abcdefghijklmnopqrstuvwxyz
    0123456789
    '(),-./:?
    Example.  Turn the encoding of letter A from 0x0041 to 0x41.
     
  2. Optionally you may do the same with the following characters:
    !"#$%&*;<=>@[]^_'{|}
  3. Turn any sequence of three bytes into a four-byte sequence using the following Base64 alphabet:
    ABCDEFGHIJKLMNOPQRSTUVWXYZ
    abcdefghijklmnopqrstuvwxyz
    0123456789+/
  4. Signal the beginning of a Base64 sequence with a + sign and signal its end with a hyphen.  The end signal is optional if the Base64 sequence is followed by a character that cannot be interpreted as a member of the Base64 alphabet.

Example

These three characters

CJK

are encoded in UTF-16BE as 0x4E2D 0x65E5 0x97D3.  The binary form of these six bytes is:

     01001110 00101101 01100101 11100101 10010111 11010011

Group these bits into 6-bit segments:

     010011 100010 110101 100101 111001 011001 011111 010011

Now convert each segment into a character of the Base64 alphabet to get +Ti1l5ZfT.

     000000 -> A     001101 -> N     011010 -> a     100111 -> n     110100 -> 0
     000001 -> B     001110 -> O     011011 -> b     101000 -> o     110101 -> 1
     000010 -> C     001111 -> P     011100 -> c     101001 -> p     110110 -> 2
     000011 -> D     010000 -> Q     011101 -> d     101010 -> q     110111 -> 3
     000100 -> E     010001 -> R     011110 -> e     101011 -> r     111000 -> 4
     000101 -> F     010010 -> S     011111 -> f     101100 -> s     111001 -> 5
     000110 -> G     010011 -> T     100000 -> g     101101 -> t     111010 -> 6
     000111 -> H     010100 -> U     100001 -> h     101110 -> u     111011 -> 7
     001000 -> I     010101 -> V     100010 -> i     101111 -> v     111100 -> 8
     001001 -> J     010110 -> W     100011 -> j     110000 -> w     111101 -> 9
     001010 -> K     010111 -> X     100100 -> k     110001 -> x     111110 -> +
     001011 -> L     011000 -> Y     100101 -> l     110010 -> y     111111 -> /
     001100 -> M     011001 -> Z     100110 -> m     110011 -> z

Source of Information

David Goldsmith and Mark Davis. 1997. UTF-7: a mail-safe transformation format of Unicode. RFC 2152.

© 2000-2002 Gyula Zsigri [Back]  [Home]  [CJK] Last updated:  July 11, 2002