1 Reply Latest reply: Aug 6, 2009 12:11 PM by DrClap RSS

    Encode/Decode a String containing Chinese etc. characters to/from unicode

    843810
      Hi, I am working to encode a multilingual String (English, chinese, japanese, Latin etc) as unicode. The encoded string is used as follows
      1)decoded by the User Interface for display purpose. The user interface is a web UI and
      2)read by user. So it should be human readable when it contains only English and special characters.

      Thus having a string that contains letters from English, Japanese, CHinese etc, we want the Japanese/Chinese characters to be encoded by the hex values such as 私 be encoded as %E7%A7%81.
      However, it is preferable that other special characters like ! ? , space etc not be encoded and left as such.

      The encoding of characters is achievable by using Java.net.URLEncoder but it also replaces all special characters including space character, which becomes a pain for the reader.
      Unfortunately URLEncoder.java does not have any API to configure which characters to encode. Any suggestions how I can proceed, or what encoder i can use?

      Thanks in advance!
      Regards
        • 1. Re: Encode/Decode a String containing Chinese etc. characters to/from unicode
          DrClap
          I have to say that I disagree that "%E7%A7%81" is "human-readable". On the contrary, I would say that "私" is human-readable; even though I don't understand it myself, there are a lot of humans who do.

          But let's say that "human-readable" wasn't the right term. Maybe you just wanted a representation of an arbitrary Unicode string in ASCII characters only. In which case I would recommend Base-64 encoding. It does make everything unreadable, though, whereas your requirements appear to be to only make languages other than English be unreadable. So if you don't like that you are going to have to write your own encoder.

          Note also that your original premise:
          Hi, I am working to encode a multilingual String (English, chinese, japanese, Latin etc) as unicode.
          is misguided, since all data in Java Strings is already Unicode characters.