3 Replies Latest reply: Dec 10, 2009 4:16 PM by 843810 RSS

    Handling of extra bytes by Java String

    843810
      Hi,

      I was wondering how does Java String handle bytes that are not in UTF-8 format. Eg:

      Lets say I have some bytes coming from a server and I expect it to be in UTF-8. But due to a bug of some sorts the server accidentally appends some bytes that are not in UTF-8.
      Now, at client side, I read in these bytes and convert them to String and display it. So:

      1) Does String class handle these extra bytes itself and chop it off and display only the valid UTF-8 transformed string.
      2) Does it Error out?
      3) If not 2) then is there any way to validate the bytes so that we can know that the server has some bugs?!

      Any help will be appreciated...

      Thanks.
        • 1. Re: Handling of extra bytes by Java String
          843810
          #1. I'm not sure Java will read UTF-8 Strings correctly unless you explicitly tell it to

          ie
          byte[] stringData = ...;
          String str1 = new String(byte); //don't know if this works. It might... I don't know
          String str2 = new String(byte, "UTF-8");

          #2. How can java "know" that extra bytes have been added to your string? When decoding a UTF-8 string, java will follow exactly the specification that defines UTF-8.

          So to your questions:

          #1 no
          #2 no
          #3 read in the string as byte[] data. This is the safest way to go. Then send that back to the sender and see if it is the same as what they sent.

          Edited by: tjacobs01 on Dec 10, 2009 8:28 PM
          • 2. Re: Handling of extra bytes by Java String
            DrClap
            Good question. I didn't know the answer so I had a look at the API documentation for the String(byte[], charset) constructor. And it turns out to say:
            The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.
            So there you go. The answer to (1) is "It depends". The answer to (2) is "No". The answer to (3) is "Use the java.nio.charset.CharsetDecoder class".
            • 3. Re: Handling of extra bytes by Java String
              843810
              Great! Thanks guys.. I have not had a chance to use the CharsetDecoder class, but I will now and I know where to come back in case if I have any more issues ;)