12 Replies Latest reply: Oct 31, 2007 1:40 AM by 796440 RSS

    IO.DataInputStream.readUTF has problem of handling ascii 128-255

    807603
      Hi all,

      My receiving end servlet uses IO.DataInputStream.readUTF to read in socket stream from my provider end (c++ code). It throw UTFDataFormatException when encountering a character � (entered by ALT + 0187) in socket stream. It still happens even after I converted it to UTF-8 format in the provider end before sending to the servlet. It seems that it has problem with any characters within ASCII 128-255. I'm using jdk1.4.2. Did anyone experience the same kind of problem? and any solution?

      Thanks
        • 1. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
          800351
          The readUTF() should be used when and only when the sender uses writeUTF() of the DataOutputStream class.
          • 2. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
            796440
            I'm pretty sure that 128-255 are not valid unicode characters. I think those need to be expressed as two bytes.
            • 3. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
              807603
              I did convert it to two bytes before sending to servlet. For example, � (ALT+0187) was converted to UTF-8 format at the provider side as two bytes sequence of �� (i.e. 11000010 10111011). However, the servlet side IO.DataInputStream.readUTF still failed and throw UTFDataFormatException.

              Thanks
              • 4. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                807603
                javac wrote:
                Hi all,

                My receiving end servlet uses IO.DataInputStream.readUTF to read in socket stream from my provider end (c++ code). It throw UTFDataFormatException when encountering a character � (entered by ALT + 0187) in socket stream. It still happens even after I converted it to UTF-8 format in the provider end before sending to the servlet. It seems that it has problem with any characters within ASCII 128-255. I'm using jdk1.4.2. Did anyone experience the same kind of problem? and any solution?
                Yes everyone from time to time experiences the problems caused by not consulting the API Javadoc. The solution is to read them.

                Then you learn. And eventually you learn to start with this thus short circuiting the whole mess.

                You need to examine what the readUTF method actually does because it doesn't do what you think.
                String readUTF()
                throws IOExceptionReads in a string that has been encoded using a modified UTF-8 format. The general contract of readUTF is that it reads a representation of a Unicode character string encoded in modified UTF-8 format; this string of characters is then returned as a String.
                First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read.
                • 5. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                  807603
                  Thanks for your opinion. I not only read the document but also study its implementation code. What your highlights are the internal implementation of readUTF if you look into the implementation code. I think I did according to the document. I have no problem with sending ascii code set 1 characters. It got into problem with ascii code set 2 characters. In my example, it's �. I converted it to UTF-8 (i.e. ��) and sent it to the servlet by TCP/IP socket. So, the servlet side readUTF will read in the stream containing UTF-8 data. Is that not what readUTF expects? Is any picture I'm missing here? Please explain your opinion, not just quote document comments.

                  Thanks
                  • 6. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                    807603
                    readUTF is expecting data sent in the following format.

                    2 bytes - unsigned short - length of data to read

                    x bytes - data

                    This is the format that writeUTF uses.

                    But from your description this is not what the C++ code is sending you. So what is the C++ code sending you?
                    • 7. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                      796440
                      javac wrote:
                      Please explain your opinion, not just quote document comments.
                      Yeah, quoting the docs is pointless when the person the quote is intended for refuses to read them.

                      :-P
                      • 8. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                        807603
                        Of course, I did according to its spec. Otherwise, it wouldn't have worked with ascii 0-127 either. However, as I said, I have no problem with communicating ascii 0-127 chars between my provider side and my servlet side using readUTF too. That has implicated to you that the message was packeted following spec that length first and then data bytes. So, please don't jump on your gun so quickly. When you ask me to read your comments carefully. Please, you should think my answer intellegently too.

                        Secondly, socket communication is based on protocol and data format. writeUTF and readUTF may be paired up in java. However, it doesn't mean that one end of socket communication channel can't use another implementation as long as it follows the same protocol and data format.

                        Thanks
                        • 9. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                          800351
                          It's a sender problem, not reader.
                          no problem with communicating ascii 0-127 chars
                          They are one byte chars even in UTF-8, quite same as good old ASCII.

                          If the sender can't correctly mimic writeUTF(), read the javadoc, reader should stop using readUTF() and do a simple byte read for the original, non-UTF8-converted, text data. Construct a Java String by using proper charset name.
                          • 10. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                            EJP
                            Exactly. Instead of
                            Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                            the correct statement of the situation is that your C++ implementation has problem of writing ascii 128-255 according to the readUTF() specification.
                            • 11. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                              796440
                              javac wrote:
                              Of course, I did according to its spec. Otherwise, it wouldn't have worked with ascii 0-127 either. However, as I said, I have no problem with communicating ascii 0-127 chars between my provider side and my servlet side using readUTF too.
                              Well, you're doing something wrong.
                              bsh % dos = new DataOutputStream(new FileOutputStream("C:/cygwin/tmp/utf.utf"));
                              bsh % dos.writeUTF("ABC");
                              bsh % dos.writeUTF("A�Z");
                              bsh % dos.writeUTF("AあんZ");
                              bsh % dos.flush();
                              bsh % dos.close();
                              
                              bsh % dos = new DataInputStream(new FileInputStream("C:/cygwin/tmp/utf.utf"));
                              bsh % dos.readUTF();
                              <ABC>
                              bsh % dos.readUTF();
                              <A�Z>
                              bsh % dos.readUTF();
                              <A&#12354;&#12435;Z>
                              (That last line shows up properly in my bsh window. Forum must be munging it.)

                              File has:
                              00000000:00 03 41 42 43 00 04 41 c2 bb 5a 00 08 41 e3 81
                              00000010:82 e3 82 93 5a      
                              which, according to

                              http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8

                              looks to me like

                              0003 = 3 UTF bytes follow
                              41 42 43 = The 3 bytes for 'A', 'B', and 'C'

                              0004 = 4 UTF bytes follow
                              41 = 'A'
                              c2bb = 1100 0010 1011 1011 = 110 (00010) 10 (111011) = character (0000 0000 1011 1011) = 00bb = '�'
                              5a = 'Z'

                              0008 = 8 bytes
                              41 = A
                              e3 81 82 probably = '&#12354;' but I'm not going to check
                              e3 82 93 probably = '&#12435;' but I'm not going to check
                              5a = 'Z'

                              Looks like it works to me.

                              Edited by: jverd on Oct 30, 2007 11:43 PM
                              • 12. Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255
                                796440
                                javac wrote:
                                Of course, I did according to its spec.
                                Obviously not.