This content has been marked as final. Show 12 replies
javac wrote:Yes everyone from time to time experiences the problems caused by not consulting the API Javadoc. The solution is to read them.
My receiving end servlet uses IO.DataInputStream.readUTF to read in socket stream from my provider end (c++ code). It throw UTFDataFormatException when encountering a character � (entered by ALT + 0187) in socket stream. It still happens even after I converted it to UTF-8 format in the provider end before sending to the servlet. It seems that it has problem with any characters within ASCII 128-255. I'm using jdk1.4.2. Did anyone experience the same kind of problem? and any solution?
Then you learn. And eventually you learn to start with this thus short circuiting the whole mess.
You need to examine what the readUTF method actually does because it doesn't do what you think.
String readUTF()throws IOExceptionReads in a string that has been encoded using a modified UTF-8 format. The general contract of readUTF is that it reads a representation of a Unicode character string encoded in modified UTF-8 format; this string of characters is then returned as a String.
First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read.
Thanks for your opinion. I not only read the document but also study its implementation code. What your highlights are the internal implementation of readUTF if you look into the implementation code. I think I did according to the document. I have no problem with sending ascii code set 1 characters. It got into problem with ascii code set 2 characters. In my example, it's �. I converted it to UTF-8 (i.e. ��) and sent it to the servlet by TCP/IP socket. So, the servlet side readUTF will read in the stream containing UTF-8 data. Is that not what readUTF expects? Is any picture I'm missing here? Please explain your opinion, not just quote document comments.
Of course, I did according to its spec. Otherwise, it wouldn't have worked with ascii 0-127 either. However, as I said, I have no problem with communicating ascii 0-127 chars between my provider side and my servlet side using readUTF too. That has implicated to you that the message was packeted following spec that length first and then data bytes. So, please don't jump on your gun so quickly. When you ask me to read your comments carefully. Please, you should think my answer intellegently too.
Secondly, socket communication is based on protocol and data format. writeUTF and readUTF may be paired up in java. However, it doesn't mean that one end of socket communication channel can't use another implementation as long as it follows the same protocol and data format.
It's a sender problem, not reader.
no problem with communicating ascii 0-127 charsThey are one byte chars even in UTF-8, quite same as good old ASCII.
If the sender can't correctly mimic writeUTF(), read the javadoc, reader should stop using readUTF() and do a simple byte read for the original, non-UTF8-converted, text data. Construct a Java String by using proper charset name.
javac wrote:Well, you're doing something wrong.
Of course, I did according to its spec. Otherwise, it wouldn't have worked with ascii 0-127 either. However, as I said, I have no problem with communicating ascii 0-127 chars between my provider side and my servlet side using readUTF too.
(That last line shows up properly in my bsh window. Forum must be munging it.)
bsh % dos = new DataOutputStream(new FileOutputStream("C:/cygwin/tmp/utf.utf")); bsh % dos.writeUTF("ABC"); bsh % dos.writeUTF("A�Z"); bsh % dos.writeUTF("AあんZ"); bsh % dos.flush(); bsh % dos.close(); bsh % dos = new DataInputStream(new FileInputStream("C:/cygwin/tmp/utf.utf")); bsh % dos.readUTF(); <ABC> bsh % dos.readUTF(); <A�Z> bsh % dos.readUTF(); <AあんZ>
which, according to
00000000:00 03 41 42 43 00 04 41 c2 bb 5a 00 08 41 e3 81 00000010:82 e3 82 93 5a
looks to me like
0003 = 3 UTF bytes follow
41 42 43 = The 3 bytes for 'A', 'B', and 'C'
0004 = 4 UTF bytes follow
41 = 'A'
c2bb = 1100 0010 1011 1011 = 110 (00010) 10 (111011) = character (0000 0000 1011 1011) = 00bb = '�'
5a = 'Z'
0008 = 8 bytes
41 = A
e3 81 82 probably = 'あ' but I'm not going to check
e3 82 93 probably = 'ん' but I'm not going to check
5a = 'Z'
Looks like it works to me.
Edited by: jverd on Oct 30, 2007 11:43 PM