This content has been marked as final. Show 12 replies
The readUTF() should be used when and only when the sender uses writeUTF() of the DataOutputStream class.
I'm pretty sure that 128-255 are not valid unicode characters. I think those need to be expressed as two bytes.
I did convert it to two bytes before sending to servlet. For example, � (ALT+0187) was converted to UTF-8 format at the provider side as two bytes sequence of �� (i.e. 11000010 10111011). However, the servlet side IO.DataInputStream.readUTF still failed and throw UTFDataFormatException.
javac wrote:Yes everyone from time to time experiences the problems caused by not consulting the API Javadoc. The solution is to read them.
My receiving end servlet uses IO.DataInputStream.readUTF to read in socket stream from my provider end (c++ code). It throw UTFDataFormatException when encountering a character � (entered by ALT + 0187) in socket stream. It still happens even after I converted it to UTF-8 format in the provider end before sending to the servlet. It seems that it has problem with any characters within ASCII 128-255. I'm using jdk1.4.2. Did anyone experience the same kind of problem? and any solution?
Then you learn. And eventually you learn to start with this thus short circuiting the whole mess.
You need to examine what the readUTF method actually does because it doesn't do what you think.
String readUTF()throws IOExceptionReads in a string that has been encoded using a modified UTF-8 format. The general contract of readUTF is that it reads a representation of a Unicode character string encoded in modified UTF-8 format; this string of characters is then returned as a String.
First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read.
Thanks for your opinion. I not only read the document but also study its implementation code. What your highlights are the internal implementation of readUTF if you look into the implementation code. I think I did according to the document. I have no problem with sending ascii code set 1 characters. It got into problem with ascii code set 2 characters. In my example, it's �. I converted it to UTF-8 (i.e. ��) and sent it to the servlet by TCP/IP socket. So, the servlet side readUTF will read in the stream containing UTF-8 data. Is that not what readUTF expects? Is any picture I'm missing here? Please explain your opinion, not just quote document comments.
readUTF is expecting data sent in the following format.
2 bytes - unsigned short - length of data to read
x bytes - data
This is the format that writeUTF uses.
But from your description this is not what the C++ code is sending you. So what is the C++ code sending you?
javac wrote:Yeah, quoting the docs is pointless when the person the quote is intended for refuses to read them.
Please explain your opinion, not just quote document comments.
Of course, I did according to its spec. Otherwise, it wouldn't have worked with ascii 0-127 either. However, as I said, I have no problem with communicating ascii 0-127 chars between my provider side and my servlet side using readUTF too. That has implicated to you that the message was packeted following spec that length first and then data bytes. So, please don't jump on your gun so quickly. When you ask me to read your comments carefully. Please, you should think my answer intellegently too.
Secondly, socket communication is based on protocol and data format. writeUTF and readUTF may be paired up in java. However, it doesn't mean that one end of socket communication channel can't use another implementation as long as it follows the same protocol and data format.
It's a sender problem, not reader.
no problem with communicating ascii 0-127 charsThey are one byte chars even in UTF-8, quite same as good old ASCII.
If the sender can't correctly mimic writeUTF(), read the javadoc, reader should stop using readUTF() and do a simple byte read for the original, non-UTF8-converted, text data. Construct a Java String by using proper charset name.
Exactly. Instead of
Re: IO.DataInputStream.readUTF has problem of handling ascii 128-255the correct statement of the situation is that your C++ implementation has problem of writing ascii 128-255 according to the readUTF() specification.
javac wrote:Well, you're doing something wrong.
Of course, I did according to its spec. Otherwise, it wouldn't have worked with ascii 0-127 either. However, as I said, I have no problem with communicating ascii 0-127 chars between my provider side and my servlet side using readUTF too.
(That last line shows up properly in my bsh window. Forum must be munging it.)
bsh % dos = new DataOutputStream(new FileOutputStream("C:/cygwin/tmp/utf.utf")); bsh % dos.writeUTF("ABC"); bsh % dos.writeUTF("A�Z"); bsh % dos.writeUTF("AあんZ"); bsh % dos.flush(); bsh % dos.close(); bsh % dos = new DataInputStream(new FileInputStream("C:/cygwin/tmp/utf.utf")); bsh % dos.readUTF(); <ABC> bsh % dos.readUTF(); <A�Z> bsh % dos.readUTF(); <AあんZ>
which, according to
00000000:00 03 41 42 43 00 04 41 c2 bb 5a 00 08 41 e3 81 00000010:82 e3 82 93 5a
looks to me like
0003 = 3 UTF bytes follow
41 42 43 = The 3 bytes for 'A', 'B', and 'C'
0004 = 4 UTF bytes follow
41 = 'A'
c2bb = 1100 0010 1011 1011 = 110 (00010) 10 (111011) = character (0000 0000 1011 1011) = 00bb = '�'
5a = 'Z'
0008 = 8 bytes
41 = A
e3 81 82 probably = 'あ' but I'm not going to check
e3 82 93 probably = 'ん' but I'm not going to check
5a = 'Z'
Looks like it works to me.
Edited by: jverd on Oct 30, 2007 11:43 PM
javac wrote:Obviously not.
Of course, I did according to its spec.