This content has been marked as final. Show 8 replies
RyanAllaby wrote:I suspect you are right. So don't do anything which uses the default encoding. That includes the creation of Readers and Writers (use InputStreamReader and OutputStreamWriter as intermediaries) and the use of the getBytes() method for a start.
I suspect it is the default encoding on the computer the software is running on. If this is true, then how do I force the application to honor german umlauts?
hi DrClap, thanks for your post.
I use ByteBuffer with getBytes() to convert the data from UTF-8 to ISO-8859-1 and this method works great with french, spanish and german characters, with umlauts in test mode. On the same computer, running in a jar, it fails with umlauts.
I am at a loss on how I am going to mitigate this problem.
//input is a String() Charset utf8charset = Charset.forName( "UTF-8" ); Charset iso88591charset = Charset.forName( "ISO-8859-1" ); ByteBuffer inputBuffer = ByteBuffer.wrap( input.getBytes() ); // decode UTF-8 CharBuffer data = utf8charset.decode( inputBuffer ); // encode ISO-8559-1 ByteBuffer outputBuffer = iso88591charset.encode( data ); byte outputData = outputBuffer.array(); return new String( outputData );
So you start with a string called "input"; where did that come from? As far as we know, it could already have been corrupted.
Here you convert the string to to a byte array using the default encoding. You say you've set the default to UTF-8, but how do you know it worked on the customer's machine? When we advise you not to rely on the default encoding, we don't mean you should override that system property, we mean you should always specify the encoding in your code. There's a getBytes() method that lets you do that.
ByteBuffer inputBuffer = ByteBuffer.wrap( input.getBytes() );
Now you decode the byte that you think is UTF-8, as UTF-8. If getBytes() did in fact encode the string as UTF-8, this is a wash; you just wasted a lot of time and ended up with the exact same string you started with. On the other hand, if getBytes() used something other than UTF-8, you've just created a load of garbage.
CharBuffer data = utf8charset.decode( inputBuffer );
Next you create yet another byte array, this time using the ISO-8859-1 encoding. If the string was valid to begin with, and the previous steps didn't corrupt it, there could be characters in it that can't be encoded in ISO-8859-1. Those characters will be lost.
ByteBuffer outputBuffer = iso88591charset.encode( data );
Finally, you decode the byte once more, this time using the default encoding. As with getBytes(), there's a String constructor that lets you specify the encoding, but it doesn't really matter. For the previous steps to have worked, the default had to be UTF-8. That means you have a byte that's encoded as ISO-8859-1 and you're decoding it as UTF-8. What's wrong with this picture?
byte outputData = outputBuffer.array(); return new String( outputData );
This whole sequence makes no sense anyway; at best, it's a huge waste of clock cycles. It looks like you're trying to change the encoding of the string, which is impossible. No matter what platform it runs on, Java always uses the same encoding for strings. That encoding is UTF-16, but you don't really need to know that. You should only have to deal with character encodings when your app communicates with something outside itself, like a network or a file system.
What's the real problem you're trying to solve?
You might find the following website useful. It gives explanations of unicode, utf, character encoding / input / output, etc that are as good as I've found, with useful code examples. It's for the Vietnamese language, but directly applies to all encodings by just changing the encoding charset. Highly recommended.
Stem the url to [http://vietunicode.sourceforge.net/|http://vietunicode.sourceforge.net/] for an extensive set of FAQs and links.
RyanAllaby wrote:Does this web service involve XML? If so then you shouldn't have to worry about "default" encoding -- whatever that might mean. The XML should declare its encoding, or if it doesn't, it should use UTF-8 or UTF-16. At any rate your XML parser should take care of determining the encoding.
i guess the real issue is determining the default encoding used by a web service i am consuming
If it isn't XML, then perhaps the easiest course of action would be to ask the owner of the service what encoding you should use.
Also note that if the document is being transferred via HTTP, and there's an HTTP header which specifies the charset, this value overrides whatever the document says is its encoding. Again, an XML parser should take this into account, but often people don't pass the HTTP URL to the parser, they do the HTTP connection themselves and then pass the resulting InputStream. In that case it would be your responsibility to extract the charset from the HTTP connection and make sure the parser uses that for the document's encoding.
It's also possible that the producer of the document has screwed up the encoding. But I would first assume you are the one screwing up the encoding (I thought Google was screwing up the encoding of an XML document until I found the rule that I described in the previous paragraph, then I realized it was me screwing it up.)