8 Replies Latest reply: Jul 11, 2009 12:05 AM by 796365 RSS

    German Umlauts OK in Test Environment, Question Marks (??) in production

    843810
      Hi Sun Forums,

      I have a simple Java application that uses JFrame for a window, a JTextArea for console output. While running my application in test mode (that is, run locally within Eclipse development environment) the software properly handles all German Umlauts in the JTextArea (also using Log4J to write the same output to file-- that too is OK). In fact, the application is flawless from this perspective.

      However, when I deploy the application to multiple environments, the Umlauts are displayed as ??. Deployment is destined for Mac OS X (10.4/10.5) and Windows-based computers. (XP, Vista) with a requirement of Java 1.5 at the minimum.

      On the test computer (Mac OS X 10.5), the test environment is OK, but running the application as a runnable jar, german umlauts become question marks ??. I use Jar Bundler on Mac to produce an application object, and Launch4J to build a Windows executables.

      I am setting the default encoding to UTF-8 at the start of my app. Other international characters treated OK after deployment (e, a with accents). It seems to be localized to german umlaut type characters where the app fails.

      I have encoded my source files as UTF-8 in Eclipse. I am having a hard time understanding what the root cause is. I suspect it is the default encoding on the computer the software is running on. If this is true, then how do I force the application to honor german umlauts?

      Thanks very much,

      Ryan Allaby
      RA-CC.COM
      J2EE/Java Developer

      Edited by: RyanAllaby on Jul 10, 2009 2:50 PM
        • 1. Re: German Umlauts OK in Test Environment, Question Marks (??) in production
          DrClap
          RyanAllaby wrote:
          I suspect it is the default encoding on the computer the software is running on. If this is true, then how do I force the application to honor german umlauts?
          I suspect you are right. So don't do anything which uses the default encoding. That includes the creation of Readers and Writers (use InputStreamReader and OutputStreamWriter as intermediaries) and the use of the getBytes() method for a start.
          • 2. Re: German Umlauts OK in Test Environment, Question Marks (??) in productio
            843810
            hi DrClap, thanks for your post.

            I use ByteBuffer with getBytes() to convert the data from UTF-8 to ISO-8859-1 and this method works great with french, spanish and german characters, with umlauts in test mode. On the same computer, running in a jar, it fails with umlauts.

            I am at a loss on how I am going to mitigate this problem.
                    //input is a String()
                 Charset utf8charset = Charset.forName( "UTF-8" );
                 Charset iso88591charset = Charset.forName( "ISO-8859-1" );
                 ByteBuffer inputBuffer = ByteBuffer.wrap( input.getBytes() );
                    // decode UTF-8
                 CharBuffer data = utf8charset.decode( inputBuffer );
            
                 // encode ISO-8559-1
                 ByteBuffer outputBuffer = iso88591charset.encode( data );
                      
                 byte[] outputData = outputBuffer.array();
                 return new String( outputData );
            • 3. Re: German Umlauts OK in Test Environment, Question Marks (??) in productio
              843810
              So you start with a string called "input"; where did that come from? As far as we know, it could already have been corrupted.
              ByteBuffer inputBuffer = ByteBuffer.wrap( input.getBytes() );
              Here you convert the string to to a byte array using the default encoding. You say you've set the default to UTF-8, but how do you know it worked on the customer's machine? When we advise you not to rely on the default encoding, we don't mean you should override that system property, we mean you should always specify the encoding in your code. There's a getBytes() method that lets you do that.
              CharBuffer data = utf8charset.decode( inputBuffer );
              Now you decode the byte[] that you think is UTF-8, as UTF-8. If getBytes() did in fact encode the string as UTF-8, this is a wash; you just wasted a lot of time and ended up with the exact same string you started with. On the other hand, if getBytes() used something other than UTF-8, you've just created a load of garbage.
              ByteBuffer outputBuffer = iso88591charset.encode( data );
              Next you create yet another byte array, this time using the ISO-8859-1 encoding. If the string was valid to begin with, and the previous steps didn't corrupt it, there could be characters in it that can't be encoded in ISO-8859-1. Those characters will be lost.
              byte[] outputData = outputBuffer.array();
              return new String( outputData );
              Finally, you decode the byte[] once more, this time using the default encoding. As with getBytes(), there's a String constructor that lets you specify the encoding, but it doesn't really matter. For the previous steps to have worked, the default had to be UTF-8. That means you have a byte[] that's encoded as ISO-8859-1 and you're decoding it as UTF-8. What's wrong with this picture?

              This whole sequence makes no sense anyway; at best, it's a huge waste of clock cycles. It looks like you're trying to change the encoding of the string, which is impossible. No matter what platform it runs on, Java always uses the same encoding for strings. That encoding is UTF-16, but you don't really need to know that. You should only have to deal with character encodings when your app communicates with something outside itself, like a network or a file system.

              What's the real problem you're trying to solve?
              • 4. Re: German Umlauts OK in Test Environment, Question Marks (??) in productio
                796365
                You might find the following website useful. It gives explanations of unicode, utf, character encoding / input / output, etc that are as good as I've found, with useful code examples. It's for the Vietnamese language, but directly applies to all encodings by just changing the encoding charset. Highly recommended.

                [http://vietunicode.sourceforge.net/howto/]

                Stem the url to [http://vietunicode.sourceforge.net/|http://vietunicode.sourceforge.net/] for an extensive set of FAQs and links.
                • 5. Re: German Umlauts OK in Test Environment, Question Marks (??) in productio
                  843810
                  >
                  What's the real problem you're trying to solve?
                  i guess the real issue is determining the default encoding used by a web service i am consuming
                  • 6. Re: German Umlauts OK in Test Environment, Question Marks (??) in productio
                    DrClap
                    RyanAllaby wrote:
                    i guess the real issue is determining the default encoding used by a web service i am consuming
                    Does this web service involve XML? If so then you shouldn't have to worry about "default" encoding -- whatever that might mean. The XML should declare its encoding, or if it doesn't, it should use UTF-8 or UTF-16. At any rate your XML parser should take care of determining the encoding.

                    If it isn't XML, then perhaps the easiest course of action would be to ask the owner of the service what encoding you should use.
                    • 7. Re: German Umlauts OK in Test Environment, Question Marks (??) in productio
                      843810
                      I see that the WSDL file is declared as ISO-8859-1, yet the owner of the web services says it should be encoded as UTF-8.

                      is there a book you would recommend to read concerning this issue?

                      I appreciate your help DrClap.

                      thanks,

                      ryan
                      • 8. Re: German Umlauts OK in Test Environment, Question Marks (??) in productio
                        DrClap
                        Also note that if the document is being transferred via HTTP, and there's an HTTP header which specifies the charset, this value overrides whatever the document says is its encoding. Again, an XML parser should take this into account, but often people don't pass the HTTP URL to the parser, they do the HTTP connection themselves and then pass the resulting InputStream. In that case it would be your responsibility to extract the charset from the HTTP connection and make sure the parser uses that for the document's encoding.

                        It's also possible that the producer of the document has screwed up the encoding. But I would first assume you are the one screwing up the encoding (I thought Google was screwing up the encoding of an XML document until I found the rule that I described in the previous paragraph, then I realized it was me screwing it up.)