This discussion is archived
4 Replies Latest reply: Apr 20, 2010 4:05 AM by 843810 RSS

Japanese Characters Encoding Problem

843810 Newbie
Currently Being Moderated
Hi All,
I have been looking at the problems posted in this forum and quite a few describe the issue I am facing currently but none has been able to provide a solution.

The problem I am facing is as follows:

Step 1: I am retrieving Japanese data from Oracle DB 9i (Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit) using standard JDBC API calls. [NLS_CHARACTERSET : AL32UTF8,  NLS_NCHAR_CHARACTERSET : AL16UTF16]

byte[] title = resultSet.getBytes("COLUMN_NAME");

Step 2: I pass the retrieved bytes to a method that returns SJIS encoded String.

private String getStringSJIS(byte[] bytesToBeEncoded) {
          StringBuffer sb = new StringBuffer();
          try {

               if (title != null) {
                    ByteArrayInputStream bais = new ByteArrayInputStream(bytesToBeEncoded);
                    InputStreamReader isr = new InputStreamReader(bais, "SJIS");

                    for (int c = isr.read(); c != (-1); c = isr.read()) {
                         sb.append((char) c);
                    }
               }
               return sb.toString();
          } catch (Exception ex) {;}
}

3) I am using an HTML Parser JAR to print the decimal value of the Encoded String.
String after = getStringSJIS(title);
System.out.println(Translate.encode(after));

I get an output of String 1: ツ禿コツ本ツ古ェツサツイツト
which contains 14 decimal character codes.

The same data is being read by another application that uses JDBC again and connects to the same DB and returns the decimal values as: String 2: 日本語サイト

The display of these two Strings differ significantly when viewed in the browser.
It seems String 1 contains single byte half-width characters and String 2 does not. Is anyone familiar as to why the bytes are getting modified while being retrieved from the Database for the same column value?
  • 1. Re: Japanese Characters Encoding Problem
    843810 Newbie
    Currently Being Moderated
    When you retrieve the byte array from the resultset, it appears to be in either UTF-8 or UTF-16 encoding. But you're trying to decode it as SJIS, so you get garbage.
  • 2. Re: Japanese Characters Encoding Problem
    843810 Newbie
    Currently Being Moderated
    The encoding for the bytes being returned from the database is Cp1252 but this encoding, I understand, depends on the underlying platform I am using.

    If indeed the data from the DB is in UTF-8 or 16, shouldn't it be displayed correctly in the browser? No encoding/decoding should be required on the data then. In the browser it gets displayed as “ú–{ŒêƒTƒCƒg. (The encoding of the JSP page is set to UTF-8.)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  • 3. Re: Japanese Characters Encoding Problem
    843810 Newbie
    Currently Being Moderated
    Why do you want to retrieve the text as raw bytes, anyway? If you set up the database properly, you should be able to use getString() instead of getBytes(), and not have to worry about the encoding. Anyway, it looks like the text is actually encoded as SJIS in the DB, but it's being converted (incorrectly) to UTF-8 upon retrieval.

    Here's what seems to be happening. The field you're accessing contains these bytes:
    93 FA 96 7B 8C EA 83 54 83 43 83 67
    which represents this text in SJIS encoding:
    日本語サイト
    But the DB seems to think it's encoded as ISO-8859-1, and when you retrieve it with the getBytes() call, it's being encoded as UTF-8, resulting in this byte sequence:
    C2 93 C3 BA C2 96 7B C2 8C C3 AA C2 83 54 C2 83 43 C2 83 67
    If you decode that as cp1252 you get
    “ú–{ŒêƒTƒCƒg
    If you decode is as SJIS, as your getStringSJIS() method does, you get
    ツ禿コツ本ツ古ェツサツイツト
    But, as I said earlier, you shouldn't have to decode anything yourself. You just need to set up the database (or at least the connection) so it knows that column contains text in the SJIS encoding, then retrieve it with getString().
  • 4. Re: Japanese Characters Encoding Problem
    843810 Newbie
    Currently Being Moderated
    Thanks a lot for the detailed explanation. It makes sense.

    I did remove the getBytes() call and used the simple getString() but that still doesn't work. (Let's call this database DB1).

    Surprisingly, I have another configured scheme that also holds Japanese data, and this database (let's say DB2) returns the correct characters with no additional configuration required at any level. Just a simple getString() works.

    Both these oracle databases have the NLS_CHARACTERSET set to AL32UTF8.

    The only difference has been the way I have been configuring the JDBC connection. For DB2, the connection has been configured using certain IBM-specific classes like DBSelect, DBStatement etc. But for DB1, I had been using the simple Class.forName(drivername) connection setup.

    Anyway, for DB1 also, I used the same IBM-specific code + I also used the Initial Context/Data Source-returned connection but it still isn't working.

    My guess is that the way the data has been inserted in DB1 may be causing the retrieval problem. I will now try to insert the data first and then retrieve it so as to ascertain if the bytes are lost anywhere during the round-trip.

    Can you please let me know how to specify the charset during the connection creation process?
    I tried using the CONVERT function but that will anyway not to work. I did come across a CodePageOverride property but this seems to be specific for Weblogic driver. Any little help will be appreciated! Thanks!