Forum Stats

  • 3,851,592 Users
  • 2,264,002 Discussions
  • 7,904,787 Comments

Discussions

Where is the Multi-Byte Character.

13

Comments

  • jschellSomeoneStoleMyAlias
    jschellSomeoneStoleMyAlias Member Posts: 24,877 Gold Badge
    Sanath_K wrote:
    Hello All

    While reading data from DB, our middileware interface gave following error.
    java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

    I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
    I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.
    Any such problem would of course not be caused by a character but rather by a character set.

    Although it is possible that an Oracle driver has a bug in it, the Oracle drivers have been handling character set conversions for years.
    Naturally if you do not set up the driver correctly then it will cause a problem.
  • jschellSomeoneStoleMyAlias
    jschellSomeoneStoleMyAlias Member Posts: 24,877 Gold Badge
    Sanath_K wrote:
    My challenge is to spot the multi-byte character hidden in this big dat file.
    ..
    with this, I randomly placed chinese, korean characters and program found them.
    any alternative (better code ofcourse) way to catch this black sheep ?
    That of course is ridiculous.

    Bytes are encoded to represent a character set. If you have a data file with text in it then it must be because the data file has at least one and perhaps more character sets.

    Attempting to detemine the character set of a file is generally a non-deterministic problem - a computer cannot determine the answer all the time.
    It will definitely not be able to do it with a misplaced single character from another character set.

    So your problem is actually one of the following
    - Determine what the character set is rather than using the wrong one.
    - Attempt to scrub the incoming data to remove garbage (because that is what wrong character set characters would be) from the data before attempting to insert it into the database. And provide enough error detection that problems can be dealt with manually. In this case you are NOT attempting to recognize characters from another character set but rather excluding anything that doesn't fit into the set that you are using.
    - Make the source of the file start producing a file in a single/correct character set.
  • 800308
    800308 Member Posts: 8,227
    JoachimSauer wrote:
    corlettk wrote:
    2. How many bytes (a signed 8-bit integer value) exceed 300?
    301 bytes exceed 300 bytes!
    EPIC FAIL ;-)
  • 800308
    800308 Member Posts: 8,227
    Sanath_K wrote:
    lakhs of records
    I'm stealing that word. It's just so... ummm.... "pithy"!
    A lakh (English pronunciation: /ˈlæk/ or /ˈlɑːk/; Hindi: लाख, pronounced [ˈlaːkʰ]) (also written lac) is a unit in the Indian numbering system equal to one hundred thousand (100,000; 105). It is widely used both in official and other contexts in Bangladesh, India, Maldives, Nepal, Sri Lanka, Myanmar and Pakistan, and is often used in Indian English.
    ~~ http://en.wikipedia.org/wiki/Lakh
  • 800308
    800308 Member Posts: 8,227
    Folks,

    Combining the just-read-single-bytes technique suggested by endasil with the bit-twiddling-highorder-test suggested by tschodt I get:

    (*NOTE:* The source code is it's own test-data. It contains two "extended characters" in the comment, and therefore must be saved in a UTF-8 (or equivalent) encoded file, and compiled using [javac's|http://www.manpagez.com/man/1/javac/] -encoding UTF8 argument.)

    ExtendedCharacterDetector.java
    package forums;
    
    import java.io.FileInputStream;
    import java.io.BufferedInputStream;
    
    // ÊPIC FÃIL!
    
    public class ExtendedCharacterDetector
    {
      public static void main(String[] args) {
        String filename = args.length>0 ? args[0] : "ExtendedCharacterDetector.java";
        try {
          BufferedInputStream input = null;
          try {
            input = new BufferedInputStream(new FileInputStream(filename));
            int count = 0;
            for ( int i=0,b=-1; (b=input.read()) != -1; ++i ) {
              if ( (b&0x80) != 0 ) {
                System.out.println("byte "+i+" is "+b);
                count++;
              }
            }
            System.out.println("Number of bytes exceeding "+0x80+" = "+count);
          } finally {
            if(input!=null)input.close();
          }
        } catch (Exception e) {
          e.printStackTrace();
        }
      }
    }
    output
    byte 94 is 195
    byte 95 is 138
    byte 101 is 195
    byte 102 is 131
    Number of bytes exceeding 128 = 4
    This still isn't very useful though, is it? Bytes 94 and 95 are "busted"... Goodo! So WTF is byte 94? I suppose you could download a [free HEX editor|http://www.google.com/search?q=free+hex+editor] and use that to spy-out the offending characters... (I use NEO, it works).

    So... presuming that you know what encoding the file is supposed to be in... I am still of the humble opinion that your users would be better-served if you read characters from the file, and reported character offsets and values to the user... the "extendedness test" remains logically the same... extended (non 7-bit-ascii) characters have a value exceeding 128 (2^7), and this (AFAIK) the same no matter which charset the character has been encoded in, because all modern charsets use the same "the basic ascii table" (i.e. code points <=128).

    Cheers. Keith.
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    It occurs to me that the underlying codecs, defined in java.nio.charset or thereabouts, work between ByteBuffer and CharacterBuffer, so if you put your data file contents into a ByteBuffer (or map the file to a ByteBuffer) and run the decoder directly it should leave the buffer pointers at the point of failure.

    Could save you a lot of bit twidling.

    But I'm not sure how knowing the offset of the problem is is going to help you. The likely cause is that you've somehow told the middleware that this text is UTF-8 when it's in some other encoding, and byte offsets won't help you there.

    In other words you probably have a configuration problem.
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    corlettk wrote:
    Sanath_K wrote:
    lakhs of records
    I'm stealing that word. It's just so... ummm.... "pithy"!
    A lakh (English pronunciation: /&#712;læk/ or /&#712;l&#593;&#720;k/; Hindi: &#2354;&#2366;&#2326;, pronounced [&#712;la&#720;k&#688;]) (also written lac) is a unit in the Indian numbering system equal to one hundred thousand (100,000; 105). It is widely used both in official and other contexts in Bangladesh, India, Maldives, Nepal, Sri Lanka, Myanmar and Pakistan, and is often used in Indian English.
    ~~ http://en.wikipedia.org/wiki/Lakh
    Laks is Dutch for "extremely negligent".
  • DrClap
    DrClap Member Posts: 25,479
    BalusC wrote:
    Laks is Dutch for "extremely negligent".
    Just like the English word lax only more emphatic.
  • sanath_k
    sanath_k Member Posts: 62
    As DBA suggested, the file was extracted with the the select query, where CONVERT (column_name, 'UTF8', 'WE8ISO8859P1') was used and this resolved the issue in file processing. It was done for all the description columns since we couldn't spot the exact field.
  • 791266
    791266 Member Posts: 18,005
    Sanath_K wrote:
    As DBA suggested, the file was extracted with the the select query, where CONVERT (column_name, 'UTF8', 'WE8ISO8859P1') was used and this resolved the issue in file processing. It was done for all the description columns since we couldn't spot the exact field.
    I wonder why considering that you had so many bugs in your original code.
This discussion has been closed.