Forum Stats

  • 3,728,470 Users
  • 2,245,631 Discussions
  • 7,853,548 Comments

Discussions

Where is the Multi-Byte Character.

sanath_k
sanath_k Member Posts: 62
edited October 2009 in Java Programming
Hello All

While reading data from DB, our middileware interface gave following error.
java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.

In addition to this, I wanted to suggest to the data input team on where exactly is the failure occured.
I have asked them and got the download of the dat file and my intention was to findout where exactly is
that multi-byte character located which caused this failure.

I wrote the following code to check this.
import java.io.*;
public class X
{
public static void main(String ar[])
{
int linenumber=1,columnnumber=1;
long totalcharacters=0;
try
{
File file = new File("inputfile.dat");
FileInputStream fin = new FileInputStream(file);
byte fileContent[] = new byte[(int)file.length()];
fin.read(fileContent);
for(int i=0;i<fileContent.length;i++)
 { 
   columnnumber++;totalcharacters++;
   if(fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300) // if invalid<br />
{System.out.println("failure at position: "+i);break;}<br />
if(fileContent[i]==10 || fileContent[i]==13) // if new line<br />
{linenumber++;columnnumber=1;}<br />
}<br />
fin.close();<br />
System.out.println("Finished successfully, total lines : "+linenumber+" total file size : "+totalcharacters);<br />
}<br />
catch (Exception e)<br />
{<br />
e.printStackTrace(); <br />
System.out.println("Exception at Line: "+linenumber+" columnnumber: " +columnnumber);<br />
}<br />
}<br />
}<pre class="jive-pre"><code class="jive-code">But this shows that the file is good and no issue with this.
Where as the middleware interface fails with above exception while reading exactly the same input file.

Anywhere I am doing wrong to locate that multi-byte character ?
Greatly appreciate any help everyone !

Thanks
«1

Comments

  • 807580
    807580 Member Posts: 33,048
    edited October 2009
    I have to admit that I do not know how to determine if some bytes constitute a legitimate UTF-8 value, perhaps there is something in Character that might help.

    However this if statement can't be what you want since as far as I can tell, it can never be true.
    if(fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300)<br />
    
    <pre class="jive-pre"><code class="jive-code">What single value will satisfy the first and third conditions?
    
    Edited by: johndjr on Oct
  • 800308
    800308 Member Posts: 8,227
    Sanath,

    It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

    Pop quiz:

    1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

    2. How many bytes (a signed 8-bit integer value) exceed 300?

    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

    4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

    5. Have you ever considered a career in the armed services?

    Cheers. Keith.
  • JoachimSauer
    JoachimSauer Member Posts: 4,780
    corlettk wrote:
    2. How many bytes (a signed 8-bit integer value) exceed 300?
    301 bytes exceed 300 bytes!
  • sanath_k
    sanath_k Member Posts: 62
    My challenge is to spot the multi-byte character hidden in this big dat file.
    This is because the data entry team asked me to spot out the record and column that has issue out of
    lakhs of records they sent inside this file.

    Lets have the validation code like this...
       if( (fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code">< 0 - As I tested, some chars generated -ve values for some codes.
    <div class="jive-quote">300 - was a try to find out if any characters exceeds actual chars. range.</div>10 and 13 are for line-feed.
    
    any alternative (better code ofcourse) way to catch this black sheep ?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  • sanath_k
    sanath_k Member Posts: 62
    edited October 2009
    My challenge is to spot the multi-byte character hidden in this big dat file.
    This is because the data entry team asked me to spot out the record and column that has issue out of
    lakhs of records they sent inside this file.

    Lets have the validation code like this...
       if( (fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code">lessthan 0 - I saw some -ve values when I was testing with other files.
    greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
    if 10 and 13 are for line-feed.
    
    with this, I randomly placed chinese, korean characters and program found them.
    any alternative (better code ofcourse) way to catch this black sheep ?
    
    Edited by: Sanath_K on Oct
  • 807580
    807580 Member Posts: 33,048
    Sanath_K wrote:
       if( (fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code">lessthan 0 - I saw some -ve values when I was testing with other files.
    greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
    if 10 and 13 are for line-feed.
    
    with this, I randomly placed chinese, korean characters and program found them.
    any alternative (better code ofcourse) way to catch this black sheep ?
    A less obfuscated way of doing that would be
       if( (fileContent<em>&0x80)!=0 ) // if not ASCII-7<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code
  • 807580
    807580 Member Posts: 33,048
    edited October 2009
    corlettk wrote:
    Sanath,

    It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

    Pop quiz:

    1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

    2. How many bytes (a signed 8-bit integer value) exceed 300?

    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

    4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

    5. Have you ever considered a career in the armed services?
    6. How much data do you think you've read when you do this:
    fin.read(fileContent);
    You might have read as little as one byte, meaning you're skimming over all but one byte of the file.
  • sanath_k
    sanath_k Member Posts: 62
    from right-click, file, properties , I found size : 12512196 bytes
    same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
    from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
  • 807580
    807580 Member Posts: 33,048
    edited October 2009
    Sanath_K wrote:
    from right-click, file, properties , I found size : 12512196 bytes
    same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
    from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
    If this is a disposable program, fine. But you're doing it wrong. And you still have serious issues with how you actually read in data. Namely, you look through every byte of fileContent but it's more than likely that only the first few bytes actually contain data from your file.
  • 807580
    807580 Member Posts: 33,048
    edited October 2009
    If you want the entire contents of the file in a byte array, here's how you can do it:
    FileInputStream fin;
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    int len;
    byte[] buf = new byte[1024];
    while ( (len = fin.read(buf)) != -1 ) {
       baos.write(buf, 0, len);
    }
    
    byte[] fileContents = baos.toByteArray();
    But you're probably fine looking at it chunk-by-chunk.

    In fact, if you're only interested in doing it byte-by-byte, just do this:
    BufferedInputStream bin = new BufferedInputStream(fin);
    for (int b = -1; (b = bin.read()) != -1; ) {
      //deal with this byte
    }
    Edited by: endasil on 23-Oct-2009 11:36 AM
  • sanath_k
    sanath_k Member Posts: 62
    lot of helpful comments on the logic...thanks.
    question still hunts...
    is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
    I hope to help the data-entry team to rectify the error and re-process the file.
  • 807580
    807580 Member Posts: 33,048
    edited October 2009
    Sanath_K wrote:
    lot of helpful comments on the logic...thanks.
    question still hunts...
    is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
    I hope to help the data-entry team to rectify the error and re-process the file.
    This is UTF-8 encoded text? Look at each byte. If the high bit is set, it's a participant in a multi-byte character. [See here|http://en.wikipedia.org/wiki/UTF-8#Description]. tschodt tells you how to check for this in a previous reply.
  • DrClap
    DrClap Member Posts: 25,479
    If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
    Reader r = new InputStreamReader(new FileReader(file), "UTF-8");
    int character = 0;
    while ((character = r.read()) >= 0) {
      // here we have a stream of characters decoded using UTF-8
      if (character > 127) {
        // this one isn't ASCII
      }
    }
  • 807580
    807580 Member Posts: 33,048
    DrClap wrote:
    If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
    Assuming the file contains valid UTF-8.
    If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.
  • DrClap
    DrClap Member Posts: 25,479
    tschodt wrote:
    DrClap wrote:
    If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
    Assuming the file contains valid UTF-8.
    If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.
    That's true, and certainly a possibility. But we don't know whether that's the OP's problem. All we have is some guff about "multi-byte" characters. If I were doing this -- well I wouldn't be doing this because I would get around to asking the right questions -- I would start with that, then if it threw an exception I would change it to count the number of characters read before the exception was thrown.
  • 796440
    796440 Member Posts: 19,179
    corlettk wrote:
    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.
    Erm?
  • 807580
    807580 Member Posts: 33,048
    jverd wrote:
    corlettk wrote:
    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.
    Erm?
    Think he was thinking InputStream.available(). I wouldn't use File.length either way, because then you can't use it against pipes, etc.
  • 807580
    807580 Member Posts: 33,048
    I don't see one.

    But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
  • 796440
    796440 Member Posts: 19,179
    BalusC wrote:
    I don't see one.

    But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
    It's not the byte array that's the problem in that case, but rather the fact that you're gong to read the whole file. However, if you do decide to read the whole file, and if you know it's a regular file, not a pipe or something, and if you can assume that the size won't change while you're reading it, then declaring a byte[] of exactly the file's length would be a good way to do it.

    It's not the way I'd normally read a file, but I wouldn't rule it out.
  • DrClap
    DrClap Member Posts: 25,479
    BalusC wrote:
    But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
    Even more fun if the file length exceeds the maximum value of an integer and it gets truncated to fit in an integer, which you then use as the size of your array.
  • Sanath_K wrote:
    Hello All

    While reading data from DB, our middileware interface gave following error.
    java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

    I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
    I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.
    Any such problem would of course not be caused by a character but rather by a character set.

    Although it is possible that an Oracle driver has a bug in it, the Oracle drivers have been handling character set conversions for years.
    Naturally if you do not set up the driver correctly then it will cause a problem.
  • Sanath_K wrote:
    My challenge is to spot the multi-byte character hidden in this big dat file.
    ..
    with this, I randomly placed chinese, korean characters and program found them.
    any alternative (better code ofcourse) way to catch this black sheep ?
    That of course is ridiculous.

    Bytes are encoded to represent a character set. If you have a data file with text in it then it must be because the data file has at least one and perhaps more character sets.

    Attempting to detemine the character set of a file is generally a non-deterministic problem - a computer cannot determine the answer all the time.
    It will definitely not be able to do it with a misplaced single character from another character set.

    So your problem is actually one of the following
    - Determine what the character set is rather than using the wrong one.
    - Attempt to scrub the incoming data to remove garbage (because that is what wrong character set characters would be) from the data before attempting to insert it into the database. And provide enough error detection that problems can be dealt with manually. In this case you are NOT attempting to recognize characters from another character set but rather excluding anything that doesn't fit into the set that you are using.
    - Make the source of the file start producing a file in a single/correct character set.
  • 800308
    800308 Member Posts: 8,227
    JoachimSauer wrote:
    corlettk wrote:
    2. How many bytes (a signed 8-bit integer value) exceed 300?
    301 bytes exceed 300 bytes!
    EPIC FAIL ;-)
  • 800308
    800308 Member Posts: 8,227
    Sanath_K wrote:
    lakhs of records
    I'm stealing that word. It's just so... ummm.... "pithy"!
    A lakh (English pronunciation: /&#712;læk/ or /&#712;l&#593;&#720;k/; Hindi: &#2354;&#2366;&#2326;, pronounced [&#712;la&#720;k&#688;]) (also written lac) is a unit in the Indian numbering system equal to one hundred thousand (100,000; 105). It is widely used both in official and other contexts in Bangladesh, India, Maldives, Nepal, Sri Lanka, Myanmar and Pakistan, and is often used in Indian English.
    ~~ http://en.wikipedia.org/wiki/Lakh
  • 800308
    800308 Member Posts: 8,227
    Folks,

    Combining the just-read-single-bytes technique suggested by endasil with the bit-twiddling-highorder-test suggested by tschodt I get:

    (*NOTE:* The source code is it's own test-data. It contains two "extended characters" in the comment, and therefore must be saved in a UTF-8 (or equivalent) encoded file, and compiled using [javac's|http://www.manpagez.com/man/1/javac/] -encoding UTF8 argument.)

    ExtendedCharacterDetector.java
    package forums;
    
    import java.io.FileInputStream;
    import java.io.BufferedInputStream;
    
    // ÊPIC FÃIL!
    
    public class ExtendedCharacterDetector
    {
      public static void main(String[] args) {
        String filename = args.length>0 ? args[0] : "ExtendedCharacterDetector.java";
        try {
          BufferedInputStream input = null;
          try {
            input = new BufferedInputStream(new FileInputStream(filename));
            int count = 0;
            for ( int i=0,b=-1; (b=input.read()) != -1; ++i ) {
              if ( (b&0x80) != 0 ) {
                System.out.println("byte "+i+" is "+b);
                count++;
              }
            }
            System.out.println("Number of bytes exceeding "+0x80+" = "+count);
          } finally {
            if(input!=null)input.close();
          }
        } catch (Exception e) {
          e.printStackTrace();
        }
      }
    }
    output
    byte 94 is 195
    byte 95 is 138
    byte 101 is 195
    byte 102 is 131
    Number of bytes exceeding 128 = 4
    This still isn't very useful though, is it? Bytes 94 and 95 are "busted"... Goodo! So WTF is byte 94? I suppose you could download a [free HEX editor|http://www.google.com/search?q=free+hex+editor] and use that to spy-out the offending characters... (I use NEO, it works).

    So... presuming that you know what encoding the file is supposed to be in... I am still of the humble opinion that your users would be better-served if you read characters from the file, and reported character offsets and values to the user... the "extendedness test" remains logically the same... extended (non 7-bit-ascii) characters have a value exceeding 128 (2^7), and this (AFAIK) the same no matter which charset the character has been encoded in, because all modern charsets use the same "the basic ascii table" (i.e. code points <=128).

    Cheers. Keith.
This discussion has been closed.