1 2 3 Previous Next 32 Replies Latest reply: Oct 30, 2009 8:39 PM by 800374 RSS

    Where is the Multi-Byte Character.

    800374
      Hello All

      While reading data from DB, our middileware interface gave following error.
      java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

      I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
      I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.

      In addition to this, I wanted to suggest to the data input team on where exactly is the failure occured.
      I have asked them and got the download of the dat file and my intention was to findout where exactly is
      that multi-byte character located which caused this failure.

      I wrote the following code to check this.
      import java.io.*;
      public class X
      {
      public static void main(String ar[])
      {
      int linenumber=1,columnnumber=1;
      long totalcharacters=0;
      try
      {
      File file = new File("inputfile.dat");
      FileInputStream fin = new FileInputStream(file);
      byte fileContent[] = new byte[(int)file.length()];
      fin.read(fileContent);
      for(int i=0;i<fileContent.length;i++)
       { 
         columnnumber++;totalcharacters++;
         if(fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300) // if invalid
      {System.out.println("failure at position: "+i);break;}
      if(fileContent[i]==10 || fileContent[i]==13) // if new line
      {linenumber++;columnnumber=1;}
      }
      fin.close();
      System.out.println("Finished successfully, total lines : "+linenumber+" total file size : "+totalcharacters);
      }
      catch (Exception e)
      {
      e.printStackTrace();
      System.out.println("Exception at Line: "+linenumber+" columnnumber: " +columnnumber);
      }
      }
      }
      But this shows that the file is good and no issue with this.
      Where as the middleware interface fails with above exception while reading exactly the same input file.
      
      Anywhere I am doing wrong to locate that multi-byte character ?
      Greatly appreciate any help everyone !
      
      Thanks.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
        • 1. Re: Where is the Multi-Byte Character.
          807580
          I have to admit that I do not know how to determine if some bytes constitute a legitimate UTF-8 value, perhaps there is something in Character that might help.

          However this if statement can't be what you want since as far as I can tell, it can never be true.
          if(fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300)
          What single value will satisfy the first and third conditions?
          
          Edited by: johndjr on Oct 23, 2009 8:26 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
          • 2. Re: Where is the Multi-Byte Character.
            800308
            Sanath,

            It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

            Pop quiz:

            1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

            2. How many bytes (a signed 8-bit integer value) exceed 300?

            3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

            4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

            5. Have you ever considered a career in the armed services?

            Cheers. Keith.
            • 3. Re: Where is the Multi-Byte Character.
              JoachimSauer
              corlettk wrote:
              2. How many bytes (a signed 8-bit integer value) exceed 300?
              301 bytes exceed 300 bytes!
              • 4. Re: Where is the Multi-Byte Character.
                800374
                My challenge is to spot the multi-byte character hidden in this big dat file.
                This is because the data entry team asked me to spot out the record and column that has issue out of
                lakhs of records they sent inside this file.

                Lets have the validation code like this...
                   if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid
                {System.out.println("failure at position: "+i);break;}
                < 0 - As I tested, some chars generated -ve values for some codes.
                
                300 - was a try to find out if any characters exceeds actual chars. range.
                10 and 13 are for line-feed. any alternative (better code ofcourse) way to catch this black sheep ?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                • 5. Re: Where is the Multi-Byte Character.
                  800374
                  My challenge is to spot the multi-byte character hidden in this big dat file.
                  This is because the data entry team asked me to spot out the record and column that has issue out of
                  lakhs of records they sent inside this file.

                  Lets have the validation code like this...
                     if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid
                  {System.out.println("failure at position: "+i);break;}
                  lessthan 0 - I saw some -ve values when I was testing with other files.
                  greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
                  if 10 and 13 are for line-feed.
                  
                  with this, I randomly placed chinese, korean characters and program found them.
                  any alternative (better code ofcourse) way to catch this black sheep ?
                  
                  Edited by: Sanath_K on Oct 23, 2009 8:06 PM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                  • 6. Re: Where is the Multi-Byte Character.
                    807580
                    Sanath_K wrote:
                       if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid
                    {System.out.println("failure at position: "+i);break;}
                    lessthan 0 - I saw some -ve values when I was testing with other files.
                    greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
                    if 10 and 13 are for line-feed.
                    
                    with this, I randomly placed chinese, korean characters and program found them.
                    any alternative (better code ofcourse) way to catch this black sheep ?
                    A less obfuscated way of doing that would be
                       if( (fileContent&0x80)!=0 ) // if not ASCII-7
                    {System.out.println("failure at position: "+i);break;}
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
                    • 7. Re: Where is the Multi-Byte Character.
                      807580
                      corlettk wrote:
                      Sanath,

                      It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

                      Pop quiz:

                      1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

                      2. How many bytes (a signed 8-bit integer value) exceed 300?

                      3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

                      4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

                      5. Have you ever considered a career in the armed services?
                      6. How much data do you think you've read when you do this:
                      fin.read(fileContent);
                      You might have read as little as one byte, meaning you're skimming over all but one byte of the file.
                      • 8. Re: Where is the Multi-Byte Character.
                        800374
                        from right-click, file, properties , I found size : 12512196 bytes
                        same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
                        from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
                        • 9. Re: Where is the Multi-Byte Character.
                          807580
                          Sanath_K wrote:
                          from right-click, file, properties , I found size : 12512196 bytes
                          same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
                          from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
                          If this is a disposable program, fine. But you're doing it wrong. And you still have serious issues with how you actually read in data. Namely, you look through every byte of fileContent but it's more than likely that only the first few bytes actually contain data from your file.
                          • 10. Re: Where is the Multi-Byte Character.
                            807580
                            If you want the entire contents of the file in a byte array, here's how you can do it:
                            FileInputStream fin;
                            ByteArrayOutputStream baos = new ByteArrayOutputStream();
                            int len;
                            byte[] buf = new byte[1024];
                            while ( (len = fin.read(buf)) != -1 ) {
                               baos.write(buf, 0, len);
                            }
                            
                            byte[] fileContents = baos.toByteArray();
                            But you're probably fine looking at it chunk-by-chunk.

                            In fact, if you're only interested in doing it byte-by-byte, just do this:
                            BufferedInputStream bin = new BufferedInputStream(fin);
                            for (int b = -1; (b = bin.read()) != -1; ) {
                              //deal with this byte
                            }
                            Edited by: endasil on 23-Oct-2009 11:36 AM
                            • 11. Re: Where is the Multi-Byte Character.
                              800374
                              lot of helpful comments on the logic...thanks.
                              question still hunts...
                              is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
                              I hope to help the data-entry team to rectify the error and re-process the file.
                              • 12. Re: Where is the Multi-Byte Character.
                                807580
                                Sanath_K wrote:
                                lot of helpful comments on the logic...thanks.
                                question still hunts...
                                is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
                                I hope to help the data-entry team to rectify the error and re-process the file.
                                This is UTF-8 encoded text? Look at each byte. If the high bit is set, it's a participant in a multi-byte character. [See here|http://en.wikipedia.org/wiki/UTF-8#Description]. tschodt tells you how to check for this in a previous reply.
                                • 13. Re: Where is the Multi-Byte Character.
                                  DrClap
                                  If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
                                  Reader r = new InputStreamReader(new FileReader(file), "UTF-8");
                                  int character = 0;
                                  while ((character = r.read()) >= 0) {
                                    // here we have a stream of characters decoded using UTF-8
                                    if (character > 127) {
                                      // this one isn't ASCII
                                    }
                                  }
                                  • 14. Re: Where is the Multi-Byte Character.
                                    807580
                                    DrClap wrote:
                                    If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
                                    Assuming the file contains valid UTF-8.
                                    If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.
                                    1 2 3 Previous Next