Forum Stats

  • 3,770,518 Users
  • 2,253,129 Discussions
  • 7,875,493 Comments

Discussions

Where is the Multi-Byte Character.

sanath_k
sanath_k Member Posts: 62
edited Oct 30, 2009 9:39PM in Java Programming
Hello All

While reading data from DB, our middileware interface gave following error.
java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.

In addition to this, I wanted to suggest to the data input team on where exactly is the failure occured.
I have asked them and got the download of the dat file and my intention was to findout where exactly is
that multi-byte character located which caused this failure.

I wrote the following code to check this.
import java.io.*;
public class X
{
public static void main(String ar[])
{
int linenumber=1,columnnumber=1;
long totalcharacters=0;
try
{
File file = new File("inputfile.dat");
FileInputStream fin = new FileInputStream(file);
byte fileContent[] = new byte[(int)file.length()];
fin.read(fileContent);
for(int i=0;i<fileContent.length;i++)
 { 
   columnnumber++;totalcharacters++;
   if(fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300) // if invalid<br />
{System.out.println("failure at position: "+i);break;}<br />
if(fileContent[i]==10 || fileContent[i]==13) // if new line<br />
{linenumber++;columnnumber=1;}<br />
}<br />
fin.close();<br />
System.out.println("Finished successfully, total lines : "+linenumber+" total file size : "+totalcharacters);<br />
}<br />
catch (Exception e)<br />
{<br />
e.printStackTrace(); <br />
System.out.println("Exception at Line: "+linenumber+" columnnumber: " +columnnumber);<br />
}<br />
}<br />
}<pre class="jive-pre"><code class="jive-code">But this shows that the file is good and no issue with this.
Where as the middleware interface fails with above exception while reading exactly the same input file.

Anywhere I am doing wrong to locate that multi-byte character ?
Greatly appreciate any help everyone !

Thanks.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
«134

Comments

  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    edited Oct 23, 2009 8:27AM
    I have to admit that I do not know how to determine if some bytes constitute a legitimate UTF-8 value, perhaps there is something in Character that might help.

    However this if statement can't be what you want since as far as I can tell, it can never be true.
    if(fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300)<br />
    
    <pre class="jive-pre"><code class="jive-code">What single value will satisfy the first and third conditions?
    
    Edited by: johndjr on Oct 23, 2009 8:26 AM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
  • 800308
    800308 Member Posts: 8,227
    Sanath,

    It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

    Pop quiz:

    1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

    2. How many bytes (a signed 8-bit integer value) exceed 300?

    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

    4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

    5. Have you ever considered a career in the armed services?

    Cheers. Keith.
  • JoachimSauer
    JoachimSauer Member Posts: 4,780
    corlettk wrote:
    2. How many bytes (a signed 8-bit integer value) exceed 300?
    301 bytes exceed 300 bytes!
  • sanath_k
    sanath_k Member Posts: 62
    My challenge is to spot the multi-byte character hidden in this big dat file.
    This is because the data entry team asked me to spot out the record and column that has issue out of
    lakhs of records they sent inside this file.

    Lets have the validation code like this...
       if( (fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code">< 0 - As I tested, some chars generated -ve values for some codes.
    <div class="jive-quote">300 - was a try to find out if any characters exceeds actual chars. range.</div>10 and 13 are for line-feed.
    
    any alternative (better code ofcourse) way to catch this black sheep ?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  • sanath_k
    sanath_k Member Posts: 62
    edited Oct 23, 2009 10:37AM
    My challenge is to spot the multi-byte character hidden in this big dat file.
    This is because the data entry team asked me to spot out the record and column that has issue out of
    lakhs of records they sent inside this file.

    Lets have the validation code like this...
       if( (fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code">lessthan 0 - I saw some -ve values when I was testing with other files.
    greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
    if 10 and 13 are for line-feed.
    
    with this, I randomly placed chinese, korean characters and program found them.
    any alternative (better code ofcourse) way to catch this black sheep ?
    
    Edited by: Sanath_K on Oct 23, 2009 8:06 PM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    Sanath_K wrote:
       if( (fileContent<em><0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code">lessthan 0 - I saw some -ve values when I was testing with other files.
    greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
    if 10 and 13 are for line-feed.
    
    with this, I randomly placed chinese, korean characters and program found them.
    any alternative (better code ofcourse) way to catch this black sheep ?
    A less obfuscated way of doing that would be
       if( (fileContent<em>&0x80)!=0 ) // if not ASCII-7<br />
    {System.out.println("failure at position: "+i);break;}<pre class="jive-pre"><code class="jive-code">                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    edited Oct 23, 2009 11:01AM
    corlettk wrote:
    Sanath,

    It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

    Pop quiz:

    1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

    2. How many bytes (a signed 8-bit integer value) exceed 300?

    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

    4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

    5. Have you ever considered a career in the armed services?
    6. How much data do you think you've read when you do this:
    fin.read(fileContent);
    You might have read as little as one byte, meaning you're skimming over all but one byte of the file.
  • sanath_k
    sanath_k Member Posts: 62
    from right-click, file, properties , I found size : 12512196 bytes
    same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
    from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    edited Oct 23, 2009 11:32AM
    Sanath_K wrote:
    from right-click, file, properties , I found size : 12512196 bytes
    same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
    from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.
    If this is a disposable program, fine. But you're doing it wrong. And you still have serious issues with how you actually read in data. Namely, you look through every byte of fileContent but it's more than likely that only the first few bytes actually contain data from your file.
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    edited Oct 23, 2009 11:38AM
    If you want the entire contents of the file in a byte array, here's how you can do it:
    FileInputStream fin;
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    int len;
    byte[] buf = new byte[1024];
    while ( (len = fin.read(buf)) != -1 ) {
       baos.write(buf, 0, len);
    }
    
    byte[] fileContents = baos.toByteArray();
    But you're probably fine looking at it chunk-by-chunk.

    In fact, if you're only interested in doing it byte-by-byte, just do this:
    BufferedInputStream bin = new BufferedInputStream(fin);
    for (int b = -1; (b = bin.read()) != -1; ) {
      //deal with this byte
    }
    Edited by: endasil on 23-Oct-2009 11:36 AM
This discussion has been closed.