Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Where is the Multi-Byte Character.

sanath_kOct 23 2009 — edited Oct 30 2009

Hello All

While reading data from DB, our middileware interface gave following error.
java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.

In addition to this, I wanted to suggest to the data input team on where exactly is the failure occured.
I have asked them and got the download of the dat file and my intention was to findout where exactly is
that multi-byte character located which caused this failure.

I wrote the following code to check this.

import java.io.*;
public class X
{
public static void main(String ar[])
{
int linenumber=1,columnnumber=1;
long totalcharacters=0;
try
{
File file = new File("inputfile.dat");
FileInputStream fin = new FileInputStream(file);
byte fileContent[] = new byte[(int)file.length()];
fin.read(fileContent);
for(int i=0;i<fileContent.length;i++)
 { 
   columnnumber++;totalcharacters++;
   if(fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300) // if invalid

{System.out.println("failure at position: "+i);break;}

if(fileContent[i]==10 || fileContent[i]==13) // if new line

{linenumber++;columnnumber=1;}

}

fin.close();

System.out.println("Finished successfully, total lines : "+linenumber+" total file size : "+totalcharacters);

}

catch (Exception e)

{

e.printStackTrace(); 

System.out.println("Exception at Line: "+linenumber+" columnnumber: " +columnnumber);

}

}

}But this shows that the file is good and no issue with this.
Where as the middleware interface fails with above exception while reading exactly the same input file.

Anywhere I am doing wrong to locate that multi-byte character ?
Greatly appreciate any help everyone !

Thanks.

807580

I have to admit that I do not know how to determine if some bytes constitute a legitimate UTF-8 value, perhaps there is something in Character that might help.

However this if statement can't be what you want since as far as I can tell, it can never be true.

if(fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13 && fileContent[i]>300)


What single value will satisfy the first and third conditions?

Edited by: johndjr on Oct 23, 2009 8:26 AM

800308

Sanath,

It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

Pop quiz:

1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

2. How many bytes (a signed 8-bit integer value) exceed 300?

3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

5. Have you ever considered a career in the armed services?

Cheers. Keith.

JoachimSauer

corlettk wrote:
2. How many bytes (a signed 8-bit integer value) exceed 300?

301 bytes exceed 300 bytes!

sanath_k

My challenge is to spot the multi-byte character hidden in this big dat file.
This is because the data entry team asked me to spot out the record and column that has issue out of
lakhs of records they sent inside this file.

Lets have the validation code like this...

   if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid

{System.out.println("failure at position: "+i);break;}< 0 - As I tested, some chars generated -ve values for some codes.
300 - was a try to find out if any characters exceeds actual chars. range.10 and 13 are for line-feed.

any alternative (better code ofcourse) way to catch this black sheep ?

sanath_k

   if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid

{System.out.println("failure at position: "+i);break;}lessthan 0 - I saw some -ve values when I was testing with other files.
greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
if 10 and 13 are for line-feed.

with this, I randomly placed chinese, korean characters and program found them.
any alternative (better code ofcourse) way to catch this black sheep ?

Edited by: Sanath_K on Oct 23, 2009 8:06 PM

807580

Sanath_K wrote:

   if( (fileContent<0 && fileContent[i]!=10 && fileContent[i]!=13) || fileContent[i]>300) // if invalid

{System.out.println("failure at position: "+i);break;}lessthan 0 - I saw some -ve values when I was testing with other files.
greaterthan 300 - was a try to find out if any characters exceeds actual chars. range.
if 10 and 13 are for line-feed.

with this, I randomly placed chinese, korean characters and program found them.
any alternative (better code ofcourse) way to catch this black sheep ?A less obfuscated way of doing that would be
   if( (fileContent&0x80)!=0 ) // if not ASCII-7

{System.out.println("failure at position: "+i);break;}

807580

corlettk wrote:
Sanath,

It is my considered opinion that you're in way over your head. That code has some really basic noob mistakes.

Pop quiz:

1. For what value(s) of x is this statement true (x<0 && x>300)? Putz!

2. How many bytes (a signed 8-bit integer value) exceed 300?

3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents. Why do you imagine that you are special (and therefore this technique will work reliably for you anyways?

4. How where you planning to determine the size of each character by reading bytes? Whitefella mojo maybe? You might also want to try a nice Séance, but I don't much like your chances there either.

5. Have you ever considered a career in the armed services?

6. How much data do you think you've read when you do this:

fin.read(fileContent);

You might have read as little as one byte, meaning you're skimming over all but one byte of the file.

sanath_k

from right-click, file, properties , I found size : 12512196 bytes
same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.

807580

Sanath_K wrote:
from right-click, file, properties , I found size : 12512196 bytes
same is the response from file.length, byte array size before for loop and finally the value I am printing to verify, i.e. totalcharacters.
from this, I felt it is ok to goahead for checking each byte value as aim is to locate the first speacial character.

If this is a disposable program, fine. But you're doing it wrong. And you still have serious issues with how you actually read in data. Namely, you look through every byte of fileContent but it's more than likely that only the first few bytes actually contain data from your file.

807580

If you want the entire contents of the file in a byte array, here's how you can do it:

FileInputStream fin;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int len;
byte[] buf = new byte[1024];
while ( (len = fin.read(buf)) != -1 ) {
   baos.write(buf, 0, len);
}

byte[] fileContents = baos.toByteArray();

But you're probably fine looking at it chunk-by-chunk.

In fact, if you're only interested in doing it byte-by-byte, just do this:

BufferedInputStream bin = new BufferedInputStream(fin);
for (int b = -1; (b = bin.read()) != -1; ) {
  //deal with this byte
}

Edited by: endasil on 23-Oct-2009 11:36 AM

sanath_k

lot of helpful comments on the logic...thanks.
question still hunts...
is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
I hope to help the data-entry team to rectify the error and re-process the file.

807580

Sanath_K wrote:
lot of helpful comments on the logic...thanks.
question still hunts...
is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
I hope to help the data-entry team to rectify the error and re-process the file.

This is UTF-8 encoded text? Look at each byte. If the high bit is set, it's a participant in a multi-byte character. [See here|http://en.wikipedia.org/wiki/UTF-8#Description]. tschodt tells you how to check for this in a previous reply.

DrClap

If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:

Reader r = new InputStreamReader(new FileReader(file), "UTF-8");
int character = 0;
while ((character = r.read()) >= 0) {
  // here we have a stream of characters decoded using UTF-8
  if (character > 127) {
    // this one isn't ASCII
  }
}

807580

DrClap wrote:
If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:

Assuming the file contains valid UTF-8.
If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.

DrClap

tschodt wrote:

Assuming the file contains valid UTF-8.
If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.

That's true, and certainly a possibility. But we don't know whether that's the OP's problem. All we have is some guff about "multi-byte" characters. If I were doing this -- well I wouldn't be doing this because I would get around to asking the right questions -- I would start with that, then if it threw an exception I would change it to count the number of characters read before the exception was thrown.

796440

corlettk wrote:
3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.

Erm?

807580

jverd wrote:

corlettk wrote:
3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.

Erm?

Think he was thinking InputStream.available(). I wouldn't use File.length either way, because then you can't use it against pipes, etc.

807580

I don't see one.

But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.

796440

BalusC wrote:
I don't see one.

But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.

It's not the byte array that's the problem in that case, but rather the fact that you're gong to read the whole file. However, if you do decide to read the whole file, and if you know it's a regular file, not a pipe or something, and if you can assume that the size won't change while you're reading it, then declaring a byte[] of exactly the file's length would be a good way to do it.

It's not the way I'd normally read a file, but I wouldn't rule it out.

DrClap

BalusC wrote:
But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.

Even more fun if the file length exceeds the maximum value of an integer and it gets truncated to fit in an integer, which you then use as the size of your array.

jschellSomeoneStoleMyAlias

Sanath_K wrote:
Hello All

While reading data from DB, our middileware interface gave following error.
java.sql.SQLException: Fail to convert between UTF8 and UCS2: failUTF8Conv

I understand that this failure is because of a multi-byte character, where 10g driver will fix this bug.
I suggested the integration admin team to replace current 9i driver with 10g one and they are on it.

Any such problem would of course not be caused by a character but rather by a character set.

Although it is possible that an Oracle driver has a bug in it, the Oracle drivers have been handling character set conversions for years.
Naturally if you do not set up the driver correctly then it will cause a problem.

jschellSomeoneStoleMyAlias

Sanath_K wrote:
My challenge is to spot the multi-byte character hidden in this big dat file.
..
with this, I randomly placed chinese, korean characters and program found them.
any alternative (better code ofcourse) way to catch this black sheep ?

That of course is ridiculous.

Bytes are encoded to represent a character set. If you have a data file with text in it then it must be because the data file has at least one and perhaps more character sets.

Attempting to detemine the character set of a file is generally a non-deterministic problem - a computer cannot determine the answer all the time.
It will definitely not be able to do it with a misplaced single character from another character set.

So your problem is actually one of the following
- Determine what the character set is rather than using the wrong one.
- Attempt to scrub the incoming data to remove garbage (because that is what wrong character set characters would be) from the data before attempting to insert it into the database. And provide enough error detection that problems can be dealt with manually. In this case you are NOT attempting to recognize characters from another character set but rather excluding anything that doesn't fit into the set that you are using.
- Make the source of the file start producing a file in a single/correct character set.

800308

JoachimSauer wrote:

corlettk wrote:
2. How many bytes (a signed 8-bit integer value) exceed 300?

301 bytes exceed 300 bytes!

EPIC FAIL ;-)

800308

Sanath_K wrote:
lakhs of records

I'm stealing that word. It's just so... ummm.... "pithy"!

A lakh (English pronunciation: /ˈlæk/ or /ˈlɑːk/; Hindi: लाख, pronounced [ˈlaːkʰ]) (also written lac) is a unit in the Indian numbering system equal to one hundred thousand (100,000; 105). It is widely used both in official and other contexts in Bangladesh, India, Maldives, Nepal, Sri Lanka, Myanmar and Pakistan, and is often used in Indian English.
~~ http://en.wikipedia.org/wiki/Lakh

800308

Folks,

Combining the just-read-single-bytes technique suggested by endasil with the bit-twiddling-highorder-test suggested by tschodt I get:

(*NOTE:* The source code is it's own test-data. It contains two "extended characters" in the comment, and therefore must be saved in a UTF-8 (or equivalent) encoded file, and compiled using [javac's|http://www.manpagez.com/man/1/javac/] -encoding UTF8 argument.)

ExtendedCharacterDetector.java

package forums;

import java.io.FileInputStream;
import java.io.BufferedInputStream;

// ÊPIC FÃIL!

public class ExtendedCharacterDetector
{
  public static void main(String[] args) {
    String filename = args.length>0 ? args[0] : "ExtendedCharacterDetector.java";
    try {
      BufferedInputStream input = null;
      try {
        input = new BufferedInputStream(new FileInputStream(filename));
        int count = 0;
        for ( int i=0,b=-1; (b=input.read()) != -1; ++i ) {
          if ( (b&0x80) != 0 ) {
            System.out.println("byte "+i+" is "+b);
            count++;
          }
        }
        System.out.println("Number of bytes exceeding "+0x80+" = "+count);
      } finally {
        if(input!=null)input.close();
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

output

byte 94 is 195
byte 95 is 138
byte 101 is 195
byte 102 is 131
Number of bytes exceeding 128 = 4

This still isn't very useful though, is it? Bytes 94 and 95 are "busted"... Goodo! So WTF is byte 94? I suppose you could download a [free HEX editor|http://www.google.com/search?q=free+hex+editor] and use that to spy-out the offending characters... (I use NEO, it works).

So... presuming that you know what encoding the file is supposed to be in... I am still of the humble opinion that your users would be better-served if you read characters from the file, and reported character offsets and values to the user... the "extendedness test" remains logically the same... extended (non 7-bit-ascii) characters have a value exceeding 128 (2^7), and this (AFAIK) the same no matter which charset the character has been encoded in, because all modern charsets use the same "the basic ascii table" (i.e. code points <=128).

Cheers. Keith.

807580

It occurs to me that the underlying codecs, defined in java.nio.charset or thereabouts, work between ByteBuffer and CharacterBuffer, so if you put your data file contents into a ByteBuffer (or map the file to a ByteBuffer) and run the decoder directly it should leave the buffer pointers at the point of failure.

Could save you a lot of bit twidling.

But I'm not sure how knowing the offset of the problem is is going to help you. The likely cause is that you've somehow told the middleware that this text is UTF-8 when it's in some other encoding, and byte offsets won't help you there.

In other words you probably have a configuration problem.

807580

corlettk wrote:

Sanath_K wrote:
lakhs of records

I'm stealing that word. It's just so... ummm.... "pithy"!

Laks is Dutch for "extremely negligent".

DrClap

BalusC wrote:
Laks is Dutch for "extremely negligent".

Just like the English word lax only more emphatic.

sanath_k

As DBA suggested, the file was extracted with the the select query, where CONVERT (column_name, 'UTF8', 'WE8ISO8859P1') was used and this resolved the issue in file processing. It was done for all the description columns since we couldn't spot the exact field.

791266

Sanath_K wrote:
As DBA suggested, the file was extracted with the the select query, where CONVERT (column_name, 'UTF8', 'WE8ISO8859P1') was used and this resolved the issue in file processing. It was done for all the description columns since we couldn't spot the exact field.

I wonder why considering that you had so many bugs in your original code.

sanath_k

Concern from the quality standpoint is that using convert function for all columns instead of finding out just one record's field where data could possibly be corrected is a much time consuming process.

sanath_k

Thanks for all the valuable responses.

1 - 32

Locked Post

New comments cannot be posted to this locked post.

Locked on Nov 27 2009

Added on Oct 23 2009

32 comments

811 views

Java Programming

Where is the Multi-Byte Character.

Comments

Post Details