Forum Stats

  • 3,855,374 Users
  • 2,264,500 Discussions
  • 7,905,979 Comments

Discussions

Where is the Multi-Byte Character.

24

Comments

  • sanath_k
    sanath_k Member Posts: 62
    lot of helpful comments on the logic...thanks.
    question still hunts...
    is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
    I hope to help the data-entry team to rectify the error and re-process the file.
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    edited Oct 23, 2009 11:55AM
    Sanath_K wrote:
    lot of helpful comments on the logic...thanks.
    question still hunts...
    is there any way we can seek help from java code to locate multi-byte characters in a dat file ?
    I hope to help the data-entry team to rectify the error and re-process the file.
    This is UTF-8 encoded text? Look at each byte. If the high bit is set, it's a participant in a multi-byte character. [See here|http://en.wikipedia.org/wiki/UTF-8#Description]. tschodt tells you how to check for this in a previous reply.
  • DrClap
    DrClap Member Posts: 25,479
    If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
    Reader r = new InputStreamReader(new FileReader(file), "UTF-8");
    int character = 0;
    while ((character = r.read()) >= 0) {
      // here we have a stream of characters decoded using UTF-8
      if (character > 127) {
        // this one isn't ASCII
      }
    }
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    DrClap wrote:
    If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
    Assuming the file contains valid UTF-8.
    If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.
  • DrClap
    DrClap Member Posts: 25,479
    tschodt wrote:
    DrClap wrote:
    If you want to scan a sequence of bytes from a file and identify locations where a subsequence of two or more bytes would represent a UTF-8 character which isn't in the ASCII character set, then the simplest way is to ask Java to do the UTF-8 part:
    Assuming the file contains valid UTF-8.
    If the file contains [invalid byte sequences|http://en.wikipedia.org/wiki/Utf-8#Invalid_byte_sequences] you get UTFDataFormatException.
    That's true, and certainly a possibility. But we don't know whether that's the OP's problem. All we have is some guff about "multi-byte" characters. If I were doing this -- well I wouldn't be doing this because I would get around to asking the right questions -- I would start with that, then if it threw an exception I would change it to count the number of characters read before the exception was thrown.
  • 796440
    796440 Member Posts: 19,179 Gold Trophy
    corlettk wrote:
    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.
    Erm?
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    jverd wrote:
    corlettk wrote:
    3. If you read the API doco for File.length() you'll find that it specifically warns against using it to size a byte-array to hold the whole files contents.
    Erm?
    Think he was thinking InputStream.available(). I wouldn't use File.length either way, because then you can't use it against pipes, etc.
  • 807580
    807580 Member Posts: 33,048 Green Ribbon
    I don't see one.

    But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
  • 796440
    796440 Member Posts: 19,179 Gold Trophy
    BalusC wrote:
    I don't see one.

    But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
    It's not the byte array that's the problem in that case, but rather the fact that you're gong to read the whole file. However, if you do decide to read the whole file, and if you know it's a regular file, not a pipe or something, and if you can assume that the size won't change while you're reading it, then declaring a byte[] of exactly the file's length would be a good way to do it.

    It's not the way I'd normally read a file, but I wouldn't rule it out.
  • DrClap
    DrClap Member Posts: 25,479
    BalusC wrote:
    But it is nothing more than obvious that you shouldn't declare a bytearray whose length is exact the file's length. That's lot of fun if the file length exceeds the available JVM heap memory.
    Even more fun if the file length exceeds the maximum value of an integer and it gets truncated to fit in an integer, which you then use as the size of your array.
This discussion has been closed.