2 Replies Latest reply: Nov 21, 2007 1:59 PM by 807603 RSS

    Reading in Latin Extended-A character set from a text file

    807603
      Hello all,

      I am writing a small program that reads in a text file containing special characters (beyond the ASCII char set) and converting it into "regular" characters. For example I would read in a uaccent and replace it with a u.

      Now I realize that Unicode support is built into Java from ground up but it goes only so far, you actually have to have the relevant character set to read it. My code is as follows:

      InputStreamReader inStreamReader = new InputStreamReader(new FileInputStream("input.txt"), "ISO-8859-1");

      BufferedReader bufferedReader = new BufferedReader(inStreamReader);

      String line = null;
      StringBuffer buff = new StringBuffer();

      while((line = bufferedReader.readLine()) != null) {
      char[] charArray = line.toCharArray();

      for(int i = 0; i < charArray.length; i++) {
      int x = (int)charArray;

      switch(x) {
      case 224: // this is agrave .. we need to replace it with a
      buff.append('a');
      break;
      case 230: // this is aelig .. we need to replace it with ae
      buff.append("ae");
      break;

      ///////// and so on

      Since I am reading in as ISO-8859-1, this works up to unicode 255. For the rest of the characters, apparently I need a Latin Extended-A and Latin Extended-B character set. How can I get that installed on my Windows OS machine? I am using jdk 1.4.1 on Windows XP. Any help is appreciated.

      Thanks,
      -vk4t
        • 1. Re: Reading in Latin Extended-A character set from a text file
          807603
          vkat wrote:
          Since I am reading in as ISO-8859-1, this works up to unicode 255. For the rest of the characters, apparently I need a Latin Extended-A and Latin Extended-B character set. How can I get that installed on my Windows OS machine? I am using jdk 1.4.1 on Windows XP. Any help is appreciated.
          If your file has characters outside of 8859-1's range (0 - 255), then it isn't ISO-8859-1 encoded. You need to know what encoding was used to store the file. It sounds like you it actually may be Unicode text, in which case you need to know which encoding (UTF8, UTF16, etc) was used.
          • 2. Re: Reading in Latin Extended-A character set from a text file
            807603
            Hi,

            I figured it out. I actually stored the input file as an unicode encoded file (using wordpad) and used UTF-16 while reading it in. Now I can read in the accurate unicode values and parse them correctly!

            Thanks,
            -Vamsi