5 Replies Latest reply: Feb 25, 2013 7:10 AM by andy dufresne RSS

    Converting a InputStream to a Reader

    andy dufresne
      I want to virtually split a huge text file (which can be with different character encodings – ASCII, UTF-8, UTF-16 etc). To do this I determine the file length (in bytes) and divide it by the number of parts I need. Since the text file contains records, I need to determine valid record boundaries after skipping the bytes equal to the splitFileSize. The code somewhat looks as below
                  int  startBytePointer = 47;
                  CountingInputStream  countingInputStream = new CountingInputStream(new FileInputStream(inputFilePath.toFile()));
                  long actualSkipped = countingInputStream.skip(startBytePointer);
                  Reader reader = new InputStreamReader(countingInputStream, getCharSetDecoder());
                  char delimiter = '\n';
                  char readCharacter;
                  while ((readCharacter = (char) reader.read()) != delimiter) { 
                      logger.debug("Bytes Read : {}", countingInputStream.getCount());             
                  }
      In the above code, I use CountingInputStream (from guava) to determine the byte count. The output of the above logger is 8237 for every character read. This is because sun.nio.cs.StreamDecoder which is internally used by InputStreamReader) has a default buffer size defined as 8192 (8kb). There is no way to override it too. Because the inputStream file pointer moves ahead I cannot determine the record boundary.

      How do I convert a InputStream to a Reader which does not have a buffer size?
        • 1. Re: Converting a InputStream to a Reader
          jtahlborn
          you need to create a custom InputStream which "limits" the amount of data returned from the underlying stream, so that the Reader can only read a single record at a time.
          • 2. Re: Converting a InputStream to a Reader
            andy dufresne
            Didn't completely understand your suggestion.

            By having a custom InputStream you meant I should override the read() method which accepts a buffer array. In this method I should then always read only one byte? The reader (StreamDecoder) would still expect 8192 bytes right?
            • 3. Re: Converting a InputStream to a Reader
              jtahlborn
              andy dufresne wrote:
              Didn't completely understand your suggestion.

              By having a custom InputStream you meant I should override the read() method which accepts a buffer array. In this method I should then always read only one byte? The reader (StreamDecoder) would still expect 8192 bytes right?
              well, that would be very inefficient. why would you return a single byte if more are available? by "limit" i meant that the custom stream only allows at most record_size number of bytes to be read from the underlying stream. that would most likely involve overriding all the read methods. since you are already using guava, you could use:

              http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/io/ByteStreams.html#limit%28java.io.InputStream,%20long%29
              • 4. Re: Converting a InputStream to a Reader
                EJP
                Get rid of the CountingInputStream and write yourself a FilteredReader subclass that does the counting directly.
                • 5. Re: Converting a InputStream to a Reader
                  andy dufresne
                  Here is code for reading the input file by skipping a calculated number of bytes and then reading the input file byte by byte to determine the record delimiter. Reply with your thoughts if you see issues with the code (especially considering the fact that the input file could be in different character encodings).
                            {
                              CountingInputStream countingInputStream = new CountingInputStream(new FileInputStream(inputFilePath.toFile()));
                              long endPointer;
                              while(true) {
                                  long actualSkipped = countingInputStream.skip(skipCount);
                                  if(actualSkipped == 0) {
                                      logger.info("Nothing to skip");
                                      break; //nothing to skip now.
                                  }
                  
                                  byte[] inputBytes = new byte[recordDelimiterBytes.length];
                                  int noOfBytesRead = countingInputStream.read(inputBytes);
                                  if(noOfBytesRead == -1) {
                                      //end of file already reached!
                                      endPointer = countingInputStream.getCount();                    
                                      break;
                                  }
                                  while (!(Arrays.equals(recordDelimiterBytes, inputBytes))) {
                                      shiftLeft(inputBytes);
                                      int readByte = countingInputStream.read();
                  
                                      if(readByte != -1) {
                                          inputBytes[inputBytes.length - 1] = (byte) readByte;
                                      } else {
                                          throw new IllegalStateException("EOF reached before getting the delimiter");
                                      }
                  
                                  }
                                  endPointer = countingInputStream.getCount();
                            }
                            
                            private void shiftLeft(byte[] inputBytes) {
                                 for(int i=0; i<inputBytes.length - 1; i++) {
                                      inputBytes[i] = inputBytes[i+1];
                                 }
                            }
                  Once the start and end pointers are noted, I create an inputStream using the guava method you suggest (ByteStreams.limit) and restrict the readers to read the input file till a specific byte count.