1 2 3 Previous Next 33 Replies Latest reply: May 9, 2011 9:39 AM by maheshguruswamy RSS

    Parse large (2GB) text file

    860177
      Hi all,

      I would like your expert tips on efficient ways (speed and memory considerations) for parsing large text files (~2GB) in Java.

      Specifically, the text files in hand contain mobility traces that I need to process. Traces have a predefined format, and each trace is given on a new line in the text file.
      To obtain the attribues of each trace i use java.util.regex.Pattern and java.util.regex.Matcher.

      Thanks in advance,
      Nick
        • 1. Re: Parse large (2GB) text file
          796440
          857174 wrote:
          Hi all,

          I would like your expert tips on efficient ways (speed and memory considerations) for parsing large text files (~2GB) in Java.

          Specifically, the text files in hand contain mobility traces that I need to process. Traces have a predefined format, and each trace is given on a new line in the text file.
          To obtain the attribues of each trace i use java.util.regex.Pattern and java.util.regex.Matcher.
          Sounds like you've got it solved. Are you having a particular problem? As long as you're using BufferedReader rather than reading a byte at a time from a lower-level InputStream or Reader, and don't have a horribly backtracking regex, there's not much you can do to make it any faster. And as long as you're reading and processing a line or smallish group of lines at a time, as opposed to reading the whole file into memory at once, you've got the memory issue knocked.

          It's difficult to offer any more detailed advice without a more detailed question or a specific problem that you've observed.
          • 2. Re: Parse large (2GB) text file
            860177
            Hi jverd,

            Thanks for the prompt reply.

            So far i had the following generic solution, which worked fine for small mobility trace files (~100Mb). Until today, where I tried to parse large trace files and got a Java out of memory message. As you can see the problem can be traced in the CharBuffer class, since it tries to load the whole file in memory.


            File file = new File( conf.getMobilityFile() );
            FileInputStream fileInputStream = new FileInputStream( file );
            FileChannel fileChannel = fileInputStream.getChannel();

            ByteBuffer byteBuffer = fileChannel.map( FileChannel.MapMode.READ_ONLY,  0, ( int )fileChannel.size() );
            Charset charSet = Charset.forName( "8859_1" ); //$NON-NLS-1$
            CharsetDecoder charsetDecoder = charSet.newDecoder();
            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            +/* Read list of Node Initial Location */+
            String regex = "......";
            Pattern pattern = Pattern.compile( regex );

            +/* Run Pattern Matching */+
            Matcher matcher = pattern.matcher( charBuffer );


            +while( matcher.find() ) {+
            +// Do some procesing+
            +}+

            I would like to retain the RegEx and Matching part but change the file reading method.

            Thanks.

            Edited by: 857174 on May 6, 2011 1:33 PM
            • 3. Re: Parse large (2GB) text file
              baftos
              I had to face a similar problem a few years ago. One important consideration was speed. My program had to run once a week overnight, so if it took hours of processing, this was not a problem. I kept it as simple as possible, read the whole thing in memory, populate some collections and start processing until the CPU smelled like BBQ. Of course I made sure the collections would be able to accommodate enough data for the foreseeable future and that 4 hours vs 8 hours of processing would not be a problem. Of course, I don't recommend writing dumb programs, but if you are not too restricted, simplicity is a quality. Sure, a database would have been better, but the program was doing exactly this: populate a database with processed data originating from files on which I did not have any control.

              Edit: After reading your reply to jverd, I have to emphasize that I was not reading the whole file in memory, but one line at a time.

              Edited by: baftos on May 6, 2011 4:49 PM
              • 4. Re: Parse large (2GB) text file
                jduprez
                So far i had the following generic solution, which worked fine for small mobility trace files (~100Mb). Until today, where I tried to parse large trace files and got a Java out of memory message. As you can see the problem can be traced in the CharBuffer class, since it tries to load the whole file in memory.
                As specified by the API Javadoc for CharsetDecoder.decode(...) , this method tries to read and decode the whole byte buffer. Apparently that won't fit your program's memory.
                Instead, use a BufferedReader around a FiileReader . You'll find ways in the java.io package to specify a specific charset encoding.

                Regards,

                J.

                P.S.: I didn't know Apple had published the format of their iPhone's location-tracing file :o)
                • 5. Re: Parse large (2GB) text file
                  860177
                  jduprez wrote:

                  P.S.: I didn't know Apple had published the format of their iPhone's location-tracing file :o)
                  Haha...It's a 2GB file ... Apple wouldn't track that much info :P
                  • 6. Re: Parse large (2GB) text file
                    YoungWinston
                    857174 wrote:
                    Haha...It's a 2GB file ... Apple wouldn't track that much info :P
                    One thing to think about is that 2Gb (Edit: should have said 2 billion entries) is the absolute limit of any array in Java - and therefore any class that uses an array as its backing storage.

                    Therefore:
                    ByteBuffer byteBuffer = fileChannel.map(
                       FileChannel.MapMode.READ_ONLY, 0,
                       ( int )fileChannel.size() );
                    may well be your limiting factor.

                    (In fact, the docs for the method explicitly says so - "size - The size of the region to be mapped; must be non-negative and no greater than Integer.MAX_VALUE)

                    What I don't understand is that if your Traces are held in lines, why don't you just use BufferedReader.readLine()? 2 billion +lines+ is a hell of a lot more than 2 billion bytes.

                    Of course you may run into +other+ memory problems then :-).

                    Winston

                    Edited by: YoungWinston on May 7, 2011 3:20 PM
                    • 7. Re: Parse large (2GB) text file
                      860177
                      >
                      Of course you may run into other memory problems then :-).

                      Winston
                      Hi Winston,

                      What kind of other memory problems i might run into?

                      Btw, how can i use RegEx as i currently do with a BufferedReader?

                      Nick
                      • 8. Re: Parse large (2GB) text file
                        796440
                        857174 wrote:
                        Of course you may run into other memory problems then :-).

                        Winston
                        Hi Winston,

                        What kind of other memory problems i might run into?

                        Btw, how can i use RegEx as i currently do with a BufferedReader?

                        Nick
                        You don't use regex with a BufferedReader. You use regex to examine and manipulate Strings. The regex doesn't know or care if the string came from a BufferedReader. I haven't kept up on this thread in detail, but my undrestanding from you initial post is that your data is line-based and you can process each line individually. So you read a line, then match it against the regex or do whatever you need to do. Then read another line and repeat the process. Keep doing that until you've processed the entire file.
                        • 9. Re: Parse large (2GB) text file
                          sabre150
                          I'm a great fan of regex but I question the use of regex in this context. Before deciding that regex are the solution to the parsing I would need to know what the format of each line is, what you are extracting from each line and then what you are going to do with the extracted components.
                          • 10. Re: Parse large (2GB) text file
                            860177
                            Hi Sabre,

                            these traces are from a mobility simulator.

                            The format is something like this:

                            tr at 2.0 "$obj(1) pos 123.20 270.98 0.0 2.4"

                            where:
                            2.0 : time (double),
                            123.20 : X Coordinate (double)
                            270.98 : Y Coordinate (double)
                            0.0 : Z Coordinate (double)
                            2.3 : Velocity (double)


                            I parse the node positions in order to come up with some mobility estimations
                            • 11. Re: Parse large (2GB) text file
                              802316
                              Memory mapped files are faster when you need random access and you don't need to load all the data, however here it just add complexity you don't need IMHO.
                              I suspect most of the time is taken by the parser so if you customise your parser it could be faster. Here is a simple custom parser
                              public static void main(String... args) throws IOException {
                                  String template = "tr at %.1f \"$obj(1) pos 123.20 270.98 0.0 2.4\"%n";
                                  File file = new File("/tmp/deleteme.txt");
                              //        if(!file.exists()) {
                                      System.out.println(new Date()+": Writing to "+file);
                                      PrintWriter pw = new PrintWriter(file);
                                      for(int i=0;i<Integer.MAX_VALUE/template.length();i++)
                                          pw.printf(template, i/10.0);
                                      pw.close();
                                      System.out.println(new Date()+": ... finished writing to " + file + " length= " + file.length() / 1024 / 1024 + " MB.");
                              //        }
                              
                                  long start = System.nanoTime();
                                  final BufferedReader br = new BufferedReader(new FileReader(file), 64 * 1024);
                                  for(String line;(line = br.readLine()) != null;) {
                                      int pos = 6;
                                      int end = line.indexOf(' ', pos);
                                      double time = Double.parseDouble(line.substring(pos, end));
                              
                                      pos = line.indexOf('s', end+12)+2;
                                      end = line.indexOf(' ', pos+1);
                                      double x = Double.parseDouble(line.substring(pos, end));
                              
                                      pos = end+1;
                                      end = line.indexOf(' ', pos+1);
                                      double y = Double.parseDouble(line.substring(pos, end));
                              
                                      pos = end+1;
                                      end = line.indexOf(' ', pos+1);
                                      double z = Double.parseDouble(line.substring(pos, end));
                              
                                      pos = end+1;
                                      end = line.indexOf('"', pos+1);
                                      double velocity = Double.parseDouble(line.substring(pos, end));
                                  }
                                  br.close();
                              
                                  long time = System.nanoTime() - start;
                                  System.out.printf(new Date()+": Took %,f sec to read %s%n", time / 1e9, file.toString());
                              }
                              {code}
                              prints
                              {code}
                              Sun May 08 09:38:02 BST 2011: Writing to /tmp/deleteme.txt
                              Sun May 08 09:42:15 BST 2011: ... finished writing to /tmp/deleteme.txt length= 2208 MB.
                              Sun May 08 09:43:21 BST 2011: Took 66.610883 sec to read /tmp/deleteme.txt
                              {code}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                              • 12. Re: Parse large (2GB) text file
                                sabre150
                                857174 wrote:
                                these traces are from a mobility simulator.

                                The format is something like this:
                                Since the line format is very simple, for ease of use it looks as if regex is going to be the best approach. For speed I would suspect that you might do slightly better using a very simple hand crafted parser but the difference is likely to be very small. In your position I would start with regex approach and then if this did present a performance problem it would be trivial to change to using a hand crafted parser.
                                • 13. Re: Parse large (2GB) text file
                                  YoungWinston
                                  857174 wrote:
                                  What kind of other memory problems i might run into?
                                  Well, your example line is 45 characters long, and a character in Java is 2 bytes.
                                  45 x 2G x 2 = 180Gb
                                  Do you have that much memory on your machine?

                                  But if you can process the lines one at a time, the overhead will be ~90 bytes. Quite a difference.

                                  The problem comes if your parser needs to "remember" information from previous lines; or if there is cross-related data in the file. Then, I suspect, you should probably use a database (and I might ask the suppliers of this enormous 'flat-file' why they didn't do that in the first place).

                                  Winston
                                  • 14. Re: Parse large (2GB) text file
                                    860177
                                    Ok I changed the code to use a BufferedReader and I realized that i have a problem :)

                                    My mistake, i forgot to mention that the trace file contains some initialization data for each object. So in essence the trace file looks like this:

                                    +$obj(1) set X_ 100.0+
                                    +$obj(1) set Y_ 200.0+
                                    +$obj(1) set Z_ 0.0+
                                    +$obj(2) set X_ 150.0+
                                    +$obj(2) set Y_ 250.0+
                                    +$obj(2) set Z_ 0.0+
                                    tr at 2.0 "$obj(1) pos 123.20 270.98 0.0 2.4"
                                    tr at 2.0 "$obj(2) pos 122.10 210.82 0.0 2.1"

                                    Apparently, my code before was working, since I was using an Input Stream, and RegEx Matcher was set to handle the initialization traces as MULTILINE Strings. In essence i had two regex's to match, one for the init data (with Patter.MULTILINE) and one regex for the rest of the mobility traces. Since BufferedReader reads line by line, the init data are not matched and therefore skipped.

                                    Any suggestions on how to overcome this?
                                    1 2 3 Previous Next