1 2 3 Previous Next 33 Replies Latest reply: May 9, 2011 9:39 AM by maheshguruswamy Go to original post RSS
      • 15. Re: Parse large (2GB) text file
        802316
        If you want your parsing to be fast, avoid using regular expressions. When I tried the code I gave you with a Scanner it was 30x slower.
        I suggest you try parsing the lines yourself, I would still use the approach I took in my example.
        • 16. Re: Parse large (2GB) text file
          YoungWinston
          857174 wrote:
          My mistake, i forgot to mention that the trace file contains some initialization data for each object. So in essence the trace file looks like this:
          +$obj(1) set X_ 100.0+
          +$obj(1) set Y_ 200.0+
          +$obj(1) set Z_ 0.0+
          +$obj(2) set X_ 150.0+
          +$obj(2) set Y_ 250.0+
          +$obj(2) set Z_ 0.0+
          tr at 2.0 "$obj(1) pos 123.20 270.98 0.0 2.4"
          tr at 2.0 "$obj(2) pos 122.10 210.82 0.0 2.1"
          Apparently, my code before was working, since I was using an Input Stream, and RegEx Matcher was set to handle the initialization traces as MULTILINE Strings.
          I wouldn't do that. Your patterns are still single-line, you just have two of them.
          In essence i had two regex's to match, one for the init data...and one...for the rest of the mobility traces.
          OK, so write them. You might even use the 'set' lines one to create a list of "settings" (which look to me like they might fit nicely into a Properties object), and substitute the result of that into your 'tr' lines, maybe something like:
          tr at 2.0 "(x=100.0,y=200.0,z=0.0) pos 123.20 270.98 0.0 2.4"
          However you look at it though, this is now beyond the realms of a simple regex (although you could probably still use them to parse the individual line types).
          Since BufferedReader reads line by line, the init data are not matched and therefore skipped.
          That's got nothing to do with it. The fact is, you forgot these lines existed. So correct it.

          Winston
          • 17. Re: Parse large (2GB) text file
            sabre150
            857174 wrote:
            Apparently, my code before was working, since I was using an Input Stream, and RegEx Matcher was set to handle the initialization traces as MULTILINE Strings.
            I don't really understand this. There is nothing in Matcher or Pattern to handle the input from an InputStream. Can you post your code for this approach?
            • 18. Re: Parse large (2GB) text file
              sabre150
              Peter Lawrey wrote:
              If you want your parsing to be fast, avoid using regular expressions. When I tried the code I gave you with a Scanner it was 30x slower.
              I suggest you try parsing the lines yourself, I would still use the approach I took in my example.
              I would expect regex to be slower than scanner but I have trouble with it being 30x slower. Can you post the regex and the code that utilized?
              • 19. Re: Parse large (2GB) text file
                EJP
                When I tried the code I gave you with a Scanner it was 30x slower.
                Why in God's earth would you use regex in conjunction with a Scanner? This is senseless. Use it with readLine(). And precompile the pattern please.
                • 20. Re: Parse large (2GB) text file
                  YoungWinston
                  sabre150 wrote:
                  Peter Lawrey wrote:
                  If you want your parsing to be fast, avoid using regular expressions. When I tried the code I gave you with a Scanner it was 30x slower.
                  I suggest you try parsing the lines yourself, I would still use the approach I took in my example.
                  I would expect regex to be slower than scanner but I have trouble with it being 30x slower. Can you post the regex and the code that utilized?
                  I wonder how much of that has to do with regexes and how much to do with Scanner. Mind you, I'm biased, because I loathe Scanner :-).

                  Seems to me that OP's problem is very simple:
                  1. Read lines.
                  2. Parse 'em.
                  Now we might argue about what the best method is to 'parse 'em', but BufferedReader.readLine() has been around for an awfully long time.

                  Winston
                  • 21. Re: Parse large (2GB) text file
                    860177
                    So here is my complete code as it was working before:
                        File file = new File( conf.getMobilityFile() );
                        FileInputStream fileInputStream = new FileInputStream( file );
                        FileChannel fileChannel = fileInputStream.getChannel();
                        
                        ByteBuffer byteBuffer = fileChannel.map( FileChannel.MapMode.READ_ONLY,  0, ( int )fileChannel.size() );
                        Charset charSet = Charset.forName( "8859_1" ); //$NON-NLS-1$
                        CharsetDecoder charsetDecoder = charSet.newDecoder();
                        CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );
                        
                        /* Read list of Node Initial Location */
                        String regex = "\\$obj_\\((\\d+)\\)\\s+set\\s+X\\_\\s+([\\d\\.]+)\\s+\\$obj_\\((\\d+)\\)\\s+set\\s+Y\\_\\s+([\\d\\.]+)\\s+\\$obj_\\((\\d+)\\)\\s+set\\s+Z\\_\\s+([\\d\\.]+)"; //$NON-NLS-1$
                        Pattern pattern = Pattern.compile( regex );
                        
                        /* Run Pattern Matching */
                        Matcher matcher = pattern.matcher( charBuffer );
                    
                        startTime = System.currentTimeMillis();
                        
                        while( matcher.find() ) {
                          nodeID = Integer.parseInt( matcher.group( 1 ) );
                          xCoord = Float.parseFloat( matcher.group( 2 ) );
                          yCoord = Float.parseFloat( matcher.group( 4 ) );      
                          
                          // Store the init coords for this node and continue ...
                        }  
                        
                        
                        regex = "^\\$tr_\\s+at\\s+([\\d\\.]+)\\s+\\\"\\$obj\\((\\d+)\\)\\s+pos\\s+([\\d\\.]+)\\s+([\\d\\.]+)\\s+([\\d\\.]+)\\\""; //$NON-NLS-1$
                        pattern = Pattern.compile( regex, Pattern.MULTILINE );
                        charBuffer.rewind();
                        matcher = pattern.matcher( charBuffer );
                    
                        boolean firstWrite = false;    
                        while( matcher.find() ) {
                          /* Get the current Time */
                          time = Float.parseFloat( matcher.group( 1 ) );
                    
                         /* Get node attributes */
                          nodeID = Integer.parseInt( matcher.group( 2 ) );      
                          xCoord = Float.parseFloat( matcher.group( 3 ) );
                          yCoord = Float.parseFloat( matcher.group( 4 ) );
                          speed =  Float.parseFloat( matcher.group( 5 ) );
                         
                           //Update node trajectory ...
                          
                        }
                    • 22. Re: Parse large (2GB) text file
                      802316
                      EJP wrote:
                      When I tried the code I gave you with a Scanner it was 30x slower.
                      Why in God's earth would you use regex in conjunction with a Scanner? This is senseless. Use it with readLine(). And precompile the pattern please.
                      Scanner uses regex for its delimiters. Do you know how to Scanner with using regex?

                      From the code for Scanner
                          // Pattern used to delimit tokens
                          private Pattern delimPattern;
                      
                          // Pattern found in last hasNext operation
                          private Pattern hasNextPattern;
                      • 23. Re: Parse large (2GB) text file
                        sabre150
                        I like using regex but those regex are over complicated, are written in a form that makes them unmaintainable and are truly horrendous.

                        I think I will drop out of this thread.
                        • 24. Re: Parse large (2GB) text file
                          EJP
                          Scanner uses regex for its delimiters.
                          Exactly my point. You are using regex at least twice per line. Why would you do that when you could do it all once?
                          Do you know how to Scanner with using regex?
                          Please translate that into standard English.

                          Your sample of code copyright Oracle is pointless without explanation.
                          • 25. Re: Parse large (2GB) text file
                            EJP
                            So here is my complete code as it was working before:
                            Pretty much as expected. It would be more interesting if you posted it as it is now after all the help you've been given.
                            • 26. Re: Parse large (2GB) text file
                              802316
                              YoungWinston wrote:
                              Now we might argue about what the best method is to 'parse 'em', but BufferedReader.readLine() has been around for an awfully long time.
                              Also its is surprising fast and unlikely to be an issue here. Performing just readLine() on a 2 GB with the sample data format, took about 6 second on my machine. (I would assume most of the file is in disk cache)
                              • 27. Re: Parse large (2GB) text file
                                802316
                                EJP wrote:
                                Please translate that into standard English.
                                That may be the problem.
                                Scanner uses regex for its delimiters.
                                Exactly my point. You are using double regex. Why would you do that when you could do it all with one?
                                I was using the regex in the scanner. I set the delimiter to ignore characters which were not needed.
                                • 28. Re: Parse large (2GB) text file
                                  YoungWinston
                                  857174 wrote:
                                  So here is my complete code as it was working before:...
                                  Which is still using that 'orrible FileChannel construct.

                                  Think about the problem:
                                  1. You have a bunch of lines to read.
                                  2. They are grouped into Traces, which have a set of rules, which look to me something like "0 or more 'set' lines, followed by 1 or more 'tr' lines".
                                  (It would be a hell of a lot simpler if whoever supplied this stuff for you also supplied 'Trace Start' and 'Trace End' lines to go with it; but that's by-the-by.)

                                  So, read lines in up to the end of a particular Trace, and then set about parsing them. I think I'd probably use something like a LinkedList<String> to store the lines for a particular Trace, but it's up to you.

                                  Winston
                                  • 29. Re: Parse large (2GB) text file
                                    EJP
                                    That may be the problem.
                                    It is the problem. I don't know what 'Do you know how to Scanner with using regex?' means. Nobody does. It is not standard English. It is ungrammatical nonsense.
                                    I was using the regex in the scanner.
                                    Ah, I see. So when you gave him code using BufferedReader and then said 'avoid using regular expressions. When I tried the code I gave you with a Scanner' you didn't mean that at all. You didn't mean the code you gave him plus Scanner, you meant different code with Scanner instead.

                                    Pardon me if I find all this far from clear.