9 Replies Latest reply: Nov 5, 2008 7:00 AM by 800308 RSS

    Regex Optimization

    843785
      I was tasked with writing a regex that matches the following input.
      Vaild input egs:
      23+:34
      23-23:34
      23+:34
      23.3-23.3:34.4
      34.3+:34.4
      The input can contain lines in the form "number-number:number" or "number+:number" . The last line should be in the form number+:number. The regex should ignore any whitespaces or newlines in between the number and the symbols ( +,:) and any trailing spaces or newlines.
      The following regex seems to do the job
      ^((\s*\d+(\.\d+)?\s*-\s*\d+(\.\d+)?\s*:\s*\d+(\.\d+)?(\s*\n*)+))*(\s*\n*)*\s*\d+(\.\d+)?\s*\+\s*:\s*\d+(\.\d+)?(\s*\n*)*$
      However i run into a little performance problem while using it . It matches the correct input fairly quickly , however when the input us multiple lines of the form "number-number:number" without the line "number+:number" as the last line , the program seems to take much time depending on the number of lines in the input . I'm attaching a piece of code that'll illustrate the problem ,
      thx.
      public class Rex
        {
      
          /**
           * @param args
           * void
           */
          public static void main(String[] args)
            {
              String ip1="22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 34+:34";
              String ip2="22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 "; // this one will take some time 
              String ip3="22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34"; // this one will take lotza time 
              String regex="^((\\s*\\d+(\\.\\d+)?\\s*-\\s*\\d+(\\.\\d+)?\\s*:\\s*\\d+(\\.\\d+)?(\\s*\\n*)+))*(\\s*\\n*)*\\s*\\d+(\\.\\d+)?\\s*\\+\\s*:\\s*\\d+(\\.\\d+)?(\\s*\\n*)*$";
              System.out.println(ip1.matches(regex));
              System.out.println(ip2.matches(regex));
            }
      
        }
      Edited by: darth_code_r on Nov 4, 2008 1:56 PM
      removed the extra \ from the expression
        • 1. Re: Regex Optimization
          796293
          Hi,

          The way your going about writing and testing your RegEx seems a 'small' bit convoluted, especially for the task you've outlined.
          Really the RegEx should be quite short.

          The following link gives a good example of a RegEx tester and all the info you need to write more efficient RegEx's!
          [http://java.sun.com/docs/books/tutorial/essential/regex/index.html]

          Good luck.
          • 2. Re: Regex Optimization
            800308
            Darth,

            1. First trick is to "normalize" the "code" before you "parse" it... ie: remove all whitespace, which simplifies your "parsing" regex.

            2. The performance his is coming from "backtracking" from the $... the more you can do to fail the regex without having to parse to the end, the quicker it'll execute.

            3. Does it have to be one method of a regex? Why not just parse each line individually? Simpler ;-) I mean, then your only "special requirement" in-an isLastLine() test... which isn't exactly rocket surgery, even for a simpleton like me.

            Cheers. Keith.
            • 3. Re: Regex Optimization
              791266
              @Op. You should also try to avoid using capturing groups since you don't need them.

              Kaj
              • 4. Re: Regex Optimization
                800282
                As Kaj mentioned, use non-capturing groups. Also make all those greedy * and + possessive by adding another + after it:
                "^(?:(?:\\s*+\\d++(?:\\.\\d++)?\\s*+-\\s*+\\d++(?:\\.\\d++)?\\s*+:\\s*+\\d++(?:\\.\\d++)?(?:\\s*+\\n*+)++))*+(?:\\s*+\\n*+)*+\\s*+\\d++(?:\\.\\d++)?\\s*+\\+\\s*+:\\s*+\\d++(?:\\.\\d++)?(?:\\s*+\\n*+)*+$"
                Note that I didn't test it, I only adjusted your regex.
                • 5. Re: Regex Optimization
                  843785
                  Your biggest problem is the parts where you match the line separators and other whitespace. like these:
                  (\s*\n*)+
                  
                  (\s*\n*)*\s*
                  There are just too many ways these subexpressions can match, and if the overall match can't succeed, the regex ends up trying all of them. They're internally redundant anyway, since \s matches all whitespace characters, including \n. When I changed every instance of
                  (\s*\n*)
                  to merely
                  \s
                  all three of your test strings match or fail instantaneously.

                  Beyond that, I second the advice given by corlettk and kajbj, and I would also make all the quantifiers possessive (i.e., add a '+' after every '*', '+' or '?').
                   

                  Too slow! But I did test it, and making the quantifiers possessive does work (after the other change I recommended).
                  • 6. Re: Regex Optimization
                    843785
                    thanks all . I had tried using possessive quantifiers , but my browser showed \s*+ as a script error ( I tested using the javascript regex). It shows no error if i use like (\s+)* , but that doesnt mean the same , does it?
                    • 7. Re: Regex Optimization
                      800282
                      JavaScript does not support possessive quantifiers, Java does. And you are correct, \s*+ is not the same as (\s*)+.
                      • 8. Re: Regex Optimization
                        843785
                        prometheuzz wrote:
                        JavaScript does not support possessive quantifiers.
                        Thanks , that clears a lot of things...
                        • 9. Re: Regex Optimization
                          800308
                          Hmmm... Need to use the new nanoTime to measure it.
                          package forums;
                          
                          import java.util.regex.Pattern;
                          import java.util.regex.Matcher;
                          
                          class RegexTest
                          {
                            public static void main(String[] args) {
                              try {
                                final String FLOAT = "\\s*\\d++(\\.\\d++)?\\s*";
                                for (String regex : new String[] {
                                  "^("+FLOAT+"-"+FLOAT+":"+FLOAT+"\\s++)*(("+FLOAT+"-)?"+FLOAT+"\\+:"+FLOAT+")\\s*$"
                                //, "^((\\s*\\d+(\\.\\d+)?\\s*-\\s*\\d+(\\.\\d+)?\\s*:\\s*\\d+(\\.\\d+)?(\\s*\\n*)+))*(\\s*\\n*)*\\s*\\d+(\\.\\d+)?\\s*\\+\\s*:\\s*\\d+(\\.\\d+)?(\\s*\\n*)*$"
                                } ) {
                                  Pattern p = Pattern.compile(regex);
                                  for ( String s : new String[] {
                                    "22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 34+:34"
                                  , "22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34+:34 "                        // this one will take some time 
                                  , "22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34 \n 22-34:34"  // this one will take lotza time 
                                  } ) {
                                    long start = System.nanoTime();
                                    Matcher m = p.matcher(s);
                                    long stop = System.nanoTime();
                                    System.out.println(m.matches()+" "+((stop-start)/Math.pow(10,6))+" ms");
                                  }
                                }
                              } catch (Exception e) {
                                e.printStackTrace();
                              }
                            }
                          
                          }
                          uncle_alice should publish... I'm thinking:
                          (?iThe ([MBS]ad){0,3} Owl (in the Tutu)* Book
                          ... The cover photo would be an eagle (with turkey legs), obviously ;-)