3 Replies Latest reply: Jan 2, 2008 4:49 AM by 807603 RSS

    finding mutated strings using regex

    807603
      Hi all, I'm fairly new to regex stuff and am writing something that generates regexs to find strings similar to a known one.

      Here is the deal: I am writing something that uses bibliographic data to crawl the web looking for their corresponding PDF's. So far I have a web crawler that looks through google search results and the pdf's they eventually lead to in order to find the correct one. I am relying heavily on the title of the document, but in the process of parsing the html and/or pdf's for their text strings get a somewhat messed up.

      The idea is to generate some regex using the titles to make up for the text getting screwed up.
      StringBuffer titleRegex = new StringBuffer();
                
      for(String s : titleList)
          titleRegex.append(s.toLowerCase() + "[.\\s]{1,15}");
                
      if(debug)
          System.out.println(titleRegex.toString());
                
      this.titleRegex = titleRegex.toString().toLowerCase();
      return titleRegex.toString();
      This code turns the title "On the computational power of DNA"

      into

      ' on[.\s]{1,15}the[.\s]{1,15}computational[.\s]{1,15}power[.\s]{1,15}of[.\s]{1,15}dna[.\s]{1,15}'

      I'm going for something that looks for the title with 1 to 15 spaces or characters in between each word of the title. For some reason it is not working, it's only finding about 30% of what it is supposed to. I have never really delt with regular expressions before so I am not sure that the
      "[.\\s]{1,15}"
      is the best choice to put inbetween the words of the title.

      Thanks for reading this, help would be much appreciated.

      Edited by: Croncheezy on Jan 2, 2008 10:22 AM
        • 1. Re: finding mutated strings using regex
          807603
          I don't like the approach but to get the match you seem to want your regex separator should just be ".{0,15}". There are two reasons
          1) '.' is only matching any character when not in a character set
          2) a space will be matched by any character anyway.

          BUT BUT BUT

          If you want to match up to 15 characters that are not spaces terminated by a space then "[^\\s]{0,15}\\s+" would be better.

          Edited by: sabre150 on Jan 2, 2008 10:33 AM
          • 2. Re: finding mutated strings using regex
            807603
            Thanks for the reply. I changed it around as you suggested and had very good results.

            P.S. I have been through a lot of different ways of searching for the title and in most test cases it turned up the best results. I would appreciate input on this greatly if anyone has an idea.
            • 3. Re: finding mutated strings using regex
              807603
              sabre150 wrote:
              If you want to match up to 15 characters that are not spaces terminated by a space then "[^\\s]{0,15}\\s+" would be better.
              Once more bitten by the ridiculous forum markup
               "[^\\s]{0,15}\\s+"