This discussion is archived
3 Replies Latest reply: Jan 2, 2008 2:49 AM by 807603 RSS

finding mutated strings using regex

807603 Newbie
Currently Being Moderated
Hi all, I'm fairly new to regex stuff and am writing something that generates regexs to find strings similar to a known one.

Here is the deal: I am writing something that uses bibliographic data to crawl the web looking for their corresponding PDF's. So far I have a web crawler that looks through google search results and the pdf's they eventually lead to in order to find the correct one. I am relying heavily on the title of the document, but in the process of parsing the html and/or pdf's for their text strings get a somewhat messed up.

The idea is to generate some regex using the titles to make up for the text getting screwed up.
StringBuffer titleRegex = new StringBuffer();
          
for(String s : titleList)
    titleRegex.append(s.toLowerCase() + "[.\\s]{1,15}");
          
if(debug)
    System.out.println(titleRegex.toString());
          
this.titleRegex = titleRegex.toString().toLowerCase();
return titleRegex.toString();
This code turns the title "On the computational power of DNA"

into

' on[.\s]{1,15}the[.\s]{1,15}computational[.\s]{1,15}power[.\s]{1,15}of[.\s]{1,15}dna[.\s]{1,15}'

I'm going for something that looks for the title with 1 to 15 spaces or characters in between each word of the title. For some reason it is not working, it's only finding about 30% of what it is supposed to. I have never really delt with regular expressions before so I am not sure that the
"[.\\s]{1,15}"
is the best choice to put inbetween the words of the title.

Thanks for reading this, help would be much appreciated.

Edited by: Croncheezy on Jan 2, 2008 10:22 AM
  • 1. Re: finding mutated strings using regex
    807603 Newbie
    Currently Being Moderated
    I don't like the approach but to get the match you seem to want your regex separator should just be ".{0,15}". There are two reasons
    1) '.' is only matching any character when not in a character set
    2) a space will be matched by any character anyway.

    BUT BUT BUT

    If you want to match up to 15 characters that are not spaces terminated by a space then "[^\\s]{0,15}\\s+" would be better.

    Edited by: sabre150 on Jan 2, 2008 10:33 AM
  • 2. Re: finding mutated strings using regex
    807603 Newbie
    Currently Being Moderated
    Thanks for the reply. I changed it around as you suggested and had very good results.

    P.S. I have been through a lot of different ways of searching for the title and in most test cases it turned up the best results. I would appreciate input on this greatly if anyone has an idea.
  • 3. Re: finding mutated strings using regex
    807603 Newbie
    Currently Being Moderated
    sabre150 wrote:
    If you want to match up to 15 characters that are not spaces terminated by a space then "[^\\s]{0,15}\\s+" would be better.
    Once more bitten by the ridiculous forum markup
     "[^\\s]{0,15}\\s+"