1 Reply Latest reply on Aug 14, 2009 8:17 AM by 843853

    Many searches on a single string

      I'm working on a problem that involves searching for many keywords in a single document and could use any help you can offer.

      Basically a text string (a document) arrives that ranges anywhere from a few bytes to around 10 MB. I have thousands of users that have pre-setup search criteria to see if they are interested in the incoming string/document. Each user has (possibly) hundreds of search terms resulting in 100,000+ search terms that have to match against each document. The search strings are usually fairly short (1 to 30 bytes usually). Also documents arrive around once a second or faster so I have to be very fast in matching and passing on for delivery. The users can also change their search criteria at any time though they usually don't change more than a search term or two per day.

      Searching around on Google has been fairly useless to me (though I may not know exactly what to search for). Most text searching involves one of two problems - searching for one string in another one string, or searching for one string among many strings. The first being solved by the Boyer-Moore algorithm usually and the second being solved by text indexing like Google or database text indexing ( [Sybase Full Text|http://infocenter.sybase.com/help/topic/com.sybase.help.ase_15.0.verity/verity.pdf] or [Oracle Full Text|http://www.oracle-base.com/articles/9i/FullTextIndexingUsingOracleText9i.php]). My problem is in the other direction (many searches against one string) and faces it's own issues.

      I've tried Boyer-Moore and naive string matching on some sample data (looping through all criteria) and the results were not promising, some times taking minutes to evaluate large messages. Though it took less than a tenth of a second on the smallest messages. This just seems like the wrong way to approach the problem to repeat the same work over and over when it seems like you could make a single pass through the document to do the same work.

      The only two algorithms that I've seen that seem efficient for this problem are the following (though I haven't tested either):
      1) Least frequent trigraph: For each search term, find 3 consecutive characters that are the least likely to occur and then simultaneously search for all 100,000+ trigraphs with each message that comes in using only a single pass through the document. This then (should) reduce the problem set down to something manageable. For each least frequent trigraph found, you keep the location within the message and can then just use a direct comparison of the characters around that match to verify whether or not the whole term matches. Since the search criteria changes infrequently, the database of LFT for each term can be kept and just updated as needed. Also the statistics table to determine which trigraph truly is the least likely to occur can become smarter as time goes on based upon incoming data.

      Seems complex but doable. Efficiency all depends upon how likely the LFT for each term is to be found in each message.

      2) [Suffix Array|http://en.wikipedia.org/wiki/Suffix_array] After doing some pre-computation on the document, you can do O(log n) searches to find any word within it.

      Seems slightly less efficient (and possibly painful on large documents), but still workable and easy to implement.

      Does anyone have some experience with this type of problem? Any other algorithms that you've found that work well with this backwards string matching problem? Is there a version of Boyer-Moore that takes many strings as input?

      Thanks for any help you can offer.
        • 1. Re: Many searches on a single string
          This multi-pattern search might be an approach - [http://www.google.co.uk/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fwebglimpse.net%2Fpubs%2FTR94-17.pdf&ei=_xeFSq28BKLbjQfqkOyiCw&rct=j&q=Wu-Manber+algorithm&usg=AFQjCNH2OFAym3PHMSbWtp86sYsOSWNbNA&sig2=qMH89k-mekh8BWtCHljuxA|http://www.google.co.uk/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fwebglimpse.net%2Fpubs%2FTR94-17.pdf&ei=_xeFSq28BKLbjQfqkOyiCw&rct=j&q=Wu-Manber+algorithm&usg=AFQjCNH2OFAym3PHMSbWtp86sYsOSWNbNA&sig2=qMH89k-mekh8BWtCHljuxA] .