This discussion is archived
7 Replies Latest reply: Nov 9, 2012 1:45 AM by AlbertoFaenza RSS

Regular Expression matching all the given words

user1421762 Newbie
Currently Being Moderated
I'm trying to match all sentences that contain words starting with given search tokens at least once. For example: if the given search token words are one and two then only sentences like "one plus one is two" should match. And should not match sentences like "one plus three is four". I was able to come up with this but I need a AND condition which I'm unable to get it right. Any help is appreciated.

select count(*) from dual where regexp_like('one plus one is two', '(^|\s)one|three', 'i');

Currently this gives a count of 1. But needs to give a count of 0 when the regexp is fixed.

search words : one two

one is less than two -> match
two is greater than one -> match
onetwo is union of two numbers -> match
onetwo is union of 2 numbers -> not a match as 'two' is not at the beginning of a word
one is less than three -> not a match as two is not present.

Thanks

Edited by: user1421762 on Nov 8, 2012 11:28 AM
  • 1. Re: Regular Expression matching all the given words
    AlbertoFaenza Expert
    Currently Being Moderated
    Hi,

    clarify this:
    search words: one two
     text
    one is less then two -> match
    two is more than one -> ??
    onetwo is union of 2 numbers ->??
    twoone is not three ->??
    So:
    Do you need to search only matching words (not partial)?
    Does order has important (one xxx two) and (two xxx one)?

    Please provide some sample and expected output.

    Regards.
    Al
  • 2. Re: Regular Expression matching all the given words
    Frank Kulash Guru
    Currently Being Moderated
    Hi,

    Welcome to the forum!

    Assuming you have a table called table_x, and it contains a unique string called sentence, you can find the row where sentence contains all of the given targets like this:
    WITH     all_targets      AS
    (
         SELECT  'one' AS target  FROM dual  UNION ALL
         SELECT     'two'             FROM dual
    )
    SELECT       x.sentence
    FROM       table_x      x
    JOIN       all_targets  t  ON  x.sentence  LIKE '%'
                                          || t.target
                                 || '%'
    GROUP BY  x.sentence
    HAVING    COUNT (*) = (
                          SELECT  COUNT (*)
                     FROM       all_targets
                      )
    ;
    If you don't want to hard-code the targets, you can make all_targets a real table, perhaps a global temporary table.

    Watch out for the "mother is in chemotherapy" problem. Does the sentence "Once there lived a l<b>one</b>ly Belgian named An<b>two</b>rp.' contain the targets 'one' and 'two'? The query above assumes it does.
    Do the targets have to appear in any particular order? The query above assumes they don't.


     

    I hope this answers your question.
    If not, post a little sample data (CREATE TABLE and INSERT statements, relevant columns only) for all tables involved, and also post the results you want from that data.
    Explain, using specific examples, how you get those results from that data.
    Always say which version of Oracle you're using (e.g., 11.2.0.2.0).
    See the forum FAQ {message:id=9360002}
  • 3. Re: Regular Expression matching all the given words
    user1421762 Newbie
    Currently Being Moderated
    search words : one two

    one is less than two -> match
    two is greater than one -> match
    onetwo is union of two numbers -> match
    onetwo is union of 2 numbers -> not a match as 'two' is not at the beginning of a word
    one is less than three -> not a match as two is not present.
  • 4. Re: Regular Expression matching all the given words
    Arild Explorer
    Currently Being Moderated
    Why not just :
    with dummy as (
    select 'one is less than two' str from dual union all
    select 'two is greater than one' from dual union all
    select 'onetwo is union of two numbers ' from dual union all
    select 'onetwo is union of 2 numbers' from dual union all
    select 'one is less than three' from dual )
    
    select str,
        case 
            when regexp_like(str, '(^| )one') and regexp_like(str, '(^| )two') then 'match'
            else 'no match '
        end result
    from dummy;
    
    STR                             RESULT   
    ------------------------------- ---------
    one is less than two            match    
    two is greater than one         match    
    onetwo is union of two numbers  match    
    onetwo is union of 2 numbers    no match 
    one is less than three          no match 
    
    5 rows selected.
  • 5. Re: Regular Expression matching all the given words
    user1421762 Newbie
    Currently Being Moderated
    Users can give any number of search tokens. Then the length of the where clause might keep increasing depending on the number of search words given which doesn't sound good.
  • 6. Re: Regular Expression matching all the given words
    chris227 Guru
    Currently Being Moderated
    depending on the concret scenario one could consider using oracle text.
    Sorry, i had to correct this as i forgot about words in stoplists.
    I wonder why 'onetwo is union of two numbers' should be in the result as there is no single one in it.
    create table testtable(id number, str varchar2(100))
    
    insert into testtable (
    select 1, 'one is less than two' from dual union all
    select 2, 'two is greater than one' from dual union all
    select 3, 'onetwo is union of two numbers ' from dual union all
    select 4, 'onetwo is union of 2 numbers' from dual union all
    select 5, 'one is less than three' from dual )
    
    CREATE INDEX  testtable_str_ctx ON  testtable (str) 
       INDEXTYPE IS "CTXSYS"."CONTEXT"
       PARAMETERS ('STOPLIST mystoplist')
    
    SELECT * from testtable
    where
    contains (str, 'one & two') > 0
    
    ID STR 
    1 one is less than two 
    2 two is greater than one 
    May be the answer to my question above is
    SELECT * from testtable
    where
    contains (str, 'one% & two%') > 0
    
    ID STR 
    1 one is less than two 
    2 two is greater than one 
    3 onetwo is union of two numbers 
    
    Naturally you will use a bind variable at then end
    
    SELECT * from testtable
    where
    contains (str, :p) > 0
    Edited by: chris227 on 09.11.2012 01:21
    corrected regarding stoplists

    Edited by: chris227 on 09.11.2012 01:26
  • 7. Re: Regular Expression matching all the given words
    AlbertoFaenza Expert
    Currently Being Moderated
    Hi,

    I have used an approach similar to Frank's one.
    I could use REGEXP_COUNT but I did not have Oracle 11g at hand.

    You can simplify my query but I just make it like this to let you understand the logic:
    -- Text with one or more words to search
    WITH wordstosearch AS
    (
       SELECT '  one    two   ' txt from dual
    )
    -- Create a table with one row for each word to search 
    , wordtable AS
    (
     SELECT REGEXP_substr(txt, '\w+',1,LEVEL) word
       FROM wordstosearch   
    CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(txt,'(\s+)?\w+(\s+)?','x'))
    )
    -- This is my table with comments
    , comments AS 
    (
       SELECT 'one is less than two'            cmt FROM DUAL UNION ALL
       SELECT 'two is greater than one'         cmt FROM DUAL UNION ALL
       SELECT 'onetwo is union of two numbers ' cmt FROM DUAL UNION ALL
       SELECT 'onetwo is union of 2 numbers'    cmt FROM DUAL UNION ALL
       SELECT 'one is less than three'          cmt FROM DUAL UNION ALL 
       SELECT 'oneone and twotwo are valid too' cmt FROM DUAL 
    )
    -- the final query
    SELECT cmt
      FROM comments c JOIN wordtable w
           ON REGEXP_LIKE(c.cmt,'(^|\s)'||w.word)
     GROUP BY cmt
    HAVING count(*) = (SELECT COUNT(*) from wordtable);
    
    CMT                            
    -------------------------------
    two is greater than one        
    onetwo is union of two numbers 
    oneone and twotwo are valid too
    one is less than two           
    If you have Oracle 11g this part:
    CONNECT BY LEVEL <= LENGTH(REGEXP_REPLACE(txt,'(\s+)?\w+(\s+)?','x'))
    coult be probably replaced by
    CONNECT BY LEVEL <= REGEXP_COUNT(txt, '\w+')
    However I cannot test this part as I don't have 11g now.

    Regards.
    Al

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points