This discussion is archived
5 Replies Latest reply: Jun 12, 2012 2:07 AM by Bagagem RSS

[Text Index] Execute queries with repeated words

Bagagem Newbie
Currently Being Moderated
[Post redirected from  https://forums.oracle.com/forums/thread.jspa?threadID=2400547&tstart=0 ]

Hello everyone,

We have a text index of type CONTEXT, to enable fast searches on a table with thousands of records. Each record have multiple words, the most common is between 3 and 10.

Our purpose is to provide flexible text queries - ignoring portuguese special characters - in this table, which includes the possibility to type repeated words in the query clause, for example:

The query "JOÃO JOÃO" (provided by the where clause "contains(t.column, 'JOÃO AND JOÃO', 1)" ) should return records such as "JOÃO JOAO", "JOAO JOÃO DOE" or "HENRY JOÃO JOAO", among others.

The problem is that the indexer detects a duplicate word and searches all the records that includes the word "JOHN" only once, which is not desirable in this case...

The attributes used to create the indexer are as follows:
Ctx_Ddl.Create_Preference ('my_lexer','BASIC_LEXER');
Ctx_Ddl.Set_Attribute ( 'my_lexer', 'mixed_case', 'FALSE');
Ctx_Ddl.Set_Attribute ( 'my_lexer', 'base_letter','TRUE');

Is there any way to enable this kind of 'literal' queries through the indexer?

Best regards,
Bagagem

Edited by: Bagagem on 11/Jun/2012 9:59
  • 1. Re: [Text Index] Execute queries with repeated words
    Roger Ford Expert
    Currently Being Moderated
    Sounds like you just need a PHRASE search rather than an AND search. Try:
    "contains(t.column, 'JOÃO JOÃO', 1)"

    That will only find occurences of the same word twice.

    If you need to find documents such as "JOÃO FOO BAR BAZ JOÃO" as well, that's a little harder (but can be done).
  • 2. Re: [Text Index] Execute queries with repeated words
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    Please see the example below.
    SCOTT@orcl_11gR2> CREATE TABLE t (col  VARCHAR2 (60))
      2  /
    
    Table created.
    
    SCOTT@orcl_11gR2> INSERT ALL
      2  INTO t (col) VALUES ('JOÃO JOAO')
      3  INTO t (col) VALUES ('JOAO JOÃO DOE')
      4  INTO t (col) VALUES ('HENRY JOÃO JOAO')
      5  INTO t (col) VALUES ('JOÃO HENRY JOAO')
      6  INTO t (col) VALUES ('JOÃO')
      7  INTO t (col) VALUES ('JOAO')
      8  SELECT * FROM DUAL
      9  /
    
    6 rows created.
    
    SCOTT@orcl_11gR2> BEGIN
      2    Ctx_Ddl.Create_Preference ('my_lexer','BASIC_LEXER');
      3    Ctx_Ddl.Set_Attribute ( 'my_lexer', 'mixed_case', 'FALSE');
      4    Ctx_Ddl.Set_Attribute ( 'my_lexer', 'base_letter','TRUE');
      5  END;
      6  /
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> CREATE INDEX t_col_idx ON t (col)
      2  INDEXTYPE IS CTXSYS.CONTEXT
      3  PARAMETERS ('LEXER  my_lexer')
      4  /
    
    Index created.
    
    SCOTT@orcl_11gR2> -- JOÃO twice one after the other:
    SCOTT@orcl_11gR2> SELECT * FROM t
      2  where  contains (t.col, 'JOÃO JOÃO', 1) > 0
      3  /
    
    COL
    ------------------------------------------------------------
    JOÃO JOAO
    JOAO JOÃO DOE
    HENRY JOÃO JOAO
    
    3 rows selected.
    
    SCOTT@orcl_11gR2> -- JOÃO twice:
    SCOTT@orcl_11gR2> SELECT * FROM t
      2  where  contains (t.col, 'JOÃO NEAR JOÃO', 1) > 0
      3  /
    
    COL
    ------------------------------------------------------------
    JOÃO JOAO
    JOAO JOÃO DOE
    HENRY JOÃO JOAO
    JOÃO HENRY JOAO
    
    4 rows selected.
  • 3. Re: [Text Index] Execute queries with repeated words
    Roger Ford Expert
    Currently Being Moderated
    For completeness - if you did want to find documents which have JOÃO more than once, even if there are other words in between:

    In your case, since you have a limited number of words, you can use the NEAR operator which finds words within 100 words of each other:

    contains ( 'NEAR( (JOÃO, JOÃO), 100)' ) > 0;

    but for the more general case where they may be separated by more than 100 words, you would need to use a query template and the COUNT scoring algorithm. I've used "john" here since I can type it without cutting and pasting:
    SQL> create table foo(bar varchar2(60));
    
    Table created.
    
    SQL> insert into foo values ('john smith');
    
    1 row created.
    
    SQL> insert into foo values ('john john smith');
    
    1 row created.
    
    SQL> insert into foo values ('john smith john');
    
    1 row created.
    
    SQL> create index fooindex on foo(bar) indextype is ctxsys.context;
    
    Index created.
    
    SQL> 
    SQL> select score(1), bar from foo where contains (bar,
      2  '<query>
      3      <textquery>
      4         john
      5      </textquery>
      6      <score algorithm="COUNT"/>
      7   </query>'
      8  , 1) > 0;
    
      SCORE(1) BAR
    ---------- ------------------------------------------------------------
             1 john smith
             2 john john smith
             2 john smith john
    You can see here that the score(1) is the number of occurences. We can add:
     and score(1) > 1
    to the end of the query to ensure we only get rows with at least two occurences of "john". Or, since CONTAINS returns the score, we could instead do:
    select score(1), bar from foo where contains (bar,
    '<query>
       <textquery>
         john
       </textquery>
       <score algorithm="COUNT"/>
     </query>'
    , 1) > 1
    Note the "> 1" on the last line. This is one of the very few occasions where it makes sense to use anything other than "> 0" with a CONTAINS clause.

    Edited by: Roger Ford on Jun 11, 2012 11:31 AM

    Edited by: Roger Ford on Jun 11, 2012 11:33 AM
  • 4. Re: [Text Index] Execute queries with repeated words
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    You can also use DEFINESCORE with OCCURRENCE or DEFINEMERGE with ADD to affect the score, as shown below.
    SCOTT@orcl_11gR2> SELECT SCORE(1), col FROM t
      2  where  contains (t.col, 'DEFINESCORE (JOÃO, OCCURRENCE)', 1) > 0
      3  ORDER  BY SCORE(1) DESC
      4  /
    
      SCORE(1) COL
    ---------- ------------------------------------------------------------
             2 JOÃO JOAO
             2 JOAO JOÃO DOE
             2 HENRY JOÃO JOAO
             2 JOÃO HENRY JOAO
             1 JOÃO
             1 JOAO
    
    6 rows selected.
    
    SCOTT@orcl_11gR2> SELECT SCORE(1), col FROM t
      2  where  contains (t.col, 'DEFINEMERGE (((JOÃO), (JOÃO)), AND, ADD)', 1) > 0
      3  ORDER  BY SCORE(1) DESC
      4  /
    
      SCORE(1) COL
    ---------- ------------------------------------------------------------
            12 JOÃO JOAO
            12 JOAO JOÃO DOE
            12 HENRY JOÃO JOAO
            12 JOÃO HENRY JOAO
             6 JOÃO
             6 JOAO
    
    6 rows selected.
  • 5. Re: [Text Index] Execute queries with repeated words
    Bagagem Newbie
    Currently Being Moderated
    Thank you guys, a lot!

    I think I was creating a little too much confusion, as I had already tried the
    contains (t.col, 'JOÃO JOÃO', 1) > 0
    and this is exactly what we pretend.

    The thing is that when I tested this option I was developing on a single SQL developer worksheet (under a single session), and rarely performed a commit to the changes i made on the database... And I forgot this little detail:
    parameters ('LEXER my_lexer SYNC ( ON COMMIT )');
    A little distraction that cost me a few hours... :)

    Once again thank you Roger Ford and Barbara Boehmer!
    Bagagem

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points