This discussion is archived
7 Replies Latest reply: Jan 17, 2013 4:02 PM by Barbara Boehmer RSS

Question on optimum choice of index - whether to use CTXCAT or CONTEXT

orausern Explorer
Currently Being Moderated
Hi ,

I have a situation in which there are short texts that are to be searched for diacritical characters and for that I implemented CTXCAT type of index. The solution works fine except for left side wild card search - in that case I have suggested the developers to use the query template feature. -this is the background information for the question I have and following example demonstrates it:
 
CREATE TABLE TEST_USER 
( 
  FIRST_NAME  VARCHAR2(64 CHAR)                 NOT NULL, 
  LAST_NAME   VARCHAR2(64 CHAR)                 NOT NULL 
); 


CREATE INDEX TEST_USER_IDX3 ON TEST_USER 
(FIRST_NAME) 
INDEXTYPE IS CTXSYS.CTXCAT 
PARAMETERS('LEXER cust_lexer'); 


CREATE INDEX TEST_USER_IDX4 ON TEST_USER 
(LAST_NAME) 
INDEXTYPE IS CTXSYS.CTXCAT 
PARAMETERS('LEXER cust_lexer'); 

Don't worry about the cust_lexer, it is for diacritical search and it is not relevant to this question so I am not copying the code for the preference I created etc. 
Now I have a row of data in the table with first_name column as Supervisor. If I run the below sql, it gives output: 

SELECT * 
  FROM test_user 
WHERE catsearch (first_name, 'Supervisor', NULL) > 0; 

FIRST_NAME                     LAST_NAME 
------------------------------ ------------------------------ 
Supervisor                     upervisor 

--even the below sql with wild card (*) at the end works fine... 
SQL> SELECT * 
  2    FROM test_user 
  3   WHERE catsearch (first_name, 'Super*', NULL) > 0; 

FIRST_NAME                     LAST_NAME 
------------------------------ ------------------------------ 
Supervisor                     upervisor 


However the below sql queries doesn't give any output, though they should return the same row as above! 

SQL> SELECT * 
  2    FROM test_user 
  3   WHERE catsearch (first_name, '*visor', NULL) > 0; 

no rows selected 


SQL> SELECT * 
  2    FROM test_user 
  3   WHERE catsearch (first_name, '*vis*', NULL) > 0; 

no rows selected 

--Using query template as below solves the issue: 
select * from test_user 
where catsearch(first_name, 
'<query> 
  <textquery grammar="context"> 
     %viso% 
  </textquery> 
</query>','')>0 

FIRST_NAME                     LAST_NAME 
------------------------------ ------------------------------ 
Supervisor                     upervisor 

Note that I verified the query execution plan and it uses the index and there is no Full Table Scan: 


---------------------------------------------------------------------------------------------- 
| Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     | 
---------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT            |                |       |       |     9 (100)|          | 
|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_USER      |   376 |    99K|     9   (0)| 00:00:01 | 
|*  2 |   DOMAIN INDEX              | TEST_USER_IDX3 |       |       |            |          | 
---------------------------------------------------------------------------------------------- 
Up to the above , all details were by way of the back ground...now the question is - it is this the right choice? I am using context grammer using query template. There is another thread on this forum where an expert (Barbara) said that
". It should be better to use a context index than a ctxcat index with a query template that uses context grammar. " -this was said on this question link: Re: Wildcard Search
So I am getting this doubt. However I have good data here that shows that the query doesn't do full table scan - still is it a bad choice? Note that there are several issues with CONTENT type of indexes( as per my limited understanding) - because they are not transactional in nature and so we have to take extra steps/measures to have the indexes updated which seems like a major pain area to me.

My doubt is , did I do the right thing by using query template or should I use the CONTEXT type of index instead of CTXCAT type of index?

Thanks,
Nirav

Edited by: orausern on Jan 17, 2013 1:40 AM

Edited by: orausern on Jan 17, 2013 1:43 AM
  • 1. Re: Question on optimum choice of index - whether to use CTXCAT or CONTEXT
    Roger Ford Expert
    Currently Being Moderated
    CTXCAT grammar does not support left-truncation. See:
    http://docs.oracle.com/cd/B10500_01/text.920/a96518/csql.htm#29433

    This is for performance reasons. Although you're not seeing a full table scan in your query plan, there will be a full table scan on the underlying DR$indexname$I table which contains the list of words and their locations. Since the SQL doing this is internal (recursive) SQL, it doesn't show up on your plan.

    Is this a problem? It depends how much data you have. If your $I table contains a few hundred thousand entries, it will probably get cached comfortably in the SGA and a full table scan won't be an issue. But once the $I gets too big to cache in the SGA it will definitely start being a problem.

    The proper solution is to use a CONTEXT indextype with the SUBSTRING_INDEX option, which is designed to support double truncation. Yes, you'll have to deal with syncing and optimizing the index. If you use SYNC (ON COMMIT) then the sync side of things is covered easily, but that will require you to set up a periodic (daily?) optimize job to keep the index in good shape.
  • 2. Re: Question on optimum choice of index - whether to use CTXCAT or CONTEXT
    orausern Explorer
    Currently Being Moderated
    Hi Roger,

    Thanks a LOT for the excellent help!! I have some more questions if you can review further:

    1) I had to create a preference with BASIC_LEXER as our requirement is that we have to search in a way such that diacritical characters don't matter (words from different languages and words such as peña and pena should both return from search and so on.)

    So I created the preference as follows:
     
    BEGIN 
       ctxsys.ctx_ddl.create_preference ('cust_lex1', 'BASIC_LEXER'); 
       ctxsys.ctx_ddl.set_attribute ('cust_lex1', 'base_letter', 'YES'); -- removes diacritics 
    EXCEPTION 
       WHEN OTHERS 
       THEN 
          v_err := SQLERRM; 
          v_sqlcode := SQLCODE; 
          v_count := INSTR (v_err, 'DRG-10701'); 
    
          IF v_count > 0 
          THEN 
             DBMS_OUTPUT.put_line ( 
                'The required preference named cust_lex1 with BASIC LEXER is already set up'); 
          ELSE 
             RAISE; 
          END IF; 
    END; 
    / 
    
    and after that I created the index using this as parameter as follows: 
    CREATE INDEX TEST_USER_IDX3 ON TEST_USER 
    (FIRST_NAME) 
    INDEXTYPE IS CTXSYS.CTXCAT 
    PARAMETERS('LEXER cust_lexer'); 
    Now if I have to use the 'BASIC_WORDLIST' and SUBSTRING_INDEX how do I combine along with the BASIC_LEXER? Can you help to give an example? Also I didn't get your point about periodic (daily) optimize job - what are you referring to? Once the index with SYNC(ON COMMIT) option is set up , isn't it all done? or do we need to set up again some other optimize job and if so how is that done - can you please mention the commands needed for it?

    Thank you,
    OrauserN
  • 3. Re: Question on optimum choice of index - whether to use CTXCAT or CONTEXT
    Roger Ford Expert
    Currently Being Moderated
    See the example below - combining the wordlist preference with the lexer preference is simple enough.

    Unfortunately SYNC (ON COMMIT) on its own is not enough. As you add new rows to the table, the index will become fragmented. If you run the example below, you will see that there is only one row in the $I (main index) table for the word "JOHN" even though it appears in two rows of the original table. If we add an extra row to the original table, a new entry for "JOHN" appears in the $I table. Only when we run OPTIMIZE_INDEX does that get absorbed into the original "JOHN" entry.

    For more info on Oracle Text index maintenance, see my paper here:
    http://www.oracle.com/technetwork/database/enterprise-edition/index-maintenance-089308.html

    You may also want to review the more basic paper "How Oracle Text processes DML" here:
    http://www.oracle.com/technetwork/database/enterprise-edition/text-dml-processing-092316.html
    The paper refers to interMedia Text, which is an old name for Oracle Text.
    drop table test_user;
    create table test_user (first_name varchar2(200));
    
    insert into test_user values ('john smith');
    insert into test_user values ('john jones');
    
    exec ctx_ddl.drop_preference   ('cust_lexer')
    exec ctx_ddl.create_preference ('cust_lexer', 'BASIC_LEXER')
    exec ctx_ddl.set_attribute ('cust_lex1', 'base_letter', 'YES')
    
    exec ctx_ddl.drop_preference   ('cust_wl')
    exec ctx_ddl.create_preference ('cust_wl', 'BASIC_WORDLIST')
    exec ctx_ddl.set_attribute     ('cust_wl', 'SUBSTRING_INDEX', 'true')
    
    CREATE INDEX TEST_USER_IDX ON TEST_USER 
    (FIRST_NAME) 
    INDEXTYPE IS CTXSYS.CONTEXT
    PARAMETERS('LEXER cust_lexer WORDLIST cust_wl sync (on commit)'); 
    
    select token_text from dr$test_user_idx$i;
    
    insert into test_user values ('john brown');
    commit;
    
    select token_text from dr$test_user_idx$i;
    
    exec ctx_ddl.optimize_index('test_user_idx', 'FULL')
    
    select token_text from dr$test_user_idx$i;
  • 4. Re: Question on optimum choice of index - whether to use CTXCAT or CONTEXT
    orausern Explorer
    Currently Being Moderated
    That helps a LOT again!! I will go through your paper and come back if I need some more help later today or tomorrow.

    Best regards and thanks a lot!
    OrauserN
  • 5. Re: Question on optimum choice of index - whether to use CTXCAT or CONTEXT
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    I would just like to add a few comments.

    Alhough it is documented that the ctxcxat index and catsearch do not support a wildcard in front of the term, a workaround is to use two asterisks on the left side of the term, as demonstrated below. I provide this only for clarification and interesting trivia. I would still use a context index for various reasons.
    SCOTT@orcl_11gR2> CREATE TABLE TEST_USER
      2    (FIRST_NAME  VARCHAR2(64 CHAR))
      3  /
    
    Table created.
    
    SCOTT@orcl_11gR2> INSERT INTO test_user VALUES ('Supervisor')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> CREATE INDEX TEST_USER_IDX
      2  ON TEST_USER (FIRST_NAME)
      3  INDEXTYPE IS CTXSYS.CTXCAT
      4  /
    
    Index created.
    
    SCOTT@orcl_11gR2> SET AUTOTRACE ON EXPLAIN
    SCOTT@orcl_11gR2> SELECT * FROM test_user
      2  WHERE  catsearch (first_name, '**vis*', NULL) > 0
      3  /
    
    FIRST_NAME
    ----------------------------------------------------------------
    Supervisor
    
    1 row selected.
    
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 4046491764
    
    ---------------------------------------------------------------------------------------------
    | Id  | Operation                   | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
    ---------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT            |               |     1 |   142 |     3   (0)| 00:00:01 |
    |   1 |  TABLE ACCESS BY INDEX ROWID| TEST_USER     |     1 |   142 |     3   (0)| 00:00:01 |
    |*  2 |   DOMAIN INDEX              | TEST_USER_IDX |       |       |            |          |
    ---------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
       2 - access("CTXSYS"."CATSEARCH"("FIRST_NAME",'**vis*',NULL)>0)
    
    Note
    -----
       - dynamic sampling used for this statement (level=2)
    
    SCOTT@orcl_11gR2>
    The only time that I am aware of that there is a conflict between a wordlist and a lexer is when you specify stemming in both. If you are using stemming, you can still use both a wordlist and a lexer, but only set the stemmer attribute in the wordlist, not the index_stems attribute in the lexer.
  • 6. Re: Question on optimum choice of index - whether to use CTXCAT or CONTEXT
    Roger Ford Expert
    Currently Being Moderated
    I didn't know that! How strange ... not sure why that should be. I don't think it was intentional, but it seems like we should avoid "fixing" it as it could be useful.
  • 7. Re: Question on optimum choice of index - whether to use CTXCAT or CONTEXT
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    I am trying to guess at why that works. Catsearch is documented to support searches like 's*vis*', which would find anything that starts with 's' followed by anything, then 'vis', then anything. I think maybe '**vis*' gets interpreted as starts with anything, followed by anything, then 'vis', then anything. I found that '?*vis*' also produces the same results. It may be that any character that Oracle Text ignores is treated like anything and so the following wildcard works.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points