This discussion is archived
6 Replies Latest reply: Oct 10, 2012 11:18 AM by Barbara Boehmer RSS

stop words handling with CONTEXT index - weird behavior

757652 Newbie
Currently Being Moderated
I have a context index with the following output from the report (describe index report).


CTX_REPORT.DESCRIBE_INDEX('KWTI10569_20121010115054')
------------------------------------------------------------------------------------------------------------------------
===========================================================================
INDEX DESCRIPTION
===========================================================================
index name: "METCALF_T"."KWTI10569_20121010115054"
index id: 1524
index type: context

base table: "METCALF_T"."KWTD10569_20121010115054"
primary key column:
text column: MESSAGE_CONTENT
text column type: RAW(2000)
language column:
format column: FMT
charset column: CSET


===========================================================================
INDEX OBJECTS
===========================================================================
datastore: DIRECT_DATASTORE

filter: CHARSET_FILTER
charset: UTF8

section group: NULL_SECTION_GROUP

lexer: BASIC_LEXER
punctuations: .?!
skipjoins: _-"'`~!@#$%^&*()+=|}{[]\:;<>?/,
continuation: \-
index_stems: NONE

wordlist: BASIC_WORDLIST
stemmer: ENGLISH
fuzzy_match: GENERIC

stoplist: BASIC_STOPLIST
stop_word: how
stop_word: however
stop_word: i
stop_word: if
<trimmed for brevity of message......but all default stop words provided by Oracle has been added here>

storage: BASIC_STORAGE
i_table_clause: tablespace TEXT_INDEX storage (initial 10M next 10M)
k_table_clause: tablespace TEXT_INDEX storage (initial 10M next 10M)
r_table_clause: tablespace TEXT_INDEX storage (initial 1M) lob (data) store as (cache)
n_table_clause: tablespace TEXT_INDEX storage (initial 1M)
i_index_clause: tablespace TEXT_INDEX storage (initial 1M) compress 2

DB: 10g (10.2.0.4)
DB characterset: UTF8

Distinct tokens from index:
SQL> select distinct token_text from dr$KWTI10569_20121010115054$i;

TOKEN_TEXT
----------------------------------------------------------------
BLAH
EXPIRE
OFFER

My text content:

SQL>
SQL> select distinct utl_raw.cast_to_varchar2(message_content) from KWTD10569_20121010115054;

UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
------------------------------------------------------------------------------------------------------------------------
blah blah offer will expire blah blah
offer expire
this offer shall expire
offer to expire
offer expire
blah blah offer expire blah blah
blah blah offer to expire blah blah
blah blah offer expire blah blah
offer will expire
blah blah this offer shall expire blah blah

10 rows selected.


Now, when i perform some contain queries i get some behavior that i cant understand.

When i search for "this offer will expire" i dont get every row (10 rows) - why is that?


SQL> select UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT) from KWTD10569_20121010115054 where contains(message_content,'this offer will expire')>0;

UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
------------------------------------------------------------------------------------------------------------------------
blah blah offer will expire blah blah
this offer shall expire
blah blah offer to expire blah blah
blah blah this offer shall expire blah blah


Also, when i search for "offer expire" i get the following
SQL> select UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT) from KWTD10569_20121010115054 where contains(message_content,'offer expire')>0;

UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
------------------------------------------------------------------------------------------------------------------------
offer expire
blah blah offer expire blah blah
blah blah offer expire blah blah
offer expire


I was thinking that the stop words will be ignored while searching in context grammar, so i would get all my rows back? Isnt that correct?

What i really want to achieve is that all these stop words are stripped from the content AND the keywords when i run the query and i get 100% matches. Any pointers on how that can be accomplished?
  • 1. Re: stop words handling with CONTEXT index - weird behavior
    Roger Ford Expert
    Currently Being Moderated
    Two facts should explain the behavior you're seeing:

    (1) Searching for "several words together" is a phrase search - all the words must appear in the order specified.

    (2) A stopword in a phrase search will match any word - but there has to be a word present. So "offer shall expire" will match "offer will expire", "offer to expire", "offer aardvark expire" but not "offer expire".

    It looks like an AND search will achieve what you want - search for "offer AND expire" or "offer & expire". If you need them in the right order, then a NEAR search with appropriate arguments should do it.
  • 2. Re: stop words handling with CONTEXT index - weird behavior
    757652 Newbie
    Currently Being Moderated
    Roger-
    Thanks again. Is there any place in Oracle doc that documents these two facts?

    Please see the example below, does the number of words also matter? My search phrase was "the offer will expire" but why is that i didnt get rows like "offer to expire" back?

    SQL> select distinct utl_raw.cast_to_varchar2(message_content) from KWTD10569_20121010115054;

    UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
    ------------------------------------------------------------------------------------------------------------------------
    offer expire
    blah blah offer expire blah blah
    blah blah offer will expire blah blah
    this offer shall expire
    offer expire
    offer to expire
    blah blah offer to expire blah blah
    blah blah offer expire blah blah
    offer will expire
    blah blah this offer shall expire blah blah

    10 rows selected.

    SQL> select UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT) from KWTD10569_20121010115054 where contains(message_content,'the offer will expire')>0;

    UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
    ------------------------------------------------------------------------------------------------------------------------
    blah blah offer will expire blah blah
    this offer shall expire
    blah blah offer to expire blah blah
    blah blah this offer shall expire blah blah
  • 3. Re: stop words handling with CONTEXT index - weird behavior
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    >
    ...
    ...My search phrase was "the offer will expire" but why is that i didnt get rows like "offer to expire" back?
    ...

    "The" and "will" are both stopwords. So, searching for "the offer will expire" results in searching for "anyword offer anyword expire". "Offer to expire" is missing a word before "offer", so it does not match.
  • 4. Re: stop words handling with CONTEXT index - weird behavior
    757652 Newbie
    Currently Being Moderated
    Thanks. I am trying to find documentation on how phrase search works with CONTEXT indexes but i cant find them. Would anyone be able to point me to that?
  • 5. Re: stop words handling with CONTEXT index - weird behavior
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    Per the following section of the documentation:

    http://docs.oracle.com/cd/E11882_01/text.112/e24435/query.htm#CCAPP9149

    "When you include a stopword within your query phrase, the stopword matches any word."
  • 6. Re: stop words handling with CONTEXT index - weird behavior
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    The following manual section deals with querying with Oracle Text in general:

    http://docs.oracle.com/cd/E11882_01/text.112/e24435/query.htm#CCAPP0400

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points