6 Replies Latest reply: Oct 10, 2012 1:18 PM by Barbara Boehmer RSS

    stop words handling with CONTEXT index - weird behavior

    757652
      I have a context index with the following output from the report (describe index report).


      CTX_REPORT.DESCRIBE_INDEX('KWTI10569_20121010115054')
      ------------------------------------------------------------------------------------------------------------------------
      ===========================================================================
      INDEX DESCRIPTION
      ===========================================================================
      index name: "METCALF_T"."KWTI10569_20121010115054"
      index id: 1524
      index type: context

      base table: "METCALF_T"."KWTD10569_20121010115054"
      primary key column:
      text column: MESSAGE_CONTENT
      text column type: RAW(2000)
      language column:
      format column: FMT
      charset column: CSET


      ===========================================================================
      INDEX OBJECTS
      ===========================================================================
      datastore: DIRECT_DATASTORE

      filter: CHARSET_FILTER
      charset: UTF8

      section group: NULL_SECTION_GROUP

      lexer: BASIC_LEXER
      punctuations: .?!
      skipjoins: _-"'`~!@#$%^&*()+=|}{[]\:;<>?/,
      continuation: \-
      index_stems: NONE

      wordlist: BASIC_WORDLIST
      stemmer: ENGLISH
      fuzzy_match: GENERIC

      stoplist: BASIC_STOPLIST
      stop_word: how
      stop_word: however
      stop_word: i
      stop_word: if
      <trimmed for brevity of message......but all default stop words provided by Oracle has been added here>

      storage: BASIC_STORAGE
      i_table_clause: tablespace TEXT_INDEX storage (initial 10M next 10M)
      k_table_clause: tablespace TEXT_INDEX storage (initial 10M next 10M)
      r_table_clause: tablespace TEXT_INDEX storage (initial 1M) lob (data) store as (cache)
      n_table_clause: tablespace TEXT_INDEX storage (initial 1M)
      i_index_clause: tablespace TEXT_INDEX storage (initial 1M) compress 2

      DB: 10g (10.2.0.4)
      DB characterset: UTF8

      Distinct tokens from index:
      SQL> select distinct token_text from dr$KWTI10569_20121010115054$i;

      TOKEN_TEXT
      ----------------------------------------------------------------
      BLAH
      EXPIRE
      OFFER

      My text content:

      SQL>
      SQL> select distinct utl_raw.cast_to_varchar2(message_content) from KWTD10569_20121010115054;

      UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
      ------------------------------------------------------------------------------------------------------------------------
      blah blah offer will expire blah blah
      offer expire
      this offer shall expire
      offer to expire
      offer expire
      blah blah offer expire blah blah
      blah blah offer to expire blah blah
      blah blah offer expire blah blah
      offer will expire
      blah blah this offer shall expire blah blah

      10 rows selected.


      Now, when i perform some contain queries i get some behavior that i cant understand.

      When i search for "this offer will expire" i dont get every row (10 rows) - why is that?


      SQL> select UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT) from KWTD10569_20121010115054 where contains(message_content,'this offer will expire')>0;

      UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
      ------------------------------------------------------------------------------------------------------------------------
      blah blah offer will expire blah blah
      this offer shall expire
      blah blah offer to expire blah blah
      blah blah this offer shall expire blah blah


      Also, when i search for "offer expire" i get the following
      SQL> select UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT) from KWTD10569_20121010115054 where contains(message_content,'offer expire')>0;

      UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
      ------------------------------------------------------------------------------------------------------------------------
      offer expire
      blah blah offer expire blah blah
      blah blah offer expire blah blah
      offer expire


      I was thinking that the stop words will be ignored while searching in context grammar, so i would get all my rows back? Isnt that correct?

      What i really want to achieve is that all these stop words are stripped from the content AND the keywords when i run the query and i get 100% matches. Any pointers on how that can be accomplished?
        • 1. Re: stop words handling with CONTEXT index - weird behavior
          Roger Ford-Oracle
          Two facts should explain the behavior you're seeing:

          (1) Searching for "several words together" is a phrase search - all the words must appear in the order specified.

          (2) A stopword in a phrase search will match any word - but there has to be a word present. So "offer shall expire" will match "offer will expire", "offer to expire", "offer aardvark expire" but not "offer expire".

          It looks like an AND search will achieve what you want - search for "offer AND expire" or "offer & expire". If you need them in the right order, then a NEAR search with appropriate arguments should do it.
          • 2. Re: stop words handling with CONTEXT index - weird behavior
            757652
            Roger-
            Thanks again. Is there any place in Oracle doc that documents these two facts?

            Please see the example below, does the number of words also matter? My search phrase was "the offer will expire" but why is that i didnt get rows like "offer to expire" back?

            SQL> select distinct utl_raw.cast_to_varchar2(message_content) from KWTD10569_20121010115054;

            UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
            ------------------------------------------------------------------------------------------------------------------------
            offer expire
            blah blah offer expire blah blah
            blah blah offer will expire blah blah
            this offer shall expire
            offer expire
            offer to expire
            blah blah offer to expire blah blah
            blah blah offer expire blah blah
            offer will expire
            blah blah this offer shall expire blah blah

            10 rows selected.

            SQL> select UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT) from KWTD10569_20121010115054 where contains(message_content,'the offer will expire')>0;

            UTL_RAW.CAST_TO_VARCHAR2(MESSAGE_CONTENT)
            ------------------------------------------------------------------------------------------------------------------------
            blah blah offer will expire blah blah
            this offer shall expire
            blah blah offer to expire blah blah
            blah blah this offer shall expire blah blah
            • 3. Re: stop words handling with CONTEXT index - weird behavior
              Barbara Boehmer
              >
              ...
              ...My search phrase was "the offer will expire" but why is that i didnt get rows like "offer to expire" back?
              ...

              "The" and "will" are both stopwords. So, searching for "the offer will expire" results in searching for "anyword offer anyword expire". "Offer to expire" is missing a word before "offer", so it does not match.
              • 4. Re: stop words handling with CONTEXT index - weird behavior
                757652
                Thanks. I am trying to find documentation on how phrase search works with CONTEXT indexes but i cant find them. Would anyone be able to point me to that?
                • 5. Re: stop words handling with CONTEXT index - weird behavior
                  Barbara Boehmer
                  Per the following section of the documentation:

                  http://docs.oracle.com/cd/E11882_01/text.112/e24435/query.htm#CCAPP9149

                  "When you include a stopword within your query phrase, the stopword matches any word."
                  • 6. Re: stop words handling with CONTEXT index - weird behavior
                    Barbara Boehmer
                    The following manual section deals with querying with Oracle Text in general:

                    http://docs.oracle.com/cd/E11882_01/text.112/e24435/query.htm#CCAPP0400