3 Replies Latest reply: Dec 9, 2013 8:52 AM by Roger Ford-Oracle RSS

    CONTAINS operator: how to weight down matches on frequent words?

    user10214810

      Hello,

       

      I have a table with a column containing geographical names, indexed by CTXSYS.CONTEXT index with default parameters. I query it using CONTAINS operator with an expression like 'fuzzy(term1),fuzzy(term2),fuzzy(term3)'  e.g. 'fuzzy({TAK},66,1,weight),fuzzy({AFGHANISTAN},66,1,weight)'


      Some geographical terms contain words typical for a particular country (e.g. TAK for Thailand), and such words are repeated in almost every geographical name within this country.  Hence, in my table such words occur at multiple records, almost like stop-word. Though I do not like declare it as a stop word, I rather would like weight their match scores down.  In my example, I would like the match on AFGHANISTAN is ranked highest


      How can I do that?

      Thanks, Dmitry.

       

      Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

      PL/SQL Release 11.2.0.3.0 - Production

      "CORE 11.2.0.3.0 Production"

      TNS for Linux: Version 11.2.0.3.0 - Production

      NLSRTL Version 11.2.0.3.0 - Production

        • 1. Re: CONTAINS operator: how to weight down matches on frequent words?
          Barbara Boehmer

          You can multiply it, as demonstrated below.

           

          SCOTT@orcl12c> create table test_tab (test_col    varchar2(60))

            2  /

           

          Table created.

           

          SCOTT@orcl12c> insert all

            2  into test_tab values ('TAK THAILAND')

            3  into test_tab values ('AFGHANISTAN')

            4  into test_tab values ('UNITED STATES OF AMERICA')

            5  select * from dual

            6  /

           

          3 rows created.

           

          SCOTT@orcl12c> create index test_idx on test_tab (test_col) indextype is ctxsys.context

            2  /

           

          Index created.

           

          SCOTT@orcl12c> select score(1), test_col from test_tab

            2  where  contains

            3            (test_col,

            4             'fuzzy({TAK},66,1,weight),

            5          fuzzy({AFGHANISTAN},66,1,weight)',

            6             1) > 0

            7  order  by score(1) desc

            8  /

           

            SCORE(1) TEST_COL

          ---------- ------------------------------------------------------------

                  22 AFGHANISTAN

                  22 TAK THAILAND

           

          2 rows selected.

           

          SCOTT@orcl12c> select score(1), test_col from test_tab

            2  where  contains

            3            (test_col,

            4             'fuzzy({TAK},66,1,weight)*.1,

            5          fuzzy({AFGHANISTAN},66,1,weight)*10',

            6             1) > 0

            7  order  by score(1) desc

            8  /

           

            SCORE(1) TEST_COL

          ---------- ------------------------------------------------------------

                  98 AFGHANISTAN

                   1 TAK THAILAND

           

          2 rows selected.

          • 2. Re: CONTAINS operator: how to weight down matches on frequent words?
            user10214810

            Thanks for your answer.

            Yes, I know a manually assigned factor might be used to attenuate importance of unwanted matches. My question was if Oracle text can evaluate such a factor automatically, on the basis that some words (like "Tak") are omnipresent across multiple records, like:

             

            SCOTT@orcl12c> insert all

              2  into test_tab values ('TAK THAILAND')

              4  into test_tab values ('Changwat Tak THAILAND')

              5  into test_tab values ('Muang Tak THAILAND')

              6  into test_tab values ('Tak Fa THAILAND')

              7  into test_tab values ('Tak Bai THAILAND')

              8  into test_tab values ('AFGHANISTAN')

              9  into test_tab values ('UNITED STATES OF AMERICA')

              10  select * from dual

              11  /

             

            Thanks,

            Dmitry.

            • 3. Re: CONTAINS operator: how to weight down matches on frequent words?
              Roger Ford-Oracle

              The standard scoring algorithm automatically weights words according to their "rarity" - a common word will score less than an uncommon word. For example:

               

               

              SQL> create table foo(bar varchar2(200));

              Table created.

              SQL> insert into foo values ('tak england');

              1 row created.

              SQL> insert into foo values ('tak usa');

              1 row created.

              SQL> insert into foo values ('tak afghanistan');

              1 row created.

              SQL> insert into foo values ('afghanistan')

              1 row created.

              SQL> create index foobar on foo(bar) indextype is ctxsys.context;

              Index created.

              SQL> select score(1), bar from foo where contains(bar, 'tak or afghanistan', 1) > 0;

               

                SCORE(1) BAR

              ---------- ------------------------------

                       3 tak england

                       3 tak usa

                       4 tak afghanistan

                       4 afghanistan

               

              Notice that "tak" which appears three times scores less than "afghanistan" which appears twice.

               

              However, this may not work with fuzzy, since fuzzy returns a score based on how good a match it gets.

               

              See The Oracle Text Scoring Algorithm