This discussion is archived
6 Replies Latest reply: Sep 18, 2012 12:10 AM by udayaraj RSS

Assigning score based on criteria

udayaraj Newbie
Currently Being Moderated
I have the following test case on Oracle 10gR2


-----
CREATE TABLE t (id  VARCHAR2 (60), data clob)
/ 
 

INSERT ALL
    INTO t (id, data) VALUES ('Test_1', '<book title="Tale"> 
            It was the best of times. 
     </book> ')
    INTO t (id, data) VALUES ('Test_2', '<book title="Disclosure"> 
            Opening times. 
     </book>')
    INTO t (id, data) VALUES ('Test_3', '<book title="Blue"> 
            No of times. 
     </book> ')
    INTO t (id, data) VALUES ('Test_4', '<book title="Olympics"> 
            It was the best of times.
     </book> ')
    SELECT * FROM DUAL
/ 
 

BEGIN
     Ctx_Ddl.drop_Preference ('my_lexer');
END;
/ 

BEGIN
     Ctx_Ddl.Create_Preference ('my_lexer','BASIC_LEXER');
     Ctx_Ddl.Set_Attribute ( 'my_lexer', 'mixed_case', 'FALSE');
     Ctx_Ddl.Set_Attribute ( 'my_lexer', 'base_letter','TRUE');
END;
/ 

BEGIN
     Ctx_Ddl.drop_section_group ('myxmlgroup');
END;
   /

begin
     ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP');
     ctx_ddl.add_zone_section('myxmlgroup', 'book', 'book');
     ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title');
end;
/
 
CREATE INDEX t_data_idx ON t (data) INDEXTYPE IS ctxsys.context
     PARAMETERS 
          ('LEXER  my_lexer
                  datastore ctxsys.default_datastore 
                  filter ctxsys.null_filter 
                  section group myxmlgroup'
                )
/
-----

If searching with the string "times", would it be possible to get the following result

     1. for hits in <book title="Blue"> should have highest score
     2. for hits in <book title="Disclosure"> should have second highest score
     3. for hits in <book title="Tale"> should have third highest score
     4. for hits other than the above 3 should have the lowest score


The result should be for example
ID                         Element                       Score
Test_3                    Blue                    10     -- sample score
Test_2                    Disclosure               8     
Test_1                    Tale                    6     
Test_4                    Olympics               3
Could anyone kindly help me on this please

Thanking you in advance
Raj

Edited by: udayaraj on Sep 11, 2012 1:29 AM

Edited by: udayaraj on Sep 11, 2012 1:30 AM

Edited by: udayaraj on Sep 11, 2012 1:34 AM

Edited by: udayaraj on Sep 11, 2012 1:36 AM
  • 1. Re: Assigning score based on criteria
    Roger Ford Expert
    Currently Being Moderated
    Here we go. A little bit of XML Database magic to extract the title, and we get:
    column id format a20
    column title format a30
    
    select id, xt.title, score(1)
    from t, xmltable( '/book' PASSING XMLTYPE(t.data) 
                      COLUMNS title VARCHAR2(30) PATH '@title' 
                    ) as xt 
    where contains (data, '
       times ACCUM ( 
                    (blue WITHIN booktitle)*5 OR
                    (disclosure WITHIN booktitle)*4 OR
                    (tale WITHIN booktitle)*3 
                   )', 1) > 0
    order by score(1) desc;
    which gives us:
    ID               TITLE                     SCORE(1)
    -------------------- ------------------------------ ----------
    Test_3               Blue                        57
    Test_2               Disclosure                    56
    Test_1               Tale                        54
    Test_4               Olympics                         2
    Note that this works fine for this simple example, but might get it wrong if, for example, you had a document "Times, times, times, times, times, times, times, times". In this case the score of the left hand side of the ACCUM would be higher than the right hand side, so that document would get boosted too high. Instead you could use progressive relaxation:
    select id, xt.title, score(1)
    from t, xmltable( '/book' PASSING XMLTYPE(t.data) 
                      COLUMNS title VARCHAR2(30) PATH '@title' 
                    ) as xt 
    where contains (data, '
    <query>
      <textquery>
        <progression>
          <seq>times AND (blue WITHIN booktitle)*10*10</seq>
          <seq>times AND (disclosure WITHIN booktitle)*10*10</seq>
          <seq>times AND (tale WITHIN booktitle)*10*10</seq>
          <seq>times</seq>
       </progression>
      </textquery>
    </query>
    ', 1) > 0
    order by score(1) desc;
    This is guaranteed to ensure that the documents appear in the specified order of titles.

    Note the *10*10 is to push the scoring of the right hand side of the AND to the maximum value of 100. Since AND scores the minimum of its operands, that makes sure that the "WITHIN booktitle" part has no effect on the overall score.
  • 2. Re: Assigning score based on criteria
    udayaraj Newbie
    Currently Being Moderated
    Thank you very much Roger.

    I read an article on Progressive relaxation written by you on [Oracle Technet|http://www.oracle.com/technetwork/database/enterprise-edition/prog-relax-099280.html] which made it more easier for me to understand the subject

    Thank you once again

    Kind Regards
    Raj
  • 3. Re: Assigning score based on criteria
    udayaraj Newbie
    Currently Being Moderated
    Hello Roger

    I was trying the solution you had given me on the original data, but the result seems to be not in proper order. Could you kindly have a look?

    Oracle DB version - Enterprise edition 10.2.0.5.0

    raj@cms> drop table c
      2  /
    
    Table dropped.
    
    Elapsed: 00:00:00.18
    raj@cms> 
    raj@cms> create table c(id varchar2(255), c_type_id number(4), data clob)
      2  /
    
    Table created.
    
    Elapsed: 00:00:00.02
    raj@cms> 
    raj@cms> alter table c add constraint c_PK primary key (id, c_type_id)
      2  /
    
    Table altered.
    
    Elapsed: 00:00:00.02
    raj@cms> insert all
      2  into c (id, c_Type_id, data) values
      3  ('POI_DE3_TAX_19701106_0001', 1701, '<n-docbody>
      4                 <poi class="id">POI_DE3_TAX_19701106_0001 DE3_TAX_19701106_0001</poi>
      5                 <poi class="pk">昭和45年11月 6日</poi>
      6                 <poi class="fk">昭和45年 9月30日</poi>
      7                 <poi class="abstract">共同相続人</poi>
      8            </n-docbody>')
      9  into c (id, c_type_id, data) values
     10  ('POI_DE3_TAX_19700930_0002', 1701, '<n-docbody>
     11                 <poi class="id">POI_DE3_TAX_19700930_0002 DE3_TAX_19700930_0002</poi>
     12                 <poi class="pk">DE3 昭和45年 9月30日</poi>
     13                 <poi class="fk">昭和45年11月 6日</poi>
     14                 <poi class="abstract">有価証券の売買は</poi>
     15            </n-docbody>')
     16  into c (id, c_type_id, data) values
     17  ('POI_DE3_TAX_19701029_0003', 1701, '<n-docbody>
     18                 <poi class="id">POI_DE3_TAX_19701029_0003 DE3_TAX_19701029_0003</poi>
     19                 <poi class="pk">昭和45年10月29日</poi>
     20                 <poi class="abstract">昭和45年11月6日 0001、 昭和45年9月30日 </poi>
     21            </n-docbody>')
     22  select * from dual
     23  /
    
    3 rows created.
    
    Elapsed: 00:00:00.02
    raj@cms> 
    raj@cms> begin
      2       Ctx_Ddl.Drop_Preference('my_jp_lexer');
      3  end;
      4  /
    
    PL/SQL procedure successfully completed.
    
    Elapsed: 00:00:00.01
    raj@cms> 
    raj@cms> begin
      2       Ctx_Ddl.Create_Preference('my_jp_lexer','JAPANESE_VGRAM_LEXER');
      3  end;
      4  /
    
    PL/SQL procedure successfully completed.
    
    Elapsed: 00:00:00.01
    raj@cms> 
    raj@cms> 
    raj@cms> begin
      2       ctx_ddl.drop_section_group('poigroup');
      3  end;
      4  /
    
    PL/SQL procedure successfully completed.
    
    Elapsed: 00:00:00.01
    raj@cms> 
    raj@cms> begin
      2       ctx_ddl.create_section_group('poigroup', 'XML_SECTION_GROUP');
      3       ctx_ddl.add_zone_section('poigroup', 'poi', 'poi');
      4       ctx_ddl.add_attr_section('poigroup', 'poiclass', 'poi@class');
      5  end;
      6  /
    
    PL/SQL procedure successfully completed.
    
    Elapsed: 00:00:00.01
    raj@cms> 
    raj@cms> CREATE INDEX
      2       c_text_index ON c(DATA)
      3  INDEXTYPE
      4       IS CTXSYS.CONTEXT
      5  PARAMETERS(
      6            'transactional
      7            LEXER my_jp_lexer
      8            sync (every "SYSDATE+30/1440" MEMORY 300M PARALLEL 2)
      9            STOPLIST ctxsys.default_stoplist
     10            filter ctxsys.null_filter
     11            section group poigroup
     12            '
     13         )
     14  /
    
    Index created.
    
    Elapsed: 00:00:00.13
    raj@cms> select
      2       id, score(1)
      3  from c
      4  where
      5       contains (data, '<query>
      6                        <textquery>
      7                               <progression>
      8                                      <seq>昭和45年11月 6日 AND (id WITHIN poiclass)*10*10</seq>
      9                                      <seq>昭和45年11月 6日 AND (pk WITHIN poiclass)*10*10</seq>
     10                                      <seq>昭和45年11月 6日 AND (fk WITHIN poiclass)*10*10</seq>
     11                                      <seq>昭和45年11月 6日</seq>
     12                              </progression>
     13                        </textquery>
     14                 </query>', 1) > 0
     15  order by
     16       score(1) desc
     17  /
    
    ID                                         SCORE(1)
    ---------------------------------------- ----------
    POI_DE3_TAX_19701106_0001                        76
    POI_DE3_TAX_19701029_0003                        76
    POI_DE3_TAX_19700930_0002                        76
    
    Elapsed: 00:00:00.04
    raj@cms> 
    raj@cms> select
      2       id, xt.class, score(1)
      3  from c, xmltable( '/n-docbody/poi' PASSING XMLTYPE(c.data)
      4                    COLUMNS class VARCHAR2(30) PATH '@class'
      5                  ) as xt
      6  where
      7       contains (data, '<query>
      8                        <textquery>
      9                               <progression>
     10                                      <seq>昭和45年11月 6日 AND (id WITHIN poiclass)*10*10</seq>
     11                                      <seq>昭和45年11月 6日 AND (pk WITHIN poiclass)*10*10</seq>
     12                                      <seq>昭和45年11月 6日 AND (fk WITHIN poiclass)*10*10</seq>
     13                                      <seq>昭和45年11月 6日</seq>
     14                              </progression>
     15                        </textquery>
     16                 </query>', 1) > 0
     17  order by
     18       score(1) desc
     19  /
    
    ID                                       CLASS                                                          SCORE(1)
    ---------------------------------------- ------------------------------------------------------------ ----------
    POI_DE3_TAX_19701106_0001                id                                                                   76
    POI_DE3_TAX_19701106_0001                pk                                                                   76
    POI_DE3_TAX_19701106_0001                fk                                                                   76
    POI_DE3_TAX_19701106_0001                abstract                                                             76
    POI_DE3_TAX_19700930_0002                id                                                                   76
    POI_DE3_TAX_19701029_0003                abstract                                                             76
    POI_DE3_TAX_19700930_0002                fk                                                                   76
    POI_DE3_TAX_19700930_0002                abstract                                                             76
    POI_DE3_TAX_19701029_0003                id                                                                   76
    POI_DE3_TAX_19701029_0003                pk                                                                   76
    POI_DE3_TAX_19700930_0002                pk                                                                   76
    
    11 rows selected.
    
    Elapsed: 00:00:00.07
    The results should be in the following order

    1. POI_DE3_TAX_19701106_0001 => The string "昭和45年11月 6日" which is being searched, exists in <poi class=pk> so should have higher score
    2. POI_DE3_TAX_19700930_0002 => The string "昭和45年11月 6日" which is being searched, exists in <poi class=fk> so should have second highest score
    3. POI_DE3_TAX_19701029_0003 => The string "昭和45年11月 6日" which is being searched, does not exist in <poi class=pk> or <poi class=fk> but in <poi class=abstract> so should have the lowest score

    Also for the query which extracts the "class", the resultset shows matches for <poi class="id">. The string "昭和45年11月 6日" does'nt exist in <poi class="id"> at all.

    Thanking you in advance
    Kind Regards
    Raj
  • 4. Re: Assigning score based on criteria
    Roger Ford Expert
    Currently Being Moderated
    All of your documents have <poi class="id"> so they're all matched in the first part of the <progression>

    To match only those strings that occur in that specific segment, you'll probably need to do a nested WITHIN search.

    Let me see if I can demonstrate this.
  • 5. Re: Assigning score based on criteria
    Roger Ford Expert
    Currently Being Moderated
    It seems we can't use nested sections, as Oracle Text does not allow you to nest attribute sections within other sections.

    The solution is to use PATH_SECTION_GROUP which allows much more flexibility in XML searching than the simple XML section group.

    Also, because of the way your XML is constructed, it's no longer possible for us to use XMLTABLE to identify the class - several different classes are present in each document, and we can't tell which one our hit came from. So that gives us:
    drop table c
    /
    create table c(id varchar2(255), c_type_id number(4), data clob)
    / 
    alter table c add constraint c_PK primary key (id, c_type_id)
    / 
    insert all
     into c (id, c_Type_id, data) values
     ('POI_DE3_TAX_19701106_0001', 1701, '<n-docbody>
                    <poi class="id">POI_DE3_TAX_19701106_0001 DE3_TAX_19701106_0001</poi>
                    <poi class="pk">昭和45年11月 6日</poi>
                    <poi class="fk">昭和45年 9月30日</poi>
                   <poi class="abstract">共同相続人</poi>
              </n-docbody>')
    into c (id, c_type_id, data) values
    ('POI_DE3_TAX_19700930_0002', 1701, '<n-docbody>
                      <poi class="id">POI_DE3_TAX_19700930_0002 DE3_TAX_19700930_0002</poi>
                   <poi class="pk">DE3 昭和45年 9月30日</poi>
                   <poi class="fk">昭和45年11月 6日</poi>
                   <poi class="abstract">有価証券の売買は</poi>
                            </n-docbody>')
    into c (id, c_type_id, data) values
    ('POI_DE3_TAX_19701029_0003', 1701, '<n-docbody>
                   <poi class="id">POI_DE3_TAX_19701029_0003 DE3_TAX_19701029_0003</poi>
                   <poi class="pk">昭和45年10月29日</poi>
                   <poi class="abstract">昭和45年11月 6日 0001、 昭和45年9月30日 </poi>
              </n-docbody>')
    select * from dual
    / 
    
    begin
         Ctx_Ddl.Drop_Preference('my_jp_lexer');
    end;
    / 
    
    begin
         Ctx_Ddl.Create_Preference('my_jp_lexer','JAPANESE_VGRAM_LEXER');
    end;
    / 
    begin
         ctx_ddl.drop_section_group('poigroup');
    end;
    / 
    
    begin
            ctx_ddl.create_section_group('poigroup', 'PATH_SECTION_GROUP');
    end;
    / 
    
    CREATE INDEX
         c_txt_index ON c(DATA)
    INDEXTYPE
         IS CTXSYS.CONTEXT
    PARAMETERS(
              'transactional
              LEXER my_jp_lexer
              STOPLIST ctxsys.default_stoplist
              filter ctxsys.null_filter
              section group poigroup
              '
           )
    / 
    
    column id format a25
    column class format a10
    
    select
    id, score(1)
    from c
    where
         contains (data, '<query>
                          <textquery>
                                 <progression>
                                        <seq>(昭和45年11月 6日) INPATH (//poi[@class="id"])</seq>
                                        <seq>(昭和45年11月 6日) INPATH (//poi[@class="pk"])</seq>
                                        <seq>(昭和45年11月 6日) INPATH (//poi[@class="fk"])</seq>
                                        <seq>(昭和45年11月 6日) INPATH (//poi[@class])</seq>
                                </progression>
                          </textquery>
                   </query>', 1) > 0
    order by
         score(1) desc
    / 
    And the output of that is:
    ID                   SCORE(1)
    ------------------------- ----------
    POI_DE3_TAX_19701106_0001       51
    POI_DE3_TAX_19700930_0002       26
    POI_DE3_TAX_19701029_0003        1
    Note I made a couple of extra changes to your testcase:
    1/ I removed "sync every ..." from the parameters clause due to problems with the job scheduler on my test machine.
    2/ I added an extra space before the 6 in "昭和45年11月6日" in the third document, to make it the same as the other two. Without that, I wasn't getting a match on that document.
  • 6. Re: Assigning score based on criteria
    udayaraj Newbie
    Currently Being Moderated
    Thank you very much, Roger for helping out.

    The actual table holds huge number of xml documents with different content types and is already indexed using XML_SECTION_GROUP for basic keyword search within XML sections. Changing it to PATH_SECTION_GROUP, need to test and assess how it will impact on the existing search functionality.

    Once testing is done with different test cases, will get back to you.

    Also regarding the extra space before the 6 in "昭和45年11月6日", the actual content has values with missing 'spaces' in some cases. The lexer being JAPNESE_VGRAM_LEXER, has limitations in dealing with special characters. Is there any possible workaround for this?


    Thank you once again
    Kind Regards
    Raj

    Edited by: udayaraj on Sep 18, 2012 12:03 AM

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points