6 Replies Latest reply: Sep 18, 2012 2:10 AM by udayaraj RSS

    Assigning score based on criteria

    udayaraj
      I have the following test case on Oracle 10gR2


      -----
      CREATE TABLE t (id  VARCHAR2 (60), data clob)
      / 
       
      
      INSERT ALL
          INTO t (id, data) VALUES ('Test_1', '<book title="Tale"> 
                  It was the best of times. 
           </book> ')
          INTO t (id, data) VALUES ('Test_2', '<book title="Disclosure"> 
                  Opening times. 
           </book>')
          INTO t (id, data) VALUES ('Test_3', '<book title="Blue"> 
                  No of times. 
           </book> ')
          INTO t (id, data) VALUES ('Test_4', '<book title="Olympics"> 
                  It was the best of times.
           </book> ')
          SELECT * FROM DUAL
      / 
       
      
      BEGIN
           Ctx_Ddl.drop_Preference ('my_lexer');
      END;
      / 
      
      BEGIN
           Ctx_Ddl.Create_Preference ('my_lexer','BASIC_LEXER');
           Ctx_Ddl.Set_Attribute ( 'my_lexer', 'mixed_case', 'FALSE');
           Ctx_Ddl.Set_Attribute ( 'my_lexer', 'base_letter','TRUE');
      END;
      / 
      
      BEGIN
           Ctx_Ddl.drop_section_group ('myxmlgroup');
      END;
         /
      
      begin
           ctx_ddl.create_section_group('myxmlgroup', 'XML_SECTION_GROUP');
           ctx_ddl.add_zone_section('myxmlgroup', 'book', 'book');
           ctx_ddl.add_attr_section('myxmlgroup', 'booktitle', 'book@title');
      end;
      /
       
      CREATE INDEX t_data_idx ON t (data) INDEXTYPE IS ctxsys.context
           PARAMETERS 
                ('LEXER  my_lexer
                        datastore ctxsys.default_datastore 
                        filter ctxsys.null_filter 
                        section group myxmlgroup'
                      )
      /
      -----

      If searching with the string "times", would it be possible to get the following result

           1. for hits in <book title="Blue"> should have highest score
           2. for hits in <book title="Disclosure"> should have second highest score
           3. for hits in <book title="Tale"> should have third highest score
           4. for hits other than the above 3 should have the lowest score


      The result should be for example
      ID                         Element                       Score
      Test_3                    Blue                    10     -- sample score
      Test_2                    Disclosure               8     
      Test_1                    Tale                    6     
      Test_4                    Olympics               3
      Could anyone kindly help me on this please

      Thanking you in advance
      Raj

      Edited by: udayaraj on Sep 11, 2012 1:29 AM

      Edited by: udayaraj on Sep 11, 2012 1:30 AM

      Edited by: udayaraj on Sep 11, 2012 1:34 AM

      Edited by: udayaraj on Sep 11, 2012 1:36 AM
        • 1. Re: Assigning score based on criteria
          Roger Ford-Oracle
          Here we go. A little bit of XML Database magic to extract the title, and we get:
          column id format a20
          column title format a30
          
          select id, xt.title, score(1)
          from t, xmltable( '/book' PASSING XMLTYPE(t.data) 
                            COLUMNS title VARCHAR2(30) PATH '@title' 
                          ) as xt 
          where contains (data, '
             times ACCUM ( 
                          (blue WITHIN booktitle)*5 OR
                          (disclosure WITHIN booktitle)*4 OR
                          (tale WITHIN booktitle)*3 
                         )', 1) > 0
          order by score(1) desc;
          which gives us:
          ID               TITLE                     SCORE(1)
          -------------------- ------------------------------ ----------
          Test_3               Blue                        57
          Test_2               Disclosure                    56
          Test_1               Tale                        54
          Test_4               Olympics                         2
          Note that this works fine for this simple example, but might get it wrong if, for example, you had a document "Times, times, times, times, times, times, times, times". In this case the score of the left hand side of the ACCUM would be higher than the right hand side, so that document would get boosted too high. Instead you could use progressive relaxation:
          select id, xt.title, score(1)
          from t, xmltable( '/book' PASSING XMLTYPE(t.data) 
                            COLUMNS title VARCHAR2(30) PATH '@title' 
                          ) as xt 
          where contains (data, '
          <query>
            <textquery>
              <progression>
                <seq>times AND (blue WITHIN booktitle)*10*10</seq>
                <seq>times AND (disclosure WITHIN booktitle)*10*10</seq>
                <seq>times AND (tale WITHIN booktitle)*10*10</seq>
                <seq>times</seq>
             </progression>
            </textquery>
          </query>
          ', 1) > 0
          order by score(1) desc;
          This is guaranteed to ensure that the documents appear in the specified order of titles.

          Note the *10*10 is to push the scoring of the right hand side of the AND to the maximum value of 100. Since AND scores the minimum of its operands, that makes sure that the "WITHIN booktitle" part has no effect on the overall score.
          • 2. Re: Assigning score based on criteria
            udayaraj
            Thank you very much Roger.

            I read an article on Progressive relaxation written by you on [Oracle Technet|http://www.oracle.com/technetwork/database/enterprise-edition/prog-relax-099280.html] which made it more easier for me to understand the subject

            Thank you once again

            Kind Regards
            Raj
            • 3. Re: Assigning score based on criteria
              udayaraj
              Hello Roger

              I was trying the solution you had given me on the original data, but the result seems to be not in proper order. Could you kindly have a look?

              Oracle DB version - Enterprise edition 10.2.0.5.0

              raj@cms> drop table c
                2  /
              
              Table dropped.
              
              Elapsed: 00:00:00.18
              raj@cms> 
              raj@cms> create table c(id varchar2(255), c_type_id number(4), data clob)
                2  /
              
              Table created.
              
              Elapsed: 00:00:00.02
              raj@cms> 
              raj@cms> alter table c add constraint c_PK primary key (id, c_type_id)
                2  /
              
              Table altered.
              
              Elapsed: 00:00:00.02
              raj@cms> insert all
                2  into c (id, c_Type_id, data) values
                3  ('POI_DE3_TAX_19701106_0001', 1701, '<n-docbody>
                4                 <poi class="id">POI_DE3_TAX_19701106_0001 DE3_TAX_19701106_0001</poi>
                5                 <poi class="pk">昭和45年11月 6日</poi>
                6                 <poi class="fk">昭和45年 9月30日</poi>
                7                 <poi class="abstract">共同相続人</poi>
                8            </n-docbody>')
                9  into c (id, c_type_id, data) values
               10  ('POI_DE3_TAX_19700930_0002', 1701, '<n-docbody>
               11                 <poi class="id">POI_DE3_TAX_19700930_0002 DE3_TAX_19700930_0002</poi>
               12                 <poi class="pk">DE3 昭和45年 9月30日</poi>
               13                 <poi class="fk">昭和45年11月 6日</poi>
               14                 <poi class="abstract">有価証券の売買は</poi>
               15            </n-docbody>')
               16  into c (id, c_type_id, data) values
               17  ('POI_DE3_TAX_19701029_0003', 1701, '<n-docbody>
               18                 <poi class="id">POI_DE3_TAX_19701029_0003 DE3_TAX_19701029_0003</poi>
               19                 <poi class="pk">昭和45年10月29日</poi>
               20                 <poi class="abstract">昭和45年11月6日 0001、 昭和45年9月30日 </poi>
               21            </n-docbody>')
               22  select * from dual
               23  /
              
              3 rows created.
              
              Elapsed: 00:00:00.02
              raj@cms> 
              raj@cms> begin
                2       Ctx_Ddl.Drop_Preference('my_jp_lexer');
                3  end;
                4  /
              
              PL/SQL procedure successfully completed.
              
              Elapsed: 00:00:00.01
              raj@cms> 
              raj@cms> begin
                2       Ctx_Ddl.Create_Preference('my_jp_lexer','JAPANESE_VGRAM_LEXER');
                3  end;
                4  /
              
              PL/SQL procedure successfully completed.
              
              Elapsed: 00:00:00.01
              raj@cms> 
              raj@cms> 
              raj@cms> begin
                2       ctx_ddl.drop_section_group('poigroup');
                3  end;
                4  /
              
              PL/SQL procedure successfully completed.
              
              Elapsed: 00:00:00.01
              raj@cms> 
              raj@cms> begin
                2       ctx_ddl.create_section_group('poigroup', 'XML_SECTION_GROUP');
                3       ctx_ddl.add_zone_section('poigroup', 'poi', 'poi');
                4       ctx_ddl.add_attr_section('poigroup', 'poiclass', 'poi@class');
                5  end;
                6  /
              
              PL/SQL procedure successfully completed.
              
              Elapsed: 00:00:00.01
              raj@cms> 
              raj@cms> CREATE INDEX
                2       c_text_index ON c(DATA)
                3  INDEXTYPE
                4       IS CTXSYS.CONTEXT
                5  PARAMETERS(
                6            'transactional
                7            LEXER my_jp_lexer
                8            sync (every "SYSDATE+30/1440" MEMORY 300M PARALLEL 2)
                9            STOPLIST ctxsys.default_stoplist
               10            filter ctxsys.null_filter
               11            section group poigroup
               12            '
               13         )
               14  /
              
              Index created.
              
              Elapsed: 00:00:00.13
              raj@cms> select
                2       id, score(1)
                3  from c
                4  where
                5       contains (data, '<query>
                6                        <textquery>
                7                               <progression>
                8                                      <seq>昭和45年11月 6日 AND (id WITHIN poiclass)*10*10</seq>
                9                                      <seq>昭和45年11月 6日 AND (pk WITHIN poiclass)*10*10</seq>
               10                                      <seq>昭和45年11月 6日 AND (fk WITHIN poiclass)*10*10</seq>
               11                                      <seq>昭和45年11月 6日</seq>
               12                              </progression>
               13                        </textquery>
               14                 </query>', 1) > 0
               15  order by
               16       score(1) desc
               17  /
              
              ID                                         SCORE(1)
              ---------------------------------------- ----------
              POI_DE3_TAX_19701106_0001                        76
              POI_DE3_TAX_19701029_0003                        76
              POI_DE3_TAX_19700930_0002                        76
              
              Elapsed: 00:00:00.04
              raj@cms> 
              raj@cms> select
                2       id, xt.class, score(1)
                3  from c, xmltable( '/n-docbody/poi' PASSING XMLTYPE(c.data)
                4                    COLUMNS class VARCHAR2(30) PATH '@class'
                5                  ) as xt
                6  where
                7       contains (data, '<query>
                8                        <textquery>
                9                               <progression>
               10                                      <seq>昭和45年11月 6日 AND (id WITHIN poiclass)*10*10</seq>
               11                                      <seq>昭和45年11月 6日 AND (pk WITHIN poiclass)*10*10</seq>
               12                                      <seq>昭和45年11月 6日 AND (fk WITHIN poiclass)*10*10</seq>
               13                                      <seq>昭和45年11月 6日</seq>
               14                              </progression>
               15                        </textquery>
               16                 </query>', 1) > 0
               17  order by
               18       score(1) desc
               19  /
              
              ID                                       CLASS                                                          SCORE(1)
              ---------------------------------------- ------------------------------------------------------------ ----------
              POI_DE3_TAX_19701106_0001                id                                                                   76
              POI_DE3_TAX_19701106_0001                pk                                                                   76
              POI_DE3_TAX_19701106_0001                fk                                                                   76
              POI_DE3_TAX_19701106_0001                abstract                                                             76
              POI_DE3_TAX_19700930_0002                id                                                                   76
              POI_DE3_TAX_19701029_0003                abstract                                                             76
              POI_DE3_TAX_19700930_0002                fk                                                                   76
              POI_DE3_TAX_19700930_0002                abstract                                                             76
              POI_DE3_TAX_19701029_0003                id                                                                   76
              POI_DE3_TAX_19701029_0003                pk                                                                   76
              POI_DE3_TAX_19700930_0002                pk                                                                   76
              
              11 rows selected.
              
              Elapsed: 00:00:00.07
              The results should be in the following order

              1. POI_DE3_TAX_19701106_0001 => The string "昭和45年11月 6日" which is being searched, exists in <poi class=pk> so should have higher score
              2. POI_DE3_TAX_19700930_0002 => The string "昭和45年11月 6日" which is being searched, exists in <poi class=fk> so should have second highest score
              3. POI_DE3_TAX_19701029_0003 => The string "昭和45年11月 6日" which is being searched, does not exist in <poi class=pk> or <poi class=fk> but in <poi class=abstract> so should have the lowest score

              Also for the query which extracts the "class", the resultset shows matches for <poi class="id">. The string "昭和45年11月 6日" does'nt exist in <poi class="id"> at all.

              Thanking you in advance
              Kind Regards
              Raj
              • 4. Re: Assigning score based on criteria
                Roger Ford-Oracle
                All of your documents have <poi class="id"> so they're all matched in the first part of the <progression>

                To match only those strings that occur in that specific segment, you'll probably need to do a nested WITHIN search.

                Let me see if I can demonstrate this.
                • 5. Re: Assigning score based on criteria
                  Roger Ford-Oracle
                  It seems we can't use nested sections, as Oracle Text does not allow you to nest attribute sections within other sections.

                  The solution is to use PATH_SECTION_GROUP which allows much more flexibility in XML searching than the simple XML section group.

                  Also, because of the way your XML is constructed, it's no longer possible for us to use XMLTABLE to identify the class - several different classes are present in each document, and we can't tell which one our hit came from. So that gives us:
                  drop table c
                  /
                  create table c(id varchar2(255), c_type_id number(4), data clob)
                  / 
                  alter table c add constraint c_PK primary key (id, c_type_id)
                  / 
                  insert all
                   into c (id, c_Type_id, data) values
                   ('POI_DE3_TAX_19701106_0001', 1701, '<n-docbody>
                                  <poi class="id">POI_DE3_TAX_19701106_0001 DE3_TAX_19701106_0001</poi>
                                  <poi class="pk">昭和45年11月 6日</poi>
                                  <poi class="fk">昭和45年 9月30日</poi>
                                 <poi class="abstract">共同相続人</poi>
                            </n-docbody>')
                  into c (id, c_type_id, data) values
                  ('POI_DE3_TAX_19700930_0002', 1701, '<n-docbody>
                                    <poi class="id">POI_DE3_TAX_19700930_0002 DE3_TAX_19700930_0002</poi>
                                 <poi class="pk">DE3 昭和45年 9月30日</poi>
                                 <poi class="fk">昭和45年11月 6日</poi>
                                 <poi class="abstract">有価証券の売買は</poi>
                                          </n-docbody>')
                  into c (id, c_type_id, data) values
                  ('POI_DE3_TAX_19701029_0003', 1701, '<n-docbody>
                                 <poi class="id">POI_DE3_TAX_19701029_0003 DE3_TAX_19701029_0003</poi>
                                 <poi class="pk">昭和45年10月29日</poi>
                                 <poi class="abstract">昭和45年11月 6日 0001、 昭和45年9月30日 </poi>
                            </n-docbody>')
                  select * from dual
                  / 
                  
                  begin
                       Ctx_Ddl.Drop_Preference('my_jp_lexer');
                  end;
                  / 
                  
                  begin
                       Ctx_Ddl.Create_Preference('my_jp_lexer','JAPANESE_VGRAM_LEXER');
                  end;
                  / 
                  begin
                       ctx_ddl.drop_section_group('poigroup');
                  end;
                  / 
                  
                  begin
                          ctx_ddl.create_section_group('poigroup', 'PATH_SECTION_GROUP');
                  end;
                  / 
                  
                  CREATE INDEX
                       c_txt_index ON c(DATA)
                  INDEXTYPE
                       IS CTXSYS.CONTEXT
                  PARAMETERS(
                            'transactional
                            LEXER my_jp_lexer
                            STOPLIST ctxsys.default_stoplist
                            filter ctxsys.null_filter
                            section group poigroup
                            '
                         )
                  / 
                  
                  column id format a25
                  column class format a10
                  
                  select
                  id, score(1)
                  from c
                  where
                       contains (data, '<query>
                                        <textquery>
                                               <progression>
                                                      <seq>(昭和45年11月 6日) INPATH (//poi[@class="id"])</seq>
                                                      <seq>(昭和45年11月 6日) INPATH (//poi[@class="pk"])</seq>
                                                      <seq>(昭和45年11月 6日) INPATH (//poi[@class="fk"])</seq>
                                                      <seq>(昭和45年11月 6日) INPATH (//poi[@class])</seq>
                                              </progression>
                                        </textquery>
                                 </query>', 1) > 0
                  order by
                       score(1) desc
                  / 
                  And the output of that is:
                  ID                   SCORE(1)
                  ------------------------- ----------
                  POI_DE3_TAX_19701106_0001       51
                  POI_DE3_TAX_19700930_0002       26
                  POI_DE3_TAX_19701029_0003        1
                  Note I made a couple of extra changes to your testcase:
                  1/ I removed "sync every ..." from the parameters clause due to problems with the job scheduler on my test machine.
                  2/ I added an extra space before the 6 in "昭和45年11月6日" in the third document, to make it the same as the other two. Without that, I wasn't getting a match on that document.
                  • 6. Re: Assigning score based on criteria
                    udayaraj
                    Thank you very much, Roger for helping out.

                    The actual table holds huge number of xml documents with different content types and is already indexed using XML_SECTION_GROUP for basic keyword search within XML sections. Changing it to PATH_SECTION_GROUP, need to test and assess how it will impact on the existing search functionality.

                    Once testing is done with different test cases, will get back to you.

                    Also regarding the extra space before the 6 in "昭和45年11月6日", the actual content has values with missing 'spaces' in some cases. The lexer being JAPNESE_VGRAM_LEXER, has limitations in dealing with special characters. Is there any possible workaround for this?


                    Thank you once again
                    Kind Regards
                    Raj

                    Edited by: udayaraj on Sep 18, 2012 12:03 AM