9 Replies Latest reply: Jun 6, 2012 1:23 PM by Barbara Boehmer RSS

    Fuzzy search - more accurate score??

    909317
      Hello again :)

      I have a problem with score returned from fuzzy search. Short example:

      Table Customer contain column sortname. Sortname is in format "Surname, Firstname". Table customer contain a lot of members with same Surname for example: "Karlsson"

      Then I run fuzzy search for name "Karlsson Hanna" and I want to have more accurate results on top.

      But the problem is that "Karlsson, Hanna" has score 7 and also "Karlsson, Anna" or "Carlsson, Anna" has score 7.
      Why? Why the "Karlsson, Hanna" does not have more??

      Example:

      --create table
      CREATE TABLE TEMP_CUSTOMER
      (     
           SORTNAME VARCHAR2(200 BYTE)
      );

      --import
      Import.sql is here - LINK REMOVED BY MODERATOR

      --create index
      exec ctx_ddl.drop_preference('TEST1_LEXER');
      exec ctx_ddl.create_preference('TEST1_LEXER', 'BASIC_LEXER');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'index_themes', 'NO');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'index_text', 'YES');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'index_stems', 'NONE');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'alternate_spelling', 'SWEDISH');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'base_letter', 'YES');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'base_letter_type', 'GENERIC');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'override_base_letter', 'TRUE');

      exec ctx_ddl.drop_preference('TEST1_SWEDISH_FUZZY_PREF');
      exec ctx_ddl.create_preference('TEST1_SWEDISH_FUZZY_PREF', 'BASIC_WORDLIST');
      exec ctx_ddl.set_attribute('TEST1_SWEDISH_FUZZY_PREF','FUZZY_MATCH','AUTO');
      exec ctx_ddl.set_attribute('TEST1_SWEDISH_FUZZY_PREF','FUZZY_SCORE','60');
      exec ctx_ddl.set_attribute('TEST1_SWEDISH_FUZZY_PREF','FUZZY_NUMRESULTS','100');
      exec ctx_ddl.set_attribute('TEST1_SWEDISH_FUZZY_PREF','SUBSTRING_INDEX','TRUE');
      exec ctx_ddl.set_attribute('TEST1_SWEDISH_FUZZY_PREF','STEMMER','AUTO');

      drop index IX_CUST_TMP_CTXSORTNAME;
      create index IX_CUST_TMP_CTXSORTNAME on TEMP_CUSTOMER (sortname) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS ('LEXER TEST1_LEXER Wordlist TEST1_SWEDISH_FUZZY_PREF SYNC(ON COMMIT)');

      --search
      SELECT sortname FROM TEMP_CUSTOMER where CONTAINS(sortname, 'Karlsson Hanna') > 0;
      SELECT score(1), sortname FROM TEMP_CUSTOMER where CONTAINS(sortname, 'fuzzy({Karlsson Hanna}, 50, 100, weight)',1) > 0 order by score(1) desc;
      SELECT score(1), sortname FROM TEMP_CUSTOMER where CONTAINS(sortname, 'fuzzy({Karlsson Hanna}, 70, 100, weight)',1) > 0 order by score(1) desc;

      Is it possible to do something with this?
      I also try to change the lexer preferences, change the printjoins, skipjoins, whitespaces, punctations, but nothing works.

      exec ctx_ddl.set_attribute('TEST1_LEXER', 'endjoins', ',');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'printjoins', ',');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'skipjoins', ',');
      exec ctx_ddl.set_attribute('TEST1_LEXER', 'punctuations', ',');

      EDIT: remove exec ctx_ddl.set_attribute('TEST1_LEXER', 'endjoins', ','); from example

      Edited by: Mariooo on 14.3.2012 3:55

      Edited by: Mariooo on 14.3.2012 3:59

      Edited by: Roger Ford on Mar 14, 2012 9:46 AM
        • 1. Re: Fuzzy search - more accurate score??
          Dom Brooks
          See the documentation for more information on the default score algorithm, particularly regarding the sort of frequency you need to get 100%:
          http://docs.oracle.com/cd/B28359_01/text.111/b28304/ascore.htm
          • 2. Re: Fuzzy search - more accurate score??
            909317
            Thank you.

            So I assume that there is no possibility to change algorithm, Am I right? (inverse frequency algorithm based on Salton's formula.)

            Or it is possible to change the behaviour with DEFINESCORE operator? I am nut sure if I understood this operator correctly.
            • 3. Re: Fuzzy search - more accurate score??
              Roger Ford-Oracle
              Thanks for posting a testcase, but your I'm not happy with that "sendspace" link you posted, as the ads on the site try to trick you into downloading suspect executables. I removed the link. I do have the file, though, and will try your testcase.

              Edited by: Roger Ford on Mar 14, 2012 10:28 AM
              • 4. Re: Fuzzy search - more accurate score??
                Roger Ford-Oracle
                I've reposted your testfile on Dropbox: http://dl.dropbox.com/u/21167892/export.sql
                • 5. Re: Fuzzy search - more accurate score??
                  Roger Ford-Oracle
                  I think the problem is that Oracle Text doesn't deal very well with fuzzy applied to phrases. We can express the query better as:
                  SELECT score(1), sortname FROM TEMP_CUSTOMER where 
                  CONTAINS(sortname, 'fuzzy({Karlsson}, 50, 50, weight) ACCUM fuzzy({Hanna}, 50, 50, weight)',1) > 0 order by score(1) desc
                  Which will separately process a fuzzy match on each word, and add the scores of the results.

                  We get
                    SCORE(1) SORTNAME
                  ---------- ------------------------------------------------------------
                          81 Leif Karlsson/Anna-Lisa Planho, Leif Karlsson/Anna-L
                          72 Karlsson, Hanna-Liisa
                          72 Karlsson, Hanna
                          72 Karlsson, Hanna
                          72 Karlsson, Hanna
                          72 Karlsson, Hanna
                          72 Karlsson, Hanna
                          72 Karlsson, Hanna
                          72 Karlsson, Hanna
                  Note that the first row is scoring higher because there is more than one fuzzy match on "Hanna" in it. I can't immediately think of a solution to get rid of that, although you might want to consider using progressive relaxation to fetch exact matches first:
                  SELECT score(1), sortname FROM TEMP_CUSTOMER where 
                  CONTAINS(sortname, '
                  <textquery>
                    <query>
                      <progression>
                        <seq>Karlsson Hanna</seq>
                        <seq>Karlsson AND Hanna</seq>
                        <seq>fuzzy({Karlsson}, 50, 50, weight) ACCUM fuzzy({Hanna}, 50, 50, weight)</seq>
                      </progression>
                    </query>
                  </textquery>
                  ', 1 ) > 0
                  That produces:
                    SCORE(1) SORTNAME
                  ---------- ------------------------------
                          69 Karlsson, Hanna-Liisa
                          69 Karlsson, Hanna
                          69 Karlsson, Hanna
                          69 Karlsson, Hanna
                          69 Karlsson, Hanna
                          69 Karlsson, Hanna
                          69 Karlsson, Hanna
                          69 Karlsson, Hanna
                          69 Karlsson, Hanna
                  Adding "-" as a PRINTJOINS would prevent "Hanna-Liisa" being in that list.
                  • 6. Re: Fuzzy search - more accurate score??
                    Barbara Boehmer
                    Thanks for posting the thorough test case.

                    In addition to what Roger suggested, you could add weighting to exact matches. If you multiply any exact match by 10 twice, then it ensures than any exact match will score 100. Since you appear to be searching just names, you could also sub-order by some variation of jaro winkler and/or edit distance and/or length.

                    Suggested queries:
                    SELECT score(1), sortname 
                    FROM   TEMP_CUSTOMER 
                    where  CONTAINS
                             (sortname, 
                              '{Karlsson Hanna} * 10 * 10 
                               or ({Karlsson} and {Hanna}) * 1.1 
                               or (fuzzy({Karlsson}, 70, 100, weight) ACCUM 
                                   fuzzy({Hanna}, 70, 100, weight))',
                              1) > 0 
                    order  by score(1) desc, 
                              utl_match.jaro_winkler_similarity ('Karlsson, Hanna', sortname) desc,
                              utl_match.edit_distance ('Karlsson, Hanna', sortname),
                              length (sortname);
                    SELECT score(1), sortname 
                    FROM   TEMP_CUSTOMER 
                    where  CONTAINS
                             (sortname, 
                              '<query>
                                 <textquery>
                                   <progression>
                                     <seq>{Karlsson Hanna}</seq>
                                     <seq>{Karlsson} and {Hanna}</seq>
                                     <seq>fuzzy({Karlsson}, 70, 100, weight) ACCUM 
                                          fuzzy({Hanna}, 70, 100, weight)</seq>
                                   </progression>
                                 </textquery>
                               </query>',
                              1) > 0 
                    order  by score(1) desc, 
                              utl_match.jaro_winkler_similarity ('Karlsson, Hanna', sortname) desc,
                              utl_match.edit_distance ('Karlsson, Hanna', sortname),
                              length (sortname);
                    • 7. Re: Fuzzy search - more accurate score??
                      909317
                      Thank you for your inputs. I will try to look at this.

                      For now It seems that I have some kind of "workaround" how I can score more accurate matches at the top, so for now I am satisfied :)
                      • 8. Re: Fuzzy search - more accurate score??
                        719141
                        Hi Barbara,

                        Slightly deviating from the main question above, I have a need for fundamental fuzzy string matching and I have been reading and trying out all possible approaches. The problem that I need to solve is as "simple to put in words" as having two tables with customer related information and trying to match customer name from one table to find the closes match(s) in the other table.
                        TABLE SOURCE_CUSTOMERS_TAB
                        ( CUSTOMER_ID NUMBER
                        , CUSTOMER_NAME VARCHAR2(4000))
                        
                        AND 
                        
                        TABLE TARGET_CUSTOMERS_TAB
                        ( CUSTOMER_ID NUMBER
                        , CUSTOMER_NAME VARCHAR2(4000))
                        
                        so what needs to be done is in below pseudo code
                        
                        FOR every CUSTOMER_NAME in SOURCE_CUSTOMERS_TAB
                        LOOP
                        
                          FIND all "possible/closest match" CUSTOMER_NAMEs from TARGET_CUSTOMERS_TAB (can be more than one in order or best to worst match, or ideally one BEST match)
                        
                        END LOOP
                        This boils down to the old "FUZZY STRING MATCHING" problem.

                        Till now I have tried the inbuilt Jaro-Winkler, LD, EDIT_DISTANCE etc.. But some how I am not happy with the results since we have customer names which are GLOBAL in nature in both the tables and each of these algorithms have their own limitations. I think I have used one of the PL/SQL function written by you for Levenshtein Distance as well. I am currently trying and working on the Oracle TEXT option using the native FUZZY clause in the CONTAINS part to see if it can help.

                        If you could be kind enough to tell me, I wanted to know what would be the best approach to solve this problem

                        Cheers
                        Goldi
                        • 9. Re: Fuzzy search - more accurate score??
                          Barbara Boehmer
                          Goldi,

                          In the future, please start a new thread for a new question. It would also help to have some sample data, desired results based on that data, and your Oracle version. The new NDATA is designed for finding misspelled and out of order names. You might start with that, since it uses an index, then use other things like JARO_WINKLER_SIMILARITY, which is also designed for names, to further order the result set. You can rank the results based on a combination of those things. If you only want the closes match, then just select those with a ranking of 1. Please see the demonstration below.
                          SCOTT@orcl_11gR2> CREATE TABLE SOURCE_CUSTOMERS_TAB
                            2    ( CUSTOMER_ID     NUMBER
                            3    , CUSTOMER_NAME     VARCHAR2(4000))
                            4  /
                          
                          Table created.
                          
                          SCOTT@orcl_11gR2> INSERT ALL
                            2  INTO source_customers_tab VALUES (1, 'black smith')
                            3  INTO source_customers_tab VALUES (2, 'goldsmyth')
                            4  SELECT * FROM DUAL
                            5  /
                          
                          2 rows created.
                          
                          SCOTT@orcl_11gR2> CREATE TABLE TARGET_CUSTOMERS_TAB
                            2    ( CUSTOMER_ID     NUMBER
                            3    , CUSTOMER_NAME     VARCHAR2(4000))
                            4  /
                          
                          Table created.
                          
                          SCOTT@orcl_11gR2> INSERT ALL
                            2  INTO target_customers_tab VALUES (1, 'black smith')
                            3  INTO target_customers_tab VALUES (2, 'smith black')
                            4  INTO target_customers_tab VALUES (3, 'blacksmith')
                            5  INTO target_customers_tab VALUES (4, 'goldsmith')
                            6  INTO target_customers_tab VALUES (5, 'black smythe')
                            7  SELECT * FROM DUAL
                            8  /
                          
                          5 rows created.
                          
                          SCOTT@orcl_11gR2> BEGIN
                            2    CTX_DDL.CREATE_PREFERENCE ('customer_ds', 'MULTI_COLUMN_DATASTORE');
                            3    CTX_DDL.SET_ATTRIBUTE ('customer_ds', 'COLUMNS', 'customer_name');
                            4    CTX_DDL.CREATE_SECTION_GROUP ('customer_sg', 'BASIC_SECTION_GROUP');
                            5    CTX_DDL.ADD_NDATA_SECTION ('customer_sg', 'customer_name', 'customer_name');
                            6  END;
                            7  /
                          
                          PL/SQL procedure successfully completed.
                          
                          SCOTT@orcl_11gR2> CREATE INDEX customer_idx
                            2  ON target_customers_tab (customer_name)
                            3  INDEXTYPE IS CTXSYS.CONTEXT
                            4  PARAMETERS
                            5    ('DATASTORE     customer_ds
                            6        SECTION GROUP     customer_sg')
                            7  /
                          
                          Index created.
                          
                          SCOTT@orcl_11gR2> COLUMN source FORMAT A20
                          SCOTT@orcl_11gR2> COLUMN target FORMAT A20
                          SCOTT@orcl_11gR2> SELECT s.customer_name source,
                            2           SCORE(1) score,
                            3           UTL_MATCH.JARO_WINKLER_SIMILARITY
                            4             (s.customer_name,
                            5              t.customer_name) jws,
                            6           ROW_NUMBER () OVER
                            7             (PARTITION BY s.customer_name
                            8              ORDER BY SCORE(1) DESC,
                            9              UTL_MATCH.JARO_WINKLER_SIMILARITY
                           10             (s.customer_name,
                           11              t.customer_name) DESC) ranking,
                           12           t.customer_name target
                           13  FROM   source_customers_tab s,
                           14           target_customers_tab t
                           15  WHERE  CONTAINS
                           16             (t.customer_name,
                           17              'NDATA (customer_name, ' || s.customer_name || ', N, P)', 1) > 0
                           18  ORDER  BY source, score DESC, ranking
                           19  /
                          
                          SOURCE                    SCORE        JWS    RANKING TARGET
                          -------------------- ---------- ---------- ---------- --------------------
                          black smith                 100        100          1 black smith
                          black smith                 100         39          2 smith black
                          black smith                  88         94          3 black smythe
                          black smith                  82         98          4 blacksmith
                          black smith                  38         73          5 goldsmith
                          goldsmyth                    92         95          1 goldsmith
                          goldsmyth                    40         72          2 black smythe
                          
                          7 rows selected.
                          
                          SCOTT@orcl_11gR2> COLUMN closest_match FORMAT A20
                          SCOTT@orcl_11gR2> SELECT source, target closest_match
                            2  FROM   (SELECT s.customer_name source,
                            3                ROW_NUMBER () OVER
                            4                  (PARTITION BY s.customer_name
                            5                   ORDER BY SCORE(1) DESC,
                            6                   UTL_MATCH.JARO_WINKLER_SIMILARITY
                            7                  (s.customer_name,
                            8                   t.customer_name) DESC) ranking,
                            9                t.customer_name target
                           10            FROM   source_customers_tab s,
                           11                target_customers_tab t
                           12            WHERE  CONTAINS
                           13                  (t.customer_name,
                           14                   'NDATA (customer_name, ' || s.customer_name || ', N, P)', 1) > 0)
                           15  WHERE  ranking = 1
                           16  ORDER  BY source
                           17  /
                          
                          SOURCE               CLOSEST_MATCH
                          -------------------- --------------------
                          black smith          black smith
                          goldsmyth            goldsmith
                          
                          2 rows selected.