Forum Stats

  • 3,825,765 Users
  • 2,260,558 Discussions
  • 7,896,665 Comments

Discussions

fuzzy matching on names

dba94403
dba94403 Member Posts: 20 Blue Ribbon
edited Aug 14, 2017 1:41PM in Text

We are currently using edit_distance to find similarity on user's first name.  We like to expand the search capability such that it finds similarity in nicknames like Bill and William, Anthony and Tony.  Is there any feature in Oracle Text that can handle this?  Thanks.

Best Answer

  • Barbara Boehmer
    Barbara Boehmer Member Posts: 4,625 Gold Trophy
    edited Aug 9, 2017 12:48AM Answer ✓

    There are various things that you might use.  For the nicknames, you will need a thesaurus that has the names and nicknames as synonyms.  Oracle provides a sample thesaurus of nicknames in <oracle_home>>ctx\sample\thes\dr0thsnames.txt.  You can start with that or provide your own, either by loading from a text file using ctxload or creating the thesaurus and loading it using the ctx_thes package.  You can modify the text file directly prior to loading or add synonyms after loading using the ctx_thes package.  You can then use this with any of the functions in the utl_match package.  For names, you may find that jaro_winkler is more appropriate than edit_distance.  Alternatively, you could create a context index and use contains with fuzzy and syn.  The index should provide faster searches on large data sets and will limit your results.  You can then use utl_match functions for ordering the limited result set.  I have provided some examples below of loading the sample thesaurus of nicknames, adding additional nicknames as synonyms, what querying the thesaurus it creates can produce, and using those synonyms in queries with utl_match functions and using those synonyms in queries using contains and fuzzy and syn.  These are just a few ideas.  There are other methods, such as ndata, that you might look into.

    [email protected]_12.1.0.2.0> -- one time load of thesaurus:
    [email protected]_12.1.0.2.0> connect ctxsys/ctxsys
    Connected.
    [email protected]_12.1.0.2.0> host ctxload -user ctxsys/ctxsys -thes -name nicknames -file C:\app\baboehme\product\12.1.0\dbhome_2\ctx\sample\thes\dr0thsnames.txt
    Connecting...
    Creating thesaurus nicknames...
    Thesaurus nicknames created...
    Processing...
    1000 lines processed
    1996 lines processed successfully
    Beginning insert...1996 lines inserted successfully
    Disconnected

    [email protected]_12.1.0.2.0> -- example of adding synonyms:
    [email protected]_12.1.0.2.0> begin
      2    ctx_thes.create_relation ('nicknames', 'Anthony', 'syn', 'Toni');
      3  end;
      4  /

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> connect scott/tiger
    Connected.
    [email protected]_12.1.0.2.0> -- examples of what can be extracted from the thesaurus:
    [email protected]_12.1.0.2.0> select ctx_thes.syn ('Bill', 'nicknames') from dual
      2  /

    CTX_THES.SYN('BILL','NICKNAMES')
    --------------------------------------------------------------------------------
    {BILL}|{WILLIAM }|{BUD}|{WILL}|{BILL}|{WILLIS  }|{WILLY}

    1 row selected.

    [email protected]_12.1.0.2.0> select ctx_thes.syn ('Anthony', 'nicknames') from dual
      2  /

    CTX_THES.SYN('ANTHONY','NICKNAMES')
    --------------------------------------------------------------------------------
    {ANTHONY}|{TONY}|{TONI}

    1 row selected.

    [email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
      2  from  (select ctx_thes.syn ('Bill', 'nicknames') names
      3          from  dual)
      4  connect by level <= regexp_count (names, '{')
      5  /

    RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
    --------------------------------------------------------------------------------
    BILL
    WILLIAM
    BUD
    WILL
    BILL
    WILLIS
    WILLY

    7 rows selected.

    [email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
      2  from  (select ctx_thes.syn ('Anthony', 'nicknames') names
      3          from  dual)
      4  connect by level <= regexp_count (names, '{')
      5  /

    RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
    --------------------------------------------------------------------------------
    ANTHONY
    TONY
    TONI

    3 rows selected.

    [email protected]_12.1.0.2.0> -- table and data for testing:
    [email protected]_12.1.0.2.0> create table test_tab (first_name varchar2(30))
      2  /

    Table created.

    [email protected]_12.1.0.2.0> insert all
      2  into test_tab values ('Bill')
      3  into test_tab values ('William')
      4  into test_tab values ('Anthony')
      5  into test_tab values ('Tony')
      6  into test_tab values ('Toni')
      7  into test_tab values ('Tiny')
      8  into test_tab values ('John')
      9  select * from dual
    10  /

    7 rows created.

    [email protected]_12.1.0.2.0> -- sample queries using utl_match and ctx_thes.syn:
    [email protected]_12.1.0.2.0> variable search_name varchar2(30)
    [email protected]_12.1.0.2.0> exec :search_name := 'Bill'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> column first_name format a15
    [email protected]_12.1.0.2.0> select max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
      2        max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
      3        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
      4        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
      5        test_tab.first_name
      6  from  test_tab,
      7        (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
      8          from  (select ctx_thes.syn (:search_name, 'nicknames') names
      9                  from  dual)
    10          connect by level <= regexp_count (names, '{')) names
    11  group  by test_tab.first_name
    12  order  by lds1 desc, lds2 desc, jws1 desc, jws2 desc
    13  /

          LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------------
          100        100        100        100 Bill
          100        43        100        72 William
            40        25        63        50 Tiny
            20          0        48          0 Tony
            17          0        47          0 Toni
            15          0        44          0 Anthony
            0          0          0          0 John

    7 rows selected.

    [email protected]_12.1.0.2.0> exec :search_name := 'Anthony'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> /

          LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------------
          100        100        100        100 Anthony
          100        58        100        42 Tony
          100        43        100        42 Toni
            75        43        85        42 Tiny
            29        29        66        42 John
            15          0        46          0 William
            0          0          0          0 Bill

    7 rows selected.

    [email protected]_12.1.0.2.0> -- usage of context index and fuzzy matching with utl_match only for ordering results:
    [email protected]_12.1.0.2.0> create index test_idx on test_tab (first_name) indextype is ctxsys.context parameters ('sync (on commit)')
      2  /

    Index created.

    [email protected]_12.1.0.2.0> select score(1) score,
      2        utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name)) lds,
      3        utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))  jws,
      4        first_name
      5  from  test_tab
      6  where  contains (first_name, 'fuzzy (syn (' || :search_name || ', nicknames), 1, 5000, W)', 1) > 0
      7  order  by score desc, lds desc, jws desc
      8  /

        SCORE        LDS        JWS FIRST_NAME
    ---------- ---------- ---------- ---------------
            6        100        100 Anthony
            6        58        42 Tony
            6        43        42 Toni
            5        43        42 Tiny

    4 rows selected.

    [email protected]_12.1.0.2.0> exec :search_name := 'Bill'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> /

        SCORE        LDS        JWS FIRST_NAME
    ---------- ---------- ---------- ---------------
            6        100        100 Bill
            6        43        72 William

    2 rows selected.

    [email protected]_12.1.0.2.0> -- combination of methods (may need to loop through names values with larger data sets):
    [email protected]_12.1.0.2.0> select max (score(1)) score,
      2        max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
      3        max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
      4        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
      5        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
      6        test_tab.first_name
      7  from  test_tab,
      8        (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
      9          from  (select ctx_thes.syn (:search_name, 'nicknames') names
    10                  from  dual)
    11          connect by level <= regexp_count (names, '{')) names
    12  where  contains (test_tab.first_name, 'fuzzy (' || names.name || ', 1, 5000, W)', 1) > 0
    13  group  by test_tab.first_name
    14  order  by score desc, lds1 desc, lds2 desc, jws1 desc, jws2 desc
    15  /

        SCORE      LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------- ---------------
            55        100        100        100        100 Bill
            55        100        43        100        72 William

    2 rows selected.

    [email protected]_12.1.0.2.0> exec :search_name := 'Anthony'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> /

        SCORE      LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------- ---------------
            55        100        100        100        100 Anthony
            46        100        58        100        42 Tony
            46        100        43        100        42 Toni
            21        75        43        85        42 Tiny

    4 rows selected.

Answers

  • Barbara Boehmer
    Barbara Boehmer Member Posts: 4,625 Gold Trophy
    edited Aug 9, 2017 12:48AM Answer ✓

    There are various things that you might use.  For the nicknames, you will need a thesaurus that has the names and nicknames as synonyms.  Oracle provides a sample thesaurus of nicknames in <oracle_home>>ctx\sample\thes\dr0thsnames.txt.  You can start with that or provide your own, either by loading from a text file using ctxload or creating the thesaurus and loading it using the ctx_thes package.  You can modify the text file directly prior to loading or add synonyms after loading using the ctx_thes package.  You can then use this with any of the functions in the utl_match package.  For names, you may find that jaro_winkler is more appropriate than edit_distance.  Alternatively, you could create a context index and use contains with fuzzy and syn.  The index should provide faster searches on large data sets and will limit your results.  You can then use utl_match functions for ordering the limited result set.  I have provided some examples below of loading the sample thesaurus of nicknames, adding additional nicknames as synonyms, what querying the thesaurus it creates can produce, and using those synonyms in queries with utl_match functions and using those synonyms in queries using contains and fuzzy and syn.  These are just a few ideas.  There are other methods, such as ndata, that you might look into.

    [email protected]_12.1.0.2.0> -- one time load of thesaurus:
    [email protected]_12.1.0.2.0> connect ctxsys/ctxsys
    Connected.
    [email protected]_12.1.0.2.0> host ctxload -user ctxsys/ctxsys -thes -name nicknames -file C:\app\baboehme\product\12.1.0\dbhome_2\ctx\sample\thes\dr0thsnames.txt
    Connecting...
    Creating thesaurus nicknames...
    Thesaurus nicknames created...
    Processing...
    1000 lines processed
    1996 lines processed successfully
    Beginning insert...1996 lines inserted successfully
    Disconnected

    [email protected]_12.1.0.2.0> -- example of adding synonyms:
    [email protected]_12.1.0.2.0> begin
      2    ctx_thes.create_relation ('nicknames', 'Anthony', 'syn', 'Toni');
      3  end;
      4  /

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> connect scott/tiger
    Connected.
    [email protected]_12.1.0.2.0> -- examples of what can be extracted from the thesaurus:
    [email protected]_12.1.0.2.0> select ctx_thes.syn ('Bill', 'nicknames') from dual
      2  /

    CTX_THES.SYN('BILL','NICKNAMES')
    --------------------------------------------------------------------------------
    {BILL}|{WILLIAM }|{BUD}|{WILL}|{BILL}|{WILLIS  }|{WILLY}

    1 row selected.

    [email protected]_12.1.0.2.0> select ctx_thes.syn ('Anthony', 'nicknames') from dual
      2  /

    CTX_THES.SYN('ANTHONY','NICKNAMES')
    --------------------------------------------------------------------------------
    {ANTHONY}|{TONY}|{TONI}

    1 row selected.

    [email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
      2  from  (select ctx_thes.syn ('Bill', 'nicknames') names
      3          from  dual)
      4  connect by level <= regexp_count (names, '{')
      5  /

    RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
    --------------------------------------------------------------------------------
    BILL
    WILLIAM
    BUD
    WILL
    BILL
    WILLIS
    WILLY

    7 rows selected.

    [email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
      2  from  (select ctx_thes.syn ('Anthony', 'nicknames') names
      3          from  dual)
      4  connect by level <= regexp_count (names, '{')
      5  /

    RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
    --------------------------------------------------------------------------------
    ANTHONY
    TONY
    TONI

    3 rows selected.

    [email protected]_12.1.0.2.0> -- table and data for testing:
    [email protected]_12.1.0.2.0> create table test_tab (first_name varchar2(30))
      2  /

    Table created.

    [email protected]_12.1.0.2.0> insert all
      2  into test_tab values ('Bill')
      3  into test_tab values ('William')
      4  into test_tab values ('Anthony')
      5  into test_tab values ('Tony')
      6  into test_tab values ('Toni')
      7  into test_tab values ('Tiny')
      8  into test_tab values ('John')
      9  select * from dual
    10  /

    7 rows created.

    [email protected]_12.1.0.2.0> -- sample queries using utl_match and ctx_thes.syn:
    [email protected]_12.1.0.2.0> variable search_name varchar2(30)
    [email protected]_12.1.0.2.0> exec :search_name := 'Bill'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> column first_name format a15
    [email protected]_12.1.0.2.0> select max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
      2        max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
      3        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
      4        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
      5        test_tab.first_name
      6  from  test_tab,
      7        (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
      8          from  (select ctx_thes.syn (:search_name, 'nicknames') names
      9                  from  dual)
    10          connect by level <= regexp_count (names, '{')) names
    11  group  by test_tab.first_name
    12  order  by lds1 desc, lds2 desc, jws1 desc, jws2 desc
    13  /

          LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------------
          100        100        100        100 Bill
          100        43        100        72 William
            40        25        63        50 Tiny
            20          0        48          0 Tony
            17          0        47          0 Toni
            15          0        44          0 Anthony
            0          0          0          0 John

    7 rows selected.

    [email protected]_12.1.0.2.0> exec :search_name := 'Anthony'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> /

          LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------------
          100        100        100        100 Anthony
          100        58        100        42 Tony
          100        43        100        42 Toni
            75        43        85        42 Tiny
            29        29        66        42 John
            15          0        46          0 William
            0          0          0          0 Bill

    7 rows selected.

    [email protected]_12.1.0.2.0> -- usage of context index and fuzzy matching with utl_match only for ordering results:
    [email protected]_12.1.0.2.0> create index test_idx on test_tab (first_name) indextype is ctxsys.context parameters ('sync (on commit)')
      2  /

    Index created.

    [email protected]_12.1.0.2.0> select score(1) score,
      2        utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name)) lds,
      3        utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))  jws,
      4        first_name
      5  from  test_tab
      6  where  contains (first_name, 'fuzzy (syn (' || :search_name || ', nicknames), 1, 5000, W)', 1) > 0
      7  order  by score desc, lds desc, jws desc
      8  /

        SCORE        LDS        JWS FIRST_NAME
    ---------- ---------- ---------- ---------------
            6        100        100 Anthony
            6        58        42 Tony
            6        43        42 Toni
            5        43        42 Tiny

    4 rows selected.

    [email protected]_12.1.0.2.0> exec :search_name := 'Bill'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> /

        SCORE        LDS        JWS FIRST_NAME
    ---------- ---------- ---------- ---------------
            6        100        100 Bill
            6        43        72 William

    2 rows selected.

    [email protected]_12.1.0.2.0> -- combination of methods (may need to loop through names values with larger data sets):
    [email protected]_12.1.0.2.0> select max (score(1)) score,
      2        max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
      3        max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
      4        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
      5        max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
      6        test_tab.first_name
      7  from  test_tab,
      8        (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
      9          from  (select ctx_thes.syn (:search_name, 'nicknames') names
    10                  from  dual)
    11          connect by level <= regexp_count (names, '{')) names
    12  where  contains (test_tab.first_name, 'fuzzy (' || names.name || ', 1, 5000, W)', 1) > 0
    13  group  by test_tab.first_name
    14  order  by score desc, lds1 desc, lds2 desc, jws1 desc, jws2 desc
    15  /

        SCORE      LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------- ---------------
            55        100        100        100        100 Bill
            55        100        43        100        72 William

    2 rows selected.

    [email protected]_12.1.0.2.0> exec :search_name := 'Anthony'

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> /

        SCORE      LDS1      LDS2      JWS1      JWS2 FIRST_NAME
    ---------- ---------- ---------- ---------- ---------- ---------------
            55        100        100        100        100 Anthony
            46        100        58        100        42 Tony
            46        100        43        100        42 Toni
            21        75        43        85        42 Tiny

    4 rows selected.

  • Roger Ford-Oracle
    Roger Ford-Oracle Member Posts: 1,132 Employee
    edited Aug 9, 2017 6:07AM

    I would take a look at NDATA first, as it's specifically designed for names.

    There's some instructions and sample code here

    NDATA includes the option to add your own thesaurus to support nicknames / diminutives like Bill / William or Lizzie / Beth / Elizabeth. The comments in the sample code should show you how to use it.

This discussion has been closed.