Discussions
Categories
- 17.9K All Categories
- 3.4K Industry Applications
- 3.3K Intelligent Advisor
- 62 Insurance
- 536K On-Premises Infrastructure
- 138.2K Analytics Software
- 38.6K Application Development Software
- 5.7K Cloud Platform
- 109.4K Database Software
- 17.5K Enterprise Manager
- 8.8K Hardware
- 71.1K Infrastructure Software
- 105.2K Integration
- 41.6K Security Software
fuzzy matching on names
We are currently using edit_distance to find similarity on user's first name. We like to expand the search capability such that it finds similarity in nicknames like Bill and William, Anthony and Tony. Is there any feature in Oracle Text that can handle this? Thanks.
Best Answer
-
There are various things that you might use. For the nicknames, you will need a thesaurus that has the names and nicknames as synonyms. Oracle provides a sample thesaurus of nicknames in <oracle_home>>ctx\sample\thes\dr0thsnames.txt. You can start with that or provide your own, either by loading from a text file using ctxload or creating the thesaurus and loading it using the ctx_thes package. You can modify the text file directly prior to loading or add synonyms after loading using the ctx_thes package. You can then use this with any of the functions in the utl_match package. For names, you may find that jaro_winkler is more appropriate than edit_distance. Alternatively, you could create a context index and use contains with fuzzy and syn. The index should provide faster searches on large data sets and will limit your results. You can then use utl_match functions for ordering the limited result set. I have provided some examples below of loading the sample thesaurus of nicknames, adding additional nicknames as synonyms, what querying the thesaurus it creates can produce, and using those synonyms in queries with utl_match functions and using those synonyms in queries using contains and fuzzy and syn. These are just a few ideas. There are other methods, such as ndata, that you might look into.
[email protected]_12.1.0.2.0> -- one time load of thesaurus:
[email protected]_12.1.0.2.0> connect ctxsys/ctxsys
Connected.
[email protected]_12.1.0.2.0> host ctxload -user ctxsys/ctxsys -thes -name nicknames -file C:\app\baboehme\product\12.1.0\dbhome_2\ctx\sample\thes\dr0thsnames.txt
Connecting...
Creating thesaurus nicknames...
Thesaurus nicknames created...
Processing...
1000 lines processed
1996 lines processed successfully
Beginning insert...1996 lines inserted successfully
Disconnected[email protected]_12.1.0.2.0> -- example of adding synonyms:
[email protected]_12.1.0.2.0> begin
2 ctx_thes.create_relation ('nicknames', 'Anthony', 'syn', 'Toni');
3 end;
4 /PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> connect scott/tiger
Connected.
[email protected]_12.1.0.2.0> -- examples of what can be extracted from the thesaurus:
[email protected]_12.1.0.2.0> select ctx_thes.syn ('Bill', 'nicknames') from dual
2 /CTX_THES.SYN('BILL','NICKNAMES')
--------------------------------------------------------------------------------
{BILL}|{WILLIAM }|{BUD}|{WILL}|{BILL}|{WILLIS }|{WILLY}1 row selected.
[email protected]_12.1.0.2.0> select ctx_thes.syn ('Anthony', 'nicknames') from dual
2 /CTX_THES.SYN('ANTHONY','NICKNAMES')
--------------------------------------------------------------------------------
{ANTHONY}|{TONY}|{TONI}1 row selected.
[email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
2 from (select ctx_thes.syn ('Bill', 'nicknames') names
3 from dual)
4 connect by level <= regexp_count (names, '{')
5 /RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
--------------------------------------------------------------------------------
BILL
WILLIAM
BUD
WILL
BILL
WILLIS
WILLY7 rows selected.
[email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
2 from (select ctx_thes.syn ('Anthony', 'nicknames') names
3 from dual)
4 connect by level <= regexp_count (names, '{')
5 /RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
--------------------------------------------------------------------------------
ANTHONY
TONY
TONI3 rows selected.
[email protected]_12.1.0.2.0> -- table and data for testing:
[email protected]_12.1.0.2.0> create table test_tab (first_name varchar2(30))
2 /Table created.
[email protected]_12.1.0.2.0> insert all
2 into test_tab values ('Bill')
3 into test_tab values ('William')
4 into test_tab values ('Anthony')
5 into test_tab values ('Tony')
6 into test_tab values ('Toni')
7 into test_tab values ('Tiny')
8 into test_tab values ('John')
9 select * from dual
10 /7 rows created.
[email protected]_12.1.0.2.0> -- sample queries using utl_match and ctx_thes.syn:
[email protected]_12.1.0.2.0> variable search_name varchar2(30)
[email protected]_12.1.0.2.0> exec :search_name := 'Bill'PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> column first_name format a15
[email protected]_12.1.0.2.0> select max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
2 max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
3 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
4 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
5 test_tab.first_name
6 from test_tab,
7 (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
8 from (select ctx_thes.syn (:search_name, 'nicknames') names
9 from dual)
10 connect by level <= regexp_count (names, '{')) names
11 group by test_tab.first_name
12 order by lds1 desc, lds2 desc, jws1 desc, jws2 desc
13 /LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------------
100 100 100 100 Bill
100 43 100 72 William
40 25 63 50 Tiny
20 0 48 0 Tony
17 0 47 0 Toni
15 0 44 0 Anthony
0 0 0 0 John7 rows selected.
[email protected]_12.1.0.2.0> exec :search_name := 'Anthony'
PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> /
LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------------
100 100 100 100 Anthony
100 58 100 42 Tony
100 43 100 42 Toni
75 43 85 42 Tiny
29 29 66 42 John
15 0 46 0 William
0 0 0 0 Bill7 rows selected.
[email protected]_12.1.0.2.0> -- usage of context index and fuzzy matching with utl_match only for ordering results:
[email protected]_12.1.0.2.0> create index test_idx on test_tab (first_name) indextype is ctxsys.context parameters ('sync (on commit)')
2 /Index created.
[email protected]_12.1.0.2.0> select score(1) score,
2 utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name)) lds,
3 utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name)) jws,
4 first_name
5 from test_tab
6 where contains (first_name, 'fuzzy (syn (' || :search_name || ', nicknames), 1, 5000, W)', 1) > 0
7 order by score desc, lds desc, jws desc
8 /SCORE LDS JWS FIRST_NAME
---------- ---------- ---------- ---------------
6 100 100 Anthony
6 58 42 Tony
6 43 42 Toni
5 43 42 Tiny4 rows selected.
[email protected]_12.1.0.2.0> exec :search_name := 'Bill'
PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> /
SCORE LDS JWS FIRST_NAME
---------- ---------- ---------- ---------------
6 100 100 Bill
6 43 72 William2 rows selected.
[email protected]_12.1.0.2.0> -- combination of methods (may need to loop through names values with larger data sets):
[email protected]_12.1.0.2.0> select max (score(1)) score,
2 max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
3 max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
4 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
5 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
6 test_tab.first_name
7 from test_tab,
8 (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
9 from (select ctx_thes.syn (:search_name, 'nicknames') names
10 from dual)
11 connect by level <= regexp_count (names, '{')) names
12 where contains (test_tab.first_name, 'fuzzy (' || names.name || ', 1, 5000, W)', 1) > 0
13 group by test_tab.first_name
14 order by score desc, lds1 desc, lds2 desc, jws1 desc, jws2 desc
15 /SCORE LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------- ---------------
55 100 100 100 100 Bill
55 100 43 100 72 William2 rows selected.
[email protected]_12.1.0.2.0> exec :search_name := 'Anthony'
PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> /
SCORE LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------- ---------------
55 100 100 100 100 Anthony
46 100 58 100 42 Tony
46 100 43 100 42 Toni
21 75 43 85 42 Tiny4 rows selected.
Answers
-
There are various things that you might use. For the nicknames, you will need a thesaurus that has the names and nicknames as synonyms. Oracle provides a sample thesaurus of nicknames in <oracle_home>>ctx\sample\thes\dr0thsnames.txt. You can start with that or provide your own, either by loading from a text file using ctxload or creating the thesaurus and loading it using the ctx_thes package. You can modify the text file directly prior to loading or add synonyms after loading using the ctx_thes package. You can then use this with any of the functions in the utl_match package. For names, you may find that jaro_winkler is more appropriate than edit_distance. Alternatively, you could create a context index and use contains with fuzzy and syn. The index should provide faster searches on large data sets and will limit your results. You can then use utl_match functions for ordering the limited result set. I have provided some examples below of loading the sample thesaurus of nicknames, adding additional nicknames as synonyms, what querying the thesaurus it creates can produce, and using those synonyms in queries with utl_match functions and using those synonyms in queries using contains and fuzzy and syn. These are just a few ideas. There are other methods, such as ndata, that you might look into.
[email protected]_12.1.0.2.0> -- one time load of thesaurus:
[email protected]_12.1.0.2.0> connect ctxsys/ctxsys
Connected.
[email protected]_12.1.0.2.0> host ctxload -user ctxsys/ctxsys -thes -name nicknames -file C:\app\baboehme\product\12.1.0\dbhome_2\ctx\sample\thes\dr0thsnames.txt
Connecting...
Creating thesaurus nicknames...
Thesaurus nicknames created...
Processing...
1000 lines processed
1996 lines processed successfully
Beginning insert...1996 lines inserted successfully
Disconnected[email protected]_12.1.0.2.0> -- example of adding synonyms:
[email protected]_12.1.0.2.0> begin
2 ctx_thes.create_relation ('nicknames', 'Anthony', 'syn', 'Toni');
3 end;
4 /PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> connect scott/tiger
Connected.
[email protected]_12.1.0.2.0> -- examples of what can be extracted from the thesaurus:
[email protected]_12.1.0.2.0> select ctx_thes.syn ('Bill', 'nicknames') from dual
2 /CTX_THES.SYN('BILL','NICKNAMES')
--------------------------------------------------------------------------------
{BILL}|{WILLIAM }|{BUD}|{WILL}|{BILL}|{WILLIS }|{WILLY}1 row selected.
[email protected]_12.1.0.2.0> select ctx_thes.syn ('Anthony', 'nicknames') from dual
2 /CTX_THES.SYN('ANTHONY','NICKNAMES')
--------------------------------------------------------------------------------
{ANTHONY}|{TONY}|{TONI}1 row selected.
[email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
2 from (select ctx_thes.syn ('Bill', 'nicknames') names
3 from dual)
4 connect by level <= regexp_count (names, '{')
5 /RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
--------------------------------------------------------------------------------
BILL
WILLIAM
BUD
WILL
BILL
WILLIS
WILLY7 rows selected.
[email protected]_12.1.0.2.0> select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9))
2 from (select ctx_thes.syn ('Anthony', 'nicknames') names
3 from dual)
4 connect by level <= regexp_count (names, '{')
5 /RTRIM(REGEXP_SUBSTR(NAMES,'[^{}|]+',1,ROWNUM),CHR(9))
--------------------------------------------------------------------------------
ANTHONY
TONY
TONI3 rows selected.
[email protected]_12.1.0.2.0> -- table and data for testing:
[email protected]_12.1.0.2.0> create table test_tab (first_name varchar2(30))
2 /Table created.
[email protected]_12.1.0.2.0> insert all
2 into test_tab values ('Bill')
3 into test_tab values ('William')
4 into test_tab values ('Anthony')
5 into test_tab values ('Tony')
6 into test_tab values ('Toni')
7 into test_tab values ('Tiny')
8 into test_tab values ('John')
9 select * from dual
10 /7 rows created.
[email protected]_12.1.0.2.0> -- sample queries using utl_match and ctx_thes.syn:
[email protected]_12.1.0.2.0> variable search_name varchar2(30)
[email protected]_12.1.0.2.0> exec :search_name := 'Bill'PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> column first_name format a15
[email protected]_12.1.0.2.0> select max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
2 max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
3 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
4 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
5 test_tab.first_name
6 from test_tab,
7 (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
8 from (select ctx_thes.syn (:search_name, 'nicknames') names
9 from dual)
10 connect by level <= regexp_count (names, '{')) names
11 group by test_tab.first_name
12 order by lds1 desc, lds2 desc, jws1 desc, jws2 desc
13 /LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------------
100 100 100 100 Bill
100 43 100 72 William
40 25 63 50 Tiny
20 0 48 0 Tony
17 0 47 0 Toni
15 0 44 0 Anthony
0 0 0 0 John7 rows selected.
[email protected]_12.1.0.2.0> exec :search_name := 'Anthony'
PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> /
LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------------
100 100 100 100 Anthony
100 58 100 42 Tony
100 43 100 42 Toni
75 43 85 42 Tiny
29 29 66 42 John
15 0 46 0 William
0 0 0 0 Bill7 rows selected.
[email protected]_12.1.0.2.0> -- usage of context index and fuzzy matching with utl_match only for ordering results:
[email protected]_12.1.0.2.0> create index test_idx on test_tab (first_name) indextype is ctxsys.context parameters ('sync (on commit)')
2 /Index created.
[email protected]_12.1.0.2.0> select score(1) score,
2 utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name)) lds,
3 utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name)) jws,
4 first_name
5 from test_tab
6 where contains (first_name, 'fuzzy (syn (' || :search_name || ', nicknames), 1, 5000, W)', 1) > 0
7 order by score desc, lds desc, jws desc
8 /SCORE LDS JWS FIRST_NAME
---------- ---------- ---------- ---------------
6 100 100 Anthony
6 58 42 Tony
6 43 42 Toni
5 43 42 Tiny4 rows selected.
[email protected]_12.1.0.2.0> exec :search_name := 'Bill'
PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> /
SCORE LDS JWS FIRST_NAME
---------- ---------- ---------- ---------------
6 100 100 Bill
6 43 72 William2 rows selected.
[email protected]_12.1.0.2.0> -- combination of methods (may need to loop through names values with larger data sets):
[email protected]_12.1.0.2.0> select max (score(1)) score,
2 max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (names.name))) lds1,
3 max (utl_match.edit_distance_similarity (upper (test_tab.first_name), upper (:search_name))) lds2,
4 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (names.name))) jws1,
5 max (utl_match.jaro_winkler_similarity (upper (test_tab.first_name), upper (:search_name))) jws2,
6 test_tab.first_name
7 from test_tab,
8 (select rtrim (regexp_substr (names, '[^{}|]+', 1, rownum), chr(9)) name
9 from (select ctx_thes.syn (:search_name, 'nicknames') names
10 from dual)
11 connect by level <= regexp_count (names, '{')) names
12 where contains (test_tab.first_name, 'fuzzy (' || names.name || ', 1, 5000, W)', 1) > 0
13 group by test_tab.first_name
14 order by score desc, lds1 desc, lds2 desc, jws1 desc, jws2 desc
15 /SCORE LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------- ---------------
55 100 100 100 100 Bill
55 100 43 100 72 William2 rows selected.
[email protected]_12.1.0.2.0> exec :search_name := 'Anthony'
PL/SQL procedure successfully completed.
[email protected]_12.1.0.2.0> /
SCORE LDS1 LDS2 JWS1 JWS2 FIRST_NAME
---------- ---------- ---------- ---------- ---------- ---------------
55 100 100 100 100 Anthony
46 100 58 100 42 Tony
46 100 43 100 42 Toni
21 75 43 85 42 Tiny4 rows selected.
-
I would take a look at NDATA first, as it's specifically designed for names.
There's some instructions and sample code here
NDATA includes the option to add your own thesaurus to support nicknames / diminutives like Bill / William or Lizzie / Beth / Elizabeth. The comments in the sample code should show you how to use it.