Skip to Main Content

SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Address Matching Using REGEXP_LIKE

571669Dec 29 2009 — edited Dec 30 2009
Hey all,
I am a newbie to REGEXP and have an issue where I have 2 spatial tables, for each object in table1 and am doing an SDO_NN to find the nearest 2 objects in table2. I also have an address column for each spatial table. I am trying to compare address1 (from table1) to both addresses (address2 from table2) to find the most likely match candidate.
After reading about REGEXP it seemed to me the REGEXP_LIKE was most likely best way of achieving my matches.
What I have come up with works alot of times but not always and I don't quite understand it.

Here is an example where I don't understand what is going on where I used REGEXP_LIKE as this
REGEXP_LIKE(address1,replace(address2,' ','|'))
SELECT 
  id,  
  address1,
  address2,
  CASE
    WHEN address1=address2 THEN 'EXACT MATCH'
    WHEN REGEXP_LIKE(address1,replace(address2,' ','|')) THEN 'PROBABLE MATCH'
    ELSE 'NON MATCH'
  END AS TEST
FROM
  (
  SELECT
    1 ID, '930 OLD STEESE HWY' address1, '930 OLD STEESE HWY' address2
  FROM
    DUAL  
  UNION SELECT
    3 ID, '930 OLD STEESE HWY' address1, '920 OLD STEESE HWY' address2
  FROM
    DUAL    
  UNION SELECT
    3 ID, '930 OLD STEESE HWY' address1, '99 COLLEGE RD' address2
  FROM
    DUAL    
  UNION SELECT
    4 ID, '930 OLD STEESE HWY' address1, '80 5TH ST' address2
  FROM
    DUAL      
  );

RESULTS
ID ADDRESS1            ADDRESS2            TEST
-- ------------------  ------------------  --------------
1  930 OLD STEESE HWY  930 OLD STEESE HWY  EXACT MATCH
2  930 OLD STEESE HWY  920 OLD STEESE HWY  PROBABLE MATCH
3  930 OLD STEESE HWY  99 COLLEGE RD       NON MATCH
4  930 OLD STEESE HWY  80 5TH ST           PROBABLE MATCH
I am really confused as to the 4th line in the results?
Any thoughts and or suggestions?
Appreciate it!
Cheers,
Eric
This post has been answered by 730428 on Dec 29 2009
Jump to Answer

Comments

730428
Answer
Second and fourth row match because there is a word in address2 that is present (at least as substring) in address1: For the fourth row ST is substring of STEESE.

If you want a PROBALE MATCH only when there's a word in address2 that is present as a word in address1, you can change this way:
SQL> SELECT
  2    id,
  3    address1,
  4    address2,
  5    CASE
  6      WHEN address1=address2 THEN 'EXACT MATCH'
  7      WHEN REGEXP_LIKE(address1,'(^| )'||replace(address2,' ','|')||'($| )') THEN 'PROBABLE MATCH'
  8      ELSE 'NON MATCH'
  9    END AS TEST
 10  FROM
 11    (
 12    SELECT
 13      1 ID, '930 OLD STEESE HWY' address1, '930 OLD STEESE HWY' address2
 14    FROM
 15      DUAL
 16    UNION SELECT
 17      3 ID, '930 OLD STEESE HWY' address1, '920 OLD STEESE HWY' address2
 18    FROM
 19      DUAL
 20    UNION SELECT
 21      3 ID, '930 OLD STEESE HWY' address1, '99 COLLEGE RD' address2
 22    FROM
 23      DUAL
 24    UNION SELECT
 25      4 ID, '930 OLD STEESE HWY' address1, '80 5TH ST' address2
 26    FROM
 27      DUAL
 28    );

        ID ADDRESS1           ADDRESS2           TEST
---------- ------------------ ------------------ --------------
         1 930 OLD STEESE HWY 930 OLD STEESE HWY EXACT MATCH
         3 930 OLD STEESE HWY 920 OLD STEESE HWY PROBABLE MATCH
         3 930 OLD STEESE HWY 99 COLLEGE RD      NON MATCH
         4 930 OLD STEESE HWY 80 5TH ST          NON MATCH
In Oracle 11g you can use REGEXP_COUNT to count the common words between address1 and address2:
  2    id,
  3    address1,
  4    address2,
  5    CASE
  6      WHEN address1=address2 THEN 'EXACT MATCH'
  7      WHEN REGEXP_LIKE(address1,'(^| )'||replace(address2,' ','|')||'($| )') THEN 'PROBABLE MATCH'
  8      ELSE 'NON MATCH'
  9    END AS TEST,
 10    regexp_count(address1,'(^| )'||replace(address2,' ','|')||'($| )') common_words
 11  FROM
 12    (
 13    SELECT
 14      1 ID, '930 OLD STEESE HWY' address1, '930 OLD STEESE HWY' address2
 15    FROM
 16      DUAL
 17    UNION SELECT
 18      3 ID, '930 OLD STEESE HWY' address1, '920 OLD STEESE HWY' address2
 19    FROM
 20      DUAL
 21    UNION SELECT
 22      3 ID, '930 OLD STEESE HWY' address1, '99 COLLEGE RD' address2
 23    FROM
 24      DUAL
 25    UNION SELECT
 26      4 ID, '930 OLD STEESE HWY' address1, '80 5TH ST' address2
 27    FROM
 28      DUAL
 29    );

        ID ADDRESS1           ADDRESS2           TEST           COMMON_WORDS
---------- ------------------ ------------------ -------------- ------------
         1 930 OLD STEESE HWY 930 OLD STEESE HWY EXACT MATCH               4
         3 930 OLD STEESE HWY 920 OLD STEESE HWY PROBABLE MATCH            3
         3 930 OLD STEESE HWY 99 COLLEGE RD      NON MATCH                 0
         4 930 OLD STEESE HWY 80 5TH ST          NON MATCH                 0
Max
[My Italian Oracle blog|http://oracleitalia.wordpress.com/2009/12/29/estrarre-i-dati-in-formato-xml-da-sql/]

Edited by: Massimo Ruocchio on Dec 29, 2009 10:48 PM
Added regexp_count example.
Marked as Answer by 571669 · Sep 27 2020
571669
Hey Thanks so much Max...
now may I ask, adding what you did actually does what exactly :)
(^| )'||replace(address2,' ','|')||'($| )
Thanks!


ahh bugger,
SELECT 
  id,  
  address1,
  address2,
  CASE
    WHEN address1=address2 THEN 'EXACT MATCH'
    WHEN REGEXP_LIKE(address1,'(^| )'||replace(address2,' ','|')||'($| )') THEN 'PROBABLE MATCH'
    ELSE 'NON MATCH'
  END AS TEST
FROM
  (
  SELECT
    1 ID, '2591 GOLDSTREAM RD' address1, '446 OLD CHENA PUMP' address2
  FROM
    DUAL  
  UNION SELECT
    3 ID, '2591 GOLDSTREAM RD' address1, '2591 GOLDSTREAM RD' address2
  FROM
    DUAL    
  );
seems to be matching on 'OLD'

Edited by: ebrose on Dec 29, 2009 2:05 PM
730428
After replace
(^| )'||replace(address2,' ','|')||'($| )
becomes, for example
(^| )920|OLD|STEESE|HWY($| )
That means that you're looking for substrings in address1 that
1) Start at begin of address1 or with a blank (^| )
2) continue with 920 or OLD or STEESE or HWY (920|OLD|STEESE|HWY)
3) end at end of address1 or with a blank ($| )

The REGEXP_COUNT function counts these substrings in address1, the REGEXP_LIKE condition returns true if at least one such substring exists in address1...

Hope this clarifies...

Max
[My Italian Oracle blog|http://oracleitalia.wordpress.com/2009/12/29/estrarre-i-dati-in-formato-xml-da-sql/]
Solomon Yakobson
ebrose wrote:

ahh bugger,
It is, in fact, same "bugger" as:
SELECT
  id,
  address1,
  address2,
  CASE
    WHEN address1=address2 THEN 'EXACT MATCH'
    WHEN REGEXP_LIKE(address1,'(^| )'||replace(address2,' ','|')||'($| )') THEN 'PROBABLE MATCH'
    ELSE 'NON MATCH'
  END AS TEST
FROM
   (
    SELECT
    1 ID, '930 OLD STEESE HWY' address1, '930 OLD STEESE HWY' address2
  FROM
    DUAL
  UNION SELECT
    3 ID, '930 OLD STEESE HWY' address1, '920 OLD STEESE HWY' address2
  FROM
    DUAL
  UNION SELECT
    3 ID, '930 OLD STEESE HWY' address1, '99 COLLEGE RD' address2
  FROM
    DUAL
  UNION SELECT
    4 ID, '930 OLD STEESE HWY' address1, '80 5TH ST WEST' address2
  FROM
    DUAL
  );

        ID ADDRESS1           ADDRESS2           TEST
---------- ------------------ ------------------ --------------
         1 930 OLD STEESE HWY 930 OLD STEESE HWY EXACT MATCH
         3 930 OLD STEESE HWY 920 OLD STEESE HWY PROBABLE MATCH
         3 930 OLD STEESE HWY 99 COLLEGE RD      NON MATCH
         4 930 OLD STEESE HWY 80 5TH ST WEST     PROBABLE MATCH

SQL> 
So it was not a generic enough colution. I suggest something like:
SELECT
  id,
  address1,
  address2,
  CASE
    WHEN address1=address2 THEN 'EXACT MATCH'
    WHEN REGEXP_LIKE(address1,'(^|\s+)'||replace(rtrim(address2),' ','\s+|\s+') || '$' ) THEN 'PROBABLE MATCH'
    ELSE 'NON MATCH'
  END AS TEST
FROM
   (
    SELECT
    1 ID, '930 OLD STEESE HWY' address1, '930 OLD STEESE HWY' address2
  FROM
    DUAL
  UNION SELECT
    3 ID, '930 OLD STEESE HWY' address1, '920 OLD STEESE HWY' address2
  FROM
    DUAL
  UNION SELECT
    3 ID, '930 OLD STEESE HWY' address1, '99 COLLEGE RD' address2
  FROM
    DUAL
  UNION SELECT
    4 ID, '930 OLD STEESE HWY' address1, '80 5TH ST' address2
  FROM
    DUAL
  UNION SELECT
    5 ID, '930 OLD STEESE HWY' address1, '80 5TH ST WEST' address2
  FROM
    DUAL
  UNION SELECT
    6 ID, '2591 GOLDSTREAM RD' address1, '446 OLD CHENA PUMP' address2
  FROM
    DUAL  
  UNION SELECT
    7 ID, '2591 GOLDSTREAM RD' address1, '2591 GOLDSTREAM RD' address2
  FROM
    DUAL    
  );

        ID ADDRESS1           ADDRESS2           TEST
---------- ------------------ ------------------ --------------
         1 930 OLD STEESE HWY 930 OLD STEESE HWY EXACT MATCH
         3 930 OLD STEESE HWY 920 OLD STEESE HWY PROBABLE MATCH
         3 930 OLD STEESE HWY 99 COLLEGE RD      NON MATCH
         4 930 OLD STEESE HWY 80 5TH ST          NON MATCH
         5 930 OLD STEESE HWY 80 5TH ST WEST     NON MATCH
         6 2591 GOLDSTREAM RD 446 OLD CHENA PUMP NON MATCH
         7 2591 GOLDSTREAM RD 2591 GOLDSTREAM RD EXACT MATCH

7 rows selected.

SQL> 
SY.
730428
SQL> SELECT
  2    id,
  3    address1,
  4    address2,
  5    CASE
  6      WHEN address1=address2 THEN 'EXACT MATCH'
  7      WHEN REGEXP_LIKE(address1,'(^| )('||replace(address2,' ','|')||')') THEN 'PROBABLE MATCH'
  8      ELSE 'NON MATCH'
  9    END AS TEST,
 10    regexp_count(address1,'(^| )('||replace(address2,' ','|')||')') common_words
 11  FROM
 12    (
 13    SELECT
 14      1 ID, '2591 GOLDSTREAM RD' address1, '446 OLD CHENA PUMP' address2
 15    FROM
 16      DUAL
 17    UNION SELECT
 18      3 ID, '2591 GOLDSTREAM RD' address1, '2591 GOLDSTREAM RD' address2
 19    FROM
 20      DUAL
 21    UNION SELECT
 22       1 ID, '930 OLD STEESE HWY' address1, '930 OLD STEESE HWY' address2
 23     FROM
 24       DUAL
 25     UNION SELECT
 26       3 ID, '930 OLD STEESE HWY' address1, '920 OLD STEESE HWY' address2
 27     FROM
 28       DUAL
 29     UNION SELECT
 30       3 ID, '930 OLD STEESE HWY' address1, '99 COLLEGE RD' address2
 31     FROM
 32       DUAL
 33     UNION SELECT
 34       4 ID, '930 OLD STEESE HWY' address1, '80 5TH ST' address2
 35     FROM
 36       DUAL
 37    );

        ID ADDRESS1           ADDRESS2           TEST           COMMON_WORDS
---------- ------------------ ------------------ -------------- ------------
         1 2591 GOLDSTREAM RD 446 OLD CHENA PUMP NON MATCH                 0
         1 930 OLD STEESE HWY 930 OLD STEESE HWY EXACT MATCH               4
         3 2591 GOLDSTREAM RD 2591 GOLDSTREAM RD EXACT MATCH               3
         3 930 OLD STEESE HWY 920 OLD STEESE HWY PROBABLE MATCH            3
         3 930 OLD STEESE HWY 99 COLLEGE RD      NON MATCH                 0
         4 930 OLD STEESE HWY 80 5TH ST          PROBABLE MATCH            1

Selezionate 6 righe.
Max
[My Italian Oracle blog|http://oracleitalia.wordpress.com/2009/12/29/estrarre-i-dati-in-formato-xml-da-sql/]
Solomon Yakobson
Solomon Yakobson wrote:

So it was not a generic enough colution. I suggest something like:
Still, keep in mind - one common word and it is a probable match:
SELECT
  id,
  address1,
  address2,
  CASE
    WHEN address1=address2 THEN 'EXACT MATCH'
    WHEN REGEXP_LIKE(address1,'(^|\s+)'||replace(rtrim(address2),' ','\s+|\s+') || '$' ) THEN 'PROBABLE MATCH'
    ELSE 'NON MATCH'
  END AS TEST
FROM
   (
    SELECT
    8 ID, '2591 OLD GOLDSTREAM RD' address1, '446 OLD CHENA PUMP' address2
  FROM
    DUAL
  )
/

        ID ADDRESS1               ADDRESS2           TEST
---------- ---------------------- ------------------ --------------
         8 2591 OLD GOLDSTREAM RD 446 OLD CHENA PUMP PROBABLE MATCH

SQL> 
SY.
571669
Thanks so much everyone for all your suggestions and help. I realize that there are going to be probable matches that really shouldn't be... I think that is something I am going to have to live with... I am hoping that with the SDO_NN Distance as a qualifier (~<1000 meters) and the new address match it should be good enough for a first round go off matching and reviewing.
Again thank you for your help!
Eric
1 - 7
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Jan 27 2010
Added on Dec 29 2009
7 comments
1,974 views