This discussion is archived
9 Replies Latest reply: Mar 6, 2013 4:50 AM by chris227 RSS

Match Condition

sliderrules Newbie
Currently Being Moderated
Hi,

I have two tables where I will be comparing a column that exists in both tables and that contains the same values but can be worded different e.g. (Case Sensitive, LTD Missing etc.)

I am using database version 11.2.0.3.0

Below is some sample data:

CREATE TABLE t1(
NAME VARCHAR2(255));

CREATE TABLE t2(
NAME VARCHAR2(255));

INSERT INTO T1 VALUES('ABC');
INSERT INTO T2 VALUES('ABC Ltd');
INSERT INTO T1 VALUES('Solutions UK Ltd');
INSERT INTO T2 VALUES('SOLUTIONS LTD');
INSERT INTO T1 VALUES('Associates Dawner Ltd');
INSERT INTO T2 VALUES('Dawner LTD');
INSERT INTO T1 VALUES('Equity (PVL)');
INSERT INTO T2 VALUES('EQUILIBRIUM PVL');
INSERT INTO T1 VALUES('Abacus PLC');
INSERT INTO T2 VALUES('Abacus T/A ABC');
INSERT INTO T1 VALUES('Asset Solutions Ltd');
INSERT INTO T2 VALUES('Asset (Kingsley) Ltd');
INSERT INTO T1 VALUES('Hunt Windows Ltd');
INSERT INTO T2 VALUES('Joe Hunt Ltd');



Can anyone help with a query that provides the closest match on the columns in both tables?

Thanks
  • 1. Re: Match Condition
    Gaff Journeyer
    Currently Being Moderated
    Some combination of upper(), lower(), "like", regexp_like and possibly Oracle text functionality? Although Oracle text is probably overkill for this.

    Actually regexp_like does case insensitive if you pass in the proper argument. So I would say regexp_like().

    http://psoug.org/reference/regexp.html

    Edited by: Gaff on Mar 5, 2013 4:02 PM
  • 2. Re: Match Condition
    944080 Newbie
    Currently Being Moderated
    You need to work with the owners of the data on both ends to clean up the data to make it more consistent. You're in the zone where a program has to make judgement calls, which means data will get lost, mismatched, unmatched...

    Is there another field in both data sets you can work with to create a solid match, such as address?

    I had a situation similar to this with an aging report. You could create a function that strikes out common words like Ltd amd PLC and does a UPPER comparison. Send matches to one file or table, and unmatched data to a separate. Analyze both, refine your methodology, and iterate until you have a manageable set of questionable data you could then send to the business user to make their own subjective calls on. That way you've filtered out the easy calls so you're not swamping them with too much information.

    Edited by: Bill Haverberg on Mar 5, 2013 1:03 PM
  • 3. Re: Match Condition
    ---Brodyaga--- Explorer
    Currently Being Moderated
    Hi

    You can play with function SOUNDEX. It give good result (>75% match). Example:
    select n1, n2
    from (select q.*,
                 min(tp) over (partition by n1) mtp1,
                 min(tp) over (partition by n2) mtp2
          from (select t1.name n1, t2.name n2, 1 tp
                from (select name, soundex(name) snd
                      from t1) t1,
                     (select name, soundex(name) snd
                      from t2) t2
                where t1.snd = t2.snd
                
                union all
                
                select t1.name, t2.name, 2
                from (select name, soundex(name) snd
                      from t1) t1,
                     (select name, soundex(name) snd
                      from t2) t2
                where t1.snd != t2.snd
                and substr(t1.snd, 1, 1) = substr(t2.snd, 1, 1)
                and substr(t1.snd, 2) between substr(t2.snd, 2) - 10 and substr(t2.snd, 2) + 10
                
                union all
                
                select t1.name, t2.name, 3
                from (select name, soundex(name) snd
                      from t1) t1,
                     (select name, soundex(name) snd
                      from t2) t2
                where t1.snd != t2.snd
                and substr(t1.snd, 1, 1) != substr(t2.snd, 1, 1)
                and substr(t1.snd, 2) between substr(t2.snd, 2) - 10 and substr(t2.snd, 2) + 10) q) q
    where q.mtp1 = q.mtp2
    and q.mtp1 = q.tp
    Abacus PLC               Abacus T/A ABC
    ABC                       ABC Ltd
    Asset Solutions Ltd       Asset (Kingsley) Ltd
    Equity (PVL)               EQUILIBRIUM PVL
    Hunt Windows Ltd       Joe Hunt Ltd
    Solutions UK Ltd          SOLUTIONS LTD
    Edited by: ---Brodyaga--- on Mar 5, 2013 11:27 PM

    Edited by: ---Brodyaga--- on Mar 6, 2013 12:34 AM
  • 4. Re: Match Condition
    Another_user Explorer
    Currently Being Moderated
    You may want to look at Barbara's post here:

    Re: Fuzzy search - more accurate score??
  • 5. Re: Match Condition
    chris227 Guru
    Currently Being Moderated
    Some ideas put in one query.
    Iterative approach as proposed above seems to be a good idea, because this below is not effiient at all.
    Oralce text could be considered also
    with matches1 as (
    select 
     t1.name n1
    ,t2.name n2
    ,rank() over ( partition by t1.name
                   order by utl_match.jarO_winkler(upper(t1.name), upper(t2.name)) desc
                 ) rnkj
    ,rank() over ( partition by t1.name
                   order by utl_match.edit_distance(upper(t1.name), upper(t2.name))
                 ) rnke
    from t1,t2
    where
        upper(t1.name) like '%'||substr(upper(t2.name), 1, instr(t2.name,' ') - 1)||'%'
     or upper(t2.name) like '%'||substr(upper(t1.name), 1, instr(t1.name,' ') - 1)||'%'
    )
    , matches2 as (
    select 
     t1.name n1
    ,t2.name n2
    ,rank() over ( partition by t1.name
                   order by utl_match.jarO_winkler(upper(t1.name), upper(t2.name)) desc
                 ) rnkj
    ,rank() over ( partition by t1.name
                   order by utl_match.edit_distance(upper(t1.name), upper(t2.name))
                 ) rnke
    from t1,t2
    )
    
    
    select 
     *
    from matches1
    where
     rnkj = 1 or rnke = 1
    union
    select
    *
    from matches2
    where
      (rnkj = 1 and rnke = 1)
      and
      n1 not in (select n1 from matches1)
    
    N1     N2     RNKJ     RNKE
    ABC     ABC Ltd     1     1
    Abacus PLC     Abacus T/A ABC     1     1
    Asset Solutions Ltd     Asset (Kingsley) Ltd     1     2
    Asset Solutions Ltd     SOLUTIONS LTD     2     1
    Associates Dawner Ltd     Dawner LTD     1     1
    Equity (PVL)     EQUILIBRIUM PVL     1     1
    Hunt Windows Ltd     Joe Hunt Ltd     1     1
    Solutions UK Ltd     SOLUTIONS LTD     1     1
  • 6. Re: Match Condition
    sliderrules Newbie
    Currently Being Moderated
    Hi,

    I am looking at the post you recommended but have changed it to point at my test tables:

    SELECT s.name source,
    SCORE(1) score,
         UTL_MATCH.JARO_WINKLER_SIMILARITY
         (S.NAME,
         t.name) jws,
         ROW_NUMBER () OVER
         (PARTITION BY s.name
         ORDER BY SCORE(1) DESC,
         UTL_MATCH.JARO_WINKLER_SIMILARITY
              (s.name,
              t.name) DESC) ranking,
         t.name target
    FROM T1 S,
         t2 t
    WHERE CONTAINS
         (T.NAME,
         'NDATA (customer_name, ' || s.name || ', N, P)', 1) > 0
    ORDER BY SOURCE, SCORE DESC, RANKING

    I want to know why the following PL/SQL is required to be run before the select statement above:

    SCOTT@orcl_11gR2> BEGIN
    2 CTX_DDL.CREATE_PREFERENCE ('customer_ds', 'MULTI_COLUMN_DATASTORE');
    3 CTX_DDL.SET_ATTRIBUTE ('customer_ds', 'COLUMNS', 'customer_name');
    4 CTX_DDL.CREATE_SECTION_GROUP ('customer_sg', 'BASIC_SECTION_GROUP');
    5 CTX_DDL.ADD_NDATA_SECTION ('customer_sg', 'customer_name', 'customer_name');
    6 END;
    7 /

    I get the following error when running the procedure above:

    Error starting at line 21 in command:
    BEGIN
    CTX_DDL.CREATE_PREFERENCE ('customer_ds', 'MULTI_COLUMN_DATASTORE');
    CTX_DDL.SET_ATTRIBUTE ('customer_ds', 'COLUMNS', 'name');
    CTX_DDL.CREATE_SECTION_GROUP ('customer_sg', 'BASIC_SECTION_GROUP');
    CTX_DDL.ADD_NDATA_SECTION ('customer_sg', 'name', 'name');
    END;
    Error report:
    ORA-06550: line 2, column 1:
    PLS-00201: identifier 'CTX_DDL' must be declared
    ORA-06550: line 2, column 1:
    PL/SQL: Statement ignored

    SCOTT@orcl_11gR2> CREATE INDEX customer_idx
    2 ON target_customers_tab (customer_name)
    3 INDEXTYPE IS CTXSYS.CONTEXT
    4 PARAMETERS
    5 ('DATASTORE     customer_ds
    6      SECTION GROUP     customer_sg')
    7 /

    Can anyone help?

    Thanks

    Edited by: sliderrules on 06-Mar-2013 04:10
  • 7. Re: Match Condition
    chris227 Guru
    Currently Being Moderated
    You must have the rights (role) to use oracle text. Your actual user doesnt sees the package CTX_DDL.
    You have to create an oracle text index for the usage of contains.
    If you compare the query given by me with the query you choose to work with, you will perhaps find the same ideas.
    Instead of using oracle text, i tried to find text-matchesbased on the first tokens of each name.
    For the rows below there was no rule to decide which is the correct match, because the second could also match because SOLUTIONS is conained in Asset Solutions Ltd and also the metrics dont give an unambiguously hint.
    I guess you wont neither get an unambiguously result by the other solution, but its worth a try.
    Asset Solutions Ltd     Asset (Kingsley) Ltd     1     2
    Asset Solutions Ltd     SOLUTIONS LTD     2     1
    Soundex (both function and oracle text contains operator) and oracle text operator fuzzy are also worth to consider.

    Edited by: chris227 on 06.03.2013 04:38

    Edited by: chris227 on 06.03.2013 04:40
  • 8. Re: Match Condition
    sliderrules Newbie
    Currently Being Moderated
    Thanks for getting back. Does the CTX_DDL package need to be run each time the query is run or does it just need to be executed once. I had a go at your sample script and it also works but was looking for a solution with less code
  • 9. Re: Match Condition
    chris227 Guru
    Currently Being Moderated
    sliderrules wrote:
    Thanks for getting back. Does the CTX_DDL package need to be run each time the query is run or does it just need to be executed once.
    Just once, It is just setting up the attributes used by the index creation.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points