9 Replies Latest reply: Mar 6, 2013 6:50 AM by chris227 RSS

    Match Condition

    sliderrules
      Hi,

      I have two tables where I will be comparing a column that exists in both tables and that contains the same values but can be worded different e.g. (Case Sensitive, LTD Missing etc.)

      I am using database version 11.2.0.3.0

      Below is some sample data:

      CREATE TABLE t1(
      NAME VARCHAR2(255));

      CREATE TABLE t2(
      NAME VARCHAR2(255));

      INSERT INTO T1 VALUES('ABC');
      INSERT INTO T2 VALUES('ABC Ltd');
      INSERT INTO T1 VALUES('Solutions UK Ltd');
      INSERT INTO T2 VALUES('SOLUTIONS LTD');
      INSERT INTO T1 VALUES('Associates Dawner Ltd');
      INSERT INTO T2 VALUES('Dawner LTD');
      INSERT INTO T1 VALUES('Equity (PVL)');
      INSERT INTO T2 VALUES('EQUILIBRIUM PVL');
      INSERT INTO T1 VALUES('Abacus PLC');
      INSERT INTO T2 VALUES('Abacus T/A ABC');
      INSERT INTO T1 VALUES('Asset Solutions Ltd');
      INSERT INTO T2 VALUES('Asset (Kingsley) Ltd');
      INSERT INTO T1 VALUES('Hunt Windows Ltd');
      INSERT INTO T2 VALUES('Joe Hunt Ltd');



      Can anyone help with a query that provides the closest match on the columns in both tables?

      Thanks
        • 1. Re: Match Condition
          Gaff
          Some combination of upper(), lower(), "like", regexp_like and possibly Oracle text functionality? Although Oracle text is probably overkill for this.

          Actually regexp_like does case insensitive if you pass in the proper argument. So I would say regexp_like().

          http://psoug.org/reference/regexp.html

          Edited by: Gaff on Mar 5, 2013 4:02 PM
          • 2. Re: Match Condition
            Bill Haverberg
            You need to work with the owners of the data on both ends to clean up the data to make it more consistent. You're in the zone where a program has to make judgement calls, which means data will get lost, mismatched, unmatched...

            Is there another field in both data sets you can work with to create a solid match, such as address?

            I had a situation similar to this with an aging report. You could create a function that strikes out common words like Ltd amd PLC and does a UPPER comparison. Send matches to one file or table, and unmatched data to a separate. Analyze both, refine your methodology, and iterate until you have a manageable set of questionable data you could then send to the business user to make their own subjective calls on. That way you've filtered out the easy calls so you're not swamping them with too much information.

            Edited by: Bill Haverberg on Mar 5, 2013 1:03 PM
            • 3. Re: Match Condition
              ---Brodyaga---
              Hi

              You can play with function SOUNDEX. It give good result (>75% match). Example:
              select n1, n2
              from (select q.*,
                           min(tp) over (partition by n1) mtp1,
                           min(tp) over (partition by n2) mtp2
                    from (select t1.name n1, t2.name n2, 1 tp
                          from (select name, soundex(name) snd
                                from t1) t1,
                               (select name, soundex(name) snd
                                from t2) t2
                          where t1.snd = t2.snd
                          
                          union all
                          
                          select t1.name, t2.name, 2
                          from (select name, soundex(name) snd
                                from t1) t1,
                               (select name, soundex(name) snd
                                from t2) t2
                          where t1.snd != t2.snd
                          and substr(t1.snd, 1, 1) = substr(t2.snd, 1, 1)
                          and substr(t1.snd, 2) between substr(t2.snd, 2) - 10 and substr(t2.snd, 2) + 10
                          
                          union all
                          
                          select t1.name, t2.name, 3
                          from (select name, soundex(name) snd
                                from t1) t1,
                               (select name, soundex(name) snd
                                from t2) t2
                          where t1.snd != t2.snd
                          and substr(t1.snd, 1, 1) != substr(t2.snd, 1, 1)
                          and substr(t1.snd, 2) between substr(t2.snd, 2) - 10 and substr(t2.snd, 2) + 10) q) q
              where q.mtp1 = q.mtp2
              and q.mtp1 = q.tp
              Abacus PLC               Abacus T/A ABC
              ABC                       ABC Ltd
              Asset Solutions Ltd       Asset (Kingsley) Ltd
              Equity (PVL)               EQUILIBRIUM PVL
              Hunt Windows Ltd       Joe Hunt Ltd
              Solutions UK Ltd          SOLUTIONS LTD
              Edited by: ---Brodyaga--- on Mar 5, 2013 11:27 PM

              Edited by: ---Brodyaga--- on Mar 6, 2013 12:34 AM
              • 4. Re: Match Condition
                Another_user
                You may want to look at Barbara's post here:

                Re: Fuzzy search - more accurate score??
                • 5. Re: Match Condition
                  chris227
                  Some ideas put in one query.
                  Iterative approach as proposed above seems to be a good idea, because this below is not effiient at all.
                  Oralce text could be considered also
                  with matches1 as (
                  select 
                   t1.name n1
                  ,t2.name n2
                  ,rank() over ( partition by t1.name
                                 order by utl_match.jarO_winkler(upper(t1.name), upper(t2.name)) desc
                               ) rnkj
                  ,rank() over ( partition by t1.name
                                 order by utl_match.edit_distance(upper(t1.name), upper(t2.name))
                               ) rnke
                  from t1,t2
                  where
                      upper(t1.name) like '%'||substr(upper(t2.name), 1, instr(t2.name,' ') - 1)||'%'
                   or upper(t2.name) like '%'||substr(upper(t1.name), 1, instr(t1.name,' ') - 1)||'%'
                  )
                  , matches2 as (
                  select 
                   t1.name n1
                  ,t2.name n2
                  ,rank() over ( partition by t1.name
                                 order by utl_match.jarO_winkler(upper(t1.name), upper(t2.name)) desc
                               ) rnkj
                  ,rank() over ( partition by t1.name
                                 order by utl_match.edit_distance(upper(t1.name), upper(t2.name))
                               ) rnke
                  from t1,t2
                  )
                  
                  
                  select 
                   *
                  from matches1
                  where
                   rnkj = 1 or rnke = 1
                  union
                  select
                  *
                  from matches2
                  where
                    (rnkj = 1 and rnke = 1)
                    and
                    n1 not in (select n1 from matches1)
                  
                  N1     N2     RNKJ     RNKE
                  ABC     ABC Ltd     1     1
                  Abacus PLC     Abacus T/A ABC     1     1
                  Asset Solutions Ltd     Asset (Kingsley) Ltd     1     2
                  Asset Solutions Ltd     SOLUTIONS LTD     2     1
                  Associates Dawner Ltd     Dawner LTD     1     1
                  Equity (PVL)     EQUILIBRIUM PVL     1     1
                  Hunt Windows Ltd     Joe Hunt Ltd     1     1
                  Solutions UK Ltd     SOLUTIONS LTD     1     1
                  • 6. Re: Match Condition
                    sliderrules
                    Hi,

                    I am looking at the post you recommended but have changed it to point at my test tables:

                    SELECT s.name source,
                    SCORE(1) score,
                         UTL_MATCH.JARO_WINKLER_SIMILARITY
                         (S.NAME,
                         t.name) jws,
                         ROW_NUMBER () OVER
                         (PARTITION BY s.name
                         ORDER BY SCORE(1) DESC,
                         UTL_MATCH.JARO_WINKLER_SIMILARITY
                              (s.name,
                              t.name) DESC) ranking,
                         t.name target
                    FROM T1 S,
                         t2 t
                    WHERE CONTAINS
                         (T.NAME,
                         'NDATA (customer_name, ' || s.name || ', N, P)', 1) > 0
                    ORDER BY SOURCE, SCORE DESC, RANKING

                    I want to know why the following PL/SQL is required to be run before the select statement above:

                    SCOTT@orcl_11gR2> BEGIN
                    2 CTX_DDL.CREATE_PREFERENCE ('customer_ds', 'MULTI_COLUMN_DATASTORE');
                    3 CTX_DDL.SET_ATTRIBUTE ('customer_ds', 'COLUMNS', 'customer_name');
                    4 CTX_DDL.CREATE_SECTION_GROUP ('customer_sg', 'BASIC_SECTION_GROUP');
                    5 CTX_DDL.ADD_NDATA_SECTION ('customer_sg', 'customer_name', 'customer_name');
                    6 END;
                    7 /

                    I get the following error when running the procedure above:

                    Error starting at line 21 in command:
                    BEGIN
                    CTX_DDL.CREATE_PREFERENCE ('customer_ds', 'MULTI_COLUMN_DATASTORE');
                    CTX_DDL.SET_ATTRIBUTE ('customer_ds', 'COLUMNS', 'name');
                    CTX_DDL.CREATE_SECTION_GROUP ('customer_sg', 'BASIC_SECTION_GROUP');
                    CTX_DDL.ADD_NDATA_SECTION ('customer_sg', 'name', 'name');
                    END;
                    Error report:
                    ORA-06550: line 2, column 1:
                    PLS-00201: identifier 'CTX_DDL' must be declared
                    ORA-06550: line 2, column 1:
                    PL/SQL: Statement ignored

                    SCOTT@orcl_11gR2> CREATE INDEX customer_idx
                    2 ON target_customers_tab (customer_name)
                    3 INDEXTYPE IS CTXSYS.CONTEXT
                    4 PARAMETERS
                    5 ('DATASTORE     customer_ds
                    6      SECTION GROUP     customer_sg')
                    7 /

                    Can anyone help?

                    Thanks

                    Edited by: sliderrules on 06-Mar-2013 04:10
                    • 7. Re: Match Condition
                      chris227
                      You must have the rights (role) to use oracle text. Your actual user doesnt sees the package CTX_DDL.
                      You have to create an oracle text index for the usage of contains.
                      If you compare the query given by me with the query you choose to work with, you will perhaps find the same ideas.
                      Instead of using oracle text, i tried to find text-matchesbased on the first tokens of each name.
                      For the rows below there was no rule to decide which is the correct match, because the second could also match because SOLUTIONS is conained in Asset Solutions Ltd and also the metrics dont give an unambiguously hint.
                      I guess you wont neither get an unambiguously result by the other solution, but its worth a try.
                      Asset Solutions Ltd     Asset (Kingsley) Ltd     1     2
                      Asset Solutions Ltd     SOLUTIONS LTD     2     1
                      Soundex (both function and oracle text contains operator) and oracle text operator fuzzy are also worth to consider.

                      Edited by: chris227 on 06.03.2013 04:38

                      Edited by: chris227 on 06.03.2013 04:40
                      • 8. Re: Match Condition
                        sliderrules
                        Thanks for getting back. Does the CTX_DDL package need to be run each time the query is run or does it just need to be executed once. I had a go at your sample script and it also works but was looking for a solution with less code
                        • 9. Re: Match Condition
                          chris227
                          sliderrules wrote:
                          Thanks for getting back. Does the CTX_DDL package need to be run each time the query is run or does it just need to be executed once.
                          Just once, It is just setting up the attributes used by the index creation.