3 Replies Latest reply: Jan 15, 2008 1:17 AM by 617589 RSS

    finding duplicates in large table with the use of utl_match

    617589
      hello specialist

      I'm relatively new to the world of working with text comparisons but now the client threw me in the pool and I need to start swimming.

      Basic concept is that based on some predefined rules I have to find doubles in a table. Doubles can be exact matches on addresses and names, those are the easier ones. But it gets harder when the utl_match needs to kick in.

      How would you solve the issue of trying to find a duplicate with the utl_match in a 20 million record table? I have my source table of 20M and at the end I need a table that contains a field that indicates that record X is a match to record Y.

      Thanks in advance for helping me out and sharing your expertise.

      Koen Verheyen
      Belgium
        • 1. Re: finding duplicates in large table with the use of utl_match
          Barbara Boehmer
          It might help to have some more details, like your Oracle version, the structure of your source table, the structure of the table that you want to end up with up, and what those predefined rules are. It might help determine which of the utl_match functions is appropriate and what combination of other things might help.

          In general, if you just try to "create table ... as select ... from ... where utl_match.... (..., ...) ...;" you are likely to exceed most standard resources and get disconnected quickly.

          Tom Kyte suggested a solution to a similar problem here:

          http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1400328669393#10577325169281

          In general utl_match operates slowly on large result sets. Therefore, it is usually best to use whatever other filter conditions that you can apply to reduce the size of the result set that you are going to use it on. I don't know whether you have your names and addresses all in one column or whether you have things (hopefully) in separate columns. You might, for example, be able to limit your initial result set to rows where the state and/or zip code match, then apply utl_match to that. Once you have narrowed down your result set as much as possible and are ready to apply utl_match, you are going to end up comparing every row in that set to every row in that set. So, no matter how you do it, whether you select using a self-join or loop through nested cursors, you are going to be comparing a cartesian product. The size of that cartesian product has the greatest effect on performance, so the more that you can reduce it the better. The length of the strings that you are comparing also affect performance.

          I have provided an example of a method below to first extract exact matches, then the remaining close matches.

          The example below finds the exact matches quickly by creating a unique constraint, then saving the rowid's of the rows that violate that constraint into an exceptions table, then selecting the corresponding values from the source table.

          The example below then finds the close matches by inserting a cartesian product as mentioned above that has been filtered to reduce its size as much as possible, then creating a function-based index that compares the columns. Since utl_match.edit_distance_similarity was apparently not created using the deterministic keyword, I have added a wrapper function to make it deterministic, so that a function-based index can be created. Then the rows that are not close matches can be quickly deleted using that index, leaving only the closest matches.

          I have used a small data set below to demonstrate the functionality and truncated the results to save space.

          -- test data with index on column to be compared and current statistics:
          SCOTT@orcl_11g> CREATE TABLE source_table AS
            2  SELECT object_id AS id, object_name AS name_and_address
            3  FROM   all_objects
            4  WHERE  object_name NOT LIKE 'SYS%'
            5  AND    object_name NOT LIKE 'DR$%'
            6  AND    ROWNUM < 10000
            7  /

          Table created.

          SCOTT@orcl_11g> SELECT COUNT (*) FROM source_table
            2  /

            COUNT(*)
          ----------
                9999

          SCOTT@orcl_11g> CREATE INDEX name_and_ad_idx ON source_table (name_and_address)
            2  /

          Index created.

          SCOTT@orcl_11g> EXEC DBMS_STATS.GATHER_TABLE_STATS ('SCOTT', 'SOURCE_TABLE')

          PL/SQL procedure successfully completed.





          -- find exact matches quickly by creating unique constraint,
          -- saving rowid's of rows that violate the constraint into an exceptions table,
          -- creating an index and gathering statistics,
          -- then inserting the corresponding values into an exact_matches table:
          SCOTT@orcl_11g> SELECT SYSDATE FROM DUAL
            2  /

          SYSDATE
          --------------------
          14-JAN-2008 17:41:23

          SCOTT@orcl_11g> CREATE TABLE exceptions
            2    (row_id           ROWID,
            3       owner           VARCHAR2 (30),
            4       table_name    VARCHAR2 (30),
            5       constraint    VARCHAR2 (30))
            6  /

          Table created.

          SCOTT@orcl_11g> ALTER TABLE source_table
            2  ADD CONSTRAINT unique_name_and_ad UNIQUE (name_and_address)
            3  EXCEPTIONS INTO exceptions
            4  /
          ADD CONSTRAINT unique_name_and_ad UNIQUE (name_and_address)
                         *
          ERROR at line 2:
          ORA-02299: cannot validate (SCOTT.UNIQUE_NAME_AND_AD) - duplicate keys found


          SCOTT@orcl_11g> CREATE INDEX row_id_idx ON exceptions (row_id)
            2  /

          Index created.

          SCOTT@orcl_11g> EXEC DBMS_STATS.GATHER_TABLE_STATS ('SCOTT', 'EXCEPTIONS')

          PL/SQL procedure successfully completed.

          SCOTT@orcl_11g> CREATE TABLE exact_matches
            2    (id1           NUMBER,
            3       id2           NUMBER,
            4       name_and_ad1  VARCHAR2 (30),
            5       name_and_ad2  VARCHAR2 (30))
            6  /

          Table created.

          SCOTT@orcl_11g> SET AUTOTRACE ON EXPLAIN
          SCOTT@orcl_11g> INSERT INTO exact_matches (id1, id2, name_and_ad1, name_and_ad2)
            2  SELECT s1.id, s2.id, s1.name_and_address, s2.name_and_address
            3  FROM   source_table s1, source_table s2, exceptions e1, exceptions e2
            4  WHERE  s1.ROWID = e1.row_id
            5  AND    s2.ROWID = e2.row_id
            6  AND    e1.row_id < e2.row_id
            7  AND    s1.name_and_address = s2.name_and_address
            8  /

          2478 rows created.


          Execution Plan
          ----------------------------------------------------------
          Plan hash value: 1289264384

          ------------------------------------------------------------------------------------------
          | Id  | Operation                | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
          ------------------------------------------------------------------------------------------
          |   0 | INSERT STATEMENT         |               |   125 |  8000 |    40   (5)| 00:00:01 |
          |   1 |  LOAD TABLE CONVENTIONAL | EXACT_MATCHES |       |       |            |          |
          |*  2 |   HASH JOIN              |               |   125 |  8000 |    40   (5)| 00:00:01 |
          |   3 |    INDEX FAST FULL SCAN  | ROW_ID_IDX    |  4392 | 43920 |     6   (0)| 00:00:01 |
          |*  4 |    HASH JOIN             |               |  5697 |   300K|    33   (4)| 00:00:01 |
          |*  5 |     HASH JOIN            |               |  4392 |   137K|    20   (5)| 00:00:01 |
          |   6 |      INDEX FAST FULL SCAN| ROW_ID_IDX    |  4392 | 43920 |     6   (0)| 00:00:01 |
          |   7 |      TABLE ACCESS FULL   | SOURCE_TABLE  |  9999 |   214K|    13   (0)| 00:00:01 |
          |   8 |     TABLE ACCESS FULL    | SOURCE_TABLE  |  9999 |   214K|    13   (0)| 00:00:01 |
          ------------------------------------------------------------------------------------------

          Predicate Information (identified by operation id):
          ---------------------------------------------------

             2 - access("S2".ROWID="E2"."ROW_ID")
                 filter("E1"."ROW_ID"<"E2"."ROW_ID")
             4 - access("S1"."NAME_AND_ADDRESS"="S2"."NAME_AND_ADDRESS")
             5 - access("S1".ROWID="E1"."ROW_ID")

          SCOTT@orcl_11g> SET AUTOTRACE OFF
          SCOTT@orcl_11g> SELECT SYSDATE FROM DUAL
            2  /

          SYSDATE
          --------------------
          14-JAN-2008 17:41:24

          SCOTT@orcl_11g> SELECT * FROM exact_matches
            2  /

                 ID1        ID2 NAME_AND_AD1                   NAME_AND_AD2
          ---------- ---------- ------------------------------ ------------------------------
                3804       3805 USER_IND_PENDING_STATS         USER_IND_PENDING_STATS
                3806       3807 ALL_COL_PENDING_STATS          ALL_COL_PENDING_STATS
                3808       3809 DBA_COL_PENDING_STATS          DBA_COL_PENDING_STATS
                3810       3811 USER_COL_PENDING_STATS         USER_COL_PENDING_STATS

          -- results truncated to save space

                8306      10789 DBMS_SERVER_ALERT_PRVT         DBMS_SERVER_ALERT_PRVT
                7747      10790 DBMS_SERVER_ALERT_EXPORT       DBMS_SERVER_ALERT_EXPORT
                7746      10790 DBMS_SERVER_ALERT_EXPORT       DBMS_SERVER_ALERT_EXPORT
               10791      10796 DBMS_AUTOTASK_PRVT             DBMS_AUTOTASK_PRVT
                9851      10797 DBMS_AUTO_TASK_IMMEDIATE       DBMS_AUTO_TASK_IMMEDIATE
                9850      10797 DBMS_AUTO_TASK_IMMEDIATE       DBMS_AUTO_TASK_IMMEDIATE

          2478 rows selected.





          -- find close matches by creating a table containing a cartesian product of all rows
          -- and subtracting the exact matches already found from it,
          -- then creating a wrapper function around utl_match.edit_distance_similarity,
          -- so that it is deterministic and can be used in a function-based index,
          -- then create the function based index and gather statistics,
          -- then delete the rows that are not close matches
          SCOTT@orcl_11g> CREATE TABLE close_matches
            2    (id1           NUMBER,
            3       id2           NUMBER,
            4       name_and_ad1  VARCHAR2 (30),
            5       name_and_ad2  VARCHAR2 (30))
            6  /

          Table created.

          SCOTT@orcl_11g> INSERT INTO close_matches (id1, id2, name_and_ad1, name_and_ad2)
            2  SELECT s1.id, s2.id, s1.name_and_address, s2.name_and_address
            3  FROM   source_table s1, source_table s2
            4  WHERE  s1.ROWID < s2.ROWID
            5  AND    ABS (s1.id - s2.id) < 10 -- simulates additional filter conditions
            6  MINUS
            7  SELECT id1, id2, name_and_ad1, name_and_ad2
            8  FROM   exact_matches
            9  /

          83574 rows created.

          SCOTT@orcl_11g> COMMIT
            2  /

          Commit complete.

          SCOTT@orcl_11g> SELECT SYSDATE FROM DUAL
            2  /

          SYSDATE
          --------------------
          14-JAN-2008 17:43:26

          SCOTT@orcl_11g> CREATE OR REPLACE FUNCTION eds
            2    (p_source_string IN VARCHAR2,
            3       p_target_string IN VARCHAR2)
            4    RETURN NUMBER DETERMINISTIC
            5  AS
            6  BEGIN
            7    RETURN utl_match.edit_distance_similarity (p_source_string, p_target_string);
            8  END eds;
            9  /

          Function created.

          SCOTT@orcl_11g> CREATE INDEX eds_fbi ON close_matches (eds (name_and_ad1, name_and_ad2))
            2  /

          Index created.

          SCOTT@orcl_11g> EXEC DBMS_STATS.GATHER_TABLE_STATS ('SCOTT', 'CLOSE_MATCHES')

          PL/SQL procedure successfully completed.

          SCOTT@orcl_11g> SET AUTOTRACE ON EXPLAIN
          SCOTT@orcl_11g> DELETE FROM close_matches WHERE eds (name_and_ad1, name_and_ad2) < 92
            2  /

          81780 rows deleted.


          Execution Plan
          ----------------------------------------------------------
          Plan hash value: 3211811259

          ----------------------------------------------------------------------------------------------
          | Id  | Operation                    | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
          ----------------------------------------------------------------------------------------------
          |   0 | DELETE STATEMENT             |               | 79266 |   232K|   157   (1)| 00:00:02 |
          |   1 |  DELETE                      | CLOSE_MATCHES |       |       |            |          |
          |   2 |   TABLE ACCESS BY INDEX ROWID| CLOSE_MATCHES | 79266 |   232K|   157   (1)| 00:00:02 |
          |*  3 |    INDEX RANGE SCAN          | EDS_FBI       | 79266 |       |   157   (1)| 00:00:02 |
          ----------------------------------------------------------------------------------------------

          Predicate Information (identified by operation id):
          ---------------------------------------------------

             3 - access("SCOTT"."EDS"("NAME_AND_AD1","NAME_AND_AD2")<92)

          SCOTT@orcl_11g> SET AUTOTRACE OFF
          SCOTT@orcl_11g> COMMIT
            2  /

          Commit complete.

          SCOTT@orcl_11g> SELECT SYSDATE FROM DUAL
            2  /

          SYSDATE
          --------------------
          14-JAN-2008 17:43:43

          SCOTT@orcl_11g> COLUMN name_and_ad1 FORMAT A30
          SCOTT@orcl_11g> COLUMN name_and_ad2 FORMAT A30
          SCOTT@orcl_11g> SELECT * FROM close_matches ORDER BY id1, id2
            2  /

                 ID1        ID2 NAME_AND_AD1                   NAME_AND_AD2
          ---------- ---------- ------------------------------ ------------------------------
                   9          8 I_FILE#_BLOCK#                 C_FILE#_BLOCK#
                  26         27 I_PROXY_ROLE_DATA$_1           I_PROXY_ROLE_DATA$_2
                 105        106 I_DEPENDENCY1                  I_DEPENDENCY2
                 154        155 I1_JIREFRESHSQL$               I2_JIREFRESHSQL$
                 160        161 I_TRIGGERCOL1                  I_TRIGGERCOL2

          -- results truncated to save space:

               10732      10737 DBMS_RULE_EXP_EC_INTERNAL      DBMS_RULE_EXP_RL_INTERNAL
               10733      10734 DBMS_RULE_EXP_EC_INTERNAL      DBMS_RULE_EXP_RS_INTERNAL
               10733      10735 DBMS_RULE_EXP_EC_INTERNAL      DBMS_RULE_EXP_RS_INTERNAL
               10733      10736 DBMS_RULE_EXP_EC_INTERNAL      DBMS_RULE_EXP_RL_INTERNAL
               10733      10737 DBMS_RULE_EXP_EC_INTERNAL      DBMS_RULE_EXP_RL_INTERNAL
               10734      10736 DBMS_RULE_EXP_RS_INTERNAL      DBMS_RULE_EXP_RL_INTERNAL
               10734      10737 DBMS_RULE_EXP_RS_INTERNAL      DBMS_RULE_EXP_RL_INTERNAL
               10735      10736 DBMS_RULE_EXP_RS_INTERNAL      DBMS_RULE_EXP_RL_INTERNAL
               10735      10737 DBMS_RULE_EXP_RS_INTERNAL      DBMS_RULE_EXP_RL_INTERNAL

          1794 rows selected.

          SCOTT@orcl_11g>
          • 2. Re: finding duplicates in large table with the use of utl_match
            617589
            Hello Barbara

            Thanks for the very in-depth example. This would be a very usefull technique and I'm going to think about it how we can implement it here. The thing is that the client want to maximise the use of OWB so I have to give that a thought as well.

            Koen
            • 3. Re: finding duplicates in large table with the use of utl_match
              617589
              Another issue is that the deduplication I need to do is not only on the file but also from the file to the whole customer database. The concept of the unique constraints is not really usable then (I think).

              The files I get in are transformed to a fixed table, the content of that table is usually arround 100.000, the table I'm comparing with is arround 20.000.000 records.