1 2 3 Previous Next 122 Replies Latest reply on Feb 23, 2017 7:45 PM by Jonathan Lewis

    Ideas on cleaning up large table with massive duplicate rows.

    EdStevens

      This thread is a branch from Analytics to identify duplicate rows not working as expected , to have a broader discussion of the core problem I am working on.

       

      Oracle 11.2.0.4.0, StdEd One, 64-bit

      Oracle Grid Infrastructure 11.2.0.4.0, standalone ASM, no cluster, no RAC

      Oracle Linux 6.

       

      Here’s the situation:  A bit lengthy as I'll try to provide as much relevant information as possible and avoid you having to drag it out a bit at a time.

       

      I have one table, used as a staging table for a data warehouse app.  The only source of data for this table is a weekly load from a file provided by an outside source.  The entire file is loaded into the table in a weekly job.  This is the only input source to this table. The primary business analyst says the non-duplicated rows must be retained for historical research.

       

      Each week the data file has all of the content from the previous week, plus new content.  Thus, all rows loaded last week will be loaded again this week, along with new data.  And all data loaded this week will be loaded again next week, along with next week’s new data.  Once per quarter the source does some cleanup and housekeeping, but the data file is never restarted from zero.  You can see how this gets a massive build-up of duplicated rows.

       

      Currently the table has 1.13 Billion rows, 29,655,781 blocks, with an average row length of 184 bytes.  Table has 37 columns, of which a match on 35 would define a ‘duplicate’.  The two columns not factoring for duplication status are the PK, which is a surrogate generated from a sequence at time of insert, and an INSERT_USER_DATE, populated by sysdate at time of insert.  The table has its own dedicated tablespace that is currently at 9 data files totaling 242 gbytes.  There is an MV dependent on this table.

       

      There are three indexes, all in one tablespace that contains only these 3 segments.  The index TS has 4 data files totaling 72.1 gbytes. I can provide index stats if deemed relevant.

       

       

      So, that’s the landscape. With that I have two objectives

      First, stop the bleeding,

      Second, clean up the duplicates.

       

      On ‘stop the bleeding’ I thought I had a good method by replacing the INSERT with a MERGE that only exercises the ‘insert’ option.  On low volumes proof of concept this worked exactly as expected, but given the current size of the table, it ran for two hours before I killed it.  (The current production procedure runs in about 45 minutes, inserting approx. 2.2 million records).  So it appears that I’m not going to achieve acceptable run-time on the load until I can get the trash removed.

       

      On ‘clean up the duplicates’, I see four methods.

        One, copy the data to be retained to a new table, then drop the original, rename the new, and rebuild all dependencies

       

        Two, copy the data to be retained to a new table, truncate the original, then copy the data from the new back to the original.

       

        Three, working in identified subsets of the original data (say, all purchase_account_num between 10000 and 20000), copy data to be retained to ‘working table’, delete all identified data from original, copy everything from working table back to original.

       

        Four, working with identified chunks of data, delete in place.

       

      I ran a full volume trial of the INSERT (SELECT .. FROM) method.  I killed it after it had run 24 hours with no end in sight.

       

      I’ve been reading and re-reading the discussion on this at AskTom  but am still spinning my wheels.

       

      Here's a script of creating and loading a POC table.  This is an abridged version of the real table, but does include the three indexed columns along with a representative group of other columns.

      --

      -- create test tables

      --

      drop TABLE CUSTOMER_ACCOUNT purge;

      CREATE TABLE CUSTOMER_ACCOUNT

         (    CUSTOMER_ACCOUNT_ID NUMBER(*,0) NOT NULL ENABLE,

              COMPANY_CODE CHAR(2),

              SALES_DIVISION_CODE CHAR(2),

              PURCHASE_ACCOUNT_NUM CHAR(10),

              ACCOUNT_NAME VARCHAR2(20),

              INSERT_USER_DATE DATE,

            constraint ca_pk primary key (CUSTOMER_ACCOUNT_ID)

          )

      ;  

      CREATE INDEX IX_INSERT_USER_DATE

      ON CUSTOMER_ACCOUNT (INSERT_USER_DATE)

      ;

      CREATE INDEX IX_PURCHASE_ACCOUNT_NUM

      ON CUSTOMER_ACCOUNT (PURCHASE_ACCOUNT_NUM)

      ;

      --

      --  insert test data

      --

      insert into CUSTOMER_ACCOUNT values (1,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-01','yyyy-mm-dd'));

      insert into CUSTOMER_ACCOUNT values (2,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-02','yyyy-mm-dd'));

      insert into CUSTOMER_ACCOUNT values (3,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-03','yyyy-mm-dd'));

      insert into CUSTOMER_ACCOUNT values (4,'BB','BB',NULL,'HELEN BACH',to_date('2017-01-03','yyyy-mm-dd'));

      --

      insert into CUSTOMER_ACCOUNT values (5,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-01','yyyy-mm-dd'));

      insert into CUSTOMER_ACCOUNT values (6,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-02','yyyy-mm-dd'));

      insert into CUSTOMER_ACCOUNT values (7,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-03','yyyy-mm-dd'));

      --

      insert into CUSTOMER_ACCOUNT values (8,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-01','yyyy-mm-dd'));

      insert into CUSTOMER_ACCOUNT values (9,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-02','yyyy-mm-dd'));

      insert into CUSTOMER_ACCOUNT values (10,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-03','yyyy-mm-dd'));

      --

      insert into CUSTOMER_ACCOUNT values (11,'AA','AA','AAAA','MURKEY RESEARCH INC.',to_date('2017-03-03','yyyy-mm-dd'));

      --

      commit;

      From the above data,

      - PK 1 to 3 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

      - PK 4 has no dupes across all relevant columns.  Keep it.

      - PK 5 to 7 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

      - PK 8 to 10 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

      - PK 11 has no dupes across all relevant columns.  Keep it.

       

      I am open to any and all approaches.

        • 1. Re: Ideas on cleaning up large table with massive duplicate rows.
          JuanM

          I chose your option 1.

          >>>>One, copy the data to be retained to a new table, then drop the original, rename the new, and rebuild all dependencies

           

          Have you read this article How to Find and Delete Duplicate Rows with SQL Section Unvalidated constraints.

           

          You can addequate the delete process as an insert process into a new table.

          a) Inserting AS A BULK (will be faster with parallel options but you are in SE One ) rows that are not included in exceptions table.

          b) Inserting with PL/SQL just fews rows from exceptions table (if they are few, but now you have the ROWID)

          Also, try to configure your new table with uniform extent size and NOLOGGING

           

          This option probably will require storage for index creation, but could be faster. Not tested with too much rows and does not even in SE One. But read the article.

           

          HTH

          Juan Manuel

          • 2. Re: Ideas on cleaning up large table with massive duplicate rows.
            EdStevens

            Juan - thanks for the response.

             

            your link returns a 404 not found

            • 4. Re: Ideas on cleaning up large table with massive duplicate rows.
              JuanM

              EdStevens wrote:

               

              Juan - thanks for the response.

               

              your link returns a 404 not found

              Sorry, now link is fixed.

              • 5. Re: Ideas on cleaning up large table with massive duplicate rows.
                rp0428

                I am open to any and all approaches.

                Ok - except you never responded to ANY of the suggestions many of us made in the other thread.

                 

                Mine was:

                 

                Identify dups

                 

                1. create a separate table (for now)

                 

                2. scan the current table, compute a hash for each row and store the hash and ROWID in the new table

                 

                3. identify your current duplicates by processing that new, small table

                 

                Stop the bleeding

                 

                1. add the hash to the current table (or continue to use the new 'hash lookup' table

                 

                2. compute a hash on new rows added

                 

                3. verify the row is (or is not) a duplicate by checking for the hash in the hash lookup table

                 

                4. abandon the insert (or whatever you need/want to do) if it is a dup.

                 

                5. accept the insert if it is not a dup. Add the hash value to the main table or the lookup table depending on which you are using.

                • 6. Re: Ideas on cleaning up large table with massive duplicate rows.
                  EdStevens

                  rp0428 wrote:

                   

                  I am open to any and all approaches.

                  Ok - except you never responded to ANY of the suggestions many of us made in the other thread.

                   

                  Because I saw the discussion was quickly expanding well beyond the original question and felt it best to do some additional testing and benchmarking, then "respond" with a new thread giving a more comprehensive view of the entire issue.  Partly to get all the cards on a clean table and not be accused of withholding critical information.

                   

                  Mine was:

                   

                  Identify dups

                   

                  1. create a separate table (for now)

                   

                  2. scan the current table, compute a hash for each row and store the hash and ROWID in the new table

                   

                  3. identify your current duplicates by processing that new, small table

                   

                  Stop the bleeding

                   

                  1. add the hash to the current table (or continue to use the new 'hash lookup' table

                   

                  2. compute a hash on new rows added

                   

                  3. verify the row is (or is not) a duplicate by checking for the hash in the hash lookup table

                   

                  4. abandon the insert (or whatever you need/want to do) if it is a dup.

                   

                  5. accept the insert if it is not a dup. Add the hash value to the main table or the lookup table depending on which you are using.

                   

                  I promise I will take another look at this, though I am concerned about Johnathan Lewis's comment about hashes (even very sophisticated ones) only reducing the possibility of a hash collision.  With these data volumes, one collision in a billion would cause data loss.

                  • 7. Re: Ideas on cleaning up large table with massive duplicate rows.
                    EdStevens

                    John Stegeman wrote:

                     

                    it doesn't return a 404, it's not formed right

                     

                    https://blogs.oracle.com/sql/entry/how_to_find_and_delete

                    Ah, that one.  The one I had previously bookmarked and was using as a primary source, and referenced in the original thread. 

                    I will give it another review.

                    • 8. Re: Ideas on cleaning up large table with massive duplicate rows.
                      Solomon Yakobson

                      I'd try something like:

                       

                      CREATE TABLE CUSTOMER_ACCOUNT_NODUPS(

                                                           CUSTOMER_ACCOUNT_ID  NUMBER(*,0) NOT NULL ENABLE,

                                                           COMPANY_CODE        CHAR(2),

                                                           SALES_DIVISION_CODE  CHAR(2),

                                                           PURCHASE_ACCOUNT_NUM CHAR(10),

                                                           ACCOUNT_NAME        VARCHAR2(20),

                                                           INSERT_USER_DATE    DATE

                                                          )

                      /

                      CREATE UNIQUE INDEX CUSTOMER_ACCOUNT_NODUPS_UIDX1

                        ON CUSTOMER_ACCOUNT_NODUPS(

                                                   COMPANY_CODE,

                                                   SALES_DIVISION_CODE,

                                                   PURCHASE_ACCOUNT_NUM,

                                                   ACCOUNT_NAME

                                                  )

                      /

                      INSERT /*+ IGNORE_ROW_ON_DUPKEY_INDEX(CUSTOMER_ACCOUNT_NODUPS,CUSTOMER_ACCOUNT_NODUPS_UIDX1) */

                        INTO CUSTOMER_ACCOUNT_NODUPS

                        SELECT  *

                          FROM  CUSTOMER_ACCOUNT

                          ORDER BY COMPANY_CODE,

                                   SALES_DIVISION_CODE,

                                   PURCHASE_ACCOUNT_NUM,

                                   ACCOUNT_NAME,

                                   INSERT_USER_DATE

                      /

                       

                      Then drop original table and rename nodups table back to original. For example:

                       

                      SQL> SELECT  EMPNO,

                        2          DEPTNO,

                        3          JOB,

                        4          HIREDATE

                        5    FROM  EMP

                        6    ORDER BY DEPTNO,

                        7             JOB,

                        8             HIREDATE

                        9  /

                       

                           EMPNO     DEPTNO JOB        HIREDATE

                      ---------- ---------- ---------- ---------

                            7934         10 CLERK      23-JAN-82

                            7782         10 MANAGER    09-JUN-81

                            7839         10 PRESIDENT  17-NOV-81

                            7902         20 ANALYST    03-DEC-81

                            7788         20 ANALYST    09-DEC-82

                            7369         20 CLERK      17-DEC-80

                            7876         20 CLERK      12-JAN-83

                            7566         20 MANAGER    02-APR-81

                            7900         30 CLERK      03-DEC-81

                            7698         30 MANAGER    01-MAY-81

                            7499         30 SALESMAN   20-FEB-81

                            7521         30 SALESMAN   22-FEB-81

                            7844         30 SALESMAN   08-SEP-81

                            7654         30 SALESMAN   28-SEP-81

                       

                      14 rows selected.

                       

                      SQL> CREATE TABLE TBL

                        2    AS

                        3      SELECT  EMPNO,

                        4              DEPTNO,

                        5              JOB,

                        6              HIREDATE

                        7        FROM  EMP

                        8        WHERE 1 = 2

                        9  /

                       

                      Table created.

                       

                      SQL> CREATE UNIQUE INDEX TBL_UIDX1

                        2    ON TBL(

                        3           DEPTNO,

                        4           JOB

                        5          )

                        6  /

                       

                      Index created.

                       

                      SQL> INSERT /*+ IGNORE_ROW_ON_DUPKEY_INDEX(TBL,TBL_UIDX1) */

                        2    INTO TBL

                        3    SELECT  EMPNO,

                        4            DEPTNO,

                        5            JOB,

                        6            HIREDATE

                        7      FROM  EMP

                        8      ORDER BY DEPTNO,

                        9               JOB,

                      10               HIREDATE

                      11  /

                       

                      9 rows created.

                       

                      SQL> SELECT  *

                        2    FROM  TBL

                        3  /

                       

                           EMPNO     DEPTNO JOB        HIREDATE

                      ---------- ---------- ---------- ---------

                            7934         10 CLERK      23-JAN-82

                            7782         10 MANAGER    09-JUN-81

                            7839         10 PRESIDENT  17-NOV-81

                            7902         20 ANALYST    03-DEC-81

                            7369         20 CLERK      17-DEC-80

                            7566         20 MANAGER    02-APR-81

                            7900         30 CLERK      03-DEC-81

                            7698         30 MANAGER    01-MAY-81

                            7499         30 SALESMAN   20-FEB-81

                       

                      9 rows selected.

                       

                      SQL>

                       

                      SY.

                      • 9. Re: Ideas on cleaning up large table with massive duplicate rows.
                        rp0428

                        I promise I will take another look at this, though I am concerned about Johnathan Lewis's comment about hashes (even very sophisticated ones) only reducing the possibility of a hash collision.  With these data volumes, one collision in a billion would cause data loss.

                        Ok - two points

                         

                        First - your statement is just plain wrong. No number of collisions is going to cause data loss. Period. I don't even understand how you can think that. The ONLY possible data loss would occur if you actually delete data this is duplicated. And if you use the process I outlined properly that is just NOT possible.

                         

                        Please read my reply again and think through the steps I outlined. I tried to 'outline' the basic steps involved - not design the entire process for you.

                        3. verify the row is (or is not) a duplicate by checking for the hash in the hash lookup table

                         

                        4. abandon the insert (or whatever you need/want to do) if it is a dup.

                        So you compute a hash on some new row (or other large object). And step #3 you check to see if you have the hash.

                         

                        Let's say you get a hit with an existing hash. Well - I guess it thought it would be obvious that if you had any concern about a 'false collision' that might cause data loss that you would just:

                         

                        1. Use the rowid in the hash lookup table to fetch the actual row of data from the main table

                        2. Compare the existing row with the incoming row

                         

                        That tells you if the collision was a phantom or not. If it was, and I REALLY mean this, you should contact the Feds because they will be EXTREMELY interested in those two DIFFERENT rows that produced the same hash.

                         

                        Back to the issue - the point everyone, seemingly even Jonathan, seems to have missed is that the primary purpose of using a hash for a data use case like yours is NOT as a final arbiter of identity. The purpose is to simplify and improve the performance of the process.

                         

                        Using a hash makes it possible to do what I just described - finding any existing duplicate 'candidate'. In your use case you will only have one at most. Then you do an actual comparison to confirm that you don't have a false positive.

                         

                        Without the hash how do you know which existing rows to check as duplicates for incoming data? You don't.

                         

                        Consider this additional fact: Using the same algorithm is is IMPOSSIBLE for a given string of bytes to produce a different hash for different executions.

                         

                        That means that if the hash of the new rows does NOT MATCH a hash in the lookup table you are GUARANTEED that the data does not exist.

                         

                        Regarding Jonathan's last comment you quoted above?

                         

                        Well - I didn't respond to that because I assumed he was being facetious for the reasons I just gave above. Collisions don't matter for your use case since you don't base your decision to throw data away just on the existence of a collision. You use the collision to identify the row you need to perform the full 'identity' check on.

                         

                        For anyone interested - the U. S. federal government, no less is the one that sets those standards that all federal agencies must use. Clearly they aren't worried about the minuscule possibility of collisions.

                         

                        The comment I made to Jonathan  was actually serious. If you can discover two different rows in your data that cause a collision the Feds, and most other Crypto agencies would be VERY INTERESTED because it would mean they missed something very important.

                         

                        But since we need to beat it to death here are the two primary FIPS standards docs that you  might want to review. The first one is the standard itself. See section 3 on page iv in particular:

                        http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf

                        The hash algorithms specified in this Standard are called secure because, for a given algorithm, it

                        is computationally infeasible 1) to find a message that corresponds to a given message digest, or

                        2) to find two different messages that produce the same message digest. Any change to a

                        message will, with a very high probability, result in a different message digest. This wil

                        l result in a verification failure when the secure hash algorithm is used with a digital signature algorithm or

                        a keyed-hash message authentication algorithm. 

                        At the end of that doc they mention that a different doc has the actual data about collision resistance of different algorithms.

                         

                        See table 1 in section 4.2 on page 8 and related info.

                        https://www.google.com/search?q=SP+800-+107&ie=utf-8&oe=utf-8

                         

                        Summary:

                         

                        1. A one time table scan to compute a hash (use MD5 if you want) is the fastest way I know to prepare for a de-duplication process.

                        2. Store the hash values in a lookup table along with other metadata as needed

                        3. Store the hash in the main table also. The lack of a hash can indicate you just haven't processed that data yet.

                        4. Hash incoming data and check for collisions with the prestored/indexed hash values of the existing data

                        5. For any 'collisions' (i.e. the hash exists) perform a full duplicate check (e.g. column by column for your use case)

                        6. hash values make EXCELLENT primary keys - I would use a simple index-organized table for the few columns you need

                         

                        Hopefully you can see how the above process lets you proceed at your own pace in both identifying duplicates and in cleaning up the data.

                         

                        Your 'create a hash' can just query 1000 rows a night that have a null hash in the main table. Those are the rows you haven't processed yet. When you process them you will either add the hash value to them or delete the row depending on your needs.

                         

                        Meanwhile you can use the process above for any new data. Yes - you might add a new row that, unknown for now, duplicates an existing row. But your 'create a hash' process will find those when it gets to those existing rows because they won't have hash values.

                         

                        Gotta tell you it seems to meet EVERY requirement you posted and is EXTREMELY easy to implement.

                        • 10. Re: Ideas on cleaning up large table with massive duplicate rows.
                          Jonathan Lewis

                          Does every row in the weekly batch get the same value for insert date,  or is the load effected through multiple calls that end up with slightly different values of sysdate ?

                           

                          Is 2.2M the number of rows in the file, or the number of rows that survive any current deduplication mechanism? If the latter how big is the file to start with, and how many rows will your table end up with if you do a full deduplication?

                           

                          Modern Fermat - there is an obvious strategy, but I don't want to type it on an iPad.  Step 1, though, rolling dedup by insert date to a copy table.

                           

                          Regards

                          Jonathan Lewis

                          • 11. Re: Ideas on cleaning up large table with massive duplicate rows.
                            Stew Ashton

                            This is a suggestion for deduplicating the existing table to a new table, "working in identified subsets". I generated 1,100,000 extra rows from your data and filled the new table in 2.5 seconds. Extrapolating, this would take less than an hour.

                             

                            The actual number after LIMIT depends on how many distinct purchase order numbers you have. I would use 1000 max but no more than 1% of the count of distinct purchase order numbers.

                             

                            insert into CUSTOMER_ACCOUNT
                            with data as (select to_char(level,'fm00009') n from dual connect by level <= 100000)
                            , start_id as (select max(customer_account_id)+1 id from customer_account)
                            select id+rownum customer_account_id,
                              COMPANY_CODE, SALES_DIVISION_CODE,
                              case when purchase_account_num is not null then trim(purchase_account_num)||n end PURCHASE_ACCOUNT_NUM,
                              ACCOUNT_NAME,
                              INSERT_USER_DATE+rownum INSERT_USER_DATE
                            from CUSTOMER_ACCOUNT, data, start_id;
                            
                            
                            drop table no_dups purge;
                            create table no_dups as select * from customer_account where 1=0;
                            
                            
                            declare
                              cursor c_pan is
                                select distinct PURCHASE_ACCOUNT_NUM
                                from CUSTOMER_ACCOUNT
                                where PURCHASE_ACCOUNT_NUM is not null
                                -- and PURCHASE_ACCOUNT_NUM not in (select PURCHASE_ACCOUNT_NUM from no_dups)
                                -- add above line if restarting.
                                order by 1
                                ;
                              type tt_PAN is table of customer_account.PURCHASE_ACCOUNT_NUM%type;
                              lt_PAN tt_PAN;
                              procedure insert_pan(
                                P_PAN_FROM in customer_account.PURCHASE_ACCOUNT_NUM%type,
                                P_PAN_THROUGH in customer_account.PURCHASE_ACCOUNT_NUM%type
                              ) is
                                PRAGMA AUTONOMOUS_TRANSACTION;
                              begin
                                insert /*+ append */ into no_dups (
                                CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE
                                )
                                select
                                CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE
                                from (
                                  select ca.*,
                                  row_number() over(partition by
                                  COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME
                                  order by INSERT_USER_DATE) rn
                                  from CUSTOMER_ACCOUNT ca
                                  where PURCHASE_ACCOUNT_NUM between P_PAN_FROM and P_PAN_THROUGH
                                )
                                where rn = 1;
                                commit;
                              end insert_pan;
                            begin
                              open c_pan;
                              loop
                                fetch c_pan bulk collect into lt_pan limit 1000;
                                insert_pan(lt_pan(1), lt_pan(lt_pan.count));
                                exit when c_pan%notfound;
                              end loop;
                              close c_pan;
                              insert into no_dups (
                              CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE
                              )
                              select
                              CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE
                              from (
                                select ca.*,
                                row_number() over(partition by
                                COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME
                                order by INSERT_USER_DATE) rn
                                from CUSTOMER_ACCOUNT ca
                                where PURCHASE_ACCOUNT_NUM is null
                              )
                              where rn = 1;
                              commit;
                            end;
                            /
                            
                            • 12. Re: Ideas on cleaning up large table with massive duplicate rows.
                              Rob the Real Relic

                              You wrote :

                              On ‘stop the bleeding’ I thought I had a good method by replacing the INSERT with a MERGE that only exercises the ‘insert’ option.  On low volumes proof of concept this worked exactly as expected, but given the current size of the table, it ran for two hours before I killed it.  (The current production procedure runs in about 45 minutes, inserting approx. 2.2 million records).  So it appears that I’m not going to achieve acceptable run-time on the load until I can get the trash removed.

                               

                              So my question here would be, why not reverse your objectives? :

                              1.  Clean up the duplicates - this may take several iterations of your chosen method if the 'cleanup' task overlaps with the data load process.

                              2.  Stop the bleeding once you've fully removed the duplicates.

                              • 13. Re: Ideas on cleaning up large table with massive duplicate rows.
                                EdStevens

                                Solomon Yakobson wrote:

                                 

                                I'd try something like:

                                 

                                Interesting.  I'll do some bench-marking on it. 

                                 

                                Thanks.

                                • 14. Re: Ideas on cleaning up large table with massive duplicate rows.
                                  Stew Ashton

                                  EdStevens wrote:

                                   

                                  Solomon Yakobson wrote:

                                   

                                  I'd try something like:

                                   

                                  Interesting. I'll do some bench-marking on it.

                                   

                                  Thanks.

                                  [UPDATE: I did misread Solomon' post. He creates the index on the target table, not the source.]

                                   

                                  Ed, with all due respect to Solomon and you, I suggest you try another solution. Unless I have completely misread his post, Solomon requires that you create an index on almost all the columns of that huge table. That may take a long time. Then you can start doing the insert, which may also take a while since his code sorts the big table again. I'm not sure why that sort is necessary, but just the preliminary index creation may be a killer.

                                   

                                  Best regards, Stew

                                  1 2 3 Previous Next