Forum Stats

  • 3,734,273 Users
  • 2,246,935 Discussions
  • 7,857,216 Comments

Discussions

Ideas on cleaning up large table with massive duplicate rows.

EdStevens
EdStevens Member Posts: 28,143 Gold Crown
edited Feb 23, 2017 2:45PM in SQL & PL/SQL

This thread is a branch from Analytics to identify duplicate rows not working as expected , to have a broader discussion of the core problem I am working on.

Oracle 11.2.0.4.0, StdEd One, 64-bit

Oracle Grid Infrastructure 11.2.0.4.0, standalone ASM, no cluster, no RAC

Oracle Linux 6.

Here’s the situation:  A bit lengthy as I'll try to provide as much relevant information as possible and avoid you having to drag it out a bit at a time.

I have one table, used as a staging table for a data warehouse app.  The only source of data for this table is a weekly load from a file provided by an outside source.  The entire file is loaded into the table in a weekly job.  This is the only input source to this table. The primary business analyst says the non-duplicated rows must be retained for historical research.

Each week the data file has all of the content from the previous week, plus new content.  Thus, all rows loaded last week will be loaded again this week, along with new data.  And all data loaded this week will be loaded again next week, along with next week’s new data.  Once per quarter the source does some cleanup and housekeeping, but the data file is never restarted from zero.  You can see how this gets a massive build-up of duplicated rows.

Currently the table has 1.13 Billion rows, 29,655,781 blocks, with an average row length of 184 bytes.  Table has 37 columns, of which a match on 35 would define a ‘duplicate’.  The two columns not factoring for duplication status are the PK, which is a surrogate generated from a sequence at time of insert, and an INSERT_USER_DATE, populated by sysdate at time of insert.  The table has its own dedicated tablespace that is currently at 9 data files totaling 242 gbytes.  There is an MV dependent on this table.

 

There are three indexes, all in one tablespace that contains only these 3 segments.  The index TS has 4 data files totaling 72.1 gbytes. I can provide index stats if deemed relevant.

So, that’s the landscape. With that I have two objectives

First, stop the bleeding,

Second, clean up the duplicates.

On ‘stop the bleeding’ I thought I had a good method by replacing the INSERT with a MERGE that only exercises the ‘insert’ option.  On low volumes proof of concept this worked exactly as expected, but given the current size of the table, it ran for two hours before I killed it.  (The current production procedure runs in about 45 minutes, inserting approx. 2.2 million records).  So it appears that I’m not going to achieve acceptable run-time on the load until I can get the trash removed.

On ‘clean up the duplicates’, I see four methods.

  One, copy the data to be retained to a new table, then drop the original, rename the new, and rebuild all dependencies

  Two, copy the data to be retained to a new table, truncate the original, then copy the data from the new back to the original.

  Three, working in identified subsets of the original data (say, all purchase_account_num between 10000 and 20000), copy data to be retained to ‘working table’, delete all identified data from original, copy everything from working table back to original.

  Four, working with identified chunks of data, delete in place.

I ran a full volume trial of the INSERT (SELECT .. FROM) method.  I killed it after it had run 24 hours with no end in sight.

I’ve been reading and re-reading the discussion on this at AskTom  but am still spinning my wheels.

Here's a script of creating and loading a POC table.  This is an abridged version of the real table, but does include the three indexed columns along with a representative group of other columns.

---- create test tables--drop TABLE CUSTOMER_ACCOUNT purge;CREATE TABLE CUSTOMER_ACCOUNT   (    CUSTOMER_ACCOUNT_ID NUMBER(*,0) NOT NULL ENABLE,         COMPANY_CODE CHAR(2),         SALES_DIVISION_CODE CHAR(2),         PURCHASE_ACCOUNT_NUM CHAR(10),         ACCOUNT_NAME VARCHAR2(20),         INSERT_USER_DATE DATE,      constraint ca_pk primary key (CUSTOMER_ACCOUNT_ID)    );   CREATE INDEX IX_INSERT_USER_DATE ON CUSTOMER_ACCOUNT (INSERT_USER_DATE) ;CREATE INDEX IX_PURCHASE_ACCOUNT_NUM  ON CUSTOMER_ACCOUNT (PURCHASE_ACCOUNT_NUM) ;----  insert test data--insert into CUSTOMER_ACCOUNT values (1,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-01','yyyy-mm-dd'));insert into CUSTOMER_ACCOUNT values (2,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-02','yyyy-mm-dd'));insert into CUSTOMER_ACCOUNT values (3,'AA','AA','AAAA','HELEN BACH',to_date('2017-01-03','yyyy-mm-dd'));insert into CUSTOMER_ACCOUNT values (4,'BB','BB',NULL,'HELEN BACH',to_date('2017-01-03','yyyy-mm-dd'));--insert into CUSTOMER_ACCOUNT values (5,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-01','yyyy-mm-dd'));insert into CUSTOMER_ACCOUNT values (6,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-02','yyyy-mm-dd'));insert into CUSTOMER_ACCOUNT values (7,'AA','AA','AAAA','CHEATEM AND HOWE',to_date('2017-02-03','yyyy-mm-dd'));--insert into CUSTOMER_ACCOUNT values (8,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-01','yyyy-mm-dd'));insert into CUSTOMER_ACCOUNT values (9,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-02','yyyy-mm-dd'));insert into CUSTOMER_ACCOUNT values (10,'AA','AA','AAAA','Pikof Andropov',to_date('2017-03-03','yyyy-mm-dd'));--insert into CUSTOMER_ACCOUNT values (11,'AA','AA','AAAA','MURKEY RESEARCH INC.',to_date('2017-03-03','yyyy-mm-dd'));--commit;

From the above data,

- PK 1 to 3 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

- PK 4 has no dupes across all relevant columns.  Keep it.

- PK 5 to 7 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

- PK 8 to 10 are duplicated on all columns except the pk and insert date.  Retain the one with the earliest  insert date

- PK 11 has no dupes across all relevant columns.  Keep it.

I am open to any and all approaches.

JuanMJohn StegemanEdStevensjgarryWilliam RobertsonStefan Jager

Best Answer

  • Paulzip
    Paulzip Member Posts: 8,322 Blue Diamond
    edited Feb 10, 2017 2:16PM Accepted Answer

    My tuppence worth.

    Stop the bleeding

    1. Rename big table CUSTOMER_ACCOUNT to CUSTOMER_ACCOUNT_RAW.

    2. Create two new empty tables properly keyed / indexed on the columns which determine uniqueness (use concatenation with chr(0) if too many columns).  One called CUSTOMER_ACCOUNT_STAGED (for external table loads), the other CUSTOMER_ACCOUNT_FIXED for De-Dupes

    3. Create a view of union all these three tables called CUSTOMER_ACCOUNT, this preserves any current references

    4. Load external table data into this CUSTOMER_ACCOUNT_STAGED going forward.

    5. Loads will take the form....

    insert /*+ append */ into CUSTOMER_ACCOUNT_STAGED (col1, col2,.... coln)

    select col1, col2,.... coln

    from EXTERNAL_TABLE

    minus

    select col1, col2,.... coln

    from CUSTOMER_ACCOUNT_STAGED

    De-Dupe

    1. Add an index on INSERT_USER_DATE

    2. Create a table LOAD_DATES with a PROCESSED flag (default = 'N') and load with unique dates from CUSTOMER_ACCOUNT_RAW using Markus Winand's loose index scan CTE solution technique (or Jonathan Lewis's min max index scan technique)

    3. Loop through unprocessed LOAD_DATES in reverse (oldest first), and chunk load CUSTOMER_ACCOUNT_RAW filtered by INSERT_USER_DATE date into CUSTOMER_ACCOUNT_FIXED using

    insert /*+ append */ into CUSTOMER_ACCOUNT_FIXED (col1, col2,.... coln)

    select col1, col2,.... coln

    from CUSTOMER_ACCOUNT_RAW

    where INSERT_USER_DATE = <LOAD_DATE>

    minus

    select col1, col2,.... coln

    from CUSTOMER_ACCOUNT_FIXED

    this could be scheduled, if need be, to be done out of hours and maybe over a period of days.

    5. Mark LOAD_DATES.PROCESSED = 'Y' after each chunk.

    6. When all dates are processed, recreate view CUSTOMER_ACCOUNT as CUSTOMER_ACCOUNT_FIXED  union all'd with CUSTOMER_ACCOUNT_STAGED

    7.  Merge the staged data into the fixed data using the insert append minus technique, mentioned.

    8. Drop CUSTOMER_ACCOUNT view and rename CUSTOMER_ACCOUNT_FIXED to CUSTOMER_ACCOUNT

    9. Change external table data load destination to CUSTOMER_ACCOUNT

    10 Drop CUSTOMER_ACCOUNT_STAGED, and when you're happy drop CUSTOMER_ACCOUNT_RAW

«1345

Answers

  • JuanM
    JuanM Member Posts: 2,144 Gold Trophy
    edited Feb 2, 2017 1:54PM

    I chose your option 1.

    >>>>One, copy the data to be retained to a new table, then drop the original, rename the new, and rebuild all dependencies

    Have you read this article How to Find and Delete Duplicate Rows with SQL Section Unvalidated constraints.

    You can addequate the delete process as an insert process into a new table.

    a) Inserting AS A BULK (will be faster with parallel options but you are in SE One ) rows that are not included in exceptions table.

    b) Inserting with PL/SQL just fews rows from exceptions table (if they are few, but now you have the ROWID)

    Also, try to configure your new table with uniform extent size and NOLOGGING

    This option probably will require storage for index creation, but could be faster. Not tested with too much rows and does not even in SE One. But read the article.

    HTH

    Juan Manuel

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 2, 2017 1:52PM

    Juan - thanks for the response.

    your link returns a 404 not found

  • John Stegeman
    John Stegeman Member Posts: 24,269 Blue Diamond
    edited Feb 2, 2017 1:53PM
  • JuanM
    JuanM Member Posts: 2,144 Gold Trophy
    edited Feb 2, 2017 1:55PM
    EdStevens wrote:Juan - thanks for the response.your link returns a 404 not found

    Sorry, now link is fixed.

  • Unknown
    edited Feb 2, 2017 1:59PM
    I am open to any and all approaches.

    Ok - except you never responded to ANY of the suggestions many of us made in the other thread.

    Mine was:

    Identify dups

    1. create a separate table (for now)

    2. scan the current table, compute a hash for each row and store the hash and ROWID in the new table

    3. identify your current duplicates by processing that new, small table

    Stop the bleeding

    1. add the hash to the current table (or continue to use the new 'hash lookup' table

    2. compute a hash on new rows added

    3. verify the row is (or is not) a duplicate by checking for the hash in the hash lookup table

    4. abandon the insert (or whatever you need/want to do) if it is a dup.

    5. accept the insert if it is not a dup. Add the hash value to the main table or the lookup table depending on which you are using.

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 2, 2017 3:28PM

    rp0428 wrote:

    I am open to any and all approaches.

    Ok - except you never responded to ANY of the suggestions many of us made in the other thread.

    Because I saw the discussion was quickly expanding well beyond the original question and felt it best to do some additional testing and benchmarking, then "respond" with a new thread giving a more comprehensive view of the entire issue.  Partly to get all the cards on a clean table and not be accused of withholding critical information.

    Mine was:Identify dups1. create a separate table (for now)2. scan the current table, compute a hash for each row and store the hash and ROWID in the new table3. identify your current duplicates by processing that new, small tableStop the bleeding1. add the hash to the current table (or continue to use the new 'hash lookup' table2. compute a hash on new rows added3. verify the row is (or is not) a duplicate by checking for the hash in the hash lookup table4. abandon the insert (or whatever you need/want to do) if it is a dup.5. accept the insert if it is not a dup. Add the hash value to the main table or the lookup table depending on which you are using.

    I promise I will take another look at this, though I am concerned about Johnathan Lewis's comment about hashes (even very sophisticated ones) only reducing the possibility of a hash collision.  With these data volumes, one collision in a billion would cause data loss.

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 2, 2017 3:32PM
    John Stegeman wrote:it doesn't return a 404, it's not formed right https://blogs.oracle.com/sql/entry/how_to_find_and_delete

    Ah, that one.  The one I had previously bookmarked and was using as a primary source, and referenced in the original thread. 

    I will give it another review.

    William Robertson
  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,398 Black Diamond
    edited Feb 2, 2017 4:52PM

    I'd try something like:

    CREATE TABLE CUSTOMER_ACCOUNT_NODUPS(

                                         CUSTOMER_ACCOUNT_ID  NUMBER(*,0) NOT NULL ENABLE,

                                         COMPANY_CODE        CHAR(2),

                                         SALES_DIVISION_CODE  CHAR(2),

                                         PURCHASE_ACCOUNT_NUM CHAR(10),

                                         ACCOUNT_NAME        VARCHAR2(20),

                                         INSERT_USER_DATE    DATE

                                        )

    /

    CREATE UNIQUE INDEX CUSTOMER_ACCOUNT_NODUPS_UIDX1

      ON CUSTOMER_ACCOUNT_NODUPS(

                                 COMPANY_CODE,

                                 SALES_DIVISION_CODE,

                                 PURCHASE_ACCOUNT_NUM,

                                 ACCOUNT_NAME

                                )

    /

    INSERT /*+ IGNORE_ROW_ON_DUPKEY_INDEX(CUSTOMER_ACCOUNT_NODUPS,CUSTOMER_ACCOUNT_NODUPS_UIDX1) */

      INTO CUSTOMER_ACCOUNT_NODUPS

      SELECT  *

        FROM  CUSTOMER_ACCOUNT

        ORDER BY COMPANY_CODE,

                 SALES_DIVISION_CODE,

                 PURCHASE_ACCOUNT_NUM,

                 ACCOUNT_NAME,

                 INSERT_USER_DATE

    /

    Then drop original table and rename nodups table back to original. For example:

    SQL> SELECT  EMPNO,

      2          DEPTNO,

      3          JOB,

      4          HIREDATE

      5    FROM  EMP

      6    ORDER BY DEPTNO,

      7             JOB,

      8             HIREDATE

      9  /

         EMPNO     DEPTNO JOB        HIREDATE

    ---------- ---------- ---------- ---------

          7934         10 CLERK      23-JAN-82

          7782         10 MANAGER    09-JUN-81

          7839         10 PRESIDENT  17-NOV-81

          7902         20 ANALYST    03-DEC-81

          7788         20 ANALYST    09-DEC-82

          7369         20 CLERK      17-DEC-80

          7876         20 CLERK      12-JAN-83

          7566         20 MANAGER    02-APR-81

          7900         30 CLERK      03-DEC-81

          7698         30 MANAGER    01-MAY-81

          7499         30 SALESMAN   20-FEB-81

          7521         30 SALESMAN   22-FEB-81

          7844         30 SALESMAN   08-SEP-81

          7654         30 SALESMAN   28-SEP-81

    14 rows selected.

    SQL> CREATE TABLE TBL

      2    AS

      3      SELECT  EMPNO,

      4              DEPTNO,

      5              JOB,

      6              HIREDATE

      7        FROM  EMP

      8        WHERE 1 = 2

      9  /

    Table created.

    SQL> CREATE UNIQUE INDEX TBL_UIDX1

      2    ON TBL(

      3           DEPTNO,

      4           JOB

      5          )

      6  /

    Index created.

    SQL> INSERT /*+ IGNORE_ROW_ON_DUPKEY_INDEX(TBL,TBL_UIDX1) */

      2    INTO TBL

      3    SELECT  EMPNO,

      4            DEPTNO,

      5            JOB,

      6            HIREDATE

      7      FROM  EMP

      8      ORDER BY DEPTNO,

      9               JOB,

    10               HIREDATE

    11  /

    9 rows created.

    SQL> SELECT  *

      2    FROM  TBL

      3  /

         EMPNO     DEPTNO JOB        HIREDATE

    ---------- ---------- ---------- ---------

          7934         10 CLERK      23-JAN-82

          7782         10 MANAGER    09-JUN-81

          7839         10 PRESIDENT  17-NOV-81

          7902         20 ANALYST    03-DEC-81

          7369         20 CLERK      17-DEC-80

          7566         20 MANAGER    02-APR-81

          7900         30 CLERK      03-DEC-81

          7698         30 MANAGER    01-MAY-81

          7499         30 SALESMAN   20-FEB-81

    9 rows selected.

    SQL>

    SY.

  • Unknown
    edited Feb 2, 2017 7:05PM
    I promise I will take another look at this, though I am concerned about Johnathan Lewis's comment about hashes (even very sophisticated ones) only reducing the possibility of a hash collision.  With these data volumes, one collision in a billion would cause data loss.

    Ok - two points

    First - your statement is just plain wrong. No number of collisions is going to cause data loss. Period. I don't even understand how you can think that. The ONLY possible data loss would occur if you actually delete data this is duplicated. And if you use the process I outlined properly that is just NOT possible.

    Please read my reply again and think through the steps I outlined. I tried to 'outline' the basic steps involved - not design the entire process for you.

    3. verify the row is (or is not) a duplicate by checking for the hash in the hash lookup table4. abandon the insert (or whatever you need/want to do) if it is a dup.

    So you compute a hash on some new row (or other large object). And step #3 you check to see if you have the hash.

    Let's say you get a hit with an existing hash. Well - I guess it thought it would be obvious that if you had any concern about a 'false collision' that might cause data loss that you would just:

    1. Use the rowid in the hash lookup table to fetch the actual row of data from the main table

    2. Compare the existing row with the incoming row

    That tells you if the collision was a phantom or not. If it was, and I REALLY mean this, you should contact the Feds because they will be EXTREMELY interested in those two DIFFERENT rows that produced the same hash.

    Back to the issue - the point everyone, seemingly even Jonathan, seems to have missed is that the primary purpose of using a hash for a data use case like yours is NOT as a final arbiter of identity. The purpose is to simplify and improve the performance of the process.

    Using a hash makes it possible to do what I just described - finding any existing duplicate 'candidate'. In your use case you will only have one at most. Then you do an actual comparison to confirm that you don't have a false positive.

    Without the hash how do you know which existing rows to check as duplicates for incoming data? You don't.

    Consider this additional fact: Using the same algorithm is is IMPOSSIBLE for a given string of bytes to produce a different hash for different executions.

    That means that if the hash of the new rows does NOT MATCH a hash in the lookup table you are GUARANTEED that the data does not exist.

    Regarding Jonathan's last comment you quoted above?

    Well - I didn't respond to that because I assumed he was being facetious for the reasons I just gave above. Collisions don't matter for your use case since you don't base your decision to throw data away just on the existence of a collision. You use the collision to identify the row you need to perform the full 'identity' check on.

    For anyone interested - the U. S. federal government, no less is the one that sets those standards that all federal agencies must use. Clearly they aren't worried about the minuscule possibility of collisions.

    The comment I made to Jonathan  was actually serious. If you can discover two different rows in your data that cause a collision the Feds, and most other Crypto agencies would be VERY INTERESTED because it would mean they missed something very important.

    But since we need to beat it to death here are the two primary FIPS standards docs that you  might want to review. The first one is the standard itself. See section 3 on page iv in particular:

    http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf

    The hash algorithms specified in this Standard are called secure because, for a given algorithm, it is computationally infeasible 1) to find a message that corresponds to a given message digest, or 2) to find two different messages that produce the same message digest. Any change to a message will, with a very high probability, result in a different message digest. This will result in a verification failure when the secure hash algorithm is used with a digital signature algorithm or a keyed-hash message authentication algorithm.  

    At the end of that doc they mention that a different doc has the actual data about collision resistance of different algorithms.

    See table 1 in section 4.2 on page 8 and related info.

    https://www.google.com/search?q=SP+800-+107&ie=utf-8&oe=utf-8

    Summary:

    1. A one time table scan to compute a hash (use MD5 if you want) is the fastest way I know to prepare for a de-duplication process.

    2. Store the hash values in a lookup table along with other metadata as needed

    3. Store the hash in the main table also. The lack of a hash can indicate you just haven't processed that data yet.

    4. Hash incoming data and check for collisions with the prestored/indexed hash values of the existing data

    5. For any 'collisions' (i.e. the hash exists) perform a full duplicate check (e.g. column by column for your use case)

    6. hash values make EXCELLENT primary keys - I would use a simple index-organized table for the few columns you need

    Hopefully you can see how the above process lets you proceed at your own pace in both identifying duplicates and in cleaning up the data.

    Your 'create a hash' can just query 1000 rows a night that have a null hash in the main table. Those are the rows you haven't processed yet. When you process them you will either add the hash value to them or delete the row depending on your needs.

    Meanwhile you can use the process above for any new data. Yes - you might add a new row that, unknown for now, duplicates an existing row. But your 'create a hash' process will find those when it gets to those existing rows because they won't have hash values.

    Gotta tell you it seems to meet EVERY requirement you posted and is EXTREMELY easy to implement.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,607 Gold Crown
    edited Feb 2, 2017 9:36PM

    Does every row in the weekly batch get the same value for insert date,  or is the load effected through multiple calls that end up with slightly different values of sysdate ?

    Is 2.2M the number of rows in the file, or the number of rows that survive any current deduplication mechanism? If the latter how big is the file to start with, and how many rows will your table end up with if you do a full deduplication?

    Modern Fermat - there is an obvious strategy, but I don't want to type it on an iPad.  Step 1, though, rolling dedup by insert date to a copy table.

    Regards

    Jonathan Lewis

  • Stew Ashton
    Stew Ashton Member Posts: 2,849 Gold Trophy
    edited Feb 3, 2017 3:05AM

    This is a suggestion for deduplicating the existing table to a new table, "working in identified subsets". I generated 1,100,000 extra rows from your data and filled the new table in 2.5 seconds. Extrapolating, this would take less than an hour.

    The actual number after LIMIT depends on how many distinct purchase order numbers you have. I would use 1000 max but no more than 1% of the count of distinct purchase order numbers.

    insert into CUSTOMER_ACCOUNTwith data as (select to_char(level,'fm00009') n from dual connect by level <= 100000), start_id as (select max(customer_account_id)+1 id from customer_account)select id+rownum customer_account_id,  COMPANY_CODE, SALES_DIVISION_CODE,  case when purchase_account_num is not null then trim(purchase_account_num)||n end PURCHASE_ACCOUNT_NUM,  ACCOUNT_NAME,  INSERT_USER_DATE+rownum INSERT_USER_DATEfrom CUSTOMER_ACCOUNT, data, start_id;drop table no_dups purge;create table no_dups as select * from customer_account where 1=0;declare  cursor c_pan is    select distinct PURCHASE_ACCOUNT_NUM    from CUSTOMER_ACCOUNT    where PURCHASE_ACCOUNT_NUM is not null    -- and PURCHASE_ACCOUNT_NUM not in (select PURCHASE_ACCOUNT_NUM from no_dups)    -- add above line if restarting.    order by 1    ;  type tt_PAN is table of customer_account.PURCHASE_ACCOUNT_NUM%type;  lt_PAN tt_PAN;  procedure insert_pan(    P_PAN_FROM in customer_account.PURCHASE_ACCOUNT_NUM%type,    P_PAN_THROUGH in customer_account.PURCHASE_ACCOUNT_NUM%type  ) is    PRAGMA AUTONOMOUS_TRANSACTION;  begin    insert /*+ append */ into no_dups (    CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE    )    select    CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE    from (      select ca.*,      row_number() over(partition by      COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME      order by INSERT_USER_DATE) rn      from CUSTOMER_ACCOUNT ca      where PURCHASE_ACCOUNT_NUM between P_PAN_FROM and P_PAN_THROUGH    )    where rn = 1;    commit;  end insert_pan;begin  open c_pan;  loop    fetch c_pan bulk collect into lt_pan limit 1000;    insert_pan(lt_pan(1), lt_pan(lt_pan.count));    exit when c_pan%notfound;  end loop;  close c_pan;  insert into no_dups (  CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE  )  select  CUSTOMER_ACCOUNT_ID, COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME, INSERT_USER_DATE  from (    select ca.*,    row_number() over(partition by    COMPANY_CODE, SALES_DIVISION_CODE, PURCHASE_ACCOUNT_NUM, ACCOUNT_NAME    order by INSERT_USER_DATE) rn    from CUSTOMER_ACCOUNT ca    where PURCHASE_ACCOUNT_NUM is null  )  where rn = 1;  commit;end;/
  • Rob the Real Relic
    Rob the Real Relic Member Posts: 335
    edited Feb 3, 2017 3:17AM

    You wrote :

    On ‘stop the bleeding’ I thought I had a good method by replacing the INSERT with a MERGE that only exercises the ‘insert’ option.  On low volumes proof of concept this worked exactly as expected, but given the current size of the table, it ran for two hours before I killed it.  (The current production procedure runs in about 45 minutes, inserting approx. 2.2 million records).  So it appears that I’m not going to achieve acceptable run-time on the load until I can get the trash removed.

    So my question here would be, why not reverse your objectives? :

    1.  Clean up the duplicates - this may take several iterations of your chosen method if the 'cleanup' task overlaps with the data load process.

    2.  Stop the bleeding once you've fully removed the duplicates.

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 3, 2017 10:17AM
    Solomon Yakobson wrote:I'd try something like:

    Interesting.  I'll do some bench-marking on it. 

    Thanks.

  • Stew Ashton
    Stew Ashton Member Posts: 2,849 Gold Trophy
    edited Feb 3, 2017 11:24AM
    EdStevens wrote:Solomon Yakobson wrote:I'd try something like:Interesting. I'll do some bench-marking on it. Thanks.

    [UPDATE: I did misread Solomon' post. He creates the index on the target table, not the source.]

    Ed, with all due respect to Solomon and you, I suggest you try another solution. Unless I have completely misread his post, Solomon requires that you create an index on almost all the columns of that huge table. That may take a long time. Then you can start doing the insert, which may also take a while since his code sorts the big table again. I'm not sure why that sort is necessary, but just the preliminary index creation may be a killer.

    Best regards, Stew

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 3, 2017 10:35AM
    rp0428 wrote:I promise I will take another look at this, though I am concerned about Johnathan Lewis's comment about hashes (even very sophisticated ones) only reducing the possibility of a hash collision. With these data volumes, one collision in a billion would cause data loss.Ok - two pointsFirst - your statement is just plain wrong. No number of collisions is going to cause data loss. Period. I don't even understand how you can think that. The ONLY possible data loss would occur if you actually delete data this is duplicated. And if you use the process I outlined properly that is just NOT possible.Please read my reply again and think through the steps I outlined. I tried to 'outline' the basic steps involved - not design the entire process for you.

    Thanks for the further explanation.  It may seem perfectly clear to you but will take me a while to digest.  Sometimes I have to spend time trying to develop a rebuttal before the light goes on and I see the error in my own thinking. 

    A couple of things I have to consider here, not the least of which is the human element.  This approach does put more moving parts in the load procedure, and those parts use things our developers have never seen.  I fear the developers would resist due to their own lack of understanding and attempt to KISS.  In this case in particular, the load procedure for this table is one of many that do essentially the same thing - load a staging table from an external table - and they all use the same outline.  When a new table/data source comes up they simply copy the code from an existing procedure and make as few changes as possible - name of staging and external tables, and the column list.  This may seem to some like it is easily dismissed, but "on the ground" it is a real factor to consider.

    One thing I did miss on your original suggestion was that a hash collision is not the final arbiter, but simply triggers a further, actual row comparison.

    Another thing I need to consider is that adding a hash column to the existing table will almost certainly lead to a massive amount of row migration.

    Please understand that none of the above means I'm rejecting this approach out of hand.  Just letting you know additional concerns that it brings up and that I will need to think on it some more.

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,398 Black Diamond
    edited Feb 3, 2017 10:35AM
    Stew Ashton wrote:Unless I have completely misread his post, Solomon requires that you create an index on almost all the columns of that huge table. 

    You did. Index is created on empty NODUPS table. It serves as a filter not letting duplicate rows with more recent dates to be inserted and hint masks exception. Weak point of this solution is sorting that huge table, but this is the step you probably can't avoid in any solution.

    SY.

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 3, 2017 10:42AM
    Jonathan Lewis wrote:Does every row in the weekly batch get the same value for insert date, or is the load effected through multiple calls that end up with slightly different values of sysdate ?

    Every row gets the same insert date.  The load is a single INSERT .. INTO ... SELECT ... FROM EXTERN_TBL, with SYSDATE the supplied value for that column.  I just ran a query to verify that all insert dates for a given trunc(insert_date) have the same time.

    Is 2.2M the number of rows in the file, or the number of rows that survive any current deduplication mechanism? If the latter how big is the file to start with, and how many rows will your table end up with if you do a full deduplication?

    There is no current de-duplication process.  That is the root of the existing problem.  So yes, the 2.2M rows in the file are also 2.2M rows added to the table every week.

    Modern Fermat - there is an obvious strategy, but I don't want to type it on an iPad. Step 1, though, rolling dedup by insert date to a copy table.RegardsJonathan Lewis
  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 3, 2017 10:44AM
    Stew Ashton wrote:This is a suggestion for deduplicating the existing table to a new table, "working in identified subsets". I generated 1,100,000 extra rows from your data and filled the new table in 2.5 seconds. Extrapolating, this would take less than an hour.The actual number after LIMIT depends on how many distinct purchase order numbers you have. I would use 1000 max but no more than 1% of the count of distinct purchase order numbers.

    Thanks.  Like RP's suggestion, this one will take me a bit to digest, but I will study it.

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 3, 2017 11:03AM
    Rob the Real Relic wrote:You wrote :On ‘stop the bleeding’ I thought I had a good method by replacing the INSERT with a MERGE that only exercises the ‘insert’ option. On low volumes proof of concept this worked exactly as expected, but given the current size of the table, it ran for two hours before I killed it. (The current production procedure runs in about 45 minutes, inserting approx. 2.2 million records). So it appears that I’m not going to achieve acceptable run-time on the load until I can get the trash removed.So my question here would be, why not reverse your objectives? :1. Clean up the duplicates - this may take several iterations of your chosen method if the 'cleanup' task overlaps with the data load process.2. Stop the bleeding once you've fully removed the duplicates.

    Yes, I've already come to that conclusion and that's where my testing and benchmarking has been focused the last couple of days.  If I can just clean up the duplicates and buy back the disk space, then 'simple' procedures to prevent further may become feasible vis a vis run time on the load procedure (RP's hashing solution not withstanding )

    Worst case on that is not being able to come up with a 'prevention' method that is acceptable to all parties (back to the human element), once the initial cleanout is accomplished. The initial may have to be spread over some time spanning load cycles but I'd think that after that is completed, on-going cleanouts could probably be accomplished in a reasonable time frame and could be scheduled periodically.  I guess a fine distinction here is that duplicates, in and of themselves, are not an issue.  It's the continual multiplication of those duplicates going un-checked.  So while the ideal solution would prevent future occurrences, it would be acceptable to be able to simply stabilize the amount of duplicates.

  • Stew Ashton
    Stew Ashton Member Posts: 2,849 Gold Trophy
    edited Feb 3, 2017 11:26AM
    Solomon Yakobson wrote:Stew Ashton wrote:Unless I have completely misread his post, Solomon requires that you create an index on almost all the columns of that huge table. You did. Index is created on empty NODUPS table. It serves as a filter not letting duplicate rows with more recent dates to be inserted and hint masks exception. Weak point of this solution is sorting that huge table, but this is the step you probably can't avoid in any solution.SY.

    I most certainly did! Why did I think the index was created on the huge table? Of course it is created on the target table, which is empty.

    My later suggestion does not sort the huge table all at once, it sorts subsets, which should allow for sorting in memory.

    But why do you "sort the huge table" anyway? Is that really necessary?

    [UPDATE: I see now why Solomon sorts the table. It is in order to insert the row with the earliest insert date. If Ed let that requirement go, there would be no need for the sort and Solomon's solution would probably go MUCH faster...]

    Thanks and best regards, Stew

    William Robertson
  • Stew Ashton
    Stew Ashton Member Posts: 2,849 Gold Trophy
    edited Feb 3, 2017 12:18PM

    Hm. I tried Solomon's solution on my million row dataset and it took several minutes to complete - without doing the sort. With the sort, I ran out of temp space.

    I am surprised. I thought Solomon had the winner once he straightened me out.

    Best regards, Stew

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 3, 2017 12:58PM
    Solomon Yakobson wrote:I'd try something like:CREATE TABLE CUSTOMER_ACCOUNT_NODUPS( <snip>

    I must need another set of eyes.  Got an ORA-00001: unique constraint (ESTEVENS.CA_DEDUPE_UIDX1) violated.

    Trying to run that to ground, found ( Guy Harrison - Yet Another Database Blog - The 11GR2 IGNORE_ROW_ON_DUPKEY_INDEX hint  ) where Guy Harrison says "As with all hints, a syntax error in the hint causes it to be silently ignored. The result will be that ORA-00001 will be raised, just as if no hint were used."

    But I'm not seeing it.

    SQL> create table ca_dedupe  2  as (select * from customer_account  3      where 1=2)  4  ;Table created.Elapsed: 00:00:00.01SQL> create unique index ca_dedupe_uidx1  2  on ca_dedupe  3      (COMPANY_CODE,  4        SALES_DIVISION_CODE,  5        PURCHASE_ACCOUNT_NUM,  6        ACCOUNT_NAME  7        )  8  ;Index created.Elapsed: 00:00:00.01SQL> --SQL> insert /* IGNORE_ROW_ON_DUPKEY_INDEX(CA_DEDUPE,CA_DEDUPE_UIDX1) */  2    into ca_dedupe  3    select *  4      from customer_account  5      order by  6        COMPANY_CODE,  7        SALES_DIVISION_CODE,  8        PURCHASE_ACCOUNT_NUM,  9        ACCOUNT_NAME,10        INSERT_USER_DATE11  ;insert /* IGNORE_ROW_ON_DUPKEY_INDEX(CA_DEDUPE,CA_DEDUPE_UIDX1) */*ERROR at line 1:ORA-00001: unique constraint (ESTEVENS.CA_DEDUPE_UIDX1) violatedElapsed: 00:00:00.01
  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,398 Black Diamond
    edited Feb 3, 2017 1:03PM

    insert /*+ IGNORE_ROW_ON_DUPKEY_INDEX(CA_DEDUPE,CA_DEDUPE_UIDX1) */

    SY.

    John StegemanEdStevensWilliam Robertson
  • Mike Kutz
    Mike Kutz Member Posts: 5,588 Silver Crown
    edited Feb 3, 2017 1:38PM

    I'm still trying to digest your business requirements:

    • you stage a file each week into the same table
    • each file is a cumulative of the prior week's file plus current week's data
    • you need to keep all prior loaded data for "debugging" purposes
    • you have duplicates you want to remove

    Step 1 - Stop the bleeding

    Instead of using a "hello kitty" band aid solution to stop your bleeding, try using stitches

    Use In-Database Archiving (or similar technique) and "archive" your staged data when you have finished processing it.

    Under normal circumstances, you will only see "currently staged file".  Since the files are cumulative, you only need to process "currently staged file".

    You can turn off the "VPD" filter when you need to "debug".

    Step 2 - Stage the Stage.

    If you need to "dedup" the "currently staged file, I would probably do a "INSERT .. SELECT" of the distinct records into a second table,

    And, if needed, follow that by a "PARTITION EXCHANGE"

    (actually, doing "reverse" might be better: stage into 2nd; insert .. select  onto the main stage)

    My $0.02

    MK

  • EdStevens
    EdStevens Member Posts: 28,143 Gold Crown
    edited Feb 3, 2017 2:22PM
    Solomon Yakobson wrote:insert /*+ IGNORE_ROW_ON_DUPKEY_INDEX(CA_DEDUPE,CA_DEDUPE_UIDX1) */ SY.

    Image result for doh the simpsons

This discussion has been closed.