Forum Stats

  • 3,757,132 Users
  • 2,251,200 Discussions
  • 7,869,737 Comments

Discussions

Fastest way to batch delete data from a table with 1 billion rows

13»

Answers

  • OraC
    OraC Member Posts: 70 Blue Ribbon

    Apologies for the confusion @Jonathan Lewis and thank you for drawing my attention to the discrepancies. See the tables below.

    create table ITD 

    (

      CIDES_INSTANCE_ID  NUMBER(20,0)     not null,

      CUST_ID NUMBER(20,0)     not null,

      DATA_ID NUMBER(20,0)     not null,

      D_TYPE VARCHAR2(255)    default 'NOVALUE' not null,

      EK_ID NUMBER(20,0),

      MK_ID NUMBER(20,0),

      CREATED     TIMESTAMP      not null,

      UUN  VARCHAR2(255),

      UAN  VARCHAR2(255),

      MC         BLOB,

      CHR_VALUE     VARCHAR2(1024),

      BLOB_VALUE     BLOB,

      constraint P_ITD primary key (CIDES_INSTANCE_ID) using index tablespace DU_INDEXES,

      constraint A_ITD_U1 unique (CUST_ID, DATA_ID, D_TYPE) using index tablespace DU_INDEXES

    )

    tablespace DU_DATA 

    lob (BLOB_VALUE) store as securefile BLOBSEG_ITD_BLOBVAL

    lob (MC) store as BLOBSEG_ITD_MC;


    create index ITD_CIDS_1 on ITD (CUST_ID ASC) tablespace DU_INDEXES;

    create index ITD_CIDS_2 on ITD (DATA_ID ASC) tablespace DU_INDEXES;

    create index ITD_CIDS_3 on ITD (MK_ID ASC) tablespace DU_INDEXES;

    create index ITD_CIDS_4 on ITD (EK_ID ASC) tablespace DU_INDEXES;

    create index ITD_CIDS_5 on ITD (D_TYPE ASC) tablespace DU_INDEXES;

    create index ITD_CIDS_6 on ITD (CREATED ASC) tablespace DU_INDEXES;

    create index ITD_CIDS_7 on ITD (DATA_ID ASC, CREATED ASC) tablespace DU_INDEXES;


    create table SPECS 

    (

      DATA_ID NUMBER(20,0)     not null,

      SERVICE_ID NUMBER(20,0)     not null,

      SP_NAME    VARCHAR2(255)    not null,

      SP_TYPE    VARCHAR2(40)     not null

       constraint C_DATA_ELE_SP_TYPE check (SP_TYPE in ('CHAR','BLOB')),

      E_I    VARCHAR2(1)     default 'N' not null

       constraint C_DATA_ELE_E_I check (E_I in ('Y','N') and E_I = upper(E_I)),

      constraint P_SPECS primary key (DATA_ID) using index tablespace DC_INDEXES,

      constraint A_SPECS_U1 unique (SERVICE_ID, SP_NAME) using index tablespace DC_INDEXES

    )

    tablespace DC_DATA;


    create index SPECS_IDX_1 on SPECS (SERVICE_ID ASC) tablespace DC_INDEXES;


    On the ITD table we have FK constraints to parent tables for the following columns: CUST_ID, EK_ID, MK_ID, hence the indexes for those columns. DATA_ID also has a FK constraint to the SPECS parent table.

    On the SPECS table we have FK constraints from SERVICE_ID to a parent table.

    MC blob is tiny and has a length of 32 for all rows

    The skew of data in DATA_ID as mentioned above is 95-100% for the 3 values (56, 76, 77)

    Other counts for other DATA_IDs in the ITTD table associated with other SERVICE_IDs are as follows:

    DATA_ID: 43 - 6,485,780

    DATA_ID:  15 - 3,261,283

    DATA_ID:  54 - 152,108

    after that the numbers reduce to about 200 records for 15 other DATA_IDs associated with other SERVICE_IDs.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,753 Gold Crown


    You've got more indexes than you need on the big table.

    Deciding which to keep, which to drop, and which to chanage depends entirely on your strategic requirements.

    The (created) index might be causing buffer busy waits - all the inserts in the same second go into the same leaf block - so if you have multiple sessions inserting data they will keep colliding on that leaf block; this means (data_id, created) might be the one to keep provided you can "delete where data_id = constant and created <= xxxx".

    As it stands, then,

    • you don't need the index (cust_id) as a "foreign key" index because you've got (cust_id, data_id, d_type).
    • you don't need (data_id) as a foreign key index because you've got (data_id, created)
    • you may not need (created) if you can delete with the predicate above

    There are various ways you could try to code your delete with "in list", to emulate the effect of the delete statement I've suggested but possibly the simple thing to do is (pseudo-code):

    select data_id into pl/sql array from specs
    for i in 1..data_id_array.count loop
      loop
        delete /*+ index(itd (data_id created)) */ from itd where data_id = data_id_array(i) and created <=  ...;
        commit;
      until sql%rowcount = 0;
    end loop
    


    Regards

    Jonathan Lewis

    OraC
  • OraC
    OraC Member Posts: 70 Blue Ribbon
    edited Feb 8, 2021 3:37PM

    Thank you @Jonathan Lewis

    I am investigating with the development team to determine if we can drop some of these indexes.

    A new issue is hindering my testing progress now. After creating a copy of the database I'm working on for testing these delete options I will test a number of deletions and then restore my backup. Immediately before I took my backup the delete batches were completing in a number of seconds and I gathered stats on the table using:

    exec DBMS_STATS.GATHER_TABLE_STATS('SCHEMA', 'ITD', cascade => TRUE);

    After a reboot, the delete batch is taking minutes to run...rendering my tests almost useless. I expect this is due to a lack of index or table caching or both. I've read that a full table scan is one of the ways to load up this data, however for a table this size this is not going to be feasible. Are there any other options within standard edition to alleviate those awful initial batch executions after database restore or even reboot?

    Thanks,

    OraC

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,753 Gold Crown

    After a reboot, the delete batch is taking minutes to run...rendering my tests almost useless. I expect this is due to a lack of index or table caching or both

    By reboot do you mean of the machine, or do you mean restarting the instance. In either case (but at different levels) the need to reload caches after restarting can make a big difference between clean testing and preliminary experiments against a system that's been running for a long time and has a "live" cache.

    That's one of the reasons you need to know what the data looks like and what the real patterns of usage are so that you can make reasonable guesses about what fraction of the I/O you see in testing would have been cached on a production system.

    Running big tablescans (apart from the potential for not doing what you expect because of serial direct reads) will probably not be a realistic way to warm the instance - possibly you could review a few Statspack reports to get an idea of the most used, high-volume SQL statements and create a startup model that runs some of those statements (with different, randomised, bind values) a few thousand times to warm things up.

    Regards

    Jonathan Lewis

    OraC
  • OraC
    OraC Member Posts: 70 Blue Ribbon

    Thanks for that suggestion again @Jonathan Lewis

    We have set up a loading utility to run through a whole set of application queries and it has definitely improved performance of the deletion script after a restart be it an instance restart or server reboot.

    Regarding the suggestion above of using an array loop:

    1) Is array processing more efficient than using a cursor?

    2) Would one option be more suitable than the other based on the requirement to specify 9 million DATA_ID number(20) entries in a single execution?

    3) What kind of session memory problems could we encounter and could it have a knock on impact of paging other valuable data out of memory?

    4) Should we be resizing any memory structures or modifying any initialisation parameters to reduce the impact while running other online transactions?

    Regards,

    O