Forum Stats

  • 3,839,017 Users
  • 2,262,438 Discussions
  • 7,900,836 Comments

Discussions

Bulk Deletes from a Large Table

2

Answers

  • John Stegeman
    John Stegeman Member Posts: 24,269 Blue Diamond
    edited May 14, 2015 1:13PM

    Oyvind - yes, but you said "Deleting 200 million rows may generate.... and space management" I was pointing out that space management doesn't apply to the "delete" scenario.

    If partitioning is an option for the OP and the table can be partitioned in a way that makes the delete a "partition swap" activity - that will be, by FAR, the fastest option with the least I/O

  • Paul V.
    Paul V. Member Posts: 42 Blue Ribbon
    edited May 14, 2015 1:20PM

    Thank you for some very interesting ideas.  There were some roadblocks to partitioning but after reading some of the linked articles, perhaps there are some work arounds that I had not thought of.  I am fortunate to have a test environment with the same hardware and space so I can run some real world tests.  I will update when I determine the best path.  Thanks again.

  • Oyvind Isene
    Oyvind Isene Member Posts: 64 Bronze Badge
    edited May 14, 2015 1:25PM

    Agree 100% with respect to partitioning. I mentioned CTAS in case it was impossible because of licenses or other reasons.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,010 Blue Diamond
    edited May 14, 2015 3:04PM

    There are a few strategies that might optimise this delete, and things like clustering, rows per value of col2, whether or not all the rows for a col2 would be deleted etc. Part of the problem is minimising read consistency costs, part is to do with the effects of random I/O.

    E.g. If there are many, well-clustered rows per col2 then a bulk load of all the targe col2 values followed by an index driven delete for each value in order could minimise undo, redo, random I/O, buffer visits and block changes.

    This relates to an interesting side effect of different execution paths that you should consider when doing large deletes: https://jonathanlewis.wordpress.com/2006/11/22/tuning-updates/  The article is old (2006) but is still relevant, and it is true for deletes even though it talks mostly about updates

    Regards

    Jonathan Lewis

  • jgarry
    jgarry Member Posts: 13,844 Gold Crown
    edited May 14, 2015 2:13PM

    The last line in Alberto's comment #5 seems prescient in describing more recent undo performance issues.

  • Unknown
    edited May 14, 2015 3:22PM
    jgarry wrote:
    
    The last line in Alberto's comment #5 seems prescient in describing more recent undo performance issues. 
    

    Huh? What 'Alberto' are you talking about? Comment #5 was by John Stegeman - and his reply was a one-liner.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,010 Blue Diamond
    edited May 14, 2015 3:27PM

    Joel meant comment 5 of the blog post I referenced.

    Regards

    Jonathan Lewis

  • Unknown
    edited May 14, 2015 3:40PM

    Unless we know how the data is really used it's hard to give you any specific advice

    Table has around 8 billion records and it is not partitioned.
     

    Why wasn't it partitioned?

    Tell us the history of the table:

    1. how often are rows added? how many rows are added each time?

    2. how often are rows deleted? (other than your monthly)

    3. how often are rows updated?

    Once a month, I have to delete approximately 200,000,000 records.
    
    For background, when this table/application was designed about eight years, we were never going to delete data from it.

    Now there are some new "policies" concerning some of the data that resides here causing this new delete requirement.

    What 'policies' are you talking about?

    So far you haven't mentioned ANY performance or other issues (duplicates, running out of space) with the table or amount of data

    Which suggests that the 'policies' are perhaps related ONLY to access privileges. If so then RLS (row level security) could be used to flag those rows so they can't be accessed any more. Then you don't have to delete them at all.

    And if you designed an RLS strategy that took into account HOW new data is added and HOW data will need to be deleted then you might be able to set an appropriate new flag column that would get populated when the data is first loaded and that the RLS policy could use automatically to make that set of data 'invisible' at the appropriate time.

    Post more info about the REAL issue that needs to be solved (i.e. how to implement those 'policies') and then we can help you find a REAL answer.

    For just you 'how to delete a bunch of rows' question I suggest you start at the beginning.

    You said you are deleting in 5k chunks.

    WHY?

    I suggest that the FIRST test you perform is to just do a DELETE statement and gather the metrics on how long it takes and the IO/REDO/ETC that is involved.

    You need a baseline to work from. The best way to do a delete is usually to just do a delete. Until you measure that you have no baseline. You haven't told us ANYTHING about the metrics for the column you are basing the delete on. Is there an index on it? How many unique values does it have? Are you deleting RANDOM values from it? Or are you deleting a specific SET of values?

    Start at the beginning - there are NO SHORTCUTS.

  • Martin Preiss
    Martin Preiss Member Posts: 2,381 Gold Trophy
    edited May 14, 2015 3:50PM

    at this point I would vote up for - if I had not already. Of course that would not have helped in the given situation...

  • himmy
    himmy Member Posts: 322 Blue Ribbon
    edited May 14, 2015 3:55PM

    Hi Lordan,

    Your points seem interesting could you please elaborate what do you mean by below i didn't get it

    When going after the deleted data, take into account how it is clustered. If one DB block contains 10 records to be deleted, it is better to have all 10 of them deleted in one shot, instead of each of the 10 records deleted with a separate statement
    
This discussion has been closed.