1 2 Previous Next 29 Replies Latest reply on Jul 27, 2015 8:25 PM by Iordan Iotzov Go to original post
      • 15. Re: Bulk Deletes from a Large Table
        jgarry

        The last line in Alberto's comment #5 seems prescient in describing more recent undo performance issues.

        • 16. Re: Bulk Deletes from a Large Table

          jgarry wrote:

           

          The last line in Alberto's comment #5 seems prescient in describing more recent undo performance issues.

          Huh? What 'Alberto' are you talking about? Comment #5 was by John Stegeman - and his reply was a one-liner.

          • 17. Re: Bulk Deletes from a Large Table
            Jonathan Lewis

            Joel meant comment 5 of the blog post I referenced.

             

            Regards

            Jonathan Lewis

            • 18. Re: Bulk Deletes from a Large Table

              Unless we know how the data is really used it's hard to give you any specific advice

              Table has around 8 billion records and it is not partitioned.

              Why wasn't it partitioned?

               

              Tell us the history of the table:

               

              1. how often are rows added? how many rows are added each time?

              2. how often are rows deleted? (other than your monthly)

              3. how often are rows updated?

              Once a month, I have to delete approximately 200,000,000 records.

               

              For background, when this table/application was designed about eight years, we were never going to delete data from it.

              Now there are some new "policies" concerning some of the data that resides here causing this new delete requirement.

              What 'policies' are you talking about?

               

              So far you haven't mentioned ANY performance or other issues (duplicates, running out of space) with the table or amount of data

               

              Which suggests that the 'policies' are perhaps related ONLY to access privileges. If so then RLS (row level security) could be used to flag those rows so they can't be accessed any more. Then you don't have to delete them at all.

               

              And if you designed an RLS strategy that took into account HOW new data is added and HOW data will need to be deleted then you might be able to set an appropriate new flag column that would get populated when the data is first loaded and that the RLS policy could use automatically to make that set of data 'invisible' at the appropriate time.

               

              Post more info about the REAL issue that needs to be solved (i.e. how to implement those 'policies') and then we can help you find a REAL answer.

               

              For just you 'how to delete a bunch of rows' question I suggest you start at the beginning.

               

              You said you are deleting in 5k chunks.

               

              WHY?

               

              I suggest that the FIRST test you perform is to just do a DELETE statement and gather the metrics on how long it takes and the IO/REDO/ETC that is involved.

               

              You need a baseline to work from. The best way to do a delete is usually to just do a delete. Until you measure that you have no baseline. You haven't told us ANYTHING about the metrics for the column you are basing the delete on. Is there an index on it? How many unique values does it have? Are you deleting RANDOM values from it? Or are you deleting a specific SET of values?

               

              Start at the beginning - there are NO SHORTCUTS.

              • 19. Re: Bulk Deletes from a Large Table
                Martin Preiss

                at this point I would vote up for Auto-Numbering of Comments - if I had not already. Of course that would not have helped in the given situation...

                • 20. Re: Bulk Deletes from a Large Table
                  himmy

                  Hi Lordan,

                   

                  Your points seem interesting could you please elaborate what do you mean by below i didn't get it

                   

                  When going after the deleted data, take into account how it is clustered. If one DB block contains 10 records to be deleted, it is better to have all 10 of them deleted in one shot, instead of each of the 10 records deleted with a separate statement

                  • 21. Re: Bulk Deletes from a Large Table
                    Your points seem interesting could you please elaborate what do you mean by below i didn't get it

                     

                    When going after the deleted data, take into account how it is clustered. If one DB block contains 10 records to be deleted, it is better to have all 10 of them deleted in one shot, instead of each of the 10 records deleted with a separate statement

                    Can't speak for him but can offer a couple of related possibilities.

                     

                    OP said they were deleting in chunks of 5k rows. But they did NOT say how those chunks were determined.

                     

                    If an index was being used then all of the rows in a block are NOT necessarily colocated in the same leaf blocks in the index. That all depends on the columns in the index and the related clustering factor.

                     

                    If a full table scan is being used then one, or a few, rows in the block may be at the end of a 5k chunk and the remaining rows in the block may be in another chunk. So rows in the block would be split among multiple chunks.

                    • 22. Re: Bulk Deletes from a Large Table
                      JohnWatson2

                      Paul V. wrote:

                       

                      Thank you for some very interesting ideas.  There were some roadblocks to partitioning but after reading some of the linked articles, perhaps there are some work arounds that I had not thought of.  I am fortunate to have a test environment with the same hardware and space so I can run some real world tests.  I will update when I determine the best path.  Thanks again.

                      If restructuring the table really is a possibility, you might consider making it an IOT. You are already indexing every column (one column twice) so you don't really need the table at all. An IOT might speed up a few operations. And you can partition it, of course.

                      • 23. Re: Bulk Deletes from a Large Table
                        Vsevolod Afanassiev

                        Simple suggestion: instead of running delete once per month run it every day. Deleting 7 million records should be easier than deleting 200 million.

                         

                        If you decide to use parallel delete then make sure that indexes have INITRANS and PCTFREE high enough: we have an application where every few days we delete large number of records from a table using parallel DML with degree = 64. Performance improved after we increased INITRANS to 16.

                        • 24. Re: Bulk Deletes from a Large Table
                          Jonathan Lewis

                          Minor bit of hair-splitting, but deleting in bulk tends to result in bitmap space management blocks being updated, and Oracle does have some problems handling that part of the process in a timely fashion. In fact, the "delete 5,000 and commit" is one damage limitation strategy for one of the problems that Oracle used to have (may still have) with bitmap space management on large deletes.

                           

                          Regards

                          Jonathan Lewis

                          • 25. Re: Bulk Deletes from a Large Table
                            Jonathan Lewis

                            JohnWatson2 wrote:

                             

                            If restructuring the table really is a possibility, you might consider making it an IOT. You are already indexing every column (one column twice) so you don't really need the table at all. An IOT might speed up a few operations. And you can partition it, of course.

                             

                            I agree that looks like a potentially useful idea, though with the PK on col1 and a unique constraint on (col2, col3) there's still going to be at least one global index, and the partitioning may result in increased index probes to find the data at query time.

                             

                            The thought crossed my mind that this table (call it t1) might be an intersection table between tables t2 and t3, with col2 the PK of t2 and col3 the PK of t3, where t1.col1 [edited] has been introduced (logically redundantly) because some application development tool (or standard) said that all tables must have a single column numeric key.  If this is the case then (col2, col3) is the real PK, and an IOT of just those two columns (compressed on the first column) with a secondary index on (col3, col2) - probably similarly compressed - would appear to be the ideal solution.

                             

                            Partitioning (as also suggested by John Stegeman) on one or other of the two columns (perhaps by range, possibly by hash) could have several performance benefits, though with some impact on queries involving only the non-partitioning column. I'm assuming, when I say this, that the usual use of this table is to:  "find all rows from t3 where t2.column = something", or "find all rows from t2 where t3.column = something"; the partitioning column should be dictated by the more important/frequent of these two queries, and the number of partitions should be decided with a view to how many index probes would be needed "going the wrong way" and how much extra work this would introduce to the whole query.

                             

                            Regards

                            Jonathan Lewis

                             

                             

                            Edited for various typos, and a missing reference to co1.

                            • 26. Re: Bulk Deletes from a Large Table
                              Paul V.

                              Once again, thank you for all of the responses.  Since there has been some questions about the entire scope of the problem, I will try to explain things a little further.   The system is a "document warehouse" where we get documents from several sources (at the moment, we have 30 sources).  There is a main DOCUMENTS table which has a partition for each source.  The document data is XML which resides in a CLOB column and there are other non-CLOB columns in the DOCUMENTS table such as a numeric primary key and date loaded.  Our application searches each document looking for entities - names, addresses, SSNs, and other such things.  These values are written to an ENTITY table.  The table mentioned in the original post tells the application which entities are found in which documents.  For example, the name John Doe could appear in dozens of documents but it would only have one record in the ENTITY table but it could have several records in the table from the original post.  So, from the original post,

                               

                              COL1 is a unique integer which serves as primary key.

                              COL2 is the ID from the DOCUMENTS table

                              COL3 is the ID from the ENTITY table

                               

                              The documents table has 590,000,000 records and uses about 18 TB of disk space and as mentioned, is partitoned.  The table in question was not originally partitioned, I assume, because of the small size per row and probably a lack of foresight into how large the table would grow (currently 8 billion records).  Our application loads data into these tables nearly around the clock which has not been a problem for close to 10 years now.  The issue that I am trying to deal with is that one of our sources has decided that their data can only exist in our system for three months.  They provide the data once a month so once a month we have to delete a month's worth of data before loading the latest month from this source.  I am trying to figure out how to do that with minimal impact to the rest of the system.  I am looking into partitioning the table from the original post and using a partiton swap when delete time comes.

                               

                              I hope that explains a little better what the issue is.

                              • 27. Re: Bulk Deletes from a Large Table

                                I hope that explains a little better what the issue is.

                                Thanks for the additional info.

                                 

                                The info you provided suggests that you need a 'vendor' indicator and/or a 'security' indicator that can be used to either the data from a specific vendor or to identify data that meets different privileges (public, private, etc).

                                 

                                Partitioning could then likely be used effectively to isolate data by vendor/security. Then you 'delete' or other data management operations could be performed only on one, or a limited number of, partitions.

                                 

                                Or you could consider 'manual' partitioning by using a new, separate table in the data model to hold data for time-sensitive vendors. Again the data management 'deletes' would only need to be performed on that new, much smaller, table.

                                 

                                You could use a view to provide a 'merged' view of the data. That second method is similar to the old 'partitioned view' methodology that was used before partitioning was introduced.

                                • 28. Re: Bulk Deletes from a Large Table
                                  Jonathan Lewis

                                  From what you've said it sounds like:

                                   

                                  The DOCUMENTS table should have a primary key that includes the source (so that the primary key index can be locally partitioned), with a second "document within source" identifier.

                                   

                                  COL1 shouldn't exist on the table you've described.

                                   

                                  The intersection table should be (source_id, doc_within_source_id, entity_id) - probably IOT, partitioned by source_id, indexed on (source_id, doc_within_source_id, entity_id) compress 2.

                                  The should be a secondary index of (entity_id, source_id, doc_within_source_id) locally partitioned, compress 2.

                                   

                                  Regards

                                  Jonathan Lewis

                                  • 29. Re: Bulk Deletes from a Large Table
                                    Iordan Iotzov

                                    Hi,

                                     

                                    Finally found some time for this…

                                    Since the explanation/test is rather long for an OTN post, I created a blog entry -

                                    https://iiotzov.wordpress.com/2015/07/27/importance-of-data-clustering-when-deleting-in-batches/

                                     

                                    Regards,

                                    Iordan Iotzov

                                    1 2 Previous Next