Forum Stats

  • 3,853,235 Users
  • 2,264,195 Discussions
  • 7,905,291 Comments

Discussions

Bulk Deletes from a Large Table

13»

Answers

  • Unknown
    edited May 14, 2015 4:12PM
    Your points seem interesting could you please elaborate what do you mean by below i didn't get it
    
    
    When going after the deleted data, take into account how it is clustered. If one DB block contains 10 records to be deleted, it is better to have all 10 of them deleted in one shot, instead of each of the 10 records deleted with a separate statement
    
    

    Can't speak for him but can offer a couple of related possibilities.

    OP said they were deleting in chunks of 5k rows. But they did NOT say how those chunks were determined.

    If an index was being used then all of the rows in a block are NOT necessarily colocated in the same leaf blocks in the index. That all depends on the columns in the index and the related clustering factor.

    If a full table scan is being used then one, or a few, rows in the block may be at the end of a 5k chunk and the remaining rows in the block may be in another chunk. So rows in the block would be split among multiple chunks.

  • JohnWatson2
    JohnWatson2 Member Posts: 4,488 Silver Crown
    edited May 14, 2015 4:53PM
    Paul V. wrote:
    
    Thank you for some very interesting ideas.  There were some roadblocks to partitioning but after reading some of the linked articles, perhaps there are some work arounds that I had not thought of.  I am fortunate to have a test environment with the same hardware and space so I can run some real world tests.  I will update when I determine the best path.  Thanks again.
    

    If restructuring the table really is a possibility, you might consider making it an IOT. You are already indexing every column (one column twice) so you don't really need the table at all. An IOT might speed up a few operations. And you can partition it, of course.

  • Vsevolod Afanassiev
    Vsevolod Afanassiev Member Posts: 360 Bronze Badge
    edited May 15, 2015 4:12AM

    Simple suggestion: instead of running delete once per month run it every day. Deleting 7 million records should be easier than deleting 200 million.

    If you decide to use parallel delete then make sure that indexes have INITRANS and PCTFREE high enough: we have an application where every few days we delete large number of records from a table using parallel DML with degree = 64. Performance improved after we increased INITRANS to 16.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,058 Blue Diamond
    edited May 15, 2015 4:44AM

    Minor bit of hair-splitting, but deleting in bulk tends to result in bitmap space management blocks being updated, and Oracle does have some problems handling that part of the process in a timely fashion. In fact, the "delete 5,000 and commit" is one damage limitation strategy for one of the problems that Oracle used to have (may still have) with bitmap space management on large deletes.

    Regards

    Jonathan Lewis

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,058 Blue Diamond
    edited May 15, 2015 7:49AM
    JohnWatson2 wrote:
    
    If restructuring the table really is a possibility, you might consider making it an IOT. You are already indexing every column (one column twice) so you don't really need the table at all. An IOT might speed up a few operations. And you can partition it, of course.
    
    
    

    I agree that looks like a potentially useful idea, though with the PK on col1 and a unique constraint on (col2, col3) there's still going to be at least one global index, and the partitioning may result in increased index probes to find the data at query time.

    The thought crossed my mind that this table (call it t1) might be an intersection table between tables t2 and t3, with col2 the PK of t2 and col3 the PK of t3, where t1.col1 [edited] has been introduced (logically redundantly) because some application development tool (or standard) said that all tables must have a single column numeric key.  If this is the case then (col2, col3) is the real PK, and an IOT of just those two columns (compressed on the first column) with a secondary index on (col3, col2) - probably similarly compressed - would appear to be the ideal solution.

    Partitioning (as also suggested by John Stegeman) on one or other of the two columns (perhaps by range, possibly by hash) could have several performance benefits, though with some impact on queries involving only the non-partitioning column. I'm assuming, when I say this, that the usual use of this table is to:  "find all rows from t3 where t2.column = something", or "find all rows from t2 where t3.column = something"; the partitioning column should be dictated by the more important/frequent of these two queries, and the number of partitions should be decided with a view to how many index probes would be needed "going the wrong way" and how much extra work this would introduce to the whole query.

    Regards

    Jonathan Lewis

    Edited for various typos, and a missing reference to co1.

    John Stegeman
  • Paul V.
    Paul V. Member Posts: 42 Blue Ribbon
    edited May 15, 2015 9:23AM

    Once again, thank you for all of the responses.  Since there has been some questions about the entire scope of the problem, I will try to explain things a little further.   The system is a "document warehouse" where we get documents from several sources (at the moment, we have 30 sources).  There is a main DOCUMENTS table which has a partition for each source.  The document data is XML which resides in a CLOB column and there are other non-CLOB columns in the DOCUMENTS table such as a numeric primary key and date loaded.  Our application searches each document looking for entities - names, addresses, SSNs, and other such things.  These values are written to an ENTITY table.  The table mentioned in the original post tells the application which entities are found in which documents.  For example, the name John Doe could appear in dozens of documents but it would only have one record in the ENTITY table but it could have several records in the table from the original post.  So, from the original post,

    COL1 is a unique integer which serves as primary key.

    COL2 is the ID from the DOCUMENTS table

    COL3 is the ID from the ENTITY table

    The documents table has 590,000,000 records and uses about 18 TB of disk space and as mentioned, is partitoned.  The table in question was not originally partitioned, I assume, because of the small size per row and probably a lack of foresight into how large the table would grow (currently 8 billion records).  Our application loads data into these tables nearly around the clock which has not been a problem for close to 10 years now.  The issue that I am trying to deal with is that one of our sources has decided that their data can only exist in our system for three months.  They provide the data once a month so once a month we have to delete a month's worth of data before loading the latest month from this source.  I am trying to figure out how to do that with minimal impact to the rest of the system.  I am looking into partitioning the table from the original post and using a partiton swap when delete time comes.

    I hope that explains a little better what the issue is.

  • Unknown
    edited May 15, 2015 10:56AM
    I hope that explains a little better what the issue is.
    

    Thanks for the additional info.

    The info you provided suggests that you need a 'vendor' indicator and/or a 'security' indicator that can be used to either the data from a specific vendor or to identify data that meets different privileges (public, private, etc).

    Partitioning could then likely be used effectively to isolate data by vendor/security. Then you 'delete' or other data management operations could be performed only on one, or a limited number of, partitions.

    Or you could consider 'manual' partitioning by using a new, separate table in the data model to hold data for time-sensitive vendors. Again the data management 'deletes' would only need to be performed on that new, much smaller, table.

    You could use a view to provide a 'merged' view of the data. That second method is similar to the old 'partitioned view' methodology that was used before partitioning was introduced.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,058 Blue Diamond
    edited May 15, 2015 11:11AM

    From what you've said it sounds like:

    The DOCUMENTS table should have a primary key that includes the source (so that the primary key index can be locally partitioned), with a second "document within source" identifier.

    COL1 shouldn't exist on the table you've described.

    The intersection table should be (source_id, doc_within_source_id, entity_id) - probably IOT, partitioned by source_id, indexed on (source_id, doc_within_source_id, entity_id) compress 2.

    The should be a secondary index of (entity_id, source_id, doc_within_source_id) locally partitioned, compress 2.

    Regards

    Jonathan Lewis

  • Iordan Iotzov
    Iordan Iotzov Member Posts: 750 Silver Badge
    edited Jul 27, 2015 4:25PM

    Hi,

    Finally found some time for this…

    Since the explanation/test is rather long for an OTN post, I created a blog entry -

    https://iiotzov.wordpress.com/2015/07/27/importance-of-data-clustering-when-deleting-in-batches/

    Regards,

    Iordan Iotzov

This discussion has been closed.