Bulk Deletes from a Large Table

unknown-951199

Paul V. wrote:

DB: 11.2.0.3

OS: Linux, kernal 2.6.18

I have the following three column table,

Col1 not null number

Col2 not null number

Col3 not null number

Col1 is the primary key. There is a unique index on Col2/Col3 combined. There is a non-unique index on Col3.

Table has around 8 billion records and it is not partitioned.

Once a month, I have to delete approximately 200,000,000 records. I know this is a generic question, but I would appreciate any suggestions on the best way to accomplish this. I have been deleting in chunks of about 5,000 records at a time but that is taking longer than I would like. Am I missing some obvious way to solve this problem.

For background, when this table/application was designed about eight years, we were never going to delete data from it. Now there are some new "policies" concerning some of the data that resides here causing this new delete requirement.

How is the decision made as to which rows get removed?

> I have been deleting in chunks of about 5,000 records at a time

Multiple COMMIT increase elapsed time.

Paul V.

Currently deletes are based on Col2 values found in a different table but I could wrote some PL/SQL to utilize the primary key instead.

Cobert

At present are you running a straightforward delete or context switching? See this from oracle magazine on bulk processing, may be of some help: PLSQL 101

Quite tricky with that many rows though!

1 person found this helpful

Iordan Iotzov

I cannot offer dramatic improvements, just some ideas that can help a little bit:

1.As already mentioned by sol.beach, you can make the chunks larger and thus reduce the number of commits.
When going after the deleted data, take into account how it is clustered. If one DB block contains 10 records to be deleted, it is better to have all 10 of them deleted in one shot, instead of each of the 10 records deleted with a separate statement. Reviewing index clustering factor can be useful in determining the best way forward.
Consider going parallel (DBMS_PARALLEL_EXECUTE or parallel DML). Please note that each of the methods (particularly parallel DML) comes with some restrictions/limitations.

HTH,

Iordan Iotzov

John Stegeman

Well, let me throw out the obvious - have you thought about partitioning the table?

Oyvind Isene

Besides suggesting to partition the table (as previous poster did), you can look into a simple routine that creates a new table with select (CTAS) as explained by Tom Kyte here.

If partitioning is possible, you can use the package dbms_redefinition to do so. This recipe by Dr Tim Hall shows you how.

1 person found this helpful

John Stegeman

I wouldn't ordinarily suggest CTAS when faced with deleting 200 million out of 8 billion records. 7.8 billion records is a lot of I/O

Oyvind Isene

If I could afford the space I would certainly have tested it. Deleting 200 million rows may generate a lot of extra I/O writing redo, maintaining indexes, and space management.

1 person found this helpful

John Stegeman

deleting rows = no space management

Oyvind Isene

True But the next inserts, unless the append hint is used, will try to fill those empty blocks, right? Looks like I should create an experiment and blog it

John Stegeman

Oyvind - yes, but you said "Deleting 200 million rows may generate.... and space management" I was pointing out that space management doesn't apply to the "delete" scenario.

If partitioning is an option for the OP and the table can be partitioned in a way that makes the delete a "partition swap" activity - that will be, by FAR, the fastest option with the least I/O

1 person found this helpful

Paul V.

Thank you for some very interesting ideas. There were some roadblocks to partitioning but after reading some of the linked articles, perhaps there are some work arounds that I had not thought of. I am fortunate to have a test environment with the same hardware and space so I can run some real world tests. I will update when I determine the best path. Thanks again.

Oyvind Isene

Agree 100% with respect to partitioning. I mentioned CTAS in case it was impossible because of licenses or other reasons.

Jonathan Lewis

There are a few strategies that might optimise this delete, and things like clustering, rows per value of col2, whether or not all the rows for a col2 would be deleted etc. Part of the problem is minimising read consistency costs, part is to do with the effects of random I/O.

E.g. If there are many, well-clustered rows per col2 then a bulk load of all the targe col2 values followed by an index driven delete for each value in order could minimise undo, redo, random I/O, buffer visits and block changes.

This relates to an interesting side effect of different execution paths that you should consider when doing large deletes: https://jonathanlewis.wordpress.com/2006/11/22/tuning-updates/ The article is old (2006) but is still relevant, and it is true for deletes even though it talks mostly about updates

Regards

Jonathan Lewis

jgarry

The last line in Alberto's comment #5 seems prescient in describing more recent undo performance issues.

unknown-7404

jgarry wrote:

The last line in Alberto's comment #5 seems prescient in describing more recent undo performance issues.

Huh? What 'Alberto' are you talking about? Comment #5 was by John Stegeman - and his reply was a one-liner.

Jonathan Lewis

Joel meant comment 5 of the blog post I referenced.

Regards

Jonathan Lewis

unknown-7404

Unless we know how the data is really used it's hard to give you any specific advice

Table has around 8 billion records and it is not partitioned.

Why wasn't it partitioned?

Tell us the history of the table:

1. how often are rows added? how many rows are added each time?

2. how often are rows deleted? (other than your monthly)

3. how often are rows updated?

Once a month, I have to delete approximately 200,000,000 records.

For background, when this table/application was designed about eight years, we were never going to delete data from it.

Now there are some new "policies" concerning some of the data that resides here causing this new delete requirement.

What 'policies' are you talking about?

So far you haven't mentioned ANY performance or other issues (duplicates, running out of space) with the table or amount of data

Which suggests that the 'policies' are perhaps related ONLY to access privileges. If so then RLS (row level security) could be used to flag those rows so they can't be accessed any more. Then you don't have to delete them at all.

And if you designed an RLS strategy that took into account HOW new data is added and HOW data will need to be deleted then you might be able to set an appropriate new flag column that would get populated when the data is first loaded and that the RLS policy could use automatically to make that set of data 'invisible' at the appropriate time.

Post more info about the REAL issue that needs to be solved (i.e. how to implement those 'policies') and then we can help you find a REAL answer.

For just you 'how to delete a bunch of rows' question I suggest you start at the beginning.

You said you are deleting in 5k chunks.

WHY?

I suggest that the FIRST test you perform is to just do a DELETE statement and gather the metrics on how long it takes and the IO/REDO/ETC that is involved.

You need a baseline to work from. The best way to do a delete is usually to just do a delete. Until you measure that you have no baseline. You haven't told us ANYTHING about the metrics for the column you are basing the delete on. Is there an index on it? How many unique values does it have? Are you deleting RANDOM values from it? Or are you deleting a specific SET of values?

Start at the beginning - there are NO SHORTCUTS.

Martin Preiss

at this point I would vote up for - if I had not already. Of course that would not have helped in the given situation...

himmy

Hi Lordan,

Your points seem interesting could you please elaborate what do you mean by below i didn't get it

When going after the deleted data, take into account how it is clustered. If one DB block contains 10 records to be deleted, it is better to have all 10 of them deleted in one shot, instead of each of the 10 records deleted with a separate statement

unknown-7404

Your points seem interesting could you please elaborate what do you mean by below i didn't get it

When going after the deleted data, take into account how it is clustered. If one DB block contains 10 records to be deleted, it is better to have all 10 of them deleted in one shot, instead of each of the 10 records deleted with a separate statement

Can't speak for him but can offer a couple of related possibilities.

OP said they were deleting in chunks of 5k rows. But they did NOT say how those chunks were determined.

If an index was being used then all of the rows in a block are NOT necessarily colocated in the same leaf blocks in the index. That all depends on the columns in the index and the related clustering factor.

If a full table scan is being used then one, or a few, rows in the block may be at the end of a 5k chunk and the remaining rows in the block may be in another chunk. So rows in the block would be split among multiple chunks.

JohnWatson2

Paul V. wrote:

Thank you for some very interesting ideas. There were some roadblocks to partitioning but after reading some of the linked articles, perhaps there are some work arounds that I had not thought of. I am fortunate to have a test environment with the same hardware and space so I can run some real world tests. I will update when I determine the best path. Thanks again.

If restructuring the table really is a possibility, you might consider making it an IOT. You are already indexing every column (one column twice) so you don't really need the table at all. An IOT might speed up a few operations. And you can partition it, of course.

Vsevolod Afanassiev

Simple suggestion: instead of running delete once per month run it every day. Deleting 7 million records should be easier than deleting 200 million.

If you decide to use parallel delete then make sure that indexes have INITRANS and PCTFREE high enough: we have an application where every few days we delete large number of records from a table using parallel DML with degree = 64. Performance improved after we increased INITRANS to 16.

Jonathan Lewis

Minor bit of hair-splitting, but deleting in bulk tends to result in bitmap space management blocks being updated, and Oracle does have some problems handling that part of the process in a timely fashion. In fact, the "delete 5,000 and commit" is one damage limitation strategy for one of the problems that Oracle used to have (may still have) with bitmap space management on large deletes.

Regards

Jonathan Lewis

JohnWatson2 wrote:

If restructuring the table really is a possibility, you might consider making it an IOT. You are already indexing every column (one column twice) so you don't really need the table at all. An IOT might speed up a few operations. And you can partition it, of course.

I agree that looks like a potentially useful idea, though with the PK on col1 and a unique constraint on (col2, col3) there's still going to be at least one global index, and the partitioning may result in increased index probes to find the data at query time.

The thought crossed my mind that this table (call it t1) might be an intersection table between tables t2 and t3, with col2 the PK of t2 and col3 the PK of t3, where t1.col1 [edited] has been introduced (logically redundantly) because some application development tool (or standard) said that all tables must have a single column numeric key. If this is the case then (col2, col3) is the real PK, and an IOT of just those two columns (compressed on the first column) with a secondary index on (col3, col2) - probably similarly compressed - would appear to be the ideal solution.

Partitioning (as also suggested by John Stegeman) on one or other of the two columns (perhaps by range, possibly by hash) could have several performance benefits, though with some impact on queries involving only the non-partitioning column. I'm assuming, when I say this, that the usual use of this table is to: "find all rows from t3 where t2.column = something", or "find all rows from t2 where t3.column = something"; the partitioning column should be dictated by the more important/frequent of these two queries, and the number of partitions should be decided with a view to how many index probes would be needed "going the wrong way" and how much extra work this would introduce to the whole query.

Regards

Jonathan Lewis

Edited for various typos, and a missing reference to co1.

1 person found this helpful

Paul V.

Once again, thank you for all of the responses. Since there has been some questions about the entire scope of the problem, I will try to explain things a little further. The system is a "document warehouse" where we get documents from several sources (at the moment, we have 30 sources). There is a main DOCUMENTS table which has a partition for each source. The document data is XML which resides in a CLOB column and there are other non-CLOB columns in the DOCUMENTS table such as a numeric primary key and date loaded. Our application searches each document looking for entities - names, addresses, SSNs, and other such things. These values are written to an ENTITY table. The table mentioned in the original post tells the application which entities are found in which documents. For example, the name John Doe could appear in dozens of documents but it would only have one record in the ENTITY table but it could have several records in the table from the original post. So, from the original post,

COL1 is a unique integer which serves as primary key.

COL2 is the ID from the DOCUMENTS table

COL3 is the ID from the ENTITY table

The documents table has 590,000,000 records and uses about 18 TB of disk space and as mentioned, is partitoned. The table in question was not originally partitioned, I assume, because of the small size per row and probably a lack of foresight into how large the table would grow (currently 8 billion records). Our application loads data into these tables nearly around the clock which has not been a problem for close to 10 years now. The issue that I am trying to deal with is that one of our sources has decided that their data can only exist in our system for three months. They provide the data once a month so once a month we have to delete a month's worth of data before loading the latest month from this source. I am trying to figure out how to do that with minimal impact to the rest of the system. I am looking into partitioning the table from the original post and using a partiton swap when delete time comes.

I hope that explains a little better what the issue is.

unknown-7404

I hope that explains a little better what the issue is.

Thanks for the additional info.

The info you provided suggests that you need a 'vendor' indicator and/or a 'security' indicator that can be used to either the data from a specific vendor or to identify data that meets different privileges (public, private, etc).

Partitioning could then likely be used effectively to isolate data by vendor/security. Then you 'delete' or other data management operations could be performed only on one, or a limited number of, partitions.

Or you could consider 'manual' partitioning by using a new, separate table in the data model to hold data for time-sensitive vendors. Again the data management 'deletes' would only need to be performed on that new, much smaller, table.

You could use a view to provide a 'merged' view of the data. That second method is similar to the old 'partitioned view' methodology that was used before partitioning was introduced.

Jonathan Lewis

From what you've said it sounds like:

The DOCUMENTS table should have a primary key that includes the source (so that the primary key index can be locally partitioned), with a second "document within source" identifier.

COL1 shouldn't exist on the table you've described.

The intersection table should be (source_id, doc_within_source_id, entity_id) - probably IOT, partitioned by source_id, indexed on (source_id, doc_within_source_id, entity_id) compress 2.

The should be a secondary index of (entity_id, source_id, doc_within_source_id) locally partitioned, compress 2.

Regards

Jonathan Lewis

Iordan Iotzov

Hi,

Finally found some time for this…

Since the explanation/test is rather long for an OTN post, I created a blog entry -

https://iiotzov.wordpress.com/2015/07/27/importance-of-data-clustering-when-deleting-in-batches/

Regards,

Iordan Iotzov

Oracle Database Discussions

Bulk Deletes from a Large Table

Comments

Post Details