Skip to Main Content

Oracle Database Discussions

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

which one is better

ASK_NMNov 25 2011 — edited Nov 29 2011
When we want update more than 100 million records, using bulk update is better or merge statement is better, ? when i used bulk statement it took 24 sec to update 10,000 records and when i used merge it took only 89 milli seconds, what is the difference between them ?

Comments

orawiss
Nag wrote:
When we want update more than 100 million records, using bulk update is better or merge statement is better, ? when i used bulk statement it took 24 sec to update 10,000 records and when i used merge it took only 89 milli seconds, what is the difference between them ?
What about create a new table , do an insert of the updated records, drop the original table and then rename the new to original?
ASK_NM
Create table will occupy lot of space in the database, so that is not allowed, and more over the table has original of billions of records, in that only around 100-200 is updated.
user13653962
thats good but for development environment or UAT , for production OLTP 24/7 would create a lot of mess.

@OP you need to benchmark on yours own disposal which is best fit for the purpose , it all depends on yours data.
sb92075
Nag wrote:
When we want update more than 100 million records, using bulk update is better or merge statement is better, ? when i used bulk statement it took 24 sec to update 10,000 records and when i used merge it took only 89 milli seconds, what is the difference between them ?
BULK update is "ROW by ROW" processing which is SLOW BY SLOW results.
Billy Verreynne
Nag wrote:
When we want update more than 100 million records, using bulk update is better or merge statement is better, ? when i used bulk statement it took 24 sec to update 10,000 records and when i used merge it took only 89 milli seconds, what is the difference between them ?
A pure update of a 100 million rows will be relatively slow. Simply because that is a LOT of data that needs to be I/O'ed.

And the SAME amount of rows will need to be updated for a bulk update, a non-bulk update, a native SQL update or whatever other method you can come up with.

The work to be done is constant - update a 100 million rows.

A bulk update is not going to make this workload less. In fact, it will add to the workload as the PL/SQL engine and context switching between it and the SQL engine are added to the workload.

What will make a difference is splitting the workload into smaller chunks and then doing these in parallel. And this can be done using parallel DML and native SQL.

As for benchmarking your approaches using elapsed time? That is a dangerous and misleading thing to do. Elapsed time measures how fast a workload can be moved and not the size of the workload. And the very SAME workload will show DIFFERENT execution times depending on the runtime environment. The 1st execution of that workload could require 80% of all the I/O of that workload to hit the disk via slow physical I/O. The 2nd execution of that could result in 80% I/O being faster logical I/O hitting memory (buffer cache).

So using elapsed time as a measurement as to determine which option or approach is the best, is pretty much a flawed approach. The workload should instead be determined and the one with the smallest workload selected.

In that respect, naive native SQL will always be faster than slower PL/SQL bulk processing. And when large data volumes need to be processed, parallel processing can make optimal use of the I/O bandwidth and capacity.

Edited by: Billy Verreynne on Nov 29, 2011 6:31 AM
AdamMartin
Using elapsed time can be misleading. True. Especially if it is used as the sole measurement. A single test can result in "fast" or "slow" elapsed times, but then timings will vary on subsequent runs or in different environments.

As was pointed out, generally the best way to approach a tuning problem is to reduce the workload of the process as much as possible. Taken to extremes, the fastest process is one that does not have to run at all (has zero work to do).

However, there is a popular misconception that measuring elapsed time is useless, meaningless, or counterproductive. In a question such as this one, a reduction of total execution time is actually a goal. It should be measured.

An example was given above of how elapsed time measurements can mislead. It describes a process that is run twice with the same workload but has two different elapsed times due to the amount of I/O to do and the effects of caching. The cause of this discrepancy is a poor test setup by the developer who neglected to clear the cache. There could be multiple factors beyond caching that can distort the times in two subsequent runs of the same process. Tests scenarios should minimize these types of differences as much as possible.

Usually, the purpose for tuning is to reduce resource consumption or to reduce the elapsed time for an operation to complete. Sometimes the best approach for a particular situation is not the one with the smallest workload. A suggestion was made in this thread that recommends using parallel DML, which is actually one approach to tuning for lower elapsed time instead of the absolute smallest workload. It's a trade off: additional overhead and more work for the DB server in exchange for a shorter execution time.
jgarry
An additional issue is the error handling required of merge - if anything is not handled, for example unique constraint violated, the whole thing will roll back. So if you are doing this once with a few hundred rows, you can fix it and do it again, but if you are creating a periodic process with downstream dependencies and some unbounded number of rows you really have to handle all situations, including future obscure DDL changes (even by [url http://tkyte.blogspot.com/2010/04/something-i-recently-unlearned.html]Oracle version).

As with so many things, it depends. Some people are really against merge, some people think it's the best thing since multitasking. Billy pointed out that it is generally better if you can do things in native SQL (sophisticated SQL, not naive SQL ;) ), but there are situations where that breaks down. Merge is one of those things you can't really tell from limited information if it is the best tool for the job.
jgarry
An additional problem with elapsed time is the difficulty in replicating concurrency issues in a test environment. Repeatedly testing a mass update in a test environment can eventually converge on a reasonable expectation of how long something will take - especially with direct i/o avoiding the SGA. But then doing the same thing in a production environment with re-reading undo, multiple checkpoints, delayed cleanouts and so on can lead to very unpleasant surprises.
Billy Verreynne
Adam Martin wrote:

However, there is a popular misconception that measuring elapsed time is useless, meaningless, or counterproductive. In a question such as this one, a reduction of total execution time is actually a goal. It should be measured.
I beg to differ. Around here I'm pretty much the only one that is vocal warning against using elapsed time as a metric. It is incorrectly used the vast majority of times (the warp drive principle).

It is not a popular "opinion".

Nor is it a misconception (as the warp drive principle shows).

Developers need to understand what elapsed time means and how it can be used. IMO, the majority do not. So that is why I am bashing away at the flawed idea of determining fast/good code versus slow/bad code using elapsed time as the only metric. (and that is exactly what is being used and nothing else)
Taken to extremes, the fastest process is one that does not have to run at all (has zero work to do).
There's a nice story about this in the AI environment. The s/w was programmed to adapt itself (neural net) with the basic goal being to eliminate as many processing mistakes as possible. It learned that when it does zero work, it achieves that goal. So it "committed suicide" continuously by shutting itself down, until this rule was adapted to say that not doing any work is not minimising mistakes. ;-)
Billy Verreynne
jgarry wrote:

.. it is generally better if you can do things in native SQL (sophisticated SQL, not naive SQL ;) ),
Hehehe... fixed. :-)
1 - 10
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Dec 27 2011
Added on Nov 25 2011
10 comments
183 views