This content has been marked as final. Show 42 replies
I would ignore the long checkpoints at oracle 10g unless other waits are giving you cause for concern.
1) Can you confirm you are using RMAN backups and have a large pool allocated. That way we know if we need to consider to tune RMAN.
2) Or are you doing a user managed hot backup?
3) Can you also give your exact database version.
You can probably reduce the size of your checkpoint by reducing FAST_START_MTTR_TARGET, but this may impact peformance. I wouldnt do this.
With 10g the system should be does incremental checkpointing for you anyway, not like 817 when a checkpoint would have to occur at every log switch.
For redo logs are quite large, and are mirrored. And there are 2*6 of them.
This may be appropriate for this database. But assuming you are in archivelog mode this typically means a 2gb archivelog has to be read and fired off somewhere on log switch, and dependending on your setup this could be impacting performance.
Others answering this thread may give better insight.
I also have a sixth sense that some of you /uNN mount points may in fact reside on the same physical disk, if so it may be giving you some I/O contention.
Rgds - bigdelboy
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit
We are using BCV backup, I think it's also called three-way mirroring. How can I find out if writing archivelog on log switch is the problem? Is there a way to validate that? Also why should it impact performance on customer transactions.
By /uNN do you mean the disks used for BCV backups? How can I validate that?
I am getting little nervous everytime I see performance being impacted during backup. One thing I can think of is as suggested by bigdelboy (earlier post) is that when redo log switches and next redo log has not been archived then the transactions wait until the redo log successfully gets archived. Could this be a problem? How can I track this down?
Is there a way to know if archive log was really the issue of performance being impacted.
Try Statspack...or Since ur using 10g you can go for ADDM or AWR..........It will help you analyze the Top wait events in your Database...
Also check your Alert Log for more details regarding the Log Switch.....
Are u using Raid or San ....
And since u said u r using BCV backup and nt Rman....it may lead to some performance issues...since it generally put s the entire DB into Backupmode as a result more overhead to online redo logs....
Check thrgly ur backup policy...and if possible switch to Rman .....
Thanks & Regards,
Edited by: Prosenjit on Mar 24, 2009 6:31 AM
Everything is indicating to me this is a big database.
My understanding, possilbly flawed, is to use a backup technique of spliting off the the third mirror the tablespaces should be put in backup mode before split, and out of backup mode after the split. (I believe this should be visible by entries in the alert log).
Whilst tablespaces are in backup mode extra redo will be generated, so it is important to check this time is minimized.
If data paths have been properly separated then there should only be a small blip in performance while the mirror split is occuring,
and possibly a drop when the third mirror is silverd back in.
The fact you are on a 3 way mirror backup design means you need to have documented the design and infrastucture of that set up and
ensure backup is documented and understood and monitored. This might also help pinpoint where in the process response times are increasing.
You might also try to monitor whether your disk access wait times are increasing during backups, or whether you have increased network latency.
(At abolute worst case in my opinion bad operation of this form of backup could lead to an unrecoverable database .... some system/backup admins might not appreciate this .... )
What could however be happening is while the third mirror is being backed up elsewhere that is causing contention between that backup and your databases access to storage.
This is all infrastructure dependent.
Slightly different topic what I mean was that the /uNN mounts may be filesystems set up by a volume manager, and may be on the same physical disk. Understanding the
3 way mirror backup takes priority over this.
In summary I suggest you seek all that is known about the design of your backup and its supporting documents. In your position I think I certainly would.
(It is past my bedtime and I may have veered off topic).
It is probably worth noting how often you switch redo logs and if that rate gets more frequent at certain times of day,
(ie graph when your redo logs were created and at what size).
Use Database Control / Grid Control to view AWR reports especially watching I/O quantity and waits overnight.
Before I forget it may be worth ensuring FAST_START_MTTR_TARGET is set to non zero value. (eg) 300s. This has an effect on incremental checkpointing.
Hope some of this helps - bigdelboy
You say you have long checkpoints. The term 'long' is relative ... Is there any way you could give us a few hints ...
- operating system
- size of database (storage method (files, raw, ASM), tablespaces, files, ASSM?)
- something about the disk subsystem ( NAS, SAN, attached, RAID (which), ontrollers )
- was the database installed at 10g or upgraded to 10g
Any snippets or hints around the above could be relevant.
One thing I don't understand why is there a kind of pause. Because our response times jump 20X so I am wondering that may be redo logs are not being archived in time and when switch occurs it waits until the next redo log is ready for use. Could that happen? Should I verify that? And how can I verify that?
I'll also check other things like ADDM, AWR etc.
The checkpoints are for DBWR to write dirty buffers to the database files. LGWR and Archiver, in theory, are not impacted by checkponits. Of course, they are all on the same storage, so the symptoms appear together.
Since DBWR is writing to the database files, this can impact backup time -- and vice versa. Online backups can impact checkpoint performance.
Are the database files on filesystem / raw devices / ASM ? In the latter two cases, there is no filesystem cache.
Also , if the filesystem cache is forcibly set very small and you are doing backups to disk, then the backups to disk can be very slow -- eg backups to filesystems mounted with directio are slower.
I guess I wasn't clear. I don't have issue with backup. Issue is that when backup runs then the user transactions' response time jumps 20X. I am guessing there is a pause of some kind and I was trying to figure our if it could have anything to do with redo logs not being archived fast enough.
Evidently a bottleneck may be occuring at the backup.
A) Is the 20x user response time consistent throughout the backup, or is it only when a log switch (and/or checkpoint) is occuring.
Do you by any chance have log_checkpoint_timeout set?
B) I understand your backup technique involves spliting off a mirror. It probably threfore involves the following phases (which may be over simplistic):
1) Tablespaces into backup mode and mirror split occuring, then tablespaces back into non backup mode.
2) The split off mirror is backed off to somewhere else.
3) The split mirror is re-silvered back into the mirrored pair.
I anticipate therefore you may be experiencing your performance issue in phases (2) or (3) or three, or both. (1) should be quite short.
What I would be looking out for is the operation on (2) or (3) contending with access to the database disks.
If on a SAN for storage then there could be other heavy bandwith processes using the san at backup time, this may affect your disk access.
It is also possible the the network between the user and the server is busy due to backup network traffic and this is causing the slow down.
I think it would be helpful if any of these stages can be confirmed as a possible issue or eliminated as not causing a problem.
As a matter of fact I see performance impact when log switch occurs. It's not consistent but once in 3 times I see high response time. So our avg req. takes around 1sec but it goes as high as 25sec.
log_checkpoint_interval integer 0
log_checkpoint_timeout integer 0
log_checkpoints_to_alert boolean TRUE
I just don't get it I initially thought that may be it's not archiving fast enough so process hangs for a while, but based on your answer it looks like that's not the case. Besides I don't know how to prove these theories. We are using SAN to EMC storage.
If it was not archiving fast enough I think I would expect to see cannot allocate new log, checkpoint not complete in your alert file.1 person found this helpful
you may be able to check V$LOG, V$LOG_HISTORY or archive log list,
There appears to be two things that MAY going on at the time of log switch that can affect performance:
(1) A checkpoint
(2) Archiver copying redo log to a number of archived redo logs. (As your logs are 2GB this is quite a big copy)
- The amount of effort required in (1) might be reducable by set FAST_START_MTTR_TARGET
- A smaller redo log might reduce the effort in (2), but that might cause you to switch logs too frequently (eg less than every 10 mins). It might lessen the peak of slow response time but perhaps at slightly less throughput overall.
There should be a redo log size advisor with 10g, and that might help
You may find it interesting to check [http://www.dbazine.com/blogs/blog-cf/chrisfoot/blogentry.2006-05-12.4552485084], as it does seem to cover some points quite well.
Exceptionally if might even be than when you have the long archive log copy time you actually are using the same physcial disks for redo log and archived redo log, and when you have the quick one the disks are different.
Variations could also occur depending on whether the caching on the storage subsystem was swamped at the time.
This is yet more postulations rahter than answers - rgds bigdelboy
Isn't checkpoint in oracle asynchronuous and doesn't keep user thread/transaction hanging. Also why archiver copying redo log cause performance impact on customer facing transactions? Any possible reasons. I am just trying to get different point of views because information I have is just from alert logs and doesn't seem to be sufficient to come to definite conclusion. Thanks for your help.
I see below messages in alert log and I see performance being degraded between 09:08:11 and 09:09:12. I can exactly pinpoint:
Edited by: user628400 on Mar 24, 2009 12:46 PM
Sun Mar 15 09:08:10 2009 ALTER SYSTEM ARCHIVE LOG Sun Mar 15 09:08:11 2009 Beginning log switch checkpoint up to RBA [0x1d64.2.10], SCN: 2502556541 Sun Mar 15 09:08:11 2009 Thread 1 advanced to log sequence 7524 (LGWR switch) Current log# 4 seq# 7524 mem# 0: /u07/oradata/p/redo04a.log Current log# 4 seq# 7524 mem# 1: /u08/oradata/p/redo04b.log Sun Mar 15 09:09:12 2009 alter database backup controlfile to '/scripts/oracle/backuplogs/p/backup. controlfile.before' Sun Mar 15 09:09:13 2009 Completed: alter database backup controlfile to '/scripts/oracle/backuplogs/p/backup.controlfile.before' Sun Mar 15 09:09:13 2009 alter database backup controlfile to trace Completed: alter database backup controlfile to trace Sun Mar 15 09:09:13 2009 alter tablespace E_DATA begin backup Sun Mar 15 09:09:13 2009 Completed checkpoint up to RBA [0x1d64.2.10], SCN: 2502556541 Sun Mar 15 09:09:13 2009 Completed: alter tablespace E_DATA begin backup Sun Mar 15 09:09:14 2009
- I am interested in the effect the alter database backup controlfile ... has in this as well as his was also ongoing in the time of the performance issue.
This means we have:
-1- A (manual) log switch (or is this a alter system archive log ... ) ... not an issue in itself, the log sequence advances quite normal.
-2- Checkpoint caused by log switch ... not believed in itself to be an issue.
-3- Copying of redo logs to archive log destination .... might cause disk contention.
-4- An: ALTER SYSTEM BACKUP CONTROL FILE .... may be related ... may be copying to same disks as archived logs.
-5- Tablespace(s) goingg into back mode ... this does not seem to have caused an issue.
Thought (silver bullet thought - not good) Would: alter system checkpoint; before ALTER SYSTEM ARCHIVE LOG be beneficial ???
* Thought (silver bullet thought II - not good)* Would an ALTER SYSTEM SWITCH LOGFILE 5 miinutes before backup be beneficial?
- What I should have said can you find another example of the performance degradation at a different point in the backup cycle where only a log switch was occuring and nothing else?
rgds - bigdelboy
**** Some of my comments here are rubbish ... see below **** (added after post by bigdelboy)
Edited by: bigdelboy on 24-Mar-2009 14:51