This content has been marked as final. Show 42 replies
Just trying to understand, why do you think writing control file to same disk as archived logs would cause performance impact on user transaction?
To answer your question: I looked at previous alert log messages and all of them look same as the what I posted earlier. High DB performance time seem to have happened during switching log (also looking at timestamps) I thought may be that's an issue. But looks like you think that probably is not the case.
Sorry, my error, I misread (distracted myself) I had the control file backup taking 1m 12s , not the actual 1s it took.1 person found this helpful
The best position to be in would be to have the wait event the user was waiting for at the time of the problem.
I still have in my mind two basic possible causes for contention at this time:
Possibility (1) is i/o related where reads and writes to datafiles are slowed by the log archive .... sometimes a large file copy will can be seen as having large continous writes will take much more bandwidth than individual reads of a few database blocks ... this assumes these processes are in contention, and this depends on infrastructure. At this time the dbwr is also trying to write out many blocks down to the database at the same time, so there are possiblities of contention.
Possibility (2) Says that the waits are being caused by the checkpoint process writing dirty blocks to disk.
Certainly an alter database checkpoint; prior to an alter system archive log; might well reduce contention between arc and dbw (I have in fact heard of this advice being given before).
Snippet of code from our script:
Looks like we doing checkpoint before running archive log current.
alter system checkpoint; alster system archive log current; alter databsae backup controlfile to .. alter database backup controlfile to trace;
Below are the recommendation from ADDM:
Our disks are already stripped I am not sure why it's complaining. I am just lost.
FINDING 1: 28% impact (369 seconds) ----------------------------------- Waits on event "log file sync" while performing COMMIT and ROLLBACK operations were consuming significant database time. RECOMMENDATION 1: Host Configuration, 28% benefit (369 seconds) ACTION: Investigate the possibility of improving the performance of I/O to the online redo log files. RATIONALE: The average size of writes to the online redo log files was 31 K and the average time per write was 2 milliseconds. SYMPTOMS THAT LED TO THE FINDING: SYMPTOM: Wait class "Commit" was consuming significant database time. (28% impact [369 seconds]) FINDING 6: 4.2% impact (56 seconds) ----------------------------------- The throughput of the I/O subsystem was significantly lower than expected. RECOMMENDATION 1: Host Configuration, 4.2% benefit (56 seconds) ACTION: Consider increasing the throughput of the I/O subsystem. Oracle's recommended solution is to stripe all data file using the SAME methodology. You might also need to increase the number of disks for better performance. Alternatively, consider using Oracle's Automatic Storage Management solution. RATIONALE: During the analysis period, the average data files' I/O throughput was 301 K per second for reads and 318 K per second for writes. The average response time for single block reads was 11 milliseconds. RECOMMENDATION 2: Host Configuration, 4.2% benefit (56 seconds) ACTION: The performance of file /u08/oradata/p/data_002.dbf was significantly worse than other files. If striping all files using the SAME methodology is not possible, consider striping this file over multiple disks. RELEVANT OBJECT: database file "/u08/oradata/p/data_002.dbf" RATIONALE: The average response time for single block reads for this file was 34 milliseconds. SYMPTOMS THAT LED TO THE FINDING: SYMPTOM: Wait class "User I/O" was consuming significant database time. (21% impact [285 seconds])
Edited by: user628400 on Mar 24, 2009 3:42 PM
In my sometimes wrong opinion symptoms relating to Findings (1) and (6) probably related to user repsonse time being impacted ( User I/O and Commit / Rollback).1 person found this helpful
Now I have a high suspicion the I/O system is being swamped by the archiving of a redo log which could be up to 2GB in length.
There should be monitoring on the netapp array that could confirm this.
In my opinion, if I was a disk head, is that one disadvantage of SAME it that every disk head has to move to partake in every read and every write, and head movements are expensive. Normally the caching on the storage system will smooth this out a lot but I suspect it could be swamped by the reading and writing of a large file. Now I suspect every block of the redo block is being read from the netapp storage to the server so it can be written back to it again. It may well favour reading and writing of redo log to its archive as it is a contigous file. In between all of this writes from the log buffer are being delayed and reads from the datafile are being slowed as well.
The short answer is I would consider reducing the size of the redo logs, and increasing the number of groups if necessary, and think about any impact that might have. Note that on a manual alter system switch log the archive log generated will be less that 2GB in length, and that should be proprotionate to the work done in writing it. A smaller size might also have a greater chance of not overwhelming the storage system cache.
I too am interested to hear what others say.
rgds - bigdelboy
I have a high suspicion the I/O system is being swampedI agree. Archiving the Redo Logs is done to Filesystem -- thus through the FileSystem Cache. This might be getting stressed out and "freezing" other filesystem I/Os.
Similarly, the SAN cache may be getting swamped at times and everything has to "freeze" till the SAN cache finishes write operations.
I also wonder how big the control files are.
I mentioned here you were on netapp, of course you are on emc.
Our Control file are 18MB big and are being written to local disk and not EMC. How should I go about with this problem now? Any suggestions. Reducing log file size might impact overall performance if the switching occurs often.
Do we need to do alter system archive log current? Could that be the problem.
Instead of reducing the size of online redo logs or running a job for ALTER SYSTEM ARCHIVE LOG CURRENT you could also use the ARCHIVE_LAG_TARGET parameter to force log switching and archiving.
I am still trying to comprehend why does it impact front end custoners. We have log buffer of 14MB so I am assuming if log buffer has space then transactions should continue to write to log buffer and shouldn't have to wait for anything else. Isn't that correct?
1) You should have a history of the dates logs are produced and their size, by looking at the filesystem (best?) or V$LOG_HISTORY/V$LOGHIST/V$ARCHIVED_LOG ; or my analyzing alert log.
-- Happened to see this script as I was browsing arround if you have nothing better [http://www.ss64.com/orav/V$LOG_HISTORY_perhour.html]
-- Need to see how often you switching normally and if you have any batch processes that give rise for concerns. (If you already usually switching over 6 - 10 times hour then reducing the redo log size is less appropriate).
-- Manually forced and startup/shutdown log switches should produce archived redo log proportionate to the redo actally produced, and could be any size up to 2GB.
2) Setting ARCHIVE_LAG_TARGET is probably a good thing anyway (ack prev poster).
however it does not stop 2GB archive redo logs being produced and those could perform the performance hit when produced.
I think if I was in your shoes I would:-
a) Analyze and record my current rate of redo production and associated production of archived redo logs.
b) When happy with this, set ARCHIVE_LAG_TARGET, if it starts causing issues then its easy to disable.
c) Ensure FAST_MTTR_START_TARGET is explicitly set, if necessary to default of 300. This way the 'optimal' redo log size should be should in V$INSTANCE_RECOVERY.
d) If appropriate reduce redo log size.
3) Another option is to reorganise locations so the redo logs and/or archive logs and/or FRA are relocated to a different location with less contention. Arguably this might go against SAME methology, but I do sometimes prefer when copying source and target to be separate sets of spindles. Mainly for my own benefit I've found a powerpoint on SAME methodology here [http://www.oracle.com/technology/deploy/availability/pdf/OOW2000_same_ppt.pdf], though it may be worth reviewing you ar compliant with partitioning philosophy with regard to redo logs. Interestingly page 31 indicates SAME is viable because the RAID cache absorbs high write rates ... interestingly does that mean SAME is not viable if the raid cache is unable to absorb high write rates? Anyway reorganising layout / partitions could create ass many challenges as it solves.
Note sure if this confuses more than helps, but the next step is surely to get information on (1)?
Rgds - bigdelboy
One more question. Does log switch hangs user transactions? If so then in what situations could that occur and why is that? My understanding is if the blog buffer cache is high then all transactions get written to the buffer and doesn't have to wait for log buffer to be completely empty? Am I wrong?
With regard to log buffer, as soon as a commit occurs the log buffer must be flushed to disk and the committing session is held until to storage system reports the flush of the log to the redo buffer was successful. This is to ensure the Durability of commited transactions. In practices write will in fact be occuring from the log buffer to the current redo log at intervals of not less than 3 seconds so or when (is ist 1/3 full) when a commit occurs not much has to be written out(I have not double checked the exact algoritm completely, though I should know it). I believe current thinking is the log_buffer_size should not be set so oracle can tune it, and there is a possilbe peformance issue if it is too large. (If you let oracle self size it, and Log Buffer Space wait events occur then manual tuning might be better). Oracle Applications at some version(s) I believe goes against this advice and recommends 10mb according to a couple of bits I have rad on the net.1 person found this helpful
I do not believe you are hanging users at log switch, however if they have issued a commit or need to read a block not in buffer cache then they are being severely impeded by a root cause of i/o contention
Hope this helps - bigdelboy
I ran that script and see that log switch occurs once in 3 hrs. But during backup it occurs 3 log switches occur.