Forum Stats

  • 3,728,697 Users
  • 2,245,675 Discussions
  • 7,853,706 Comments

Discussions

Control File Sequential Reads

Jérôme G
Jérôme G Member Posts: 11 Green Ribbon

Hi,

We have a 12.1.0.2. NO RAC database on Linux X86 64-bit 16 CPUu Cores and 125 Go Memory

Looking at the "top 5 timed events" section of the AWR reports (always for 1 hour), I am seeing "control file sequential read" as the number one event




SELECT executions, end_of_fetch_count, elapsed_time/px_servers elapsed_time, cpu_time/px_servers cpu_time, buffer_gets/executions buffer_gets FROM (SELECT sum(executions_delta) as EXECUTIONS, sum(case when px_servers_execs_delta > 0 then px_servers_execs_delta else executions_delta end) as px_servers, sum(end_of_fetch_count_delta) as end_of_fetch_count, sum(elapsed_time_delta) as ELAPSED_TIME, sum(cpu_time_delta) as CPU_TIME, sum(buffer_gets_delta) as BUFFER_GETS FROM DBA_HIST_SQLSTAT s, V$DATABASE d, DBA_HIST_SNAPSHOT sn WHERE s.dbid = d.dbid AND bitand(nvl(s.flag, 0), 1) = 0 AND sn.end_interval_time > (select systimestamp at TIME ZONE dbtimezone from dual) - 7 AND s.sql_id = :1 AND s.snap_id = sn.snap_id AND s.instance_number = sn.instance_number AND s.dbid = sn.dbid AND parsing_schema_name = :2)

How can identified the origin of my issue ?

I saw that that SQl ID is used by the new feature Adapative Dynamic Statc, and it is on the top 10 ordered by Elapsed time during 3 one hour snap


The ADDM expose enhard parsing issue and undersized DGA




Tagged:

Comments

  • EdStevens
    EdStevens Member Posts: 28,052 Gold Crown

    Do you have an actual perforamce issue? Or are you simply chasing the top wait event? If you solve this one, you will still have a 'number one' wait event, but it will be something else. Beware of contracting a case of Compulsive Tuning Disorder.

  • user9540031
    user9540031 Member Posts: 87 Silver Badge

    Hello,

    For a host with 16 cores, the load average being 64 in the beginning of the report period, and 34 in the end, shows that the CPUs are over-subscribed. Ideally the load average should remain below 16.

    If the report duration is 1 hr as you said, knowing that the database CPU time is reported as 2+ hr roughly (8209 s), if we compare this to 16 hr of CPU capacity, then I'd think that maybe this is not the database's fault—but you'd have to confirm this of course.

    When a system is overloaded like this, it's not unusual to see wait events taking longer than they would in a normal situation: this is because server processes spend time waiting for CPU in OS run queues, in addition to waiting for wait events themselves.

    And your I/O subsystem seems to be struggling too...

    So yes, obviously you have a performance issue going on, but you need to analyze it by:

    1. First, confirm that the CPU load is too high, and find out why—you need to do this from the OS level: AWR will not help much here, because it does not distinguish between actual CPU time and time spent by processes waiting for CPU in OS run queues.
    2. Find out if the DB activity is as expected or not. That query (SQL id: 4b4wp0a8dvkf0) which is #1 in the top SQL by elapsed time was executed 27 times / second, which means (avg duration = 1 s) it was run in 27 sessions simultaneously. It this expected? Find out which sessions are running this query, and what you can do about it.
    3. Leave the rest of findings (hard parses, undersized SGA) for later. First get that CPU and I/O activity back to acceptable levels.

    Good luck!

  • JohnWatson2
    JohnWatson2 Member Posts: 4,228 Bronze Crown

    I would further than that. This is a virtual machine, not a real computer, so you need to consider what is happening at the physical layer. Your SA has given you 16 vCPUs but how many physical cores do you have?

  • user9540031
    user9540031 Member Posts: 87 Silver Badge

    Regarding the amount of waits on "control file sequential read" (704040 waits in the AWR report period), it's consistent with the count of executions (100000+) of that #1 SQL id (4b4wp0a8dvkf0): that query is a join of DBA_HIST_SQLSTAT, DBA_HIST_SNAPSHOT, and V$DATABASE, the latter being there solely to provide with the current dbid (database id), which is key to most AWR-related queries. And V$DATABASE does a cartesian product of 2 X$ fixed tables, X$KCCDI and X$KCCDI2, which happen to read from the control file.

    On my test DBs, tracing the following: select dbid from v$database shows exactly 7 waits on "control file sequential read", each reading a single block, whenever that query is executed (I suppose that's 3 reads for X$KCCDI and 4 for X$KCCDI2).

    Repeat that 100000 times in 1 hr, and you'd have 700000 waits. Or 700000 / 3600 = 194 single block reads per second, which is a large share of those 234 physical reads per second on behalf of database requests shown in the IO Profile section of the AWR report. Again, the huge count of executions of the above-mentioned SQL id does not seem normal at all. And if that happens on a CPU-starved host with slow I/Os...

    Jonathan Lewis
  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon
    edited February 13
    thank you all for taking the time to analyze my problem
    
    @user9540031 I had carefully read your article on the subject and googling on the sql id I came across your article. 
    This request is generated by Oracle because it is not an application request but is this te real issue for you ? 
    May I disabled the adaptive dynamic stats ?
    
    I do have a big performance problem with unusually long batch processing. 
    I see a lot of wait I / O at the system level, are they due to the queue waiting for access to the control file ? 
    
    
    
    
  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon


    comparing with the same period an other day to determine if my batch aplication is responsable or not for this control file sequential read wit event
    ,i saw that the DBMS SHEDULER is running
    
    
    
    


  • user9540031
    user9540031 Member Posts: 87 Silver Badge

    1/ As said above, the 1st thing to check is the CPU load on the host: is the system over-loaded or not? And if it is, which processes are the top consumers? So I suggest starting from the unix shell, with the uptime command in order to check the load average values. Anything below 16 (as there are 16 CPUs on this host) is acceptable; higher values mean that the system is operating with more demand for CPU than it can meet. Then, if the system load average is too high, use the top utility to find who the top-consuming processes are. This will enable to tell if the load is created by database processes or by something else.

    2/ Googling for that same SQL id (4b4wp0a8dvkf0) that is obviously executed way too often, I gather that it may have something to do with Adaptive Dynamic Sampling. What is the value of the OPTIMIZER_DYNAMIC_SAMPLING parameter? And are there sessions attempting to parse parallel queries? In any case, you should find the offending sessions easily in V$SESSION or V$ACTIVE_SESSION_HISTORY (where SQL_ID = '4b4wp0a8dvkf0' ...) and find out if that is expected activity or not.

    I see a lot of wait I / O at the system level, are they due to the queue waiting for access to the control file ?

    Your I/O subsystem is struggling, obviously, if we judge by the average duration of the "control file sequential read" and "db file sequential read" wait events: you'd expect at most 10-20 ms, not 144 ms—this is a an order of magnitude too slow. So it would most certainly help if those "ADS-time-limit-from-AWR" queries (SQL id: 4b4wp0a8dvkf0) did not happen in the first place. So yes, at this point I would certainly try to prevent that from happening if I could—not sure how, anyway... Maybe set OPTIMIZER_DYNAMIC_SAMPLING to 1, but that sounds a little bit radical. I don't have access to MOS these days, but if I had, I would try to search for articles about that SQL id, just in case. That said, checking system activity at the OS level, and making sure that the I/Os are not used heavily by some activity not related to the database would a be good thing, just to focus on the right problem.

  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon
    i upload the sar resulat for this period , as you can see there is no CPU bound on the system by I/O waits. 
    optimizer_adaptive_features = TRUE
    optimizer_dynamic_sampling = 0
    


  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon

    It seems to be an application issue. Below an ASH for the SID 442. Control file seq read represent a large part of the session but the top DB Objects and Top DB Files traced back an direct path read temp event ?



  • evgenyg
    evgenyg Member Posts: 332 Bronze Badge

    Considering the average I/O waits over 100ms I would start with a call to Unix + storage friends (if you have no access/knowledge of it ).

    And ask them what's going on, for example on our systems anything bigger than 10ms per I/O would lead to an outage.

    Your's sar report only for CPU which is not sufficient, I would check queue size as well (sar -q). But even from CPU report seems you have periods when I/O wait is over 50%, I don't think such figures are acceptable on a database server.

    Two general notes:

    SAR: 10 minutes period for 16 cores = 160 minutes of work, no spikes can be observed most probably.

    AWR: 1 hour period = 160 hours of work, I would reduce the AWR to be 15 minutes, as well check out sysmetrics to try and catch up a realistic picture of behaviour.

    Good luck.

  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon
    edited February 16


    I cannot determine if the querie carrying the SQL iD 4b4wp0a8dvkf0 is responsible for the slow disk accesses or if it is the slow disk which causes these waits event. And especially why Oracle executes these requests during my application process ?


    i run the program in another environment and the same event occurs

  • evgenyg
    evgenyg Member Posts: 332 Bronze Badge

    I think the issue is outside of oracle, I would start with O/S and IO subsystem.

    The query should do a really huge amount of I/O to "slowdown" I/O subsystem, the numbers you provided are tinny.

    And, despite the tiny numbers, the average wait time is way to large

  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon
    edited February 16
    The program runs from 8 p.m. to 3 a.m.
    As an attachment, the run queue on the calling server and the database server
    


  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon
    I upload all the Sar of my 2 server, in case you see any interesting things there
    

    Thanks

  • evgenyg
    evgenyg Member Posts: 332 Bronze Badge

    runq on the client, in my opinion, a little bit high and maybe a sign of CPU starvation

    On the DB side, it also high (in my opinion).

    There are a few devices that seem constantly busy: for example dev253-12,  dev8-16

    I would start to review OS or/and I/O side, before starting with the database.

    As well as the layout of database files on the server.

    I don't see the correlation between very low "I/O" which is perhaps 200MB per minute and such a high I/O wait average time.

    Where the redolog located? not on devices I've listed above?

    What your storage/linux friends seeing?

  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon

    The located of the redo log is on the Database

    I asked my storage/linux friends (but not a system admin 😪) the result for the command lsblk and a vmstat -D

  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon
    edited February 16

    Result of the 2 command

    dev8-16 represent the disk hosted the database


    Important information that I haven't mentioned until now is that my processing executed 40 processes in parallel. 
    Each parallel processing a consistent volume of business data
    


  • evgenyg
    evgenyg Member Posts: 332 Bronze Badge

    Is the host hosting only oracle instance and no one else running in parallel?

    Is there any backup running in parallel? (not only database backup, but O/S perhaps?)

    I have no use for vmstat and/or lsblk outputs, in my opinion, the snapshot is not interesting we looking for a graph over time, where we may see spikes of IOPS, we looking for something that we can correlate with the picture we see in AWR.

    In my opinion we still need to see what's going on outside of the instance/database

  • Jérôme G
    Jérôme G Member Posts: 11 Green Ribbon
    OK thank you I'll look in that direction
    


Sign In or Register to comment.