8 Replies Latest reply: Dec 28, 2013 6:37 AM by Sujay Palavalli-Oracle RSS

7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?

user13040671 Newbie
Currently Being Moderated

I was analyzing cell metrics for a 2 hour period (8PM-10PM) .... in particular, I was looking at the Consumer Group throttling due to both DBRM (intra-db) and db plan (inter-db).

Was puzzled by this following metric:

 

CG_IO_WT_LG         ORDERSDB.CG1     64,411,835 ms           2013-12-13T19:59:14-05:00

...

...

CG_IO_WT_LG         ORDERSDB.CG1     89,818,750 ms           2013-12-13T21:59:15-05:00

 

 

Got the above output with this command:

cellcli -e "list metrichistory where metricObjectName='ORDERSDB.CG1' and collectionTime > '2013-12-13T19:58:00-05:00' and collectionTime < '2013-12-13T22:00:00-05:00' and name ='CG_IO_WT_LG'"


Well, I am missing something very big here obviously.

I can't figure out how the cumulative wait for this CG1 consumergroup in the ORDERSDB can be 7 hours, in the 2 hours of data collection?


In other words:


 

at 8PM

CG_IO_WT_LG         ORDERS.CG1     64,411,835 ms           2013-12-13T19:59:14-05:00

at 10PM:

CG_IO_WT_LG         ORDERS.CG1     89,818,750 ms           2013-12-13T21:59:15-05:00

 

 

How can there be a cumulative wait of 89,818,750 - 64,411,835 = 25406915ms=7.05748 hours in two hours between 8PM to 10PM?

It cannot be the case that 7.05 hours were spent by this CG waiting for it to be scheduled by the IORM in the 2 hour interval??


Thanks in advance

  • 1. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    robinsc Explorer
    Currently Being Moderated

    Assuming you had at least 4 sessions in the same consumer group waiting for IO ( actually many more than that) then the cumulative wait can easily exceed the elapsed wall clock time. Where is the confusion in that ?

  • 2. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    858739 Newbie
    Currently Being Moderated

    I'd rather use the rates (CG_IO_WT_LG_RQ) than the cumulative numbers.

     

     

     

    You can make use of the cell_iops.sh here http://karlarao.wordpress.com/scripts-resources/ to get a characterization of the IOs across the databases on the cell level. This only have to be executed on one of the cells. And that's a per minute detail.

     

     

     

    Whenever I need to characterize the IO profile of the databases for IORM config I would:

    > pull the IO numbers from AWR

    > pull the awr top events from AWR (this will tell if the DB is IO or CPU bound)

    > get all these numbers in a consolidated view

    > then from there depending on priority, criticality, workload type I would decide what makes sense for all of them (percentage allocation and IORM objective)

     

     

     

    Whenever I need to evaluate an existing config with IORM already in place I would:

    > pull the IO numbers from AWR

    > pull the awr top events and look at the IO latency numbers (IO wait_class)

    > pull the cell_iops.sh output on the workload periods where I'm seeing some high activity

    > get all these numbers in a consolidated view

    > get the different views of IO performance from AWR on all the databases http://goo.gl/YNUCEE

    > validate the IO capacity of both Flash and Hard disk from the workload numbers of both AWR and CELL data

    > for per consumer group I would use the "Useful monitoring SQLs" here http://goo.gl/I1mjd

    > and if that not enough then I would even do a more fine grained latency & IO monitoring using snapper

     

     

     

    On the cell_iops.sh I have yet to add the latency by database and consumer group as well as the IORM_MODE, but the methods I've listed above works very well.

     

     

     

     

     

    -Karl

  • 3. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    user13040671 Newbie
    Currently Being Moderated

    Thank you Karl, this is amazingly extensive info w.r.t IORM design...will digest and use the same and post back here.

     

    Robin, thanks for the pointer...but there were only 2 sessions:

     

    Here’s the runtime CGs (during RMAN backups)....only 2 sessions using the CG1 consumer group:

          INST   SID SERIAL# RESOURCE_C PROGRAM         MACHINE    LOGON_TIME      SERVICE_NAME      OPNAME                            TIME_REMAINING      SOFAR ELAPSED

    ---------- ----- ------- ---------- --------------- ---------- --------------- ----------------- --------------------------------- -------------- ---------- ----------

             2   280 4253 SYS_GROUP  rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: aggregate input 12512 1436451450      14297

             2  1066 471 SYS_GROUP  rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: aggregate output                            207221115      14161

             2   287 833 CG1   rman@computenode computenode DEC-13 19:58:13 cg1_service  RMAN: incremental datafile backup 93425561      14370

             2   287 833 CG1   rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: incremental datafile backup           4096 311610685      14370

             2   288 5399 CG1   rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: incremental datafile backup           4927 297877339      14368

             2   288 5399 CG1   rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: incremental datafile backup                  82269013      14368

             2   804 347 SYS_GROUP  rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: incremental datafile backup              0      17758        215

             2   804 347 SYS_GROUP  rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: incremental datafile backup              0 400025082        215

             2  1315 6967 SYS_GROUP  rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: incremental datafile backup              0 399303544        214

             2  1315 6967 SYS_GROUP  rman@computenode computenode DEC-13 19:58:13 cg1_service RMAN: incremental datafile backup              0      48350        214

     

    10 rows selected.

  • 4. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    user13040671 Newbie
    Currently Being Moderated

    Btw, purely as an academic point, here is the query used during the RMAN job to pull the runtime sessions:

    SQL> !cat check_rman_CG.sql

    set lines 200 pages 300

    col resource_consumer_group for a10

    col program for a15

    col machine for a10

    col service_name for a17

    col OPNAME for a33

    alter session set nls_date_format='MON-DD HH24:MI:SS';

     

    SELECT sess.inst_id INST, sess.sid, sess.serial#, resource_consumer_group, substr(program,1,15) PROGRAM, substr(machine,1,10) machine, logon_time, service_name, OPNAME, TIME_REMAINING, SOFAR, ELAPSED_SECONDS ELAPSED

    FROM gv$session sess, gv$session_longops longops

    where

    sess.INST_ID=longops.INST_ID

    and sess.SID=longops.SID

    and sess.SERIAL# = longops.SERIAL#

    and sess.username='SYS'

    and type <> ' BACKGROUND'

    and program not like 'oraagent%'

    and program not like 'oracle%'

    and program not like 'sqlplus%'

    order by LOGON_TIME

    /

  • 5. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    robinsc Explorer
    Currently Being Moderated

    I was wondering whether asynchronous IO might have anything to do with it , since theoretically a single session could have multiple outstanding I/O requests pending and each would contribute to the overall wait times.

  • 6. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    KarlArao Oracle ACE
    Currently Being Moderated

    BTW, if the backup IOs are killing your other processes then you can actually map that RMAN function to a consumer group and have it throttled

    check it here http://goo.gl/7akyRa and on the wiki link I sent you

     

     

    -Karl

  • 7. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    user13040671 Newbie
    Currently Being Moderated

    Karl, It's actually, the opposite...the EVIL app DBAs,  :-) have used the mapping to deny the rman operation any kind of decent IO rationing using the following mapping:

     

    ATTRIBUTE                  VALUE            CONSUMER_GROUP             STATUS

    ------------------------------ -------------------- ------------------------------ ------------------------------

    ORACLE_FUNCTION            BACKUP           CG1
    ORACLE_FUNCTION            COPY             CG1

     

     

    And the CG1 was capped & placed inside a sub-plan inside a subplan inside a subplan (MGMT_P4 level) !

    I am making the case as to how many hours were actually spent by the rman CG1 waiting to scheduled by the cell IORM.

     

    AWR analysis showed:top 1st Timed Event "Other:imm op" wait event.

    Since it's Exadata, that wait event does mean it was throttled, doesn't it?

     

    Or wondering, should we just take the above awr event at face value (as if tuning for a non-Exa) and just use the regular route of, bumping up dbwr_io_slaves to 2.

  • 8. Re: 7 hours cumulative wait for CG_IO_WT_LG cell metric in 2 hours wall-clock time?
    Sujay Palavalli-Oracle Newbie
    Currently Being Moderated

    This metric specifies the cumulative number of milliseconds that large I/O requests issued by the consumer group have waited to be scheduled by IORM. A large value indicates that

    the I/O workload from this consumer group is exceeding the allocation specified for it in the database resource plan.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points