5 Replies Latest reply: Jan 14, 2010 6:10 PM by 807559 RSS

    iopending oddness

    gleng
      I've been using various Dtrace scripts to try and shed some light on a performance problem with a 6140, a couple of V490s and Oracle. This is the first time I've used Dtrace so maybe I'm missing something obvious ...

      The script iopending seems to work for awhile and then as the server gets busy it starts showing only >=32. Here's some resent output:
      2010 Jan 12 16:50:42,  load: 3.42,  disk_r:    392 KB,  disk_w:   1690 KB
      
                 value  ------------- Distribution ------------- count
                    28 |                                         0
                    29 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     35618
                    30 |@@@@                                     4157
                    31 |                                         86
                 >= 32 |                                         139
      
      2010 Jan 12 16:50:47,  load: 3.39,  disk_r:    320 KB,  disk_w:   1443 KB
      
                 value  ------------- Distribution ------------- count
                    28 |                                         0
                    29 |@@@@@@@@@@@@@@@                          14958
                    30 |@@@@@@@@@@@@@@@@@@@@@@@@                 24003
                    31 |@                                        699
                 >= 32 |                                         340
      
      2010 Jan 12 16:50:52,  load: 3.36,  disk_r:    280 KB,  disk_w:   5052 KB
      
                 value  ------------- Distribution ------------- count
                    29 |                                         0
                    30 |@                                        1387
                    31 |@@@@@@@@@@@@@@@@@@@@@@                   22333
                 >= 32 |@@@@@@@@@@@@@@@@                         16280
      
      2010 Jan 12 16:50:57,  load: 3.37,  disk_r:    208 KB,  disk_w:   2077 KB
      
                 value  ------------- Distribution ------------- count
                    31 |                                         0
                 >= 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 40000
      
      2010 Jan 12 16:51:02,  load: 3.38,  disk_r:    112 KB,  disk_w:   1364 KB
      
                 value  ------------- Distribution ------------- count
                    31 |                                         0
                 >= 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 40000
      
      2010 Jan 12 16:51:07,  load: 3.34,  disk_r:   1424 KB,  disk_w:   2531 KB
      
                 value  ------------- Distribution ------------- count
                    31 |                                         0
                 >= 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 40000
      If I stop the script (cntl-c) and rerun it works for awhile.


      Is there some basic limitation with Dtrace?

      I'm I expecting too much by thinking Dtrace scripts should run indefinitely.

      I've run into a similar problem with one called process_io_top.



      Thanks for any insight
      Glen
        • 1. Re: iopending oddness
          807559
          What value are you lquantizing? A look at the aggregation you're using would be helpful.
          • 2. Re: iopending oddness
            gleng
            Here's the lquantize statement:
             @out = lquantize(pending, 0, MAX_PENDING, 1);
            Thanks,
            Glen

            I see from the form header
            IMPORTANT CHANGE: To get your questions answered and continue discussion on Solaris Dynamic Tracing Framework, please visit the OpenSolaris Forums
            that the Dtrace community should move to the OpenSolaris discussion area.

            Do you follow any of the OpenSolaris discussions?

            Edited by: sysglen on Jan 13, 2010 9:39 AM
            • 3. Re: iopending oddness
              807559
              Well, that's a peculiar thing.

              Ok, so you create a linear distribution from pending to MAX_PENDING. What you are plotting along that chart appears to be the number of probe hits for which a given pending value applies. It seems at some point that you exceed whatever MAX_PENDING is, and so your distribution literally goes off the chart.

              Can you show the whole probe clause, including the value of MAX_PENDING?

              The OpenSolaris DTrace discussion group isn't terribly active, but I do subscribe to it. Lots of high-powered DTrace talent listening to. You can get to it using this link:

              http://www.opensolaris.org/jive/forum.jspa?forumID=7

              Regards,

              Michael
              • 4. Re: iopending oddness
                gleng
                The script is from the example collection [DTraceToolkit-0.99.tar.gz|http://hub.opensolaris.org/bin/view/Community+Group+dtrace/dtracetoolkit] .


                MAX_PENDING is defined and referenced once:

                :g/MAX_PENDING/nu
                134 inline int MAX_PENDING = 32; /* max pending value */
                212 @out = lquantize(pending, 0, MAX_PENDING, 1);

                This is the parts that update pending:
                 io:genunix::start
                 /this->ok/
                 {
                        /* track bytes */
                        disk_r += args[0]->b_flags & B_READ ? args[0]->b_bcount : 0;
                        disk_w += args[0]->b_flags & B_READ ? 0 : args[0]->b_bcount;
                
                        /* increase event pending count */
                        pending++;
                 }
                
                 /*
                  * Process and Print completion
                  */
                 io:genunix::done
                 /this->ok/
                 {
                        /* decrease event pending count */
                        pending--;
                 }
                
                 /*
                  * Prevent pending from underflowing
                  * this can happen if this program is started during disk events.
                  */
                 io:genunix::done
                 /pending < 0/
                 {
                        pending = 0;
                 }
                Thanks,
                Glen
                • 5. Re: iopending oddness
                  807559
                  Hm, interesting. So, the script increments the variable pending when io:genunix::start fires, and decrements when io:genunix::done fires.

                  As that number is bounced by I/O events, the script samples over time. In your script, it appears you sample 40k times over 5-second interval. The quantization then distributes pending over the range 0-31. That is, if the number of pending events at 4 sampling points is 12, you'll get a

                  value ------------------Distribution---------------- count
                  ...
                  12 | 4
                  ...

                  in one of your graphs. You won't get an "@" for hitting .0001 of total though.

                  Using this script and a "find / -type f" command, I was able to exceed MAX_PENDING several times over the life of the run. So MAX_PENDING is not a hard limit. The variable pending is set to zero when the script starts, so that explains why restarting "clears" the symptom you're seeing.

                  So it looks rather like an "I/O leak," if you will. Over time, the value of pending gets above 32 and -- I would guess -- keeps rising. Something that increments your count isn't getting decremented.

                  I have some thoughts on what's causing that; unfortunately, no time to investigate. But it would be a pretty good question to pose on the OpenSolaris discussion group. I wouldn't be surprised if one of the heavyweights chimed in on the problem in less than a few hours.