6 Replies Latest reply: May 5, 2010 9:48 AM by 807559 RSS

    Dtrace and looping processes

    gleng
      I'm looking for some tools to help explain why a process is looping. Is the following case something that Dtrace could help explain?
      ps -e -o pid,ppid,nlwp,lwp,time,etime,stime,pcpu,comm | egrep 'PPID|BIB'
        PID  PPID NLWP    LWP        TIME     ELAPSED    STIME %CPU COMMAND
      22864 21691   22      1       00:03    18:03:29 16:20:46  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      2       00:01    18:03:28 16:20:47  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      3       00:07    18:03:28 16:20:47  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      4       00:02    18:03:28 16:20:47  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      5       00:01    18:03:28 16:20:47  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      6       00:17    18:03:28 16:20:47  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      7       00:17    18:03:28 16:20:47  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      8       00:00    18:03:28 16:20:47  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22      9       00:03    17:58:59 16:25:16  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     10       00:01    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     11       00:16    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     12       00:16    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     13       00:16    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     14    17:32:45    17:58:58 16:25:17 11.5 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     15       00:16    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     16       00:17    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     17       00:16    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     18       07:59    17:58:58 16:25:17 11.6 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     19    17:46:55    17:58:58 16:25:17 11.6 /u01/app/cognos/c83/bin/BIBusTKServerMain
      22864 21691   22     20       00:16    17:58:58 16:25:17  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      1       00:03    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      2       00:00    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      3       00:17    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      4       00:00    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      5       00:16    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      6       00:01    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      7       00:16    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      8       00:18    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24      9       00:16    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     10       00:16    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     11       00:16    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     12       00:01    18:11:45 16:12:30  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     13       00:01    18:07:40 16:16:35  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     14       00:00    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     15       00:17    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     16       00:17    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     17       00:17    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     18       00:16    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     19       00:23    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     20       00:20    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     21       00:17    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     22       00:17    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     23       00:21    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      21787 21691   24     24       00:16    18:07:39 16:16:36  0.0 /u01/app/cognos/c83/bin/BIBusTKServerMain
      prstat -amL
         PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
       22864 cognos8  100 0.0 0.0 0.0 0.0 0.0 0.0 0.0  31 118 276   0 BIBusTKServe/19
       22864 cognos8  100 0.0 0.0 0.0 0.0 0.3 0.0 0.0 246 193 276   0 BIBusTKServe/14
        [snip]
      
      Total: 166 processes, 1684 lwps, load averages: 2.24, 2.25, 2.24
      Thanks for any insight,
      Glen
        • 1. Re: Dtrace and looping processes
          807559
          Hi Glen -

          What do you mean by 'looping'?

          It seems the processes listed here each have several LWPs. For some reason LWPs 22864/21 & 22 aren't listed. The LWPs 22864/14 and 22864/19 consume a lion's share of CPU attention, followed at a distance by 22864/18.

          So, do you want more information on what they're doing?
          • 2. Re: Dtrace and looping processes
            gleng
            Michael_Ernest wrote:
            Hi Glen -

            What do you mean by 'looping'?

            It seems the processes listed here each have several LWPs. For some reason LWPs 22864/21 & 22 aren't listed. The LWPs 22864/14 and 22864/19 consume a lion's share of CPU attention, followed at a distance by 22864/18.

            So, do you want more information on what they're doing?
            Yes.

            They are not behaving "normally". The only way we have found to recover is to stop and start the application. I was hoping to gain some insight into why they are consuming all the CPU (USR is 100%)

            Thanks,
            Glen
            • 3. Re: Dtrace and looping processes
              807559
              If the system is pegging out on usr utilization, it suggests there's some runaway compute-intensive thing going on. One way to attack that is by profiling the stack traces of the process to find out what's spinning so intensely.

              sched:::on-cpu
              / execname == "IBBusTKConfig" /
              {
              @[ustack()] = count();
              }

              profile:::tick-30sec
              {
              trunc(@,10)l
              exit(0);
              }
              • 4. Re: Dtrace and looping processes
                gleng
                Thanks,

                The problem seems to happen just once or twice a year, so not sure when I'll get a chance to try your suggestion.


                have a good,
                Glen
                • 5. Re: Dtrace and looping processes
                  807559
                  Hi,
                  I too got a similar situation . Is it possible with dtrace to find out the frequency of each of the function calls made at the application level and also probably the time spent in the function calls ?

                  Thanks.
                  • 6. Re: Dtrace and looping processes
                    807559
                    Sure, although it's difficult to get clean output -- to a console, anyway -- on both metrics over a large number of functions. A standard profiling exercise determines the frequency of each call or the total time spent in each call. From either result you can then determine 5-10 calls to focus on, then run the other test.

                    For frequency of calls on an application you might try:
                    pid$target:::entry
                    {
                       @fcns[probefunc] = count();
                    }
                    dtrace:::END
                    {
                       trunc(@fcns, 10);
                    }
                    This will output the ten most-frequently called functions in your program. You can execute it either by starting the process you want to instrument or by attaching to a live one, as follows:
                    # dtrace -q -s myscript.d -c <program_to_start>
                    
                    # dtrace -q -s myscript.d -p <running_pid>
                    To measure the aggregate time spent in each function:
                    pid$target:::entry
                    {
                       self->interested = timestamp;
                    }
                    pid$target:::return
                    /self->interested/
                    {
                       @duration[probefunc] = sum(timestamp - self->interested);   
                    }
                    dtrace:::END
                    {
                       trunc(@duration,10);
                    }
                    I've rattled these above scripts off the top of my head so there may be an error somewhere in there. I'll leave the fixing as an exercise to the reader. :-)