2 Replies Latest reply: Nov 8, 2011 12:58 AM by 898265 RSS

    5-seconds spikes inside sbrk

    898265
      I've been investigating performance problem with my application and narrowed it down to the
      following test case: a single-threaded 64bits application allocates memory in a loop(that's all it
      does) and it experiences spikes during those allocations. The weird part is that these
      spikes occur exactly every 60 seconds, and this spike interval does not depend on the
      allocation rate, does not depend on allocation library I use (libc/libumem /libmtmalloc).
      So it looks like some process performs fishy activity every 60 seconds.
      Spikes starts at 20 msec and grow with time, reaching *5 seconds* by the time when the
      application consumes 15G.

      I run it on 16-cores box with 64G memory under SonOS 5.10. The box is practically unoccupied with lots of free memory.

      I used dtrace to drill my application and found that the time was spent inside libc.so.1`sbrk call.
      I measured the time between acquiring __sbr_lock and releasing it, so the time is not spent spinning on __sbr_lock.
      Here is a stack trace at the moment of releasing the lock:

      libc.so.1`lmutex_unlock+0xa3
      libc.so.1`sbrk+0x2f
      libc.so.1`_morecore+0x12b
      libc.so.1`_malloc_unlocked+0x10d
      libc.so.1`_smalloc+0x6d
      libc.so.1`_malloc_unlocked+0x1ad
      libc.so.1`malloc+0x2d
      libCrun.so.1`void*operator new(unsigned long)+0x26
      spike_test`void iter()+0x2d
      spike_test`main+0x34
      spike_test`0x401fdc
      tid=1 time_ms=543, lock_ptr=fffffd7ffeff1cf0d


      What happens inside libc.so.1`sbrk is that after grabing the lock and it makes sbrk call system,
      and that's where the problem occurs. Unfortunately, I do not have privileges to trace the
      kernel, so I cannot say that happens there.

      I ran monitoring tools during the spikes (vmstat, mpstat, iostat, netstat), almost everything
      looks normal, a couple thing stands out thought:

      1. vmstat 1 shows that during the spike a small burst of reclaimed pages.
      (In this example spike happens in lines# 2-5)

      kthr memory page disk faults cpu
      r b w swap free re mf pi po fr de sr s0 s1 s3 -- in sy cs us sy id
      0 0 0 58341712 48717536 0 580 0 0 0 0 0 0 4 0 0 2048 3722 1484 0 0 99
      1 0 0 58339404 48715216 0 583 0 0 0 0 0 0 0 0 0 2109 3240 1548 0 0 100
      2 0 0 58337040 48712848 1367 5152 0 8 8 0 0 0 0 0 0 2219 8471 1862 0 1 98
      1 0 0 58332608 48708460 75 1057 0 0 0 0 0 0 0 0 0 2045 2904 1426 0 7 93
      0 0 0 58328508 48708448 0 0 0 0 0 0 0 0 0 0 0 2047 1746 1419 0 7 93
      0 0 0 58328364 48708304 107 1681 0 0 0 0 0 0 0 0 0 2284 5542 1795 0 2 98
      0 0 0 58328132 48708012 65 1014 0 0 0 0 0 0 0 0 0 2232 5042 1590 0 0 99

      2. mpstat -a 1 shows one contention event in the kernel on R/W mutex(whis is not rare
      actually), and a small burst of system calls.

      SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl sze
      0 582 0 4 2383 1391 1440 0 21 7 0 3139 0 0 0 99 16
      0 4758 0 26 2119 1033 1546 0 33 53 0 6210 1 1 0 99 16
      0 582 0 2 2099 1058 1515 0 25 11 0 3742 0 1 0 99 16
      0 6130 0 43 2292 1033 1896 0 60 137 1 9264 0 2 0 97 16
      0 0 0 15 2026 975 1301 0 17 8 0 1926 0 7 0 92 16
      0 433 0 74 2315 1074 1551 0 22 30 0 3575 0 7 0 93 16

      3. And the most unusual thing is that prstat and top completely freeze during the spikes,
      so I cannot see what other processing are doing at that moment.

      It's clear that further analysis will require more privileges(dtrace and administrator's),
      which may or may not be granted to me. Either way I need to find out what causes the
      problem.

      I appreciate any help on how I approach the problem and any thoughts on what's going on.

      Edited by: 895262 on 06.11.2011 0:25

      Edited by: 895262 on 06.11.2011 0:27

      Edited by: 895262 on 06.11.2011 0:28
        • 1. Re: 5-seconds spikes inside sbrk
          800381
          Let me guess....

          ZFS root file system, with the machine's swap on ZFS devices....
          • 2. Re: 5-seconds spikes inside sbrk
            898265
            Investigation results:
            There was a script, which periodically (once in 60 seconds) started prstat with -Z option, which aggregates resource utilization per-zone.
            The implementation of this prstat functionality(-Z) appears to be very expensive since it holds some rw-lock inside the kernel for a few seconds in my case, causing contention, blocking our application during memory allocation.

            Edited by: 895262 on Nov 7, 2011 10:57 PM