3 Replies Latest reply: Jan 25, 2012 10:17 AM by 800381 RSS

    System does not respond anymore

    895623
      Hi,

      i'm using a Solaris 10 update 8 System at vmware. For some reason the os gets frozen from time to time. Login is no longer possible, neither with ssh, nor at the console. (I'm getting no longer a prompt at the console at all, when it happens). The System does still respond to icmp requests.
      I've enabled deadman to get it crashing. But it does not. Seems, that the timer can still be updated. So I've done a panic from mdb ( $<systemdump ). But there is no dump written at all. When the system is back online, the only interesting thing I could find was part of the sar output:


      pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf

      09:34:00 0.00 0.00 0.00 0.00 0.00
      09:35:00 0.00 0.00 0.00 0.00 0.00
      09:36:01 0.05 0.70 1.43 85.15 0.00
      09:37:00 0.00 0.00 0.00 0.00 0.00
      09:38:00 0.00 0.00 0.00 0.00 0.00
      09:39:00 0.30 3.98 4.24 116.81 0.00
      09:40:00 0.83 11.64 13.17 221.03 0.00
      09:41:00 0.00 0.00 0.00 0.00 0.00
      09:19:27 unix restarts
      10:20:00 0.00 0.00 0.00 0.00 0.00
      10:21:00 0.00 0.00 0.00 0.00 0.00
      10:22:00 0.00 0.00 0.00 0.00 0.00


      the pagecanner kicked in, but the last minute reportet, the values are 0 .
      and there is a jump in time for "unix restarts"

      So, no glue where to start with as the lowest level for analyzing, the crashdump, is not there.
      Does someone have an idea, how to deal with it?

      Thank you,

      Aaron

      Edited by: 892620 on Nov 10, 2011 9:44 AM
        • 1. Re: System does not respond anymore
          Soory
          I suggest better do a OS re-install.
          • 2. Re: System does not respond anymore
            902756
            Reinstall is not the approach you would normally use with Solaris or other *nix operating systems as it is for popular desktop operating systems.

            The only time I've seen anything like this is either a hardware failure or when a resource (particularly memory or disk space) is entirely consumed. In my case it was disk space. My problem was caused by time-slider taking snapshots it wasn't able to destroy when the zpool exceeded (typically 80%) because of a zfs bug with destroying legacy zpool snapshots.

            I had similar symptoms as you, ssh failed but I happened to have an alive terminal. Removing a large file won't necessarily immediately free the space because zfs must store the difference between the active pool and the pool's last snapshot. So you must manually zfs destroy -R {snapshot}.

            If you are using time-slider and might have some legacy snapshots, keep an eye on your system log to make sure there aren't errors in destroying them. Also use df -h / to make sure you aren't approaching 100% disk usage.

            Another problem can be caused by runaway applications filling up /tmp and therefore consuming all of your swap space. df -h /swap and swap -l to see if this is your problem. Good luck!

            Edited by: bnitz on Jan 20, 2012 7:51 AM
            • 3. Re: System does not respond anymore
              800381
              bnitz wrote:
              Reinstall is not the approach you would normally use with Solaris or other *nix operating systems as it is for popular desktop operating systems.
              Ouch. But too true. :)

              >
              The only time I've seen anything like this is either a hardware failure or when a resource (particularly memory or disk space) is entirely consumed. In my case it was disk space. My problem was caused by time-slider taking snapshots it wasn't able to destroy when the zpool exceeded (typically 80%) because of a zfs bug with destroying legacy zpool snapshots.

              I had similar symptoms as you, ssh failed but I happened to have an alive terminal. Removing a large file won't necessarily immediately free the space because zfs must store the difference between the active pool and the pool's last snapshot. So you must manually zfs destroy -R {snapshot}.

              If you are using time-slider and might have some legacy snapshots, keep an eye on your system log to make sure there aren't errors in destroying them. Also use df -h / to make sure you aren't approaching 100% disk usage.

              Another problem can be caused by runaway applications filling up /tmp and therefore consuming all of your swap space. df -h /swap and swap -l to see if this is your problem. Good luck!

              Edited by: bnitz on Jan 20, 2012 7:51 AM
              The page scanner kicking off is the big clue - something used up all the box's memory.