This discussion is archived
3 Replies Latest reply: Aug 23, 2011 12:41 AM by User236473 RSS

System Hang, kernel panic.

User236473 Newbie
Currently Being Moderated
Hello list,

Doing some independent research into an issue we are having here. We have 21 x4540s running 5.10 Generic_141445-09. They are used as NFSv4 servers to front-end servers. The x4540s that are used for mail (postfix+dovecot) periodically hang. The others do not.

Since they hang, it has been difficult to diagnose what the issue could be. We have "set pcplusmp:apic_panic_on_nmi=1" in /etc/system, and I try to issue "set generate_host_nmi = true" in ILOM, but it appears to be too dead to manage that.

The only symptoms we have in the logs are:

Jul 24 07:50:49 x4500-11.unix inetd[292]: [ID 702911 daemon.error] Unable to fork inetd_start method of instance svc:/network/nfs/rquota:default: Resource temporarily unavailable
Jul 24 07:51:09 x4500-11.unix last message repeated 1 time
Jul 24 07:51:15 x4500-11.unix sshd[343]: [ID 800047 auth.error] error: fork: Error 0
Jul 24 07:54:25 x4500-11.unix last message repeated 4 times

A few times I have had a working shell connected when it happened, still running top, which shows it is not out of memory, not even close. No process has gone wild (pretty much just nfsd running). It would appear that it can't start new processes, but not due to lack of memory. Eventually it grinds down to a halt, including console. If I try to run "reboot" (or even reboot -d) it tries to reboot, but fails (just hangs). ILOM reset /SYS will bring it all back again.

Currently, this happens 2-3 times a month. Funnily enough, always at night. What does run at night is a recursive zfs snapshot. It usually dies sometime after that. The process that always gets the first error is rquotad. But that could just be because it is constantly queried.

Not much to go on alas.

Then one day, we got a panic dump. Is it the same problem, or something unrelated? Hard to know, but IF it were to be related...
::status
debugging crash dump vmcore.0 (64-bit) from x4500-13.unix operating system: 5.10 Generic_144489-10 (i86pc) panic message: BAD TRAP: type=8 (#df Double fault) rp=ffffffffc4abff10 addr=0 dump content: kernel pages only
::showrev
Hostname: x4500-13.unix Release: 5.10 Kernel architecture: i86pc Application architecture: amd64 Kernel version: SunOS 5.10 i86pc Generic_144489-10 Platform: i86pc
$<msgbuf
MESSAGE                                                               fffffe80010a2ca8 dtrace:dtrace_dynvar+730 ()   >> warning! 8-byte aligned %fp = fffffe80010a2d78 fffffe80010a2d78 dtrace:dtrace_dynvar+730 ()   >> warning! 8-byte aligned %fp = fffffe80010a2e48 ......snip...   >> warning! 8-byte aligned %fp = fffffe80010a5948 fffffe80010a5948 dtrace:dtrace_probe+606 ()   >> warning! 8-byte aligned %fp = fffffe80010a5968 fffffe80010a5968 fbt:fbt_invop+a8 ()   >> warning! 8-byte aligned %fp = fffffe80010a5998 fffffe80010a5998 unix:dtrace_invop+3b () fffffe80010a5ae0 unix:invoptrap+108 () fffffe80010a5b30 genunix:fop_getpage+47 () fffffe80010a5d00 genunix:segvn_fault+8b0 () fffffe80010a5dc0 genunix:as_fault+205 () fffffe80010a5e20 unix:pagefault+8b () fffffe80010a5f00 unix:trap+3d7 () fffffe80010a5f10 unix:cmntrap+140 () 
::panicinfo
             cpu                9           thread fffffe88e5265180          message BAD TRAP: type=8 (#df Double fault) rp=ffffffffc4abff10 addr=0              rdi fffffe8f5496fa70              rsi                2              rdx fffffe80010a5768
fffffe88e5265180::walk thread
fffffe8946f791e8 mdb: failed to read thread at 100000000: no mapping for address
::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME R      0      0      0      0      0 0x00000001 fffffffffbc273c0 sched R     77      0      0      0      0 0x00020001 ffffffffc9e73c80 zpool-zpool1 R      3      0      0      0      0 0x00020001 ffffffffc1f86e10 fsflush R      2      0      0      0      0 0x00020001 ffffffffc1f87a78 pageout R      1      0      0      0      0 0x4a004000 ffffffffc1f886e0 init R  17651      1  17651  17635      0 0x4a014000 fffffe88d4b8e030 bash R  14059  17651  17651  17635      0 0x4a004000 fffffe8ab1817e60 head R  14058  17651  17651  17635      0 0x4a004000 fffffe8946f791e8 sort R  14057  17651  17651  17635      0 0x4a004000 fffffe88dc9aa040 awk R  14056  17651  17651  17635      0 0x4a004000 ffffffffc5a9f6e8 ps R   3534      1   3534   3508      0 0x4a014000 fffffe8ab126dcb8 bash R   3535   3534   3535   3508      0 0x4a004000 fffffe88dc9b6710 dtrace R    431      1    431    431      0 0x42000000 fffffe88dc9b2908 fmd R    458      1    458    458  60001 0x52000000 fffffe88d4b95028 nrpe R    441      1    441    441     25 0x52010000 fffffe88dc9b1ca0 sendmail R    440      1    439    439      0 0x42000000 fffffe88dc9aeab0 snmpd R    438      1    438    438      0 0x42000000 fffffe88dc9ad1e0 dmispd R    424      1    424    424      1 0x42000000 fffffe88d4b968f8 nfsd R    415      1    415    415      0 0x42000000 ffffffffc5a9ea80 mountd R    406      1    406    406      0 0x42010000 fffffe88c3d40020 snmpdx R    390      1    390    390      0 0x4a014000 ffffffffc1f861a8 vold R    381      1    381    381      0 0x42000000 fffffe88c3d456f8 sshd R  15695    381    381    381      0 0x42010000 fffffe8946f77918 sshd R  15696  15695    381    381    219 0x52010000 fffffe8ab1819730 sshd R  15702  15696  15702  15702    219 0x4a014000 fffffe88d4b90568 bash R   6927    381    381    381      0 0x42010000 ffffffffc9e748e8 sshd R   6928   6927    381    381    159 0x52010000 fffffe8946f76048 sshd R   6934   6928   6934   6934    159 0x4a014000 ffffffffc5a9ac78 bash R    312      1    312    312      0 0x42000000 fffffe88dc9b3570 syslogd R    306      1    306    306      1 0x42000000 fffffe88dc9b4e40 lockd R    297      1    297    297      0 0x42000000 ffffffffc5a9c548 utmpd R    270      1    270    270      0 0x42000000 ffffffffc9e75550 inetd R    254      1    250    250      1 0x42000000 ffffffffc1f848d8 nfs4cbd R    253      1    253    253      1 0x42000000 fffffe88c3d46360 statd R    252      1    252    252      1 0x52000000 fffffe88d4b911d0 nfsmapid R    247      1    247    247      1 0x42000000 fffffe88d4b94370 rpcbind R    246      1    246    246      0 0x42010000 fffffe88c3d40c88 cron R    225      1    225    225      0 0x42010000 fffffe88d4b9b368 xntpd R    180      1    178    178      0 0x42000000 fffffe88d4b981c8 iscsid R    147      1    147    147      0 0x42000000 fffffe88d4b97560 picld R    143      1    143    143      1 0x42000000 fffffe88d4b9a700 kcfd R    134      1    134    134      0 0x42000000 fffffe88c3d42558 syseventd R    141      1    141    141      0 0x42000000 fffffe88d4b98e30 nscd R     54      1     54     54      0 0x42000000 ffffffffc9e77a88 devfsadm R     11      1     11     11      0 0x42000000 ffffffffc1f83c70 svc.configd R      9      1      9      9      0 0x42000000 ffffffffc1f85540 svc.startd R    286      9    286    286      0 0x4a004000 fffffe88dc9b7378 sh R    261      9    261    261      0 0x4a014000 fffffe88d4b8f900 sac R    305    261    261    261      0 0x4a014000 fffffe88d4b91e38 ttymon R      5      0      0      0      0 0x00020001 ffffffffc1f89348 zpool-zboot
Double Fault could perhaps indicate that we have stack overflow in the kernel? Can I use mdb to dump all the currently-used-stack memory, and collect that data to see if there is a leak? I could also double the stacksize and see if the period between trouble also doubles.


Any other suggestions which may help in locating the troubles we are seeing?
  • 1. Re: System Hang, kernel panic.
    User236473 Newbie
    Currently Being Moderated
    Same thing again, this time we managed to push NMI early enough to get a dump:
    Aug 14 05:22:43 x4500-19.unix sshd[458]: [ID 800047 auth.error] error: fork: Error 0
    
    mdb: $c
    vpanic()
    apic_nmi_intr+0x65()
    av_dispatch_nmivect+0x1f()
    
    mdb: $<msgbuf
    
    WARNING: nfsauth upcall failed: RPC: Operation in progress
    WARNING: nfsauth upcall failed: RPC: Operation in progress
    WARNING: nfsauth upcall failed: RPC: Operation in progress
    
    panic[cpu2]/thread=fffffe8000470c60: 
    pcplusmp: NMI received
    
    
    mdb: ::memstat                           
    Page Summary                Pages                MB  %Tot
    ------------     ----------------  ----------------  ----
    Kernel                    8121785             31725   97%
    ZFS File Data                7935                30    0%
    Anon                         3017                11    0%
    Exec and libs                 529                 2    0%
    Page cache                     42                 0    0%
    Free (cachelist)            15210                59    0%
    Free (freelist)            237862               929    3%
    
    Total                     8386380             32759
    Physical                  8177485             31943
  • 2. Re: System Hang, kernel panic.
    SteveS Pro
    Currently Being Moderated
    869403 wrote:
    Same thing again, this time we managed to push NMI early enough to get a dump:
    Doc ID:1334490.1 - "Solaris 10 and Solaris Express x86 Servers With Certain Patches Installed May Exhibit Clock or I/O Latencies or System Hang When ACPI Power Management is Enabled" is a common reason why hangs occur on x86 and can easily be identified by looking at the dispatch queue to see if the CPU's are running the (idle) thread whilst we have runnable threads in the dispatch queue. The following shows a healthy system:
    ::cpuinfo -v
    ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC   0 fffffffffbc3ebc0  1b    0    0  -1   no    no t-6    ffffff0006405c20 (idle)                        |                RUNNING <--+                  READY                      EXISTS                      ENABLE         ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC   1 ffffff019de47ac0  1b    0    0  55   no    no t-0    ffffff019f83d060 mdb                        |                RUNNING <--+                  READY                      EXISTS                      ENABLE        
    From what you describe this could also be a system resource exhaustion issue either caused by running out of memory or you've hit the maximum number of processes/LWPs for the current project/system.

    It's probably best if you can open a service request and provide the crash dump and explorer we can review the data and hopefully advise why the system hung.

    When creating the SR make sure you use either the CSI for Solaris support or the system Serial Number. Choose Product/Component "Solaris SPARC Operating System"/"System Hang" and it'll route to the Kernel group.
  • 3. Re: System Hang, kernel panic.
    User236473 Newbie
    Currently Being Moderated
    SteveS,

    Thanks for the reply. I have handed it off to the local support, and believe they have forward it to Oracle Japan. Possible, it goes through a lot of language translation, so it is a little like playing Telephone.

    I was trying to decipher your ::cpuinfo -v information to see if I could determine if we were having this issue. But I am unsure what to look for in a non-healthy system.

    This is the output I get
     ::cpuinfo -v                  
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      0 fffffffffbc29320  1f    1    0  -1   no    no t-3    fffffe8000005c60
     (idle)
                      |    |
           RUNNING <--+    +-->  PRI THREAD      PROC
             READY                60 fffffe80000b9c60 sched
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      1 ffffffffc3af2000  1f    1    0  -1   no    no t-3    fffffe80003bec60
     (idle)
                      |    |
           RUNNING <--+    +-->  PRI THREAD      PROC
             READY                60 fffffe800014fc60 sched
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      2 fffffffffbc310a0  1b    1    0  -1   no    no t-81   fffffe8000470c60
     (idle)
                      |    |
           RUNNING <--+    +-->  PRI THREAD      PROC
             READY                60 fffffe80001a3c60 sched
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      3 ffffffffc4270800  1f    0    0  -1   no    no t-123  fffffe80004cac60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      4 ffffffffc4270000  1f    0    0  -1   no    no t-77   fffffe8000561c60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      5 ffffffffc44be800  1f    1    0  -1   no    no t-0    fffffe80005a9c60
     (idle)
                      |    |
           RUNNING <--+    +-->  PRI THREAD      PROC
             READY                60 fffffe8000bbdc60 sched
          QUIESCED         
            EXISTS         
            ENABLE         
                                          
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      6 ffffffffc44be000  1f    0    0  -1   no    no t-41   fffffe8000445c60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      7 ffffffffc4640000  1f    0    0  -1   no    no t-24   fffffe8000638c60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      8 ffffffffc4633800  1f    0    0  -1   no    no t-57   fffffe80006c9c60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS         
            ENABLE                        
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
      9 ffffffffc4633000  1f    0    0  -1   no    no t-23   fffffe8000739c60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
     10 ffffffffc46ae800  1f    0    0  -1   no    no t-28   fffffe80007cac60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS         
            ENABLE         
    
     ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
     11 ffffffffc46ae000  1f    0    0  -1   no    no t-7    fffffe8000866c60
     (idle)
                      |    
           RUNNING <--+    
             READY         
          QUIESCED         
            EXISTS                        
            ENABLE         
    As for "ps" we only have 42 procs, and threads the biggest process is "nfsd" (can use up to 1024, but hovers around 300). We are not hanging (or even considering the fork error) due to normal resource limits.

    While waiting, I am poking around in the dump to see if we can find anything interesting. Last time, Telka managed to find out about the OpenOwner lock leaks, which was encouraging.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points