1 Reply Latest reply: Oct 21, 2011 5:31 PM by SteveS RSS

    bad trap after halting a zone on solaris s10u10

    895708
      I get this panic shortly after stopping a zone on solaris 10 update 10:

      ===
      panic[cpu0]/thread=fffffe80001d9c60: BAD TRAP: type=e (#pf Page fault) rp=fffffe80001d91f0 addr=61ea84ef occurred in module "<unknown>" due to an illegal access to a user address

      sched: #pf Page fault
      Bad kernel fault at addr=0x61ea84ef
      pid=0, pc=0x61ea84ef, sp=0xfffffe80001d92e8, eflags=0x10283
      cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6b0<xmme,fxsr,pge,pae,pse>
      cr2: 61ea84ef cr3: ad2f000 cr8: c
      rdi: ffffffff8b4cc10c rsi: ffffffff8e52a72c rdx: fffffffffffffffc
      rcx: 6666e2ff r8: fffffe80001d9368 r9: fffffe80001d936c
      rax: ffffffff8b4cc080 rbx: fffffffffffffffc rbp: fffffe80001d9310
      r10: 61ea84ef r11: 1 r12: ffffffff8e52b740
      r13: ffffffff8dc68180 r14: 2c r15: 28
      fsb: 0 gsb: fffffffffbc29580 ds: 43
      es: 43 fs: 0 gs: 1c3
      trp: e err: 10 rip: 61ea84ef
      cs: 28 rfl: 10283 rsp: fffffe80001d92e8
      ss: 30

      fffffe80001d9100 unix:die+da ()
      fffffe80001d91e0 unix:trap+5e6 ()
      fffffe80001d91f0 unix:cmntrap+140 ()
      fffffe80001d9310 61ea84ef ()
      fffffe80001d93a0 ipf:fr_pullup+124 ()
      fffffe80001d93f0 ipf:frpr_ipv4hdr+8d0 ()
      fffffe80001d9430 ipf:fr_makefrip+178 ()
      fffffe80001d9570 ipf:fr_check+149 ()
      fffffe80001d95d0 ipf:ipf_hook+88 ()
      fffffe80001d95e0 ipf:ipf_hook4_out+16 ()
      fffffe80001d9620 hook:hook_run+6c ()
      fffffe80001d9750 ip:ip_wput_frag+360 ()
      fffffe80001d97b0 ip:ip_wput_ire_fragmentit+140 ()
      fffffe80001d9930 ip:ip_wput_ire+2e3c ()
      fffffe80001d9a20 ip:ip_output_options+1dd2 ()
      fffffe80001d9a30 ip:ip_output+10 ()
      fffffe80001d9b00 ip:tcp_send_data+df ()
      fffffe80001d9b60 ip:tcp_timer+52f ()
      fffffe80001d9b80 ip:tcp_timer_handler+28 ()
      fffffe80001d9bf0 ip:squeue_drain+f0 ()
      fffffe80001d9c40 ip:squeue_worker+11d ()
      fffffe80001d9c50 unix:thread_start+8 ()

      syncing file systems... 125 65 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 done (not all i/o completed)
      dumping to /dev/dsk/c1t0d0s1, offset 806092800, content: kernel + curproc
      0:03 100% done
      100% done: 81427 pages dumped, dump succeeded
      rebooting...
      ===

      I have a syslog running in global zone that receives forwarded messages from another syslog that runs in the zone. Interesting is, that after halting the zone, there is still an ESTABLISHED TCP-connection viewable in the global zone (scrooge-ext is the name of the halted zone):
      ===
      root@scrooge:/[188] # lsof | grep scrooge-ext
      syslogp 1276 root 21u IPv4 0xffffffff8dd2fcc0 0t2693 TCP 127.1.0.2:shell->scrooge-ext:32777 (ESTABLISHED)
      ===

      As soon as I kill now the global syslog above the panic occurs.

      Interesting is, that if I remove the TCP-connection with tcpdrop before killing the syslog, the panic is not occurring.

      Has anyone seen such a panic before?

      BTW: The same setup with s10u9 did not cause this problem.
        • 1. Re: bad trap after halting a zone on solaris s10u10
          SteveS
          Hello Nada,

          This isn't something which we can easily investigate or answer in a forum situation. The crash dump needs to be analysed and further investigation necessary. I'm not finding any bugs based on the stack alone.
          occurred in module "<unknown>" due to an illegal access to a user address
          The fact that we see module "<unknown>" means that whatever module this is, it's either been modunloaded or swapped out. It's also the reason why we can't decode the module:function in the stack:
          fffffe80001d91f0 unix:cmntrap+140 ()
          fffffe80001d9310 61ea84ef ()    <--- this is the unknown module
          fffffe80001d93a0 ipf:fr_pullup+124 ()
          fffffe80001d93f0 ipf:frpr_ipv4hdr+8d0 ()
          From what you've said, this is reproducible. This makes it much easier to instrument debugging to get the data we need to get to root cause.

          Assuming you have a Premium contract for either software or the system itself, please raise a software service request to the network group and provide us with an explorer from the global zone and the crashdump(s). One of the network team can then investigate further.

          Regards,
          Steve