6 Replies Latest reply on Feb 6, 2012 4:57 AM by 897085

    v480 server got autoreboot

    897085
      I have a v480 server got auto rebooted. Need to know why it went down. Please find the /var/adm/messages below.

      test@v480#tail -500 /var/adm/messages
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 215022 kern.warning] WARNING: [AFT1] Uncorrectable system bus (UE) Event detected by CPU0 in Privileged mode at TL=0, errID 0x00a1c490.a57f6240
      Oct 14 13:06:29 v480 AFSR 0x00700004<DUE,ME,PRIV,UE>.000001fd AFAR 0x000000b0.03e65780 AMBIGUOUS
      Oct 14 13:06:29 v480 Fault_PC 0x11737d8 Esynd 0x01fd AMBIGUOUS
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 688199 kern.notice] [AFT1] errID 0x00a1c490.a57f6240 Two Bits were in error
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 628304 kern.info] [AFT2] errID 0x00a1c490.a57f6240 PA=0x000000b0.03e65780
      Oct 14 13:06:29 v480 E$tag 0x000002c0.0f920024 E$state_6 Modified
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00000000.00000000 0x00000300.1b20d520 ECC 0x04f
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x000002a1.0165dec0 0x000002a1.0165df80 ECC 0x1c7
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00000000.07f80000 0xc00000b1.f2000676 ECC 0x0a0
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x00000300.04394000 0x00000300.10af7328 ECC 0x03f
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 319484 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x00a1c490.a57f6240
      Oct 14 13:06:29 v480 AFSR 0x00700004<DUE,ME,PRIV,UE>.000001fd AFAR 0x000000b0.03e65780 AMBIGUOUS
      Oct 14 13:06:29 v480 Fault_PC 0x11737d8 Esynd 0x01fd AMBIGUOUS
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 688199 kern.notice] [AFT1] errID 0x00a1c490.a57f6240 Two Bits were in error
      Oct 14 13:06:29 v480 unix: [ID 321153 kern.notice] NOTICE: Scheduling clearing of error on page 0x000000b0.03e64000
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 678432 kern.notice] NOTICE: [AFT1] Invalid AFSR on Fast ECC Trap taken by CPU0 in Privileged mode at TL=0,

      errID 0x00a1c490.a57f6240
      Oct 14 13:06:29 v480 AFSR 0x00500000<DUE,PRIV>.00000071 AFAR 0x000000b0.03e65480
      Oct 14 13:06:29 v480 Fault_PC 0x1176dd8 Esynd 0x0071 Slot B: J3100 J3101 J3201 J3200
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 341500 kern.notice] [AFT1] errID 0x00a1c490.a58b3408 Two Bits in error, likely from E$ WDU/CPU
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 746322 kern.info] [AFT2] errID 0x00a1c490.a58b3408 PA=0x000000b0.03e65480
      Oct 14 13:06:29 v480 E$tag 0x000002c0.0f000120 E$state_2 Modified
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x00) 0xc0000000.00000000 0x00000000.00000000 ECC 0x000 Bad Esynd=0x071
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x10) 0xc0000000.00000000 0x00000000.00000000 ECC 0x000 Bad Esynd=0x071
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x20) 0xc0000000.00000000 0x00000000.00000000 ECC 0x000 Bad Esynd=0x071
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 819380 kern.info] [AFT2] E$Data (0x30) 0xc0000000.00000000 0x00000000.00000000 ECC 0x000 Bad Esynd=0x071
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 929717 kern.info] [AFT2] D$ data not available
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 335345 kern.info] [AFT2] I$ data not available
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 789567 kern.warning] WARNING: [AFT1] EDU:ST Event detected by CPU0 at TL=0, errID 0x00a1c490.a58e43a0
      Oct 14 13:06:29 v480 AFSR 0x00000008<EDU>.00000071 AFAR 0x000000b0.03e65380
      Oct 14 13:06:29 v480 Fault_PC 0x100c6b4 Esynd 0x0071
      Oct 14 13:06:29 v480 Fault_PC 0x1176dd0 Esynd 0x0071 Slot B: J3100 J3101 J3201 J3200
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 802669 kern.notice] [AFT1] errID 0x00a1c490.a58e5aac Two Bits in error, likely from E$ WDU/CPU
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 312155 kern.warning] WARNING: [AFT1] First Error EDU:ST Event detected by CPU0 at TL=0, errID

      0x00a1c490.a58f55d8
      Oct 14 13:06:29 v480 AFSR 0x00000008<EDU>.00000003 AFAR 0x000000b0.03e65490
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 634480 kern.notice] [AFT1] errID 0x00a1c490.a5a18ac8 Two Bits in error, likely from E$ EDU:ST
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 922131 kern.info] [AFT2] errID 0x00a1c490.a5a18ac8 E$tag PA=0x00000000.00665480 does not match

      AFAR=0x000000b0.03e65480
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 915500 kern.info] [AFT2] errID 0x00a1c490.a5a18ac8 PA=0x00000000.00665480
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 208903 kern.warning] WARNING: [AFT1] First Error UCU Event detected by CPU0 in Privileged mode at TL=0, errID

      0x00a1c490.a5a21060
      Oct 14 13:06:29 v480 AFSR 0x00100200<PRIV,UCU>.00000003 AFAR 0x000000b0.03e65390
      Oct 14 13:06:29 v480 Fault_PC 0x117d3f4 Esynd 0x0003
      Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 306583 kern.notice] [AFT1] errID 0x00a1c490.a5a21060 Two Bits in error, likely from E$ EDU:ST
      Oct 14 13:06:30 v480 unix: [ID 836849 kern.notice]
      Oct 14 13:06:30 v480 ^Mpanic[cpu0]/thread=3001b20d520:
      Oct 14 13:06:30 v480 unix: [ID 100107 kern.notice] [AFT1] errID 0x00a1c490.a5ad4070 UE Error(s)
      Oct 14 13:06:30 v480 See previous message(s) for details
      Oct 14 13:06:30 v480 unix: [ID 100000 kern.notice]
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165c520 SUNW,UltraSPARC-III+:cpu_aflt_log+5c0 (2a10165c62b, 1, 2a10165c838, 10, 117f298,

      117f2c0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000000001262a64 0000000000000010 0000000000000003 000002a10165c838
      Oct 14 13:06:30 v480 %l4-7: 000000b003e65380 0000000000000000 000002a10165c768 000002a10165c5de
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165c770 SUNW,UltraSPARC-III+:cpu_deferred_error+4d4 (0, 1, 10000400000071, 100004, b0, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 000002a10165c838 0000000400000000 0010000400000071 000003000437c928
      Oct 14 13:06:30 v480 %l4-7: 0000000000000000 000002a10165cd80 000000000000000f 0000000080000000
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165ccd0 unix:ktl0+48 (1, 58fffff, 0, 30006611c28, 400000000, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000004 0000000000001400 0000000080001603 0000000001173700
      Oct 14 13:06:30 v480 %l4-7: 00000300003b5b60 0000000000000016 000000000000000f 000002a10165cd80
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165ce20 SUNW,UltraSPARC-III+:cpu_flt_in_memory+34 (2a10165d348, 30003a88440, 328, 2a10165d348,

      2a10165d671, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000000080001602 0000000000000016 0000000080001602 00000000011735b0
      Oct 14 13:06:30 v480 %l4-7: 000000b003e65490 000002a10165cb58 0000000000000001 000002a10165cf40
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165cfe0 genunix:errorq_dispatch+68 (300003b5a50, 2a10165d208, 468, 1, b0, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 000003000028fbc8 0000000000000001 00000300003b5a50
      Oct 14 13:06:30 v480 %l4-7: 000000000144a050 00000300003b5b68 0000030024c71ce0 000002a10115dba0
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165d090 SUNW,UltraSPARC-III+:cpu_queue_events+ec (400000000, 2a10165d670, 4010000403200087,

      3000437c928, 75ad2a00, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000000400000000 0000000000000001 0000000001492090 0010000400000087
      Oct 14 13:06:30 v480 %l4-7: 000000b003e65f80 000002a10165d208 000000000000000f 000003000437c928
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165d140 SUNW,UltraSPARC-III+:cpu_deferred_error+378 (0, 1, 4010000403200087, 40100004, b0, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 000002a10165d208 0000000000000000 4010000403200087 000003000437c928
      Oct 14 13:06:30 v480 %l4-7: 0000000000000001 000002a10165d750 0000000000000000 0000000080000000
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165d6a0 unix:ktl0+48 (2a10165dec0, 2a10165df80, 7f80000, c00000b1f2000676, 30004394000,

      30010af7328)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000007 0000000000001400 0000000000001606 0000000001173700
      Oct 14 13:06:30 v480 %l4-7: 0000030004394090 000000000142e9a8 000000000000000b 000002a10165d750
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165d7f0 unix:_resume_from_idle+d0 (3001b20d520, 30004394000, 1438a78, 3014f5361b0, 16, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000030004394000 000003001b20d520 000002a100013d40 000002a100013d40
      Oct 14 13:06:30 v480 %l4-7: 0000000001438800 0000000000000000 0000000000000000 0000000000000000
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165d8a0 genunix:cv_timedwait_sig+1b8 (3fffffefa787ab11, 3001b20d520, 149b800,

      fffffffffffffc18, ad0, 3b9aca00)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000030006611c28 000003014f5361b0 0000000000000000 0000000000000000
      Oct 14 13:06:30 v480 %l4-7: 000003001b20d69e 000003001b20d6a0 000000010f66b2f5 0000000000000000
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165d950 genunix:cv_waituntil_sig+98 (3001b20d69e, 3001b20d6a0, 2a10165dac8, 1e, 0, 3b9aca00)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 000000000000001e 0000000000000001 0000000080000000 000002a10165dac8
      Oct 14 13:06:30 v480 %l4-7: 0000000001497400 0000000001000000 000003014f5361b0 000002a10165d970
      Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165da10 genunix:poll+13c (fbd7fd58, 0, 32, 32, 0, 0)
      Oct 14 13:06:30 v480 genunix: [ID 179002 kern.notice] %l0-3: 0000000000000001 0000030006611c28 0000000000000057 0000000000000000
      Oct 14 13:06:30 v480 %l4-7: 0000000000000000 0000000000000001 000002a10165dac8 000000000000001e
      Oct 14 13:06:30 v480 unix: [ID 100000 kern.notice]
      Oct 14 13:06:30 v480 genunix: [ID 672855 kern.notice] syncing file systems...
      Oct 14 13:06:30 v480 unix: [ID 836849 kern.notice]
      Oct 14 13:06:30 v480 ^Mpanic[cpu0]/thread=3001b20d520:
      Oct 14 13:06:30 v480 unix: [ID 799565 kern.notice] BAD TRAP: type=31 rp=14382b0 addr=30090b961c0 mmu_fsr=0
      Oct 14 13:06:30 v480 unix: [ID 100000 kern.notice]
      Oct 14 13:06:30 v480 genunix: [ID 111219 kern.notice] dumping to /dev/dsk/c1t0d0s1, offset 6711672832, content: kernel
      Oct 14 13:07:12 v480 genunix: [ID 409368 kern.notice] ^M100% done: 170860 pages dumped, compression ratio 3.04,
        • 1. Re: v480 server got autoreboot
          Use your service contract an open a Support Request (SR).
          Show them:
          Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 789567 kern.warning] WARNING: [AFT1] EDU:ST Event detected by CPU0 at TL=0, errID 0x00a1c490.a58e43a0
          Oct 14 13:06:29 v480 AFSR 0x00000008<EDU>.00000071 AFAR 0x000000b0.03e65380
          Oct 14 13:06:29 v480 Fault_PC 0x100c6b4 Esynd 0x0071
          Oct 14 13:06:29 v480 Fault_PC 0x1176dd0 Esynd 0x0071 Slot B: J3100 J3101 J3201 J3200
          Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 802669 kern.notice] [AFT1] errID 0x00a1c490.a58e5aac Two Bits in error, likely from E$ WDU/CPU
          Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 312155 kern.warning] WARNING: [AFT1] First Error EDU:ST Event detected by CPU0 at TL=0, errID
          Then tell them you have a core file for the Kernel team to review.
          They will likely tell you to ...
          (1) Patch the system to a full current Recommended Patch Bundle,
          (2) Then monitor the system for a week in case there is a repeat of the panic reboot.

          They can decide whether any hardware needs to be replaced, but it's probably a patching issue.

          Let Technical Support explain to you what each of the following may mean for your system (I have emphasized various text in bold):
          Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 208903 kern.warning] WARNING: [AFT1] First Error UCU Event detected by CPU0 in Privileged mode at TL=0, errID
          Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 319484 kern.warning] WARNING: [AFT1] DUE Event detected by CPU0 at TL=0, errID 0x00a1c490.a57f6240
          Oct 14 13:06:29 v480 SUNW,UltraSPARC-III+: [ID 789567 kern.warning] WARNING: [AFT1] EDU:ST Event detected by CPU0 at TL=0, errID 0x00a1c490.a58e43a0
          Oct 14 13:06:30 v480 genunix: [ID 723222 kern.notice] 000002a10165d090 SUNW,UltraSPARC-III+:cpu_queue_events +ec (400000000, 2a10165d670, 4010000403200087, 3000437c928, 75ad2a00, 0)
          They have documents that are internal-only to paying support customers and/or internal-only to Oracle employees that give good descriptions for each of those system events I have emphasized in bold text.
          You could also have them glance at this forum post.

          You need proper assistance from Technical Support. This is more than a simple user-to-user forum can remedy.
          1 person found this helpful
          • 2. Re: v480 server got autoreboot
            805789
            Fayaz;

            This box rebooted because it crashed due to Memory DIMM issues. You 're receiving AFT1 messages which means Uncorrectable Memory Errors. Server panics in order to keep data integrity.

            One DIMM from System Board in Slot B has failed: Slot B: J3100 J3101 J3201 J3200. You need to run HW diagnostics or log a Service Request with Oracle in order to narrow the faulty DIMM and replace it.

            Regards.

            </SQ>
            • 3. Re: v480 server got autoreboot
              Sergio is probably correct that you will eventually need to replace a set of DIMMs, but in my past incarnation as a Sun support engineer (in the previous millennium) a system that was unpatched or at least a couple of years' down-rev could also result with these errors. A properly patched OS will have better self-healing and a lot less down-time.

              Again, only a proper analysis by Technical Support will figure it out.
              • 4. Re: v480 server got autoreboot
                897085
                Thanks for your reply Sergio & Rukbat.

                Well i checked out the server physically & the service LED is not lit. So my conclusion is that there is no DIMM failure. Is my conclusion correct?

                I didnt log a oracle service request at that time as i thought it is not critical server. As per your suggestion i have planned to log a service request. I need to discuss with my technical head to log SR as it is a long time now the server crashed.

                Also i tried reaching RSC but it is not reachable. Will a RSC reset from Solaris OS fix the issue?
                • 5. Re: v480 server got autoreboot
                  user300462
                  Running a max POST will get you a nice detailed diagnostic display in sequence on your screen, and you can observe the very first failure occurred at what sub test of the POST. That really helps you to FA the hardware failure.

                  If you have ever removed or installed a CPU/Memory board, make sure that the VHDM connector on CPU/Memory board and backplane has no bent or damaged pin and the male/female connectors are fully engaged.

                  When you try to isolate the problem to a faulty DIMM displayed as Jxxxx, be aware of the memory bank (0 or 1) and side (front or rear). Since it is a multiple bit ECC error in this case, it is difficult to use ECC syndrome table to map to specific data bit.

                  Personally I don't think your RSC card is faulty. Reset the system through RSC will not solve the problem if this is a hardware failure.

                  Good Luck.
                  • 6. Re: v480 server got autoreboot
                    897085
                    I will run a max POST as per your suggestion & RSC was not reachable due to a network issue...Thanks to all bro 4 the help..