This discussion is archived
7 Replies Latest reply: Jan 29, 2013 10:42 AM by rukbat RSS

Analysing show error buffer output E2900 server

midhu Newbie
Currently Being Moderated
Hi Guys Please help to find out the faulty memories in E2900 server using showerrorbuffer output. I am continuously getting incoming read messages which denotes memory errors but there is no memory error in MEM2 POST and show outputs . I can find out the corresponding CPU from port. A sample output below.

ErrorData[2]
Date: Thu Apr 26 03:03:01 PDT 2012
Device: /SB4/dx3
ErrorID: 0x33061ff1
Port: 1
Syndrome: 0x86(CE bit 17)
Direction: incoming read
First error: true
TargetAid: 0x10
Transid: 0x1

As per Oracle doc 1002710.1 memory pair is found out using the databit(CE bit 17) as data bit 41 (ESYN 0xd) translates to the DIMM pair(J16500/J16501). So please help how to translate data bit to memory pair.
  • 1. Re: Analysing show error buffer output E2900 server
    Nik Expert
    Currently Being Moderated
    Hi.

    Generaly */var/adm/messages* provide more readable information.

    For decoding try found document:
    Document ID: ID70962 Synopsis: Sun Fire[TM] High/Midrange Servers: Decoding data bit to a pair of DIMMs

    Regards.
  • 2. Re: Analysing show error buffer output E2900 server
    midhu Newbie
    Currently Being Moderated
    Thank you very much for the link ... :)
  • 3. Re: Analysing show error buffer output E2900 server
    rukbat Guru Moderator
    Currently Being Moderated
    Device: /SB4/dx3
    and
    translates to the DIMM pair(J16500/J16501)
    Go to SB4 and see that the DIMM slots for the Sunfire MidRange Systems have printed labels already in place on each board. If you are running multiple cpu/memory boards at the same time and if your firmware version is new enough to let you to disable the board, you would then be able pull it from the running chassis, and swap out the questionable RAM.

    Since you have Metalink access, you can find a drawing of the board layout in the System Handbook.

    The E2900 documentation is at:
    http://docs.oracle.com/cd/E19095-01/sfe2900.srvr/index.html
    You can glance in the E2900 Service Manual (23MB PDF file):
    http://docs.oracle.com/cd/E19095-01/sfe2900.srvr/817-4054-15/817-4054-15.pdf
    for service procedures.

    (Of course it will ALWAYS be safer to just schedule a proper maintenance window when you can shut the whole chassis down to a power-off state.)

    For those other people that might read this thread in the future and NOT have access to the MOS version of the System Handbook, here's a diagram of a Sunfire MidRange cpu/memory board as hosted on a non-Oracle archive of the old Sun System Handbook:
    http://www.sunshack.org/data/sh/2.1.8/infoserver.central/data/syshbk/Devices/System_Board/SYSBD_SunFire_V1280_CPU.html
  • 4. Re: Analysing show error buffer output E2900 server
    805789 Explorer
    Currently Being Moderated
    Sorry Rukbat you 're wrong. This "error" points to DIMMS J14400 or J14401 @ SB4.

    However I 'll suggest first to count how many "incoming read errors" you have from "showerrorbuffer". Check @ /var/adm/messages for any AFT message like AFT2 or AFT0. May be you are experiencing Correctable Errors. If you have an AFT1, does mean you 're experiencing "Uncorrectable Errors" so DIMMS will need to be replaced.

    If you still feel no confidence, please open a ticket with Oracle Support.


    Regards.

    </SQ>
    SPARC Engineer
  • 5. Re: Analysing show error buffer output E2900 server
    rukbat Guru Moderator
    Currently Being Moderated
    Sergio,
    I hadn't done any analysis of the original poster's errors.
    Their initial post was a bit too cryptic for me to attempt anything.
    My reply to their inquiry was to quote their conclusion that it was going to be that pair particular of modules.

    Yes, they most definitely need to log a ticket for proper analysis.
    The errors could require service to the system or the errors could simply be an indication the box is far underpatched and thus mean nothing.
  • 6. Re: Analysing show error buffer output E2900 server
    805789 Explorer
    Currently Being Moderated
    Haa !! Now that I read carefully, you 're right Rukbat. Please accept my apologizes.! User "midhu" was the one pointing to J16500/J16501.

    Cheers.

    </SQ>
    SPARC Engineer
  • 7. Re: Analysing show error buffer output E2900 server
    rukbat Guru Moderator
    Currently Being Moderated
    Sergio Quiroz wrote:
    Haa !! Now that I read carefully ...
    No problem.
    A very long, long time ago in a galaxy far, far away and in a previous incarnation, I enjoyed a few years as a Hardware support engineer for Sun. We handled many DIMM error support calls on a daily basis. The vast majorityof those inquiries needed no parts swapped, just OS and firmware patches applied.

    My Sunfire Midrange training was so long ago the documentation for the training classes were in what might be described as an alpha version. The E2900 is nothing more than a slightly newer UltraSPARC-III / UltraSPARC-IV iteration of the US-II V1280.

    :)

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points