Boot process discover variable memory size — oracle-tech

    Forum Stats

  • 3,714,720 Users
  • 2,242,611 Discussions
  • 7,845,026 Comments

Discussions

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Boot process discover variable memory size

user13524501
user13524501 Member Posts: 4 Red Ribbon
edited October 2019 in Oracle x86 Servers

Hi, I have an X4450 and at each reboot it discover different memory size; no apparent errors or led are signified; I am a the latest firmware version

Version 3.0.6.15.f r101655

PropertyValue
SP Firmware Version3.0.6.15.f
SP Firmware Build Number101655
SP Firmware DateFri Aug 14 14:15:22 CST 2015
SP Filesystem Version0.1.22

the server got 128G populated with 32X4G. Sometime it offers 40G, 64G, 72G. I got 128G only one or two time

I resit all the stick and the memoy riser board, I test the memory with pc-check; what else can I do ? is there something to do to fix that

Thanks !

Michel Jean

Answers

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 49 Employee
    edited September 2019

    Hello user13524501,

    Do you see any memory errors during POST or that is the amount of memory which is being initialized and presented to the OS (sometimes 40G, 64G, 72G etc)?

    Mechanically the first thing to do is check that the memory mezzanine card is firmly locked into place. We have seen the arms not secured by the green levers, and making sure they are locked down,has resolved this issue. If that is not the case you proceed to isolate bad dimms by populating the system to minimum configuration and test the DIMM in pairs - Install the first DIMM pair in slots A0/B0. Install the second pair in slots C0/D0 - (a lot of work there...)

    Out of curiosity are all the DIMM's the same manufacturer (part number)?

    A good thing is also to visually inspect the slots for any signs of contamination (dust, burnt, bent pins) which can cause such intermittent behavior.

    Best regards,

    Claudiu

  • Nik
    Nik Member Posts: 2,732 Bronze Crown
    edited September 2019

    Hi.

    You can check status off all DIMMS via ILOM.

    Check what see ILOM and what realy installed.

    Are You see correct memory size at POST output?

    Regards,

      Nik

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 49 Employee
    edited September 2019

    Hello Nik,

    This is what Michael stated:

    "the server got 128G populated with 32X4G. Sometime it offers 40G, 64G, 72G. I got 128G only one or two time"

    If POST disables part of memory during memory initialization then of course the OS will pick up the quantity of memory that passed POST.

    Best regards,
    Claudiu

  • user13524501
    user13524501 Member Posts: 4 Red Ribbon
    edited September 2019

    Hi, I double check last evening and form what I can saw all the dimm are from the same manufacturer (Samsung )with the same sun fru part (371-3069-01). I also check the levers and that look good. I also populate the dimm one by one from last summer and that let me have 128G but since that time I reboot and lost the numbers. I also inspect the socket and dimm connectors and that look good. Perhaps I have to blow it with compressed air . Like expressed the graphic console report a certain ram number and the os will work with. As now I did not got the serial console connected; should I ? is there more message from the serial console ?

    Thank !

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 49 Employee
    edited September 2019

    Hello user13524501,

    Thanks for the details.

    The part number is good, when I meant pins I also meant the pins from the DIMM's as they can also have contamination signs (burnt pins, dust etc...) are they ok as well?

    Blowing compressed air on the sockets is a good maintenance to carry from time to time, especially if the server is not located in a clean data center/environment.

    Try rebooting the system let's say 4-5 times in a row and let me know the amount of memory initialized and visible to the ILOM/OS.

    Out of curiosity what OS do you have installed on this server? Have you experienced any panic/crash/BSOD/PSOD depending on the OS?

    The serial console will display the same thing as the graphical console when it comes to POST, no difference there.

    You can also check the -> show /SP/logs/event/list/ output from ILOM and browse trough it for any memory errors during system initialization/reboots.

    Best regards,

    Claudiu

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 49 Employee
    edited September 2019

    Hello user13524501,

    Have you managed to do any progress with the memory issue?

    Best regards,

    Claudiu

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 49 Employee
    edited October 2019

    Hello user13524501,

    Have you managed to do any progress with the memory issue?

    Best regards,

    Claudiu

  • user13524501
    user13524501 Member Posts: 4 Red Ribbon
    edited October 2019

    Hi, Tanks for the follow up and sorry about the delay. I do what was asked and I re-test each pair of dimm  by booting the server for each new pair added. By doing this I found three pair of defective dimm pairs. (that left my server with 104G of ram). The week after I work on the server to reinstall the OS and manage to get vm server working. Many reboot was involved until I l got another deffective pair of dimm. So now I am running with 96G ( minus 4 pairs of dimm ) for a week now. I hope this was the last pair to replace.

    Thanks again

    Michel

  • user13524501
    user13524501 Member Posts: 4 Red Ribbon
    edited October 2019

    I have installed oracle vm server relase 3.4.6

    kernel 4.1.12-124.21.1.el6uek.x86_64

    and to answer the other question I got that kind of message from "show /SP/logs/event/list/"

    9585   Wed Sep 18 19:58:14 2019  Chassis   Action    major

           Hot removal of /SYS/MB/MCH/DD7

    9584   Wed Sep 18 19:58:14 2019  Chassis   Action    major

           Hot removal of /SYS/MB/MCH/DD6

    9583   Wed Sep 18 19:58:14 2019  Chassis   Action    major

           Hot removal of /SYS/MB/MCH/DC7

    but nothing about memory error

    I got a panic but it came at boot time when dimm have failled; the OS  boot in loop and since I got a limited console view and I can't keep it.   

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 49 Employee
    edited October 2019

    Hello user13524501,

    Thanks for the follow-up, did you find any signs of contamination/dust in the DIMM sockets?

    For the following lines in event list:

    9585   Wed Sep 18 19:58:14 2019  Chassis   Action    major

           Hot removal of /SYS/MB/MCH/DD7

    9584   Wed Sep 18 19:58:14 2019  Chassis   Action    major

           Hot removal of /SYS/MB/MCH/DD6

    9583   Wed Sep 18 19:58:14 2019  Chassis   Action    major

           Hot removal of /SYS/MB/MCH/DC7

    I hope that the power cords were removed during the removal/ insertion of the DIMM's isn't it?

    Is the server stable now with 96 Gb of memory?

    Best regards,
    Claudiu

  • Nik
    Nik Member Posts: 2,732 Bronze Crown
    edited October 2019

    Hi.

    Devices /SYS/MB/MCH/D* is memory DIMM.

    So system detect  hot removal of memory DIMM.

    It can be dust, faulted DIMM or socket problem.

    Regards,

      Nik

Sign In or Register to comment.