Replying to my own question, I am seeing more in the error log, specifically this:
20472 IPMI Log critical Thu Mar 16 19:33:36 2017 ID = 4ff4 : 03/16/2017 : 19:33:36 : System Event : BIOS : Undetermined system hardware failure
20471 IPMI Log critical Thu Mar 16 19:33:35 2017 ID = 4ff3 : 03/16/2017 : 19:33:35 : OEM sensor : BIOS : Hyper-Transport Sync Flood Error
20470 IPMI Log critical Thu Mar 16 19:33:35 2017 ID = 4ff2 : 03/16/2017 : 19:33:35 : System Boot Initiated : BIOS : Automatic boot to diagnostic
20469 IPMI Log critical Thu Mar 16 19:33:35 2017 ID = 4ff1 : 03/16/2017 : 19:33:35 : Processor : BIOS : Presence detected
20468 IPMI Log critical Thu Mar 16 19:33:35 2017 ID = 4ff0 : 03/16/2017 : 19:33:35 : System Boot Initiated : BIOS : Initiated by warm reset
I can't seem to get into the BIOS to change any settings - it just repeats this and my previous error repeatedly. I have two machines and they are suspiciously doing the same thing.
Good day, Could you kindly share the FUL ILOM snapshot from the host,
This symptom is common amongst legacy AMD servers from all vendors of that vintage.
The issue is that AMD developed the Opteron processor with a single power plane which drove both the CPU cores and the memory / I/O hub at the same time.
What happens in summary is that when BIOS is allowed to power manage the CPU's cores, the voltage driving the memory controller is degraded causing correctable and uncorrectable errors.
Sun which his now Oracle recommend disabling power management in BIOS for AMD servers as this issue is directly caused by the Opteron processor changing power states.
AMD attempted to fix the issue in late F stepping dual core CPU's but then broke the design in 10 stepping quad core CPU's.
They designed the core with a split power plane which drove the CPU cores and the memory / I/O hub separately.
It took AMD time to iron out the bugs associated with driving that split plane with multi voltage in system boards.
Note: From community notes.
Note: For further investigation Kindly open a SR with ORACLE and share Full Ilomshot may reboot your node.
Thank you for your response. I have discovered that I can successfully get into the BIOS only if I have the minimum number of DIMMs installed - any additional and I get the above behaviour.
I am unable to find a BIOS setting for power management in the BIOS. I did locate the setting for AMD PowerNow, which was enabled, but has since been disabled, but does not seem to affect my issue.
Are you able to point to me where in the BIOS I disable power management?
Here is a link to a full ILOM snapshot from one of the machines:
I am getting following issue,
Could you pls share the logs to email@example.com.
https://www.dropbox.com/s/jh2jt6hnt2cree8/SUNSP00144FD35E3F_10.248.42.43_2017-03-17T18-22-43.zip?dl=0 Peer's Certificate issuer is not recognized. HTTP Strict Transport Security: true HTTP Public Key Pinning: true Certificate chain: -----BEGIN CERTIFICATE----- MIIEATCCAumgAwIBAgIUQyGYCPTIg7CVW6FeED4iwqKxlXQwDQYJKoZIhvcNAQEL BQAweTEbMBkGA1UECgwST3JhY2xlIENvcnBvcmF0aW9uMQwwCgYDVQQLDANHSVQx FTATBgNVBAcMDFJlZHdvb2QgQ2l0eTELMAkGA1UECAwCQ0ExCzAJBgNVBAYTAlVT MRswGQYDVQQDDBJPcmFjbGUgV2ViIEdhdGV3YXkwHhcNMTcwMzE3MDkzODQxWhcN MTgwMzE3MDkzODQxWjAaMRgwFgYDVQQDDA93d3cuZHJvcGJveC5jb20wggEiMA0G CSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCq2NBSLqIXiUFanRfh2xf9kE8OOgdM uG9lpWmphaLsKbUBO6ckfTNVx5Cf7rktQxDUwINiBwQUdxvPCEIXrcPQncvjr5uB YefLOFv51DAmlJ198yt/PSebvb3NK9tOmq/lqNDOCJFnvrhdcLPm9x31T5iz+Rlf kn0d5jBgDHUwwfHQm/uk2TXGV/ePDIaE1zRjkr4QO4LkX9K9X60MByUR9unbsZiL rngzIAOYnLDzRc6Cp9BXRYKH7+oTLf/vKTqlMLSDxh4PTXkTo4knxcN+Pu8TUjCz Zh+F/0LU/DIVLudbYltOAB1rAi6pVB/CU7NX3vPfAiqo/A9UormSklYZAgMBAAGj gd8wgdwwCQYDVR0TBAIwADAdBgNVHQ4EFgQU88h+phHPgcgDih3eUvzKfp80qGEw gY0GA1UdIwSBhTCBgqF9pHsweTEbMBkGA1UECgwST3JhY2xlIENvcnBvcmF0aW9u MQwwCgYDVQQLDANHSVQxFTATBgNVBAcMDFJlZHdvb2QgQ2l0eTELMAkGA1UECAwC Q0ExCzAJBgNVBAYTAlVTMRswGQYDVQQDDBJPcmFjbGUgV2ViIEdhdGV3YXmCAQEw CwYDVR0PBAQDAgWgMBMGA1UdJQQMMAoGCCsGAQUFBwMBMA0GCSqGSIb3DQEBCwUA A4IBAQBZOwFnVw/YA7+wV9VDBL0GAA6eYgkHlyac7QoZKa9RV4OUAUHhDEwPkKe1 ZpEFoGKqHaUDUDeii8MiK9ZlnBS+4HbN0dewwUncIBEbfmnIiYNNHL0dV0187xkI yTJrYX9qAEoNhhv3Nv4mfx1BHrnaReTL0DdwaokFDR4ffy/JHC5zc97U5BtFkpEv nXK2Ot2Oo1dNoNV+70iAB//olu8asrHQHS1LdTjE9GjjVOcolTwNZOfqgzVTGVos 2nEzESMBY+jiwcvJXEbwsSbJKvEku0+ZiESxFnR14DwSiKhBMRfYptuT9Whhshd3 wb0VEYB+0G2i//g9mHle8F0YDazc -----END CERTIFICATE-----
ILOM snapshot has been sent
Have you removed any CPU from the host ?
hdtDiag: HT test passed !!
hdtDiag: Power cycling for clean start
hdtDiag: Power on 00
no dbdry cpu 00
hdtDiag: Error, HDT command failed, no CFF cpu 0
hdtDiag: Error exiting HDT mode
hdtDiag: Error, HDT command failed, no CFF cpu 0 =====================>
hdtDiag: Error exiting HDT mode
hdtDiag: exit HDT mode cpu 00
I am unable to find valid information apart from the some HD failed info in the given logs.
ipmitool -H < ilom ipaddress -U root fru, Kindly share the following information.