Is oem 13.2 able to monitor for hardware faults (memory chips) in a exadata Machine? — oracle-tech

    Forum Stats

  • 3,715,756 Users
  • 2,242,858 Discussions
  • 7,845,559 Comments

Discussions

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Is oem 13.2 able to monitor for hardware faults (memory chips) in a exadata Machine?

Kasimirtlw
Kasimirtlw Member Posts: 8 Blue Ribbon
edited August 2020 in Exadata

I've got a question concerning monitoring capabilities of oem 13.2 vs a  Exadata taget that is configured in said Oem

We recently had one of the memory chips in the exadata fail. But I could not find any trace of this in the oem alerts  or incidents .

when I look at  the exadata target , I don't see any indication it is able to see anything hardware related that is not related to iets dicovered tarets. in other words  the memory chips are not a target he checks

Currently it lists  the exadata cells, Hosts, IB switches, Virtual platforms, ethernet switches and pdu's

so nothing that in my mind, is able to follow up on the health status of dimms in the exadata machine.

Is oem able to monitor this?

If so what needs to be done to accomplish this. Or should he have seen this out of the box?

Answers

  • Nishant Baurai
    Nishant Baurai Member Posts: 215
    edited August 2020

    Yes, it does. Check if your Exadata and OEM verions are certified. Also see if Metric is enabled. Follwing is an example email notification of memory fault.

    Target Component name=/System/Memory/DIMMs/DIMM_15

    Target Component owner=SYSMAN

    Host=exadb01.xyz.com

    Target type=Systems Infrastructure Server

    Target name=exadbadm01-ilom.xyz.com

    Categories=Fault

    Message=Fault found in P1/D3 (CPU 1 DIMM 3) @ Sun Oct 14 20:42:08 2018. Description: Multiple correctable ECC errors on a memory DIMM have been detected., Probability: 100, PartNumber: 07075400,M393A4K40BB1-CRC, SerialNumber: 00CE02161031C64BD3

    Severity=Critical

    Event reported time=Oct 14, 2018 7:42:09 PM GMT-05:00

    Operating System=Linux

    Platform=x86_64

    Event Type=Metric Alert

    Event name=ComponentFaults:OpenProblemStatus

    Metric Group=Component Faults

    Metric=Open Problem Status

    Metric value=1

    Key Value=/System/Memory/DIMMs/DIMM_15

Sign In or Register to comment.