Hello,

Sharing this recent experience around having to replace an IB switch in the Exalogic machine due to the hardware failure.

It has started as a fairly simple maintenance related to a hardware failure on the redundant equipment - such as IB switch in Exalogic machine.

 

Hardware failures do happen and IB switches are no exception to that rule. Also, IB switches in Exalogic machines aren't customer serviceable units or CRUs.

Thus everything looks fairly simple, right ? With either ASR or OEM or something else  reporting a hardware failure an SR gets created, initial troubleshooting and RCA are done and Oracle Field Engineer has been scheduled to replace the failed IB switch.

So far so good.

 

At the agreed upon time, Oracle FE shows up at your datacenter , replaces the IB switch, matches the firmware on the replacement switch to the healthy one, configures ILOM and management on the new IB switch, declares "Mission Accomplished" (remember that banner on the carrier?) and leaves.

 

You also run Exachk report and it comes back clean - life is good, right ? Wrong. Very wrong unfortunately.

 

Even though IB switch isn't customer serviceable indeed, restoring configuration on it is.

Essentially, if you - the customer - haven't followed this MOS document immediately after Exalogic IB switch replacement you are exposed for a big problem.

Exalogic Infiniband Switch Replacement - Follow-up Actions (Restoration) (Doc ID 2218689.1)

 

How big of a problem one may ask ? Think of the complete outage of all VMs running on that Exalogic machine, all at once.

This is what just happened to one of our customer.

 

The moral of the story here is very simple - don't accept FEs report that everything is good since everything is good just from the HW perspective.

Your freshly replaced IB switch still has no configuration on it and in a case of the IB fail-over attempt, planned or un-planned, your Exalogic customers could get agitated very quickly.

 

Make sure to follow all the steps from this document and perform a fail-over test for at least one of the VMs.

Hope this article saves you from a major Exalogic outage or at least allows you to recover from it as quickly as possible.

Thank you for reading, Slava Urbanovich