OLVM: High Availability / Fencing issues with central FC Storage
Hi,
We've been tasked to do some failover tests on a newly set up OLVM cluster (two KVM nodes / cental FC-SAN storage). While provoking a node failure results in the VMs getting started on the other working node, plugging the ethernet-cables did not.
Fencing the offline node via the Baseboard-Management-Controller (IDRAC) worked well. But then the cluster wasn't able to start the VMs thereafter. Digging in the vdsm log revealed that there was still a storage lock on the VMs at this moment, preventing the cluster from starting the VMs.
Starting the VMs by hand worked. So for me it seems that the cluster starts the VMs to early and is giving up when the storage lock occurs. I looked into the engine config options but I found no option that seems to control the time when the cluster tries to start the HA VM on another node. What I was able to found is a RH bug report (1670339) that this issue will be fixed in ovirt-4.4.2.