in our environment we have a 2 node cluster.
The 2 nodes and 2 SAN storages are in different rooms.
Voting files for Clusterware are in ASM.
Additionally we have a third voting disk on a NFS server (configured like in this descripton: http://www.oracle.com/technetwork/products/clusterware/overview/grid-infra-thirdvoteonnfs-131158.pdf&ei=lzJXUPvJMsn-4QTJ8YDoDg&usg=AFQjCNGxaRWhwfTehOml-KgGGeRkl4yOGw)
The Quorum flag is on the disk that is on NFS.
The diskgroup is with normal redundancy.
Clusterware keeps running when one of the VDs gets lost (e.g. storage failure).
So far so good.
But when I have to restart Clusterware (e.g. reboot of a node) while the VD is still missing, then clusterware does not come up.
Did not find an indication if this whether is planned behaviour of Clusterware or maybe because I missed a detail.
From my point of view it should work to start Clusterware as long as the majority of VDs are available.
@phaeus: sorry. forgot to mention. version is 18.104.22.168
@damorgan: I think it should restart with a missing voting disk because a HA solution should be able to
handle a hardware fault of this dimension. Why do I mirror disks to different locations? Not because I have fun wasting money
in redundant hardware. I do it because I want to make my infrastructure more stable against hardware failures/etc...
Back to topic:
in the meantime I did some further investigations.
as I restarted Clusterware with one missing disk CSSD.bin started. Also ASM instance started (or wanted to start) but
logged some error messages:
ERROR: diskgroup POSTOCW was not mounted
WARNING: Disk Group POSTOCW containing spfile for this instance is not mounted
WARNING: Disk Group POSTOCW containing configured OCR is not mounted
WARNING: Disk Group POSTOCW containing voting files is not mounted
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "4" is missing from group number "1"
after "alter diskgroup POSTOCW online force;" the disk group was mounted and the Clusterware "continued" the startup process.
Is there a possibility to get Clusterware restart correctly without this hack?
normaly the cluster can relocate one voting disk to a other failgroup if it fails. What i did in my Cluster installation was to use not only 2 disk with normal redundancy. My Diskgroup layout has 5 disk with normal/high redundancy. If i use normal redundancy because i have no preffered datacenter i use high mirror for the files (maybe you must recreate your diskgroup and recreate initasm.ora, ocr). If you use high mirror file template the extend on the disk are written not only to one disk partner, they are written to any disk partner. I have done several test with customers, also with split brain and the cluster get online after reboot.
actually what you see is expected (especially in a stretched cluster environment, with 2 failgroups and 1 quorum failgroup).
It has to do with how ASM handles a disk failure and is doing the mirriong (and the strange "issue" that you need a third failgroup for the votedisks).
So before looking at this special case, lets look at how ASM normally treats a diskgroup:
A diskgroup can only be mounted in normal mode, if all disks of the diskgroup are online. If a disks is missing ASM will not allow you to "normally" mount the diskgroup, before the error situation is solved. If a disks is lost, which contents can be mirrored to other disks, then ASM will be able to restore full redundancy and will allow you to mount the diskgroup. If this is not the case ASM expects the user to tell what it should do => The administrator can issue a "alter diskgroup mount force" to tell ASM even though it cannot held up the required redundancy it should mount with disks missing. This then will allow the administrator to correct the error (or replaced failed disks/failgroups). While ASM had the diskgroup mounted the loss of a failgroup will not result in a dismount of the diskgroup.
The same holds true with the diskgroup containing the voting disks. So what you see (will continue to run, but cannot restart) is pretty much the same like for a normal diskgroup: If a disk is lost, and the contents does not get relocated (like if the quorum failgroup fails it will not allow you to relocatore, since there are no more failgroups to relocate the third vote to), it will continue to run, but it will not be able to automatically remount the diskgroup in normal mode if a disk fails.
To bring the cluster back online, manual intervention is required: Start the cluster in exclusive mode:
crsctl start crs -excl
Then connect to ASM and do a
alter disgkroup <dgname> mount force
Then resolve the error (like adding another disk to another failgroup, that the data can be remirrored and the disk can be dropped.
After that a normal startup will be possible again.