This content has been marked as final. Show 6 replies
the behaviour how the cluster handles voting and ocr disks are not the same in 11.1 and 11.2.
Can you post your Version?
Why do you think it should restart with a missing voting disk?
Human nature being what it is ... if Oracle made it work that way ... no one would ever fix the problem until they completely destroyed the grid infrastructure.
I may be that there is a workaround but I would sincerely hope that there is not.
@phaeus: sorry. forgot to mention. version is 126.96.36.199
@damorgan: I think it should restart with a missing voting disk because a HA solution should be able to
handle a hardware fault of this dimension. Why do I mirror disks to different locations? Not because I have fun wasting money
in redundant hardware. I do it because I want to make my infrastructure more stable against hardware failures/etc...
Back to topic:
in the meantime I did some further investigations.
as I restarted Clusterware with one missing disk CSSD.bin started. Also ASM instance started (or wanted to start) but
logged some error messages:
ERROR: diskgroup POSTOCW was not mounted
WARNING: Disk Group POSTOCW containing spfile for this instance is not mounted
WARNING: Disk Group POSTOCW containing configured OCR is not mounted
WARNING: Disk Group POSTOCW containing voting files is not mounted
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "4" is missing from group number "1"
after "alter diskgroup POSTOCW online force;" the disk group was mounted and the Clusterware "continued" the startup process.
Is there a possibility to get Clusterware restart correctly without this hack?
normaly the cluster can relocate one voting disk to a other failgroup if it fails. What i did in my Cluster installation was to use not only 2 disk with normal redundancy. My Diskgroup layout has 5 disk with normal/high redundancy. If i use normal redundancy because i have no preffered datacenter i use high mirror for the files (maybe you must recreate your diskgroup and recreate initasm.ora, ocr). If you use high mirror file template the extend on the disk are written not only to one disk partner, they are written to any disk partner. I have done several test with customers, also with split brain and the cluster get online after reboot.
It does work with a loss of infrastructure ... but restart? If it did no one would ever correct the problem.
actually what you see is expected (especially in a stretched cluster environment, with 2 failgroups and 1 quorum failgroup).
It has to do with how ASM handles a disk failure and is doing the mirriong (and the strange "issue" that you need a third failgroup for the votedisks).
So before looking at this special case, lets look at how ASM normally treats a diskgroup:
A diskgroup can only be mounted in normal mode, if all disks of the diskgroup are online. If a disks is missing ASM will not allow you to "normally" mount the diskgroup, before the error situation is solved. If a disks is lost, which contents can be mirrored to other disks, then ASM will be able to restore full redundancy and will allow you to mount the diskgroup. If this is not the case ASM expects the user to tell what it should do => The administrator can issue a "alter diskgroup mount force" to tell ASM even though it cannot held up the required redundancy it should mount with disks missing. This then will allow the administrator to correct the error (or replaced failed disks/failgroups). While ASM had the diskgroup mounted the loss of a failgroup will not result in a dismount of the diskgroup.
The same holds true with the diskgroup containing the voting disks. So what you see (will continue to run, but cannot restart) is pretty much the same like for a normal diskgroup: If a disk is lost, and the contents does not get relocated (like if the quorum failgroup fails it will not allow you to relocatore, since there are no more failgroups to relocate the third vote to), it will continue to run, but it will not be able to automatically remount the diskgroup in normal mode if a disk fails.
To bring the cluster back online, manual intervention is required: Start the cluster in exclusive mode:
Then connect to ASM and do a
crsctl start crs -excl
Then resolve the error (like adding another disk to another failgroup, that the data can be remirrored and the disk can be dropped.
alter disgkroup <dgname> mount force
After that a normal startup will be possible again.