6 Replies Latest reply: Sep 18, 2012 9:08 AM by Sebastian Solbach -Dba Community-Oracle RSS

    startup of Clusterware with missing voting disk

    953736
      Hello,

      in our environment we have a 2 node cluster.
      The 2 nodes and 2 SAN storages are in different rooms.
      Voting files for Clusterware are in ASM.
      Additionally we have a third voting disk on a NFS server (configured like in this descripton: http://www.oracle.com/technetwork/products/clusterware/overview/grid-infra-thirdvoteonnfs-131158.pdf&ei=lzJXUPvJMsn-4QTJ8YDoDg&usg=AFQjCNGxaRWhwfTehOml-KgGGeRkl4yOGw)
      The Quorum flag is on the disk that is on NFS.
      The diskgroup is with normal redundancy.

      Clusterware keeps running when one of the VDs gets lost (e.g. storage failure).
      So far so good.

      But when I have to restart Clusterware (e.g. reboot of a node) while the VD is still missing, then clusterware does not come up.

      Did not find an indication if this whether is planned behaviour of Clusterware or maybe because I missed a detail.
      From my point of view it should work to start Clusterware as long as the majority of VDs are available.

      Thanks.
        • 1. Re: startup of Clusterware with missing voting disk
          phaeus
          Hello,
          the behaviour how the cluster handles voting and ocr disks are not the same in 11.1 and 11.2.

          Can you post your Version?

          regards
          Peter
          • 2. Re: startup of Clusterware with missing voting disk
            damorgan
            Why do you think it should restart with a missing voting disk?

            Human nature being what it is ... if Oracle made it work that way ... no one would ever fix the problem until they completely destroyed the grid infrastructure.

            I may be that there is a workaround but I would sincerely hope that there is not.
            • 3. Re: startup of Clusterware with missing voting disk
              953736
              @phaeus: sorry. forgot to mention. version is 11.2.0.3

              @damorgan: I think it should restart with a missing voting disk because a HA solution should be able to
              handle a hardware fault of this dimension. Why do I mirror disks to different locations? Not because I have fun wasting money
              in redundant hardware. I do it because I want to make my infrastructure more stable against hardware failures/etc...

              Back to topic:
              in the meantime I did some further investigations.
              as I restarted Clusterware with one missing disk CSSD.bin started. Also ASM instance started (or wanted to start) but
              logged some error messages:
              ERROR: diskgroup POSTOCW was not mounted
              WARNING: Disk Group POSTOCW containing spfile for this instance is not mounted
              WARNING: Disk Group POSTOCW containing configured OCR is not mounted
              WARNING: Disk Group POSTOCW containing voting files is not mounted
              ORA-15032: not all alterations performed
              ORA-15040: diskgroup is incomplete
              ORA-15042: ASM disk "4" is missing from group number "1"


              after "alter diskgroup POSTOCW online force;" the disk group was mounted and the Clusterware "continued" the startup process.

              Is there a possibility to get Clusterware restart correctly without this hack?
              • 4. Re: startup of Clusterware with missing voting disk
                phaeus
                Hello,
                normaly the cluster can relocate one voting disk to a other failgroup if it fails. What i did in my Cluster installation was to use not only 2 disk with normal redundancy. My Diskgroup layout has 5 disk with normal/high redundancy. If i use normal redundancy because i have no preffered datacenter i use high mirror for the files (maybe you must recreate your diskgroup and recreate initasm.ora, ocr). If you use high mirror file template the extend on the disk are written not only to one disk partner, they are written to any disk partner. I have done several test with customers, also with split brain and the cluster get online after reboot.

                regards
                Peter
                • 5. Re: startup of Clusterware with missing voting disk
                  damorgan
                  It does work with a loss of infrastructure ... but restart? If it did no one would ever correct the problem.
                  • 6. Re: startup of Clusterware with missing voting disk
                    Sebastian Solbach -Dba Community-Oracle
                    Hi,

                    actually what you see is expected (especially in a stretched cluster environment, with 2 failgroups and 1 quorum failgroup).
                    It has to do with how ASM handles a disk failure and is doing the mirriong (and the strange "issue" that you need a third failgroup for the votedisks).

                    So before looking at this special case, lets look at how ASM normally treats a diskgroup:

                    A diskgroup can only be mounted in normal mode, if all disks of the diskgroup are online. If a disks is missing ASM will not allow you to "normally" mount the diskgroup, before the error situation is solved. If a disks is lost, which contents can be mirrored to other disks, then ASM will be able to restore full redundancy and will allow you to mount the diskgroup. If this is not the case ASM expects the user to tell what it should do => The administrator can issue a "alter diskgroup mount force" to tell ASM even though it cannot held up the required redundancy it should mount with disks missing. This then will allow the administrator to correct the error (or replaced failed disks/failgroups). While ASM had the diskgroup mounted the loss of a failgroup will not result in a dismount of the diskgroup.

                    The same holds true with the diskgroup containing the voting disks. So what you see (will continue to run, but cannot restart) is pretty much the same like for a normal diskgroup: If a disk is lost, and the contents does not get relocated (like if the quorum failgroup fails it will not allow you to relocatore, since there are no more failgroups to relocate the third vote to), it will continue to run, but it will not be able to automatically remount the diskgroup in normal mode if a disk fails.

                    To bring the cluster back online, manual intervention is required: Start the cluster in exclusive mode:
                    crsctl start crs -excl
                    Then connect to ASM and do a
                    alter disgkroup <dgname> mount force
                    Then resolve the error (like adding another disk to another failgroup, that the data can be remirrored and the disk can be dropped.

                    After that a normal startup will be possible again.

                    Regards
                    Sebastian