14 Replies Latest reply on Feb 23, 2010 3:11 PM by 807567

    CCR Initialization Failure

    807567
      I'm trying to install Sun Cluster on LDOM 1.3. I tried 2 nodes which isn't working, so now I'm trying to get just 1 node up.

      Config:
      2 Public Vnets
      1 Private Vnet

      1 Shared Quorum Disk (EMC CX-500 1Gb) whole disk given (using EMC Powerpath /dev/dsk/emcpower15c) given to the LDOM.

      VFSTAB:
      #/dev/dsk/c0d1s0 /dev/rdsk/c0d1s0 /globaldevices ufs 1 no -
      /dev/dsk/c0d1s0 /dev/rdsk/c0d1s0 /global/.devices/node@1 ufs 2 no global

      Ifconfig -a
      [root@fsdev2w]# ifconfig -a
      lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
      inet 127.0.0.1 netmask ff000000
      vnet0: flags=9000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER> mtu 1500 index 2
      inet 10.25.23.92 netmask ff000000 broadcast 10.255.255.255
      groupname sc_ipmp0
      ether 0:14:4f:fb:74:7f
      vnet0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
      inet 0.0.0.0 netmask ff000000 broadcast 0.255.255.255
      vnet1: flags=69000842<BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER,STANDBY,INACTIVE> mtu 0 index 4
      inet 0.0.0.0 netmask 0
      groupname sc_ipmp0
      ether 0:14:4f:f9:c:13
      clprivnet0: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 3
      inet 192.168.204.33 netmask fffffff0 broadcast 192.168.204.47
      ether 0:0:0:0:0:1

      (I brought up vnet1, the install didnt, and didn't work. bringing it up made no difference either)

      Install goes fine with /globaldevices mounted as the quorum disk (This disk is shared between the 2 nodes on a seperate physical server LDOM which I've shut down)

      The LOFI method doesn't work either.

      On reboot I get the following:
      Sun Blade T6340 Server Module, No Keyboard
      Copyright 2009 Sun Microsystems, Inc. All rights reserved.
      OpenBoot 4.30.6, 8192 MB memory available, Serial #83403189.
      Ethernet address 0:14:4f:f8:a1:b5, Host ID: 84f8a1b5.



      Boot device: ch1bl1gldm3wbdsk File and args:
      SunOS Release 5.10 Version Generic_141444-09 64-bit
      Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
      Use is subject to license terms.
      Hostname: fsdev2w
      NOTICE: mddb: unable to get devid for 'vdc', 0x7
      NOTICE: mddb: unable to get devid for 'vdc', 0x7
      NOTICE: mddb: unable to get devid for 'vdc', 0x7
      Configuring devices.
      Reading ZFS config: done.

      fsdev2w console login: root
      Password:
      Last login: Sun Feb 14 23:00:57 on console
      Sun Microsystems Inc. SunOS 5.10 Generic January 2005
      [root@fsdev2w]# Booting in cluster mode
      NOTICE: CMM: Node fsdev2w (nodeid = 1) with votecount = 1 added.
      NOTICE: CMM: Node fsdev2w: attempting to join cluster.
      NOTICE: CMM: Cluster has reached quorum.
      NOTICE: CMM: Node fsdev2w (nodeid = 1) is up; new incarnation number = 1266206912.
      NOTICE: CMM: Cluster members: fsdev2w.
      NOTICE: CMM: node reconfiguration #1 completed.
      Feb 14 23:08:35 fsdev2w cl_runtime: NOTICE: CMM: Node fsdev2w: joined cluster.
      Feb 14 23:08:35 fsdev2w ip: ip: joining multicasts failed (18) on clprivnet0 - will use link layer broadcasts for multicast
      Configuring DID devices
      obtaining access to all attached disks
      Configuring the /dev/global directory (global devices)
      Feb 14 23:08:52 fsdev2w Cluster.CCR: /usr/cluster/bin/scgdevs: Cannot register devices as HA.
      Feb 14 23:08:57 fsdev2w : ccr_initialize failure
      Feb 14 23:09:02 fsdev2w last message repeated 8 times
      Feb 14 23:09:03 fsdev2w svc.startd[8]: system/cluster/scdpm:default failed repeatedly: transitioned to maintenance (see 'svcs -xv' for details)



      fsdev2w console login:


      I understand the CCR & CMM are crucial to a cluster functioning (Found out the hard way when I rebooted the second node in a 2 node config and the first node panicked :( )

      Can someone PLEASE PLEASE HELP!

      Please help! Driving me nuts. Should be pretty straightforward I thoguht.

      Edited by: nvmurali on Feb 14, 2010 8:15 PM
        • 1. Re: CCR Initialization Failure
          807816
          Which networks are which? I only see vnet0 and vnet1. clprivnet0 is constructed on some underlying network.

          I think it would be easier if you started with 4 networks, 2 for public and 2 for private.

          Tim
          ---
          • 2. Re: CCR Initialization Failure
            807567
            Hi Tim,

            The 192.168.204 is the private network and the 10.25.23.x is the public network.

            I have 2 Private VSW's and can give the LDOM 2 Private Vnets but both of them are connected to the same switch and scinstall doesn't like it. It stops configuring.

            The only thing I'd add is this:
            [root@fsdev2w]# scgdevs
            Configuring DID devices
            Configuring the /dev/global directory (global devices)
            Feb 15 19:39:16 fsdev2w Cluster.CCR: scgdevs: Cannot register devices as HA.
            [root@fsdev2w]# pkgchk -n SUNWcsd
            ERROR: /devices/pseudo/clone@0:hme
            pathname does not exist
            ERROR: /devices/pseudo/cvc@0:cvc
            pathname does not exist
            ERROR: /devices/pseudo/cvcredir@0:cvcredir
            pathname does not exist
            ERROR: /etc/devlink.tab
            group name <sys> expected <other> actual
            ERROR: /etc/iu.ap
            group name <sys> expected <other> actual


            I wonder if this has anything to do with this error?

            Unfortunately the workaround suggested by Sun has no affect?
            [root@fsdev2w]# pkgchk -nf SUNWcsd
            ERROR: /devices/pseudo/clone@0:hme
            pathname does not exist
            unable to create character-special device
            ERROR: /devices/pseudo/cvc@0:cvc
            pathname does not exist
            unable to create character-special device
            ERROR: /devices/pseudo/cvcredir@0:cvcredir
            pathname does not exist
            unable to create character-special device
            [root@fsdev2w]#

            Edited by: nvmurali on Feb 15, 2010 4:40 PM

            Edited by: nvmurali on Feb 15, 2010 4:41 PM

            Edited by: nvmurali on Feb 15, 2010 4:41 PM
            • 3. Re: CCR Initialization Failure
              807816
              I would expect to see 4 separate vnets plumbed in. Can you confirm that if you do not install cluster you can plumb in these NICs from within the LDOM? Before cluster is installed the two private networks should not be plumbed in nor have any addresses assigned to them.

              Thanks

              Tim
              ---
              • 4. Re: CCR Initialization Failure
                807567
                Tim,

                I can plumb 3 Vnets. 2 for public and 1 for private. I choose the custom configuration during SCInstall and give it the vnet2 (vnet0 & 1 for public) and vnet 2 for private interconnect.

                While I do have 4 vnets I could give the LDOM, since the both the private ethernet interfaces are connected to the same switch, scinstall fails (as it expects by default to go to different switches). this is our development environment, so everything is connected to the same switch with different VLANs for public and private traffic.

                The Private heartbeat (192.168.204.x) is on an unrouted VLAN on the swtich, so no external VLAN traffic can get to it.
                • 5. Re: CCR Initialization Failure
                  807567
                  Tim,

                  Also this is weird: I have now reinstalled the OS ( full jumpstart install), did the latest recommended patch cluster for Sun 10, installed Sun Cluster 3.2 & patched it with 126106-40.

                  I then tried to create a single node cluster.
                  When it boots up, it comes up with only 1 Public VNET & 1 Private VNEt

                  [root@fsdev2w]# ifconfig -a
                  lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
                  inet 127.0.0.1 netmask ff000000
                  vnet0: flags=9000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,NOFAILOVER> mtu 1500 index 2
                  inet 10.25.23.92 netmask fffffc00 broadcast 10.25.23.255
                  groupname sc_ipmp0
                  ether 0:14:4f:fb:bb:db
                  clprivnet0: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 3
                  inet 192.168.204.33 netmask fffffff0 broadcast 192.168.204.47
                  ether 0:0:0:0:0:1
                  [root@fsdev2w]#


                  The clprivnet0 comes up with the wrong netmask as well..should be 255.255.255.0 (which I specify during install).
                  [root@fsdev2w]# dladm show-link
                  vnet0 type: non-vlan mtu: 1500 device: vnet0
                  vnet1 type: non-vlan mtu: 1500 device: vnet1
                  vnet2 type: non-vlan mtu: 1500 device: vnet2
                  clprivnet0 type: legacy mtu: 1486 device: clprivnet0
                  [root@fsdev2w]#

                  Shouldn't it be using vnet2 which is what I said should be used for private interconnect?
                  • 6. Re: CCR Initialization Failure
                    807816
                    Unfortunately, I've not set up an LDoms cluster and I don't have access to one right now to check anything. All I can suggest is to try and follow the example http://blogs.sun.com/Maddy/entry/how_to_configure_ldom_guest as closely as possible and then try your particular variation.

                    Have you missed of the 'mode=SC' setting by any chance?

                    Tim
                    ---
                    • 7. Re: CCR Initialization Failure
                      807567
                      mode =sc is set.

                      The only difference between the artcile and mine is that I have a switch instead of a cable between the 2 physical servers for private interconnect & I have 2 public interfaces and 1 pvt interface.

                      What I dont get is why even a Single Node cluster fails the scgdevs command with "Cannot register devices as HA".

                      I surely am missing something else? Do I have to use Sun Volume manager or anything before I start the cluster?
                      • 8. Re: CCR Initialization Failure
                        807816
                        As I have no experience of this (and I've not seen others chime in) there isn't much I can suggest. Can your switch not create separate VLANs for the private networks so that you can have two private networks going through one switch?

                        The other option is to start by building a single node cluster. That does not need private networks. That would limit the possibilities for errors. If you can get that working, then you can add more nodes.

                        Tim
                        ---
                        • 9. Re: CCR Initialization Failure
                          807567
                          I had the same/similar problem years ago when installing SC... scinstall just would not create private network (is that the interconnects?) and trying to manually hack it etc etc to no avail.

                          We had two separate sets of cisco switches (4 in total), one set for public network and the other for private. Public fine, private just would not work.

                          It turned out that the switch was being to "clever"... it was creating spanning tree and all sorts of "extras". Network guys switched everything off on the switch and then scinstall worked straight away and could install private network no probs?!?!?!

                          (sorry about the lack of exact info, this was few years ago)
                          • 10. Re: CCR Initialization Failure
                            807567
                            Tim / darko15

                            I tried creating a single node with no switch. I let it use the LOFI for /globaldevices. Everything installs, and reboots.

                            On reboot I still get the error:

                            Configuring DID devices
                            Configuring the /dev/global directory (global devices) Feb 17 14:34:18 fsdev2w Cluster.CCR: scgdevs: Cannot register devices as HA.
                            • 11. Re: CCR Initialization Failure
                              807816
                              Please log a service call. This is clearly something we cannot resolve on this forum.

                              Thanks,

                              Tim
                              ---
                              • 12. Re: CCR Initialization Failure
                                807567
                                I guess I'll have to. I dont have Sun cluster support, so I dont know how far I'll get and this is the development environment.

                                I tried calling to get a quote for Sun Cluster and apparently because of the merger with Oracle they're not even selling sun cluster support for a few weeks! Ha!
                                • 13. Re: CCR Initialization Failure
                                  807567
                                  Turns out its a bug from Solaris 11 that sneaked into 10 or so they think. Looks like it'll have to get escalated and I may need an IDR. If any of you are facing this, you will have to call sun support.
                                  • 14. Re: CCR Initialization Failure
                                    807567
                                    Figured out the solution. It apparantly has to do with LDOMs and EMC Power Path devices. The way to provide disks to an LDOM for Quorum or shared disk is via the Cxtxdx number and not directly the EMC Power device.

                                    So, got to create an MPGroup with vdsdev and then add the vdisk to the ldom

                                    Eg:
                                    ldm add-vdsdev mpgroup=cludatagrp /dev/dsk/c3t50060160302129A5d10s2 nfscludatadsk@primary-vds0
                                    ldm add-vdsdev mpgroup=cludatagrp /dev/dsk/c3t50060160302129A5d10s2 nfscludatadsk@primary-vds1

                                    ldm add-vdisk id=2 nfscludatadsk nfscludatadsk@primary-vds0 <ldomname>

                                    You need to do this for all disks that are going to be managed by Sun cluster software within an LDOM (and probably even without it) (and obviously on both nodes)

                                    Hopefully this will save others time.