How to Build a Cassandra Multinode Database Cluster on Oracle Solaris 11.3 with LUN Mirroring and IP Multipathing

Version 3

    by Antonis Tsavdaris

     

    This article describes how to build a Cassandra single-rack database cluster on Oracle Solaris 11.3 and extend its overall availability with LUN mirroring and IP network multipathing.

     

    Cassandra database is a popular distributed database management system from Apache Foundation. It is highly scalable and comes with a master-less notion, in that there isn't a primary node to which other nodes are subservient. Every node in the cluster is equal and any node can service any request.

     

    Oracle Solaris 11 is an enterprise-class operating system known for its reliability, availability, and serviceability (RAS) features. Its wealth of integrated features helps administrators build redundancy into every part of the system they deem critical, including the network, storage, and so on.

     

    This how-to article describes how to build a Cassandra single-rack database cluster on Oracle Solaris 11.3 and extend its overall availability with LUN mirroring and IP network multipathing (IPMP). LUN mirroring will provide extended availability at the storage level and IPMP will add redundancy to the network.

     

    In this scenario, the one-rack cluster is composed of six Oracle Solaris server instances. Three of them—dbnode1, dbnode2, and dbnode3—will be the database nodes and the other three—stgnode1, stgnode2, and stgnode3—will provide highly available storage. The highly available storage will be constructed from nine LUNs, three in each storage node.

     

    At the end of the construction, the one-rack cluster will have a fully operational database even if two of the storage nodes are not available. Furthermore, the networks—the public network and the iSCSI network—will be immune to hardware failures through IPMP groups consisting of an active and a standby network card.

     

    Cluster Topology

     

    All servers have the Oracle Solaris 11.3 operating system installed. Table 1 depicts the cluster architecture.

     

    In reality the Cassandra binaries as well as the data will reside on the storage nodes. The database nodes will serve the running instances.

     

    Table 1. Oracle Solaris servers and their role in the cluster.

     

    Node NameRole in the ClusterContains
    dbnode1Database nodeRunning instance
    dbnode2Database nodeRunning instance
    dbnode3Database nodeRunning instance
    stgnode1Storage nodeBinaries and data
    stgnode2Storage nodeBinaries and data
    stgnode3Storage nodeBinaries and data

     

    Network Interface Cards

     

    As shown in Table 2, every server in the cluster has four network interface cards (NICs) installed; net0 through net3 will be named. Redundancy is required at the network level and this will be provided by IPMP groups. IP multipathing requires that the DefaultFixed network profile be activated and static IP addresses be assigned to every network interface.

     

    Table 2. NICs and IPMP group configuration.

                                                                                                                  

    Node NameNICPrimary/Standby NICIP/SubnetIPMP Group NameIPMP IP AddressRole
    dbnode1net0primary192.168.2.10/24IPMP0192.168.2.22/24Public network
    net1standby192.168.2.11/24
    net2primary10.0.1.1/27IPMP110.0.1.13/27iSCSI initiator
    net3standby10.0.1.2/27
    dbnode2net0primary192.168.2.12/24IPMP2192.168.2.23/24Public network
    net1standby192.168.2.13/24
    net2primary10.0.1.3/27IPMP310.0.1.14/27iSCSI initiator
    net3standby10.0.1.4/27
    dbnode3net0primary192.168.2.14/24IPMP4192.168.2.24/24Public network
    net1standby192.168.2.15/24
    net2primary10.0.1.5/27IPMP510.0.1.15/27iSCSI initiator
    net3standby10.0.1.6/27
    stgnode1net0primary192.168.2.16/24IPMP6192.168.2.25/24Public network
    net1standby192.168.2.17/24
    net2primary10.0.1.7/27IPMP710.0.1.16/27iSCSI target
    net3standby10.0.1.8/27
    stgnode2net0primary192.168.2.18/24IPMP8192.168.2.26/24Public network
    net1standby192.168.2.19/24
    net2primary10.0.1.9/27IPMP910.0.1.17/27iSCSI target
    net3standby10.0.1.10/27
    stgnode3net0primary192.168.2.20/24IPMP10192.168.2.27/24Public network
    net1standby192.168.2.21/24
    net2primary10.0.1.11/27IPMP1110.0.1.18/27iSCSI target
    net3standby10.0.1.12/27

     

    First, ensure that the network service is up and running. Then check whether the network profile is set to DefaultFixed.

     

    root@dbnode1:~# svcs network/physical

    STATE          STIME    FMRI

    online         1:25:45  svc:/network/physical:upgrade

    online         1:25:51  svc:/network/physical:default

     

    root@dbnode1:~# netadm list

    TYPE        PROFILE        STATE

    ncp         Automatic      disabled

    ncp         DefaultFixed   online

    loc         DefaultFixed   online

    loc         Automatic      offline

    loc         NoNet          offline

     

     

    Because the network profile is set to DefaultFixed, review the network interfaces and the data link layer.

     

    root@dbnode1:~# dladm show-phys
    LINK              MEDIA                STATE      SPEED  DUPLEX    DEVICE net0              Ethernet             unknown    1000   full      e1000g0 net1              Ethernet             unknown    1000   full      e1000g1 net3              Ethernet             unknown    1000   full      e1000g3 net2              Ethernet             unknown    1000   full      e1000g2

     

    Create the IP interface for net0 and then configure a static IPv4 address.

     

    root@dbnode1:~# ipadm create-ip net0

    root@dbnode1:~# ipadm create-addr -T static -a 192.168.2.10/24 net0/v4

    root@dbnode1:~# ipadm show-addr

    ADDROBJ        TYPE     STATE      ADDR

    lo0/v4         static   ok         127.0.0.1/8

    net0/v4        static   ok         192.168.2.10/24

    lo0/v6         static   ok         ::1/128

     

    Following this, create the IP interfaces and assign the relevant IP addresses and subnets for each of the NICs, net0–net3, for each of the servers according to Table 2.

     

    Note: There is an exceptional article by Andrew Walton on how to configure an Oracle Solaris network along with making it internet-facing: "How to Get Started Configuring Your Network in Oracle Solaris 11."

     

    IPMP Groups

     

    After the NICs have been configured and the IP addresses have been assigned, IPMP groups can be configured as well. IPMP is a great way to group separate physical network interfaces and, thus, provide physical interface failure detection, network access failover, and network load spreading. IPMP groups will be made of two NICs in an active/standby configuration. So, when an interface that is a member of an IPMP group is brought down for maintenance or when a NIC fails due to a mechanical error, a failover process will take place; the remaining NIC and related IP interface will step in to ensure that the node is not segregated from the cluster.

     

    According to the planned scenario, two IPMP groups are going to be created in each server, one for every two NICs configured earlier. Each IPMP group will have its own IP interface, and one of the underlying NICs will be active, while the other will remain a standby. Table 2 summarizes the IPMP group configurations that must be completed on each node.

     

    First, create the IPMP group IPMP0. Then, bind interfaces net0 and net1 to this group and create an IP address for the group.

     

    root@dbnode1:~# ipadm create-ipmp ipmp0
    root@dbnode1:~# ipadm add-ipmp -i net0 -i net1 ipmp0
    root@dbnode1:~# ipadm create-addr -T static -a 192.168.2.22/24 ipmp0
    ipmp0/v4

     

    Now that IPMP0 has been created successfully, declare net1 as the standby interface.

     

    root@dbnode1:~# ipadm set-ifprop -p standby=on -m ip net1
    root@dbnode1:~# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES ipmp0       ipmp0       ok        10.00s    net0 (net1)

     

    The ipmpstat command reports that the IPMP0 group has been built successfully and that it operates over two NICs, net0 and net1. The parentheses denote a standby interface.

     

    Follow the above-mentioned approach to build the IPMP groups for the rest of the servers in the cluster, as shown in Table 2.

     

    Local Storage

     

    As shown in Table 3, each of the storage servers has nine 10 GB additional disks upon which zpools are to be created. They are to be built with a RAID 1 and hot-spare configuration. Following this, ZFS file systems and LUNs can be constructed.

     

    Table 3. Additional disk storage configuration.

                                                                                                            

    Node NameZFS Pool NameDisk NameSizeRole in MirrorZFS File System
    stgnode1zpool1c1t2d010 GBmemberzfslun1
    c1t3d010 GBmember
    c1t4d010 GBspare
    zpool2c1t5d010 GBmemberzfslun2
    c1t6d010 GBmember
    c1t7d010 GBspare
    zpool3c1t8d010 GBmemberzfslun3
    c1t9d010 GBmember
    c1t10d010 GBspare
    stgnode2zpool4c1t2d010 GBmemberzfslun4
    c1t3d010 GBmember
    c1t4d010 GBspare
    zpool5c1t5d010 GBmemberzfslun5
    c1t6d010 GBmember
    c1t7d010 GBspare
    zpool6c1t8d010 GBmemberzfslun6
    c1t9d010 GBmember
    c1t10d010 GBspare
    stgnode3zpool7c1t2d010 GBmemberzfslun7
    c1t3d010 GBmember
    c1t4d010 GBspare
    zpool8c1t5d010 GBmemberzfslun8
    c1t6d010 GBmember
    c1t7d010 GBspare
    zpool9c1t8d010 GBmemberzfslun9
    c1t9d010 GBmember
    c1t10d010 GBspare

     

    Starting with stgnode1, run the format command, which reports the additional, unconfigured disks.

     

    root@stgnode1:~# format
    Searching for disks...done  AVAILABLE DISK SELECTIONS:        0. c1t0d0 <ATA-VBOX HARDDISK-1.0-20.00GB>           /pci@0,0/pci8086,2829@d/disk@0,0        1. c1t2d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@2,0        2. c1t3d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@3,0        3. c1t4d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@4,0        4. c1t5d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@5,0        5. c1t6d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@6,0        6. c1t7d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@7,0        7. c1t8d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@8,0        8. c1t9d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@9,0        9. c1t10d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>           /pci@0,0/pci8086,2829@d/disk@a,0 Specify disk (enter its number): ^C
    root@stgnode1:~#

     

    Create the zpools zpool1, zpool2, and zpool3 in a RAID 1 with hot-spare configuration.

     

    root@stgnode1:~# zpool create zpool1 mirror c1t2d0 c1t3d0 spare c1t4d0

    root@stgnode1:~# zpool status zpool1

      pool: zpool1

    state: ONLINE

      scan: none requested

    config:

     

        NAME        STATE     READ WRITE CKSUM

        zpool1      ONLINE       0     0     0

          mirror-0  ONLINE       0     0     0

            c1t2d0  ONLINE       0     0     0

            c1t3d0  ONLINE       0     0     0

        spares

          c1t4d0    AVAIL  

     

    errors: No known data errors

     

    root@stgnode1:~# zpool create zpool2 mirror c1t5d0 c1t6d0 spare c1t7d0

    root@stgnode1:~# zpool status zpool2

      pool: zpool2

    state: ONLINE

      scan: none requested

    config:

     

        NAME        STATE     READ WRITE CKSUM

        zpool2      ONLINE       0     0     0

          mirror-0  ONLINE       0     0     0

            c1t5d0  ONLINE       0     0     0

            c1t6d0  ONLINE       0     0     0

        spares

          c1t7d0    AVAIL  

     

    errors: No known data errors

     

    root@stgnode1:~# zpool create zpool3 mirror c1t8d0 c1t9d0 spare c1t10d0

    root@stgnode1:~# zpool status zpool3

      pool: zpool3

    state: ONLINE

      scan: none requested

    config:

     

        NAME        STATE     READ WRITE CKSUM

        zpool3      ONLINE       0     0     0

          mirror-0  ONLINE       0     0     0

            c1t8d0  ONLINE       0     0     0

            c1t9d0  ONLINE       0     0     0

        spares

          c1t10d0   AVAIL  

     

    errors: No known data errors

     

    Running the format command again shows that the disks have been formatted.

     

    root@stgnode1:~# format
    Searching for disks...done  AVAILABLE DISK SELECTIONS:        0. c1t0d0 <ATA-VBOX HARDDISK-1.0-20.00GB>           /pci@0,0/pci8086,2829@d/disk@0,0        1. c1t2d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@2,0        2. c1t3d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@3,0        3. c1t4d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@4,0        4. c1t5d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@5,0        5. c1t6d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@6,0        6. c1t7d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@7,0        7. c1t8d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@8,0        8. c1t9d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@9,0        9. c1t10d0 <ATA-VBOX HARDDISK-1.0-10.00GB>           /pci@0,0/pci8086,2829@d/disk@a,0 Specify disk (enter its number): ^C

     

    Use the zpool list command to get a report on the newly created ZFS pools.

     

    root@stgnode1:~# zpool list

    NAME        SIZE     ALLOC    FREE     CAP    DEDUP    HEALTH  ALTROOT

    rpool      19.6G     8.01G    11.6G    40%     1.00x   ONLINE  -

    zpool1     9.94G       88K    9.94G     0%     1.00x   ONLINE  -

    zpool2     9.94G       88K    9.94G     0%     1.00x   ONLINE  -

    zpool3     9.94G       88K    9.94G     0%     1.00x   ONLINE  -

     

    Build ZFS file systems on the ZFS pools.

     

    root@stgnode1:~# zfs create -V 8g zpool1/zfslun1
    root@stgnode1:~# zfs create -V 8g zpool2/zfslun2
    root@stgnode1:~# zfs create -V 8g zpool3/zfslun3

     

    Use the zfs list command to get a report on the newly created ZFS file systems.

     

    root@stgnode1:~# zfs list -r /zpool*

    NAME             USED  AVAIL  REFER  MOUNTPOINT

    zpool1          8.25G  1.53G    31K  /zpool1

    zpool1/zfslun1  8.25G  9.78G    16K  -

    zpool2          8.25G  1.53G    31K  /zpool2

    zpool2/zfslun2  8.25G  9.78G    16K  -

    zpool3          8.25G  1.53G    31K  /zpool3

    zpool3/zfslun3  8.25G  9.78G    16K  -

     

    Perform the same work on the second and third storage nodes.

     

    iSCSI Targets

     

    As shown in Table 4, three pools are to be further constructed. They are to be mirrored across the network with a hot-spare configuration. ZFS pool datapool1 will be constructed from host dbnode1 by LUNs c0t600144F03B268F00000055F33BB10001d0, c0t600144F06A174000000055F5D8F50001d0, and c0t600144F0BBB5C300000055F5DB370001d0, each coming from a different storage node.

     

    Similarly, ZFS pool datapool2 will be constructed from host dbnode2 by LUNs c0t600144F03B268F00000055F33BCC0002d0, c0t600144F06A174000000055F5D90D0002d0, and c0t600144F0BBB5C300000055F5DB4D0002d0, each coming from a different storage node.

     

    Finally, pool datapool3 will be constructed from host dbnode3 by LUNs c0t600144F03B268F00000055F33BFE0003d0, c0t600144F06A174000000055F5D9350003d0, and c0t600144F0BBB5C300000055F5DB690003d0.

     

    Table 4. Structure and constituents of the two LUN mirrors.

                        

    Cross-Platform ZFS PoolNode NameZFS File SystemLUNZFS File System
    datapool1stgnode1zfslun1c0t600144F03B268F00000055F33BB10001d0/datapool1/zfsnode1
    stgnode2zfslun4c0t600144F06A174000000055F5D8F50001d0
    stgnode3zfslun7c0t600144F0BBB5C300000055F5DB370001d0
    datapool2stgnode1zfslun2c0t600144F03B268F00000055F33BCC0002d0/datapool2/zfsnode2
    stgnode2zfslun5c0t600144F06A174000000055F5D90D0002d0
    stgnode3zfslun8c0t600144F0BBB5C300000055F5DB4D0002d0
    datapool3stgnode1zfslun3c0t600144F03B268F00000055F33BFE0003d0/datapool3/zfsnode3
    stgnode2zfslun6c0t600144F06A174000000055F5D9350003d0
    stgnode3zfslun9c0t600144F0BBB5C300000055F5DB690003d0

     

    In order to be able to create iSCSI targets and LUNs, the storage server group of packages must be installed on each of the storage servers.

     

    root@stgnode1:~# pkg install storage-server

               Packages to install:  21

                Services to change:   1

           Create boot environment:  No

    Create backup boot environment: Yes

     

    DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED

    Completed                              21/21     3644/3644  111.6/111.6  586k/s

     

    PHASE                                          ITEMS

    Installing new actions                     4640/4640

    Updating package state database                 Done

    Updating package cache                           0/0

    Updating image state                            Done

    Creating fast lookup database                   Done

    Updating package cache                           1/1

     

    Verify that the group of packages has been installed by reviewing the output of the pkg info command, as follows:

     

    root@stgnode1:~# pkg info storage-server

                          Name: group/feature/storage-server

           Summary: Multi protocol storage server group package

          Category: Drivers/Storage (org.opensolaris.category.2008)

                    Meta Packages/Group Packages (org.opensolaris.category.2008)

             State: Installed

         Publisher: solaris

           Version: 0.5.11

    Build Release: 5.11

            Branch: 0.175.3.0.0.25.0

    Packaging Date: June 21, 2015 10:57:56 PM

              Size: 5.46 kB

              FMRI: pkg://solaris/group/feature/storage-server@0.5.11,5.11-0.175.3.0.0.25.0:20150621T225756Z

     

    Perform the same action on the second and third storage nodes.

     

    Enable the Oracle Solaris Common Multiprotocol SCSI TARget (COMSTAR) SCSI Target Mode Framework (STMF) service and verify that it is online. Then, create logical units for all the ZFS LUNs from the storage nodes on which they were created. Start from stgnode1.

     

    root@stgnode1:~# svcadm enable stmf
    root@stgnode1:~# svcs stmf
    STATE       STIME    FMRI online         22:48:39  svc:/system/stmf:default  root@stgnode1:~# stmfadm create-lu /dev/zvol/rdsk/zpool1/zfslun1
    Logical unit created: 600144F03B268F00000055F33BB10001  root@stgnode1:~# stmfadm create-lu /dev/zvol/rdsk/zpool2/zfslun2
    Logical unit created: 600144F03B268F00000055F33BCC0002  root@stgnode1:~# stmfadm create-lu /dev/zvol/rdsk/zpool3/zfslun3
    Logical unit created: 600144F03B268F00000055F33BFE0003

     

    Confirm that the LUNs have been created successfully.

     

    root@stgnode1:~#  stmfadm list-lu

    LU Name: 600144F03B268F00000055F33BB10001

    LU Name: 600144F03B268F00000055F33BCC0002

    LU Name: 600144F03B268F00000055F33BFE0003

     

    Create the LUN view for each of the LUNs and verify the LUN configuration.

     

    root@stgnode1:~# stmfadm add-view 600144F03B268F00000055F33BB10001
    root@stgnode1:~# stmfadm add-view 600144F03B268F00000055F33BCC0002
    root@stgnode1:~# stmfadm add-view 600144F03B268F00000055F33BFE0003
    root@stgnode1:~# stmfadm list-view -l 600144F03B268F00000055F33BB10001
    View Entry: 0     Host group   : All     Target Group : All     LUN          : Auto root@stgnode1:~# stmfadm list-view -l 600144F03B268F00000055F33BCC0002
    View Entry: 0     Host group   : All     Target Group : All     LUN          : Auto root@stgnode1:~# stmfadm list-view -l 600144F03B268F00000055F33BFE0003
    View Entry: 0     Host group   : All     Target Group : All     LUN          : Auto

     

    Enable the iSCSI target service on the first storage node and verify it is online.

     

    root@stgnode1:~# svcadm enable -r svc:/network/iscsi/target:default
    root@stgnode1:~# svcs iscsi/target
    STATE        STIME     FMRI online       22:53:44  svc:/network/iscsi/target:default

     

    Create the iSCSI target and list it:

     

    root@stgnode1:~# itadm create-target
    Target iqn.1986-03.com.sun:02:ae4d3c15-8f1c-4098-9d07-8d2c619516e4 successfully created

     

    Verify that the target has been created.

     

    root@stgnode1:~# itadm list-target -v
    TARGET NAME                                                  STATE    SESSIONS  iqn.1986-03.com.sun:02:ae4d3c15-8f1c-4098-9d07-8d2c619516e4  online   0              alias:              -      auth:               none (defaults)      targetchapuser:     -      targetchapsecret:   unset      tpg-tags:           default

     

    Follow the same steps to create logical units for the rest of the ZFS LUNs and enable the iSCSI target on the second and third storage server.

     

    After the iSCSI targets have been successfully created, the iSCSI initiators must be created on the database nodes.

     

    Enable the iSCSI initiator service.

     

    root@dbnode1:~# svcadm enable network/iscsi/initiator

     

    Configure the targets to be statically discovered. The initiator will discover targets from all three storage servers.

     

    root@dbnode1:~# iscsiadm add static-config \
    iqn.1986-03.com.sun:02:ae4d3c15-8f1c-4098-9d07-8d2c619516e4,10.0.1.16
    root@dbnode1:~# iscsiadm add static-config \
    iqn.1986-03.com.sun:02:ae65e6de-dfb1-4a77-9940-dabf68709f5d,10.0.1.17
    root@dbnode1:~# iscsiadm add static-config \
    iqn.1986-03.com.sun:02:f4e68b9d-26ca-484a-8d85-d2c8275da0eb,10.0.1.18

     

    Verify the configuration with the iscsiadm list command.

     

    root@dbnode1:~# iscsiadm list static-config
    Static Configuration Target: iqn.1986-03.com.sun:02:ae4d3c15-8f1c-4098-9d07-8d2c619516e4,10.0.1.16:3260 Static Configuration Target: iqn.1986-03.com.sun:02:ae65e6de-dfb1-4a77-9940-dabf68709f5d,10.0.1.17:3260 Static Configuration Target: iqn.1986-03.com.sun:02:f4e68b9d-26ca-484a-8d85-d2c8275da0eb,10.0.1.18:3260

     

    Enable the static target discovery method.

     

    root@dbnode1:~# iscsiadm modify discovery --static enable

     

    Perform the same actions to configure the iSCSI initiator on dbnode2 and dbnode3 and enable the static target discovery method.

     

    LUN Mirroring and Storage

     

    From the first database node (dbnode1) verify the available disks. Nine LUNs should be available.

     

    root@dbnode1:~# format
    Searching for disks...done AVAILABLE DISK SELECTIONS:      0. c0t600144F0BBB5C300000055F5DB4D0002d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f0bbb5c300000055f5db4d0002      1. c0t600144F0BBB5C300000055F5DB370001d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f0bbb5c300000055f5db370001      2. c0t600144F0BBB5C300000055F5DB690003d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f0bbb5c300000055f5db690003      3. c0t600144F03B268F00000055F33BB10001d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f03b268f00000055f33bb10001      4. c0t600144F03B268F00000055F33BCC0002d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f03b268f00000055f33bcc0002      5. c0t600144F03B268F00000055F33BFE0003d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f03b268f00000055f33bfe0003      6. c0t600144F06A174000000055F5D8F50001d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f06a174000000055f5d8f50001      7. c0t600144F06A174000000055F5D90D0002d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f06a174000000055f5d90d0002      8. c0t600144F06A174000000055F5D9350003d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>         /scsi_vhci/disk@g600144f06a174000000055f5d9350003      9. c1t0d0 <ATA-VBOX HARDDISK-1.0-20.00GB>         /pci@0,0/pci8086,2829@d/disk@0,0 Specify disk (enter its number): ^C

     

    Build the first ZFS pool from LUNs c0t600144F03B268F00000055F33BB10001d0, c0t600144F06A174000000055F5D8F50001d0, and c0t600144F0BBB5C300000055F5DB370001d0. These all come from different storage servers to ensure the storage has high availability.

     

    root@dbnode1:~# zpool create datapool1 mirror c0t600144F03B268F00000055F33BB10001d0 \
    c0t600144F06A174000000055F5D8F50001d0 spare c0t600144F0BBB5C300000055F5DB370001d0

     

    Create the zfsnode1 ZFS file system on the zpool.

     

    root@dbnode1:~# zfs create datapool1/zfsnode1

    root@dbnode1:~# zpool list

    NAME        SIZE    ALLOC   FREE    CAP   DEDUP   HEALTH  ALTROOT

    datapool1   7.94G    128K   7.94G    0%   1.00x   ONLINE  -

    rpool       19.6G   7.53G   12.1G   38%   1.00x   ONLINE  -

     

    Verify the ZFS creation recursively.

     

    root@dbnode1:~# zfs list -r datapool1
    NAME                 USED  AVAIL  REFER   MOUNTPOINT datapool1            128K  7.81G  32K     /datapool1 datapool1/zfsnode1   31K   7.81G  31K     /datapool1/zfsnode1

     

    From the second database node (dbnode2), execute the format utility to verify the available disks. Check that three of the LUNs have been formatted.

     

    root@dbnode2:~# format

    Searching for disks...done

    AVAILABLE DISK SELECTIONS:

         0. c0t600144F0BBB5C300000055F5DB4D0002d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>

            /scsi_vhci/disk@g600144f0bbb5c300000055f5db4d0002

         1. c0t600144F0BBB5C300000055F5DB370001d0 <SUN-COMSTAR-1.0-8.00GB>

            /scsi_vhci/disk@g600144f0bbb5c300000055f5db370001

         2. c0t600144F0BBB5C300000055F5DB690003d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>

            /scsi_vhci/disk@g600144f0bbb5c300000055f5db690003

         3. c0t600144F03B268F00000055F33BB10001d0 <SUN-COMSTAR-1.0-8.00GB>

            /scsi_vhci/disk@g600144f03b268f00000055f33bb10001

         4. c0t600144F03B268F00000055F33BCC0002d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>

            /scsi_vhci/disk@g600144f03b268f00000055f33bcc0002

         5. c0t600144F03B268F00000055F33BFE0003d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>

            /scsi_vhci/disk@g600144f03b268f00000055f33bfe0003

         6. c0t600144F06A174000000055F5D8F50001d0 <SUN-COMSTAR-1.0-8.00GB>

            /scsi_vhci/disk@g600144f06a174000000055f5d8f50001

         7. c0t600144F06A174000000055F5D90D0002d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>

            /scsi_vhci/disk@g600144f06a174000000055f5d90d0002

         8. c0t600144F06A174000000055F5D9350003d0 <SUN-COMSTAR-1.0 cyl 4094 alt 2 hd 128 sec 32>

            /scsi_vhci/disk@g600144f06a174000000055f5d9350003

         9. c1t0d0 <ATA-VBOX HARDDISK-1.0-20.00GB>

            /pci@0,0/pci8086,2829@d/disk@0,0

    Specify disk (enter its number): ^C

     

    Build the rest of the ZFS pools from the remaining available LUNs, as shown in Table 4.

     

    Database Installation and Configuration

     

    Before we can build Cassandra on the database nodes, Apache Ant must be installed. Apache Ant is a tool for building Java applications. Because Ant requires Java in order to run, Java Development Kit 8 (JDK 8) must be installed also.

     

    Use the pkg utility to install Ant.

     

    root@dbnode1:~# pkg install ant

                            Packages to install:  1

           Create boot environment: No

    Create backup boot environment: No

     

    DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED

    Completed                                1/1     1594/1594      7.6/7.6  216k/s

     

    PHASE                                          ITEMS

    Installing new actions                     1617/1617

    Updating package state database                 Done

    Updating package cache                           0/0

    Updating image state                            Done

    Creating fast lookup database                   Done

    Updating package cache                           1/1

     

    root@dbnode1:~# pkg info ant

              Name: developer/build/ant

           Summary: Apache Ant

       Description: Apache Ant is a Java-based build tool

          Category: Development/Distribution Tools

             State: Installed

         Publisher: solaris

           Version: 1.9.3

    Build Release: 5.11

            Branch: 0.175.3.0.0.25.3

    Packaging Date: June 21, 2015 11:51:03 PM

              Size: 35.66 MB

              FMRI: pkg://solaris/developer/build/ant@1.9.3,5.11-0.175.3.0.0.25.3:20150621T235103Z

     

    Install the Java Development Kit.

     

    root@dbnode1:~# pkg install jdk-8

               Packages to install:  2

           Create boot environment: No

    Create backup boot environment: No

     

    DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED

    Completed                                2/2       625/625    46.3/46.3  274k/s

     

    PHASE                                          ITEMS

    Installing new actions                       735/735

    Updating package state database                 Done

    Updating package cache                           0/0

    Updating image state                            Done

    Creating fast lookup database                   Done

    Updating package cache                           1/1

     

    Verify that JDK 8 is on the database node.

     

    root@dbnode1:~# java -version
    java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

     

    On all the database nodes, download the source code for Cassandra version 2.1.9 (apache-cassandra-2.1.9-src.tar.gz) from http://cassandra.apache.org/, and install the software as follows.

     

    Unzip the Apache source code and place it into the relevant /datapoolx file system. Create the db_files directory where data and log files are to reside.

     

    root@dbnode1:~# cd Downloads
    root@dbnode1:~/Downloads# ls apache-cassandra-2.1.9-src.tar.gz root@dbnode1:~/Downloads# tar -zxvf apache-cassandra-2.1.9-src.tar.gz
    root@dbnode1:~/Downloads# mv apache-cassandra-2.1.9-src cassandra
    root@dbnode1:~/Downloads# ls
    apache-cassandra-2.1.9-src.tar.gz  cassandra root@dbnode1:~/Downloads# mv cassandra /datapool1/zfsnode1
    root@dbnode1:~/Downloads# cd /datapool1/zfsnode1
    root@dbnode1:/datapool1/zfsnode1# mkdir db_files

     

    Make the cassandra directory the current working directory and build the Cassandra application with Ant.

     

    root@dbnode1:/datapool1/zfsnode1# cd cassandra

    root@dbnode1:/datapool1/zfsnode1/cassandra# ant

    ...

    BUILD SUCCESSFUL

    Total time: 8 minutes 37 seconds

     

    The application has been built. Open .profile with a text editor and add the following entries. Then source the file.

     

    export  CASSANDRA_HOME=/datapool1/zfsnode1/cassandra
    export  PATH=$CASSANDRA_HOME/bin:$PATH
    root@dbnode1:~/# source .profile

     

    One at a time, move to the /datapool1/zfsnode1/cassandra/bin and /datapool1/zfsnode1/cassandra/tools directories, and use a text editor to open the shell scripts that are shown in Table 5. In the first line of each file, change #!bin/sh to #!bin/bash and then save the file.

     

    Table 5. Shell scripts to change.

     

    Cassandra DirectoryShell Scripts to Change
    $CASSANDRA_HOME/bincassandra.sh, Cassandra-cli.sh, cqlsh.sh, debug-cql, nodetool.sh, sstablekeys.sh, sstableloader.sh, sstablescrub.sh, sshtableupgrade.sh
    $CASSANDRA_HOME/tools/bincassandra-stress.sh, cassandra-stressd.sh, json2sstable.sh, sstable2json.sh, sstableexpiredblockers.sh, sstablelevelreset.sh, sstablemetadata.sh, sstableofflinerelevel.sh, sstablerepairedset.sh, sstablesplit.sh

     

    In the Cassandra/conf directory, the shell script Cassandra-env.sh utilizes grep with the -A option. This causes Oracle Solaris to throw an illegal-option warning when starting Cassandra or running all other utilities. By default, grep runs under the /usr/bin directory. The warning can be avoided by executing the grep utility under the /usr/gnu/bin directory. To do this, declare its absolute path in Cassandra-env.sh.

     

    root@dbnode1:~/# which grep
    /usr/bin/grep

     

    Open $CASSANDRA_HOME/conf/Cassandra-env.sh and change grep -A to /usr/gnu/bin/grep -A. Then save the file to commit the change.

     

    Move to /datapool1/zfsnode1/cassandra/conf/, open cassandra.yaml with a text editor, and make the following adjustments.

     

    cluster_name: 'MyCluster' num_tokens: 5 data_file_directories:      /datapool1/zfsnode1/db_files/data commitlog_directory: /datapool1/zfsnode1/db_files/commitlog saved_caches_directory: /datapool1/zfsnode1/db_files/saved_caches seed_provider:           - seeds: "192.168.2.22" listen_address: 192.168.2.22 rpc_address: localhost rpc_keepalive: true endpoint_snitch: GossipingPropertyFileSnitch

     

    Perform the same steps to build Cassandra on dbnode2 and dbnode3. Place the source code in the relevant ZFS file system. Execute the same modifications as made earlier, too. Configure the cassandra.yaml file for the second and third database nodes as shown below:

     

    The cassandra.yaml configuration for dbnode2:

     

    cluster_name: 'MyCluster' num_tokens: 5 data_file_directories:     - /datapool2/zfsnode2/db_files/data commitlog_directory: /datapool2/zfsnode2/db_files/commitlog saved_caches_directory: /datapool2/zfsnode2/db_files/saved_caches seed_provider:           - seeds: "192.168.2.22" listen_address: 192.168.2.23 rpc_address: localhost rpc_keepalive: true endpoint_snitch: GossipingPropertyFileSnitch

     

    The cassandra.yaml configuration for dbnode3:

     

    cluster_name: 'MyCluster' num_tokens: 5 data_file_directories:     - /datapool3/zfsnode3/db_files/data commitlog_directory: /datapool3/zfsnode3/db_files/commitlog saved_caches_directory: /datapool3/zfsnode3/db_files/saved_caches seed_provider:           - seeds: "192.168.2.22" listen_address: 192.168.2.24 rpc_address: localhost rpc_keepalive: true endpoint_snitch: GossipingPropertyFileSnitch

     

    Some Notes About the Cassandra.yaml File

     

    In order for the database servers to belong to the same cluster, they must share the same cluster name. The cluster_name setting fulfills this purpose. Seed servers are one or more database servers that currently belong to the cluster and are to be contacted by a new server when it first joins the cluster. This new server will contact the seed servers for information about the rest of the servers in the cluster, that is, their names, their IP addresses, the racks and data centers they belong to, and so on.

     

    When a cluster is initialized for the first time, a token ring is created. The token ring's values range from -2^63 to 2^63. The num_tokens setting is a number that controls how many tokens are to be created per database server and in that way, a token range is built for the distribution of data. As data is inserted, the primary key (or a part of the primary key) gets hashed. This hash value falls within a token range and is the server where data will be sent. Every server can have a different num_tokens setting based on the server hardware. Better servers can have a larger number of tokens set than older or less powerful servers. The data_file_directories, commitlog_directory, and saved_caches_directory parameters set the paths where data and logs will reside.

     

    Cassandra Operation and Data Distribution

     

    Initiate the Cassandra databases on the database nodes.

     

    root@dbnode1:~/# ./cassandra -f
    root@dbnode2:~/# ./cassandra -f
    root@dbnode3:~/# ./cassandra -f

     

    The database cluster has been initiated.

     

    From any database node, execute the nodetool utility to verify the database cluster. The same members will be reported regardless of which database node the utility is run on.

     

    root@dbnode1:~/# ./nodetool status
    Datacenter: DC1 =============== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving --  Address        Load       Tokens  Owns (effective)  Host ID                                 Rack UN  192.168.2.24   72.62 KB   5       69.1%              6fdc0ead-a6c7-4e70-9a48-c9d0ef99fd84   RAC1 UN  192.168.2.22   184.55 KB  5       42.9%              26cc69f8-767e-4b1a-8da4-18d556a718a9   RAC1 UN  192.168.2.23    56.11 KB  5       88.0%              af955565-4535-4dfb-b5f5-e15190a1ee28   RAC1  root@dbnode1:/datapool1/zfsnode1/cassandra/bin# ./nodetool describecluster
    Cluster Information:    Name: MyCluster    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner    Schema versions:       6403a0ff-f93b-3b1f-8c35-0a8dc85a5b66: [192.168.2.24, 192.168.2.22, 192.168.2.23]

     

    Start the cqlsh utility to create a keyspace and start adding and querying data. A keyspace is analogous to a schema in the relational database world. The replication factor (RF) is set to 2, so data will reside in two servers. There is no master/slave or primary/secondary notion. Both replicas are masters.

     

    root@dbnode1:~/# ./cqlsh
    Connected to MyCluster at localhost:9160. [cqlsh 4.1.1 | Cassandra 2.0.14-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol 19.39.0] Use HELP for help. cqlsh>  cqlsh> create keyspace myfirstkeyspace with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 2};
    cqlsh> use myfirstkeyspace;
    cqlsh:myfirstkeyspace> create table greek_locations ( loc_id int PRIMARY KEY, loc_name text, description text);
    cqlsh:myfirstkeyspace> describe tables;
    greek_locations  cqlsh:myfirstkeyspace> insert into greek_locations (loc_id, loc_name, description) values (1,'Thessaloniki','North Greece');
    cqlsh:myfirstkeyspace> insert into greek_locations (loc_id, loc_name, description) values (2,'Larissa','Central Greece');
    cqlsh:myfirstkeyspace> insert into greek_locations (loc_id, loc_name, description) values (3,'Athens','Central Greece - Capital');
    cqlsh:myfirstkeyspace> select * from greek_locations;
    loc_id | description              | loc_name --------+--------------------------+--------------       1 |             North Greece | Thessaloniki       2 |           Central Greece |      Larissa       3 | Central Greece - Capital |       Athens  (3 rows)

     

    Connecting from any other database server should report the same results.

     

    root@dbnode2:/datapool2/zfsnode2/cassandra/bin# ./cqlsh
    Connected to MyCluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 2.1.9-SNAPSHOT | CQL spec 3.2.0 | Native protocol v3] Use HELP for help. cqlsh> use myfirstkeyspace;
    cqlsh:myfirstkeyspace> select * from greek_locations;
    loc_id | description              | loc_name --------+--------------------------+--------------       1 |             North Greece | Thessaloniki       2 |           Central Greece |      Larissa       3 | Central Greece - Capital |       Athens  (3 rows)

     

    The ring parameter of the nodetool utility will report the token range limits for each of the servers. The num_tokens parameter was set to 5 in the Cassandra.yaml file, so there are 15 token ranges in total for the three servers.

     

    root@dbnode1:/datapool1/zfsnode1/cassandra/bin# ./nodetool ring
    Datacenter: DC1 ========== Address        Rack    Status  State   Load       Owns          Token                                                                      5554128420332708557 192.168.2.22   RAC1    Up     Normal  122.3 KB        ?        -9135243804612957495 192.168.2.23   RAC1    Up     Normal  76.37 KB        ?         -8061157299090260986 192.168.2.22   RAC1    Up     Normal  122.3 KB        ?         -7087501046371881693 192.168.2.24   RAC1    Up     Normal  78.8 KB         ?         -6454951218299078731 192.168.2.22   RAC1    Up     Normal  122.3 KB        ?         -5793299020697319351 192.168.2.22   RAC1    Up     Normal  122.3 KB        ?         -5588273793487800091 192.168.2.23   RAC1    Up     Normal  76.37 KB        ?         -3763306950618271982 192.168.2.23   RAC1    Up     Normal  76.37 KB        ?         -3568767174854581436 192.168.2.23   RAC1    Up     Normal  76.37 KB        ?         -1113375360465059283 192.168.2.24   RAC1    Up     Normal  78.8 KB         ?          -682327379305650352  192.168.2.24   RAC1    Up     Normal  78.8 KB         ?          112278302282739678 192.168.2.23   RAC1    Up     Normal  76.37 KB        ?          4952728554160670447 192.168.2.24   RAC1    Up     Normal  78.8 KB         ?          5093621811617287602 192.168.2.22   RAC1    Up     Normal  122.3 KB        ?          5342254592921898323 192.168.2.24   RAC1    Up     Normal  78.8 KB         ?          5554128420332708557    Warning: "nodetool ring" is used to output all the tokens of a node.   To view status related info of a node use "nodetool status" instead.

     

    The describering parameter of the nodetool utility reports the token ranges and the endpoints in detail.

     

    root@dbnode1:/datapool1/zfsnode1/cassandra/bin# ./nodetool describering myfirstkeyspace
    Schema Version:155131ce-b922-37aa-a635-68e6fa96597c TokenRange:     TokenRange(start_token:5342254592921898323, end_token:5554128420332708557,  endpoints:[192.168.2.24, 192.168.2.22], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:112278302282739678, end_token:4952728554160670447,  endpoints:[192.168.2.23, 192.168.2.24], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:5554128420332708557, end_token:-9135243804612957495,  endpoints:[192.168.2.22, 192.168.2.23], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-8061157299090260986, end_token:-7087501046371881693,  endpoints:[192.168.2.22, 192.168.2.24], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:4952728554160670447, end_token:5093621811617287602,  endpoints:[192.168.2.24, 192.168.2.22], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:5093621811617287602, end_token:5342254592921898323,  endpoints:[192.168.2.22, 192.168.2.24], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-7087501046371881693, end_token:-6454951218299078731,  endpoints:[192.168.2.24, 192.168.2.22], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-3763306950618271982, end_token:-3568767174854581436,  endpoints:[192.168.2.23, 192.168.2.24], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-3568767174854581436, end_token:-1113375360465059283,  endpoints:[192.168.2.23, 192.168.2.24], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-6454951218299078731, end_token:-5793299020697319351,  endpoints:[192.168.2.22, 192.168.2.23], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-682327379305650352, end_token:112278302282739678,  endpoints:[192.168.2.24, 192.168.2.23], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-5588273793487800091, end_token:-3763306950618271982,  endpoints:[192.168.2.23, 192.168.2.24], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-1113375360465059283, end_token:-682327379305650352,  endpoints:[192.168.2.24, 192.168.2.23], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.24, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-5793299020697319351, end_token:-5588273793487800091,  endpoints:[192.168.2.22, 192.168.2.23], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1)])    TokenRange(start_token:-9135243804612957495, end_token:-8061157299090260986,  endpoints:[192.168.2.23, 192.168.2.22], rpc_endpoints:[127.0.0.1, 127.0.0.1],  endpoint_details:[EndpointDetails(host:192.168.2.23, datacenter:DC1, rack:RAC1),  EndpointDetails(host:192.168.2.22, datacenter:DC1, rack:RAC1)])

     

    The table created previously has a one-column primary key, so this is going to be the partition key. The primary key value will be hashed and the row will be stored to the servers in accordance with the token range this hash value falls under. The token function in the select statement reports the hash value.

     

    cqlsh:myfirstkeyspace> select token(loc_id), loc_id, loc_name, description from greek_locations;
    token(loc_id)        | loc_id | loc_name     | description ----------------------+--------+--------------+--------------------------  -4069959284402364209 |      1 | Thessaloniki |             North Greece  -3248873570005575792 |      2 |      Larissa |           Central Greece   9010454139840013625 |      3 |       Athens | Central Greece - Capital  (3 rows)

     

    Using the getendpoints parameter of the nodetool utility causes Cassandra to report in which particular servers the row with a given primary key value gets stored (here, the primary key [PK] equals 2), taking into account the replication factor as well.

     

    root@dbnode1:/datapool1/zfsnode1/cassandra/bin# ./nodetool getendpoints myfirstkeyspace greek_locations 2
    192.168.2.23 192.168.2.24

     

    It is also possible to calculate where a row will be stored for a given primary key value (here, the PK equals 1521), even if the row has not yet been inserted. The getendpoints subcommand requires that the keyspace and the PK be declared.

     

    root@dbnode3:/datapool3/zfsnode3/cassandra/bin# ./nodetool getendpoints myfirstkeyspace greek_locations 1521
    192.168.2.22 192.168.2.24  root@dbnode3:/datapool3/zfsnode3/cassandra/bin# ./nodetool getendpoints myfirstkeyspace greek_locations 4
    192.168.2.23 192.168.2.24

     

    The nodetool command reports that the row with ID 1521 will be stored on servers dbnode3 and dbnode1, whereas the row with ID 4 will be stored on servers dbnode2 and dbnode3.

     

    When a database node is down or an instance is terminated, nodetool status will report the node's unavailability in the cluster.

     

    root@dbnode1:/datapool1/zfsnode1/cassandra/bin# ./nodetool status

    Datacenter: DC1

    ===============

    Status=Up/Down

    |/ State=Normal/Leaving/Joining/Moving

    --  Address        Load       Tokens  Owns    Host ID                               Rack

    DN  192.168.2.24   ?          5          ?    6fdc0ead-a6c7-4e70-9a48-c9d0ef99fd84  r1

    UN  192.168.2.22   84.94 KB   5          ?    26cc69f8-767e-4b1a-8da4-18d556a718a9  RAC1

    UN  192.168.2.23   92.36 KB   5          ?    af955565-4535-4dfb-b5f5-e15190a1ee28  RAC1

     

    Even with dbnode3 unavailable, inserting a row with an ID equal to 4 will still be successful. Along with the data inserted in dbnode2, a system database called system.hints gets populated with the pending entry to dbnode3. When host dbnode3 comes back online, Cassandra ensures that this row is replicated by moving the relevant information from the system.hints database. The target_id column shows where the data is applied.

     

    root@dbnode1:/datapool1/zfsnode1/cassandra/bin# ./cqlsh
    Connected to MyCluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 2.1.9-SNAPSHOT | CQL spec 3.2.0 | Native protocol v3] Use HELP for help. cqlsh> use myfirstkeyspace;
    cqlsh:myfirstkeyspace> insert into greek_locations (loc_id, loc_name, description) values (4,'Agios Efstratios','Islands');
    cqlsh:myfirstkeyspace> select * from greek_locations;
    loc_id | description              | loc_name --------+--------------------------+------------------       1 |             North Greece |     Thessaloniki       2 |           Central Greece |          Larissa       4 |                  Islands | Agios Efstratios       3 | Central Greece - Capital |           Athens  cqlsh:myfirstkeyspace> select * from system.hints;
    target_id                         | hint_id                | message_version | mutation -----------------------------------+------------------------+-----------------+------------- --------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- ------------------  6fdc0ead-a6c7-4e70-9a48-c9d0ef99fd84 | 9820a250-5e23-11e5-80c3-f1ac672b1227 |          8 |  0x0004000000040000000101b78db9605d7a11e58ea4f1ac672b12277fffffff8000000000000000000000000000 00030003000000000005200826dbca5c00000000000e000b6465736372697074696f6e00000005200826dbca5c00 00000749736c616e6473000b00086c6f635f6e616d6500000005200826dbca5c000000104167696f732045667374 726174696f73

     

    The target_id value of 6fdc0ead-a6c7-4e70-9a48-c9d0ef99fd84 corresponds to dbnode3.

     

    Note that system.hints is a local, nonreplicated database that exists separate from every database node. Querying system.hints from dbnode2 will not return any results.

     

    root@dbnode2:/datapool2/zfsnode2/cassandra/bin# ./cqlsh
    Connected to MyCluster at 127.0.0.1:9042. [cqlsh 5.0.1 | Cassandra 2.1.9-SNAPSHOT | CQL spec 3.2.0 | Native protocol v3] Use HELP for help. cqlsh> select * from system.hints;
    target_id | hint_id | message_version | mutation -----------+---------+-----------------+----------  (0 rows)

     

    When dbnode3's instance comes back online, nodetool will report its availability to the cluster.

     

    Datacenter: DC1 =============== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving --  Address        Load      Tokens  Owns    Host ID                               Rack UN  192.168.2.24   95.23 KB  5       ?       6fdc0ead-a6c7-4e70-9a48-c9d0ef99fd84  RAC1 UN  192.168.2.22   84.94 KB  5       ?       26cc69f8-767e-4b1a-8da4-18d556a718a9  RAC1 UN  192.168.2.23   92.36 KB  5       ?       af955565-4535-4dfb-b5f5-e15190a1ee28  RAC1

     

    Querying system.hints from dbnode2 shows that the database is empty, revealing that replication has taken place successfully.

     

    cqlsh:myfirstkeyspace> select * from system.hints;
    target_id | hint_id | message_version | mutation -----------+---------+-----------------+----------  (0 rows) cqlsh:myfirstkeyspace>

     

    System High Availability with IPMP and LUN Mirroring

     

    During normal operation, ipmpstat reports the state of the IPMP groups as well as the status of the individual interfaces. As described earlier, two networks are configured per server. One is dedicated to the storage and the other to the public network. If a NIC goes down for maintenance or due a mechanical error, the standby interface will step in to ensure that the IPMP IP address will remain available (that is, no packet loss will occur).

     

    root@stgnode1:~# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT        INTERFACES ipmp7          ipmp7       ok     10.00s     net2 (net3) ipmp6          ipmp6       ok     10.00s     net0 (net1)  root@stgnode1:~# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE net3        no      ipmp7       is-----   up        ok        ok net2        yes     ipmp7       --mbM--   up        ok        ok net1        no      ipmp6       is-----   up        ok        ok net0        yes     ipmp6       --mbM--   up        ok        ok

     

    The loss of an interface can be accomplished by using the ipadm disable-if command.

     

    root@stgnode1:~# ipadm disable-if -t net0

     

    As shown below, ipmpstat reports that the failover interface has stepped in. There is no packet loss, and the server continues to work normally.

     

    root@stgnode1:~# ipmpstat -g
    GROUP       GROUPNAME   STATE     FDT       INTERFACES ipmp7       ipmp7       ok        10.00s    net2 (net3)ipmp6       ipmp6       ok        10.00s    net1root@stgnode1:~# ipmpstat -i
    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE net3        no      ipmp7       is-----   up        ok        ok net2        yes     ipmp7       --mbM--   up        ok        ok net1        yes     ipmp6       -smbM--   up        ok        ok

     

    Re-enabling the interface causes the NICs to take their normal active/standby status within the IPMP group. Note that bringing an interface back online will cause network unavailability (packet loss) for a few seconds.

     

    root@stgnode1:~# ipadm enable-if -t net0

    root@stgnode1:~# ipmpstat -g

    GROUP       GROUPNAME   STATE     FDT       INTERFACES

    ipmp7       ipmp7       ok        10.00s    net2 (net3)

    ipmp6       ipmp6       ok        10.00s    net0 (net1)

     

    root@stgnode1:~# ipmpstat -i

    INTERFACE   ACTIVE  GROUP       FLAGS     LINK      PROBE     STATE

    net3        no      ipmp7       is-----   up        ok        ok

    net2        yes     ipmp7       --mbM--   up        ok        ok

    net0        yes     ipmp6       --mbM--   up        ok        ok

    net1        no      ipmp6       is-----   up        ok        ok

     

    At the storage level, the three storage servers have been constructed to provide high availability to the database instances. Each instance has access to its own ZFS pool. The zpools are mirrored across the network and there is also a hot spare. This means that even if two storage servers were unavailable, data availability would be 100 percent in the cluster. The zpool list and zpool status commands show that in a situation such as this, the zpools are operating in a degraded state but at the same time, the nodetool utility reports that all database instances up and running.

     

    root@dbnode1:~# zpool list

    NAME        SIZE  ALLOC   FREE  CAP  DEDUP    HEALTH  ALTROOT

    datapool1  7.94G   148M  7.79G   1%  1.00x  DEGRADED  -

    rpool      19.6G  8.02G  11.6G  40%  1.00x    ONLINE  -

     

    root@dbnode1:~# zpool status

       pool: datapool1

      state: DEGRADED

    status: One or more devices are unavailable in response to persistent errors.

       Sufficient replicas exist for the pool to continue functioning in a

       degraded state.

    action: Determine if the device needs to be replaced, and clear the errors

       using 'zpool clear' or 'fmadm repaired', or replace the device

       with 'zpool replace'.

       Run 'zpool status -v' to see device specific details.

      scan: none requested

    config:

     

       NAME                                       STATE        READ WRITE CKSUM

       datapool1                                  DEGRADED     0     0     0

         mirror-0                                 DEGRADED     0     0     0

           c0t600144F03B268F00000055F33BB10001d0  ONLINE       0     0     0

           c0t600144F06A174000000055F5D8F50001d0  UNAVAIL      0     0     0

       spares

         c0t600144F0BBB5C300000055F5DB370001d0    UNAVAIL

     

    errors: No known data errors

     

      pool: rpool

    state: ONLINE

      scan: none requested

    config:

     

       NAME        STATE     READ WRITE CKSUM

       rpool       ONLINE      0    0     0

         c1t0d0s1  ONLINE      0    0     0

     

    errors: No known data errors

     

    root@dbnode2:~# zpool list

    NAME        SIZE  ALLOC   FREE  CAP  DEDUP    HEALTH  ALTROOT

    datapool2  7.94G   147M  7.79G   1%  1.00x  DEGRADED  -

    rpool      19.6G  8.03G  11.6G  40%  1.00x    ONLINE  -

     

    root@dbnode2:~# zpool status

        pool: datapool2

    state: DEGRADED

    status: One or more devices are unavailable in response to persistent errors.

        Sufficient replicas exist for the pool to continue functioning in a

        degraded state.

    action: Determine if the device needs to be replaced, and clear the errors

        using 'zpool clear' or 'fmadm repaired', or replace the device

        with 'zpool replace'.

        Run 'zpool status -v' to see device specific details.

    scan: resilvered 930K in 1s with 0 errors on Sun Sep 20 22:55:58 2015

    config:

     

        NAME                                       STATE     READ WRITE CKSUM

        datapool2                                  DEGRADED     0     0     0

          mirror-0                                 DEGRADED     0     0     0

            c0t600144F03B268F00000055F33BCC0002d0  ONLINE       0     0     0

            c0t600144F06A174000000055F5D90D0002d0  UNAVAIL      0     0     0

        spares

          c0t600144F0BBB5C300000055F5DB4D0002d0    UNAVAIL

     

    errors: No known data errors

     

      pool: rpool

    state: ONLINE

      scan: none requested

    config:

     

        NAME        STATE     READ WRITE CKSUM

        rpool       ONLINE       0     0     0

          c1t0d0s1  ONLINE       0     0     0

     

    errors: No known data errors

     

    root@dbnode3:~# zpool list

    NAME        SIZE  ALLOC   FREE  CAP  DEDUP    HEALTH  ALTROOT

    datapool3  7.94G   147M  7.79G   1%  1.00x  DEGRADED  -

    rpool      19.6G  8.53G  11.1G  43%  1.00x    ONLINE  -

     

    root@dbnode3:~# zpool status

      pool: datapool3

    state: DEGRADED

    status: One or more devices are unavailable in response to persistent errors.

        Sufficient replicas exist for the pool to continue functioning in a

        degraded state.

    action: Determine if the device needs to be replaced, and clear the errors

        using 'zpool clear' or 'fmadm repaired', or replace the device

        with 'zpool replace'.

        Run 'zpool status -v' to see device specific details.

      scan: resilvered 1.05M in 1s with 0 errors on Sun Sep 20 22:56:43 2015

    config:

     

        NAME                                       STATE     READ WRITE CKSUM

        datapool3                                  DEGRADED     0     0     0

          mirror-0                                 DEGRADED     0     0     0

            c0t600144F03B268F00000055F33BFE0003d0  ONLINE       0     0     0

            c0t600144F06A174000000055F5D9350003d0  UNAVAIL      0     0     0

        spares

          c0t600144F0BBB5C300000055F5DB690003d0    UNAVAIL

     

    errors: No known data errors

     

      pool: rpool

    state: ONLINE

      scan: none requested

    config:

     

        NAME        STATE     READ WRITE CKSUM

        rpool       ONLINE       0     0     0

          c1t0d0s1  ONLINE       0     0     0

     

    errors: No known data errors

     

    root@dbnode1:/datapool1/zfsnode1/cassandra/bin# ./nodetool status

    Datacenter: DC1

    ===============

    Status=Up/Down

    |/ State=Normal/Leaving/Joining/Moving

    --  Address       Load       Tokens  Owns    Host ID                               Rack

    UN  192.168.2.24  108.81 KB  5       ?       6fdc0ead-a6c7-4e70-9a48-c9d0ef99fd84  RAC1

    UN  192.168.2.22  151.29 KB  5       ?       26cc69f8-767e-4b1a-8da4-18d556a718a9  RAC1

    UN  192.168.2.23  138.58 KB  5       ?       af955565-4535-4dfb-b5f5-e15190a1ee28  RAC1

     

    Note: Non-system keyspaces don't have the same replication settings; effective ownership information is meaningless.

     

    Bringing a storage server back online is the same as bringing a failed disk back.

     

    root@dbnode1:/# zpool clear datapool1
    root@dbnode1:/# zpool status
      pool: datapool1  state: ONLINE   scan: resilvered 2.09M in 1s with 0 errors on Sun Sep 20 22:49:40 2015
    config:     NAME                                       STATE     READ WRITE CKSUM    datapool1                                  ONLINE     0     0     0      mirror-0                                 ONLINE     0     0     0        c0t600144F03B268F00000055F33BB10001d0  ONLINE     0     0     0        c0t600144F06A174000000055F5D8F50001d0  ONLINE     0     0     0    spares      c0t600144F0BBB5C300000055F5DB370001d0    UNAVAIL   errors: No known data errors    pool: rpool  state: ONLINE   scan: none requested config:     NAME        STATE     READ WRITE CKSUM    rpool       ONLINE     0     0     0      c1t0d0s1  ONLINE     0     0     0  errors: No known data errors

     

    root@dbnode1:~# zpool list

    NAME        SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT

    datapool1  7.94G   148M  7.79G   1%  1.00x  ONLINE  -

    rpool      19.6G  8.02G  11.6G  40%  1.00x  ONLINE  -

     

    The same applies to the other ZFS pools. Had there just been three servers in the cluster with data residing in local storage, the cluster would fail totally when the second server went offline. Instead, in the configuration implemented in this article, even with two out of the three storage nodes down, the cluster is fully operational. Data remains intact. However, if data corruption or down time occurs on the third storage node, it will result in complete failure.

     

    Summary

     

    This article explained how to build and run the popular Apache Cassandra distributed database management system on Oracle Solaris. Some inner mechanics where described to show how the DBMS distributes and stores data on multiple servers based on the replication strategy.

     

    At the same time, several Oracle Solaris built-in technologies—such as ZFS, IPMP, and iSCSI COMSTAR—were applied to provide additional availability to the cluster. iSCSI COMSTAR can substitute for expensive SAN storage, yet ensure data availability. ZFS provides redundancy at the local storage level (and across the network), and IPMP helps avoid network segregation.

     

    See Also

     

     

    Acknowledgments

     

    The author would like to thank Logan Rosenstein and Glynn Foster for their assistance in getting this article posted as well as Karen Perkins for making it readable.

     

    About the Author

     

    Antonis Tsavdaris is a level-2 IT support engineer who has been supporting Oracle technologies, mainly Oracle Database, for quite some time.

     

     

    Revision 1.0, 11/20/2015

     

    Follow us:
    Blog | Facebook | Twitter | YouTube