How to Build a Hadoop 2.6 Cluster Using OpenStack

Version 9

    by Ekine Akuiyibo and Orgad Kimchi

     

    How to set up a multinode Apache Hadoop 2.6 (YARN) cluster using Oracle OpenStack for Oracle Solaris and other Oracle Solaris technologies such as Oracle Solaris Zones and Oracle Solaris Unified Archives. The created cluster is a virtual cluster that uses only one physical machine, which enables efficient vertical scaling and promotes optimized resource utilization.

     

    Table of Contents
    Introduction to Hadoop
    Introduction to OpenStack
    The Benefits of Using OpenStack for a Hadoop Cluster
    OpenStack Sahara
    Architecture We Will Use
    Hadoop Zones Information
    Tasks We Will Perform
    Prerequisites
    Performing the Tasks
    Summary
    Acknowledgments
    See Also
    About the Authors
    Appendix—OpenStack hadoop_setup.sh Configuration Script

     

    This article starts with a brief overview of Hadoop and OpenStack, and follows with an example of setting up a Hadoop cluster that has two NameNodes, a Resource Manager, a History Server, and three DataNodes. As a prerequisite, you should have a basic understanding of Oracle Solaris Zones and network administration.

     

    Introduction to Hadoop

     

    Apache Hadoop is an open source distributed computing framework designed to process very large unstructured data sets. It is composed of two main subsystems: data storage and data analysis. Apache Hadoop was developed to address four system principles: the ability to reliably scale processing across multiple physical (or virtual) nodes, moving code to data, dealing gracefully with node failures, and abstracting the complexity of distributed and concurrent applications.

     

    Introduction to OpenStack

     

    OpenStack is free and open source cloud software typically deployed as an infrastructure as a service (IaaS) solution. The OpenStack platform exists as a combination of several interrelated subprojects that control the provisioning of compute, storage, and network resources in a data center. OpenStack technology simplifies the deployment of data center resources while at the same time providing a unified resource management tool. Oracle Solaris 11 includes a complete OpenStack distribution called Oracle OpenStack for Oracle Solaris.

     

    The Benefits of Using OpenStack for a Hadoop Cluster

     

    In addition to being two of the most active open source community projects, Hadoop and OpenStack are complementary technologies. This is especially evident as it applies to long-term enterprise adoption of both green field technologies.

     

    Hadoop, the more mature of the two technologies, still faces significant operational challenges. A representative Hadoop adoption journey would, for example, comprise pilot projects, quality assurance, testing, performance validation, and production. Each activity requires its own Hadoop cluster. Moreover, each business unit evaluating separate workloads or using a different software and hardware stack will have different Hadoop cluster requirements. Supporting these multiple Hadoop cluster environments can be an operational nightmare, and spinning up virtual machines (VMs) on a public cloud does not guarantee that the solution will work in-house.

     

    OpenStack alleviates these operational complexities. In addition, OpenStack provides functionality that reduces provisioning and deployment time by providing template-based provisioning. Templates can be specified at the cluster and node level allowing self-provisioning of Hadoop clusters via cluster-based or node-based templates, thereby reducing deployment time and eliminating typical errors. Templates also allow for flexibility in defining cluster types, for example, Hadoop-specific or not Hadoop-specific. Other operational benefits include more efficient cluster timesharing as well the availability of the basic infrastructure for supporting varying service level agreements (SLAs) through resource and access isolation.

     

    OpenStack, the "newer" technology of the two, can use Hadoop as its "killer" application. Several architectural characteristics of Hadoop make this the case, including its scale flexibility (vertical and horizontal), its independence from legacy applications and workloads, and the ability for multiple users (departments) to share the same platform. In other words, Hadoop is the ideal cloud application for OpenStack proof of concept.

     

    Oracle OpenStack for Oracle Solaris, Oracle Solaris Zones, Oracle Virtual Networking, and Oracle Solaris Unified Archives provide the fundamental building blocks for the Hadoop and OpenStack integration discussed in this article. Using Unified Archives, virtual Hadoop clusters can be provisioned in the time it takes to boot Oracle Solaris Zones. Oracle Solaris Zones technology provides zero-overhead virtualization making zones highly efficient. Furthermore, Oracle Solaris Kernel Zones extend the basic zones functionality to include operating system–level isolation and independence. The ability to rapidly provision zones coupled with the flexibility of Unified Archives enables template-based provisioning both at the cluster and node levels. The entire infrastructure is monitored through one pane of glass—OpenStack Horizon.

     

    OpenStack Sahara

     

    Initially named Savanna, Sahara is the data-processing component of OpenStack. Incubated in the OpenStack Icehouse release and integrated in the OpenStack Juno release, Sahara provides simple-click (push-button) provisioning of Hadoop clusters and elastic data processing (EDP) capabilities analogous to Amazon Elastic MapReduce. Sahara's integration with core OpenStack services including Horizon gives operators the ability to manage Hadoop clusters from the OpenStack dashboard. OpenStack Sahara is a work in progress for Oracle OpenStack for Oracle Solaris.

     

    Note: In this article, we are not going to use OpenStack Sahara.

     

    Architecture We Will Use

     

    Figure 1 shows the architecture used in this article.

     

    f1.gif

     

    Figure 1. Architecture

     

    The Hadoop cluster building blocks are as follows:

     

    • NameNode: The centerpiece of the Hadoop Distributed File System (HDFS), which stores file system metadata and is responsible for all client operations
    • Secondary NameNode: Synchronizes its state with the active NameNode in order to provide fast failover if the active NameNode goes down
    • ResourceManager: The global resource scheduler, which directs the slave NodeManager daemons to perform the low-level I/O tasks
    • DataNodes: Nodes that store data in the HDFS and are also known as slaves; these nodes run the NodeManager process that communicates with the ResourceManager
    • History Server: Provides REST APIs in order to allow the user to get the status of finished applications and provides information about finished jobs

     

    Figure 2 shows an example Hadoop cluster.

     

    f2.gif

     

    Figure 2. Example Hadoop cluster

     

    Hadoop Zones Information

     

    We will leverage the integration between Oracle Solaris Zones virtualization technology and the OpenStack framework that is built into Oracle Solaris.

     

    Table 1 shows a summary of the Hadoop zones we will create:

     

                    

    FunctionZone NameIP Address
    NameNodename-node1192.168.1.2/24
    Secondary NameNodename-node2192.168.1.3/24
    ResourceManagerresource-manager192.168.1.4/24
    DataNodedata-node1192.168.1.5/24
    DataNodedata-node2192.168.1.6/24
    DataNodedata-node3192.168.1.7/24

     

    Tasks We Will Perform

     

    In the next subsections, we will perform the following operations in order to build the architecture:

     

     

    Prerequisites

     

    You need to have a working OpenStack Juno environment running Oracle Solaris 11.2.10.5.0 or later. Refer to Installing and Configuring OpenStack in Oracle Solaris 11.2 in order to build an OpenStack environment. You will also need to download from Apache a copy of Hadoop 2.6.

     

    Important: In the examples presented in this article, the command prompt indicates which user needs to run each command in addition to indicating the environment where the command should be run. For example, the command prompt root@global:~# indicates that user root needs to run the command from the global zone.

     

    Performing the Tasks

     

    Create OpenStack Client Environment Scripts

     

    To increase the efficiency of client operations, OpenStack supports simple client environment scripts, which are also known as OpenRC files. The scripts include the location of the Identity service and the admin and hadoop user credentials. Future portions of this article reference these scripts to load appropriate credentials for client operations.

     

    Create the client environment scripts for the admin and hadoop users by running the following commands.

     

    root@global:~# vi admin-openrc.sh

    export OS_AUTH_URL=http://localhost:5000/v2.0

    export OS_PASSWORD=neutron

    export OS_TENANT_NAME=service

    export OS_USERNAME=neutron

     

    root@global:~# vi hadoop-openrc.sh

    export OS_AUTH_URL=http://localhost:5000/v2.0

    export OS_PASSWORD=secrete

    export OS_TENANT_NAME=demo

    export OS_USERNAME=hadoop

     

    To run clients as a specific user, you can simply load the associated client environment script prior to running the clients. This will load the environment variables for the location of the Identity service and the admin user credentials, for example:

     

    root@global:~# source admin-openrc.sh

     

    Verify that the environment variables have been applied:

     

    root@global:~# env | grep -i os

    OS_PASSWORD=neutron

    OS_AUTH_URL=http://localhost:5000/v2.0

    OS_USERNAME=neutron

    OS_TENANT_NAME=service

     

    Create the hadoop User

     

    Keystone is an OpenStack service that provides authentication and authorization services between users, administrators, and OpenStack services.

     

    Create the hadoop user using the following command:

     

    code_a.png

     

    Where:

     

    • name is the user name
    • tenant is the tenant name
    • pass is the user password
    • email is the e-mail address

     

    Note: OpenStack generates IDs dynamically, so you will see different values in the example command output.

     

    The newly created hadoop account will be used for future management of the OpenStack environment.

     

    Create the Tenant Network

     

    Neutron provides networking capabilities in OpenStack, enabling VMs to talk to each other within the same tenants and subnets, and enabling them to talk directly to the outside world.

     

    The tenant network provides internal network access for instances. The architecture isolates this type of network from other tenants.

     

    Load the location of the Identity service and the hadoop user credentials:

     

    root@global:~# source hadoop-openrc.sh

     

    Create the network:

     

    code_b.png

     

    Your network also requires a subnet that is attached to it.

     

    Some of the Hadoop services rely on static IP addresses, so this subnet will not use DHCP in order to enable static IP address allocation for the instances.

     

    code_c.png

     

    From the command output, we can see the following:

     

    • The subnet IP address range will be 192.168.1.2 through 192.168.1.254 (allocation_pools)
    • DHCP is disabled on this network (enable_dhcp | False)
    • The IP gateway for this subnet will be 192.168.1.1 (gateway_ip).

     

    Create the Glance Image

     

    Glance is a service that provides image management in OpenStack. It responsible for managing the images that you install on the compute notes when you create new VM instances.

     

    The next step will be to populate Glance with an image that we can use for our instances.

     

    In the Oracle Solaris implementation, we take advantage of a new archive type called Unified Archives. Therefore, we will create a Unified Archive.

     

    The following shows how to capture a Unified Archive of a newly created non-global zone called myzone, and then upload it to the Glance repository.

     

    Create the zone:

     

    root@global:~# zonecfg -z myzone create

     

    Install and boot the zone:

     

    root@global:~# zoneadm -z myzone installroot@global:~#zoneadm -z myzone boot

     

    We need to modify the zone as a cloud image. Cloud images are preinstalled bootable disk images that have had their identifiable host-specific metadata—such as SSH host keys, MAC addresses, and static IP addresses—removed.

     

    When we deploy instances using OpenStack, we typically provide an SSH public keypair that's used as the primary authentication mechanism to our instance.

     

    Modify the /etc/ssh/sshd_config file in order to enable root access without a password:

     

    root@global:~# zlogin myzone 'sed /^PermitRootLogin/s/no$/without-password/ \

    < /etc/ssh/sshd_config > /system/volatile/sed.$$ ; \

    cp /system/volatile/sed.$$ /etc/ssh/sshd_config'

     

    Download Hadoop

     

    This article uses Apache Hadoop Release 2.6.0.

     

    Download the Hadoop tarball hadoop-2.6.0.tar.gz.

     

    Copy the tarball into the zone's /var/tmp directory:

     

    root@global:~# cp hadoop-2.6.0.tar.gz /system/zones/myzone/root/var/tmp

     

    Shut down the zone:

     

    root@global:~# zoneadm -z myzone halt

     

    Create the Unified Archive (UAR):

     

    root@global:~# archiveadm create -z myzone /var/tmp/myzone.uar

     

    Create the following upload.ksh script in order to upload the UAR into Glance. The script will perform the following actions:

     

    • Get the system architecture (SPARC or x86).
    • Load the Glance user credentials.
    • Upload the image into the Glance image repository.

     

    root@global:~# vi image-upload.ksh

    #!/bin/ksh

     

    # Upload Unified Archive image to glance with proper Solaris decorations

     

    arch=$(archiveadm info -p $1|grep ^archive|cut -d '|' -f 4)

     

    if [[ "$arch" == "i386" ]]; then

            imgarch=x86_64

    else

            imgarch=sparc64

    fi

     

    name=$(basename $1 .uar)

    export OS_USERNAME=glance

    export OS_PASSWORD=glance

    export OS_TENANT_NAME=service

    export OS_AUTH_URL=http://localhost:5000/v2.0

     

    glance \

    image-create \

    --name $name \

    --container-format bare \

    --disk-format raw \

    --owner service \

    --file "$1" \

    --is-public True \

    --property architecture="$imgarch" \

    --property hypervisor_type=solariszones \

    --property vm_mode=solariszones \

    --progress

     

    Change the script's permissions and upload the UAR into Glance by running the following commands:

     

    code_d.png

     

    Log In to Horizon

     

    Within the host environment, open up a browser and navigate to the IP address you allocated to the global zone:

     

    http://<IP_address>/horizon

     

    Use hadoop/secrete as the user/password combination in the login screen.

     

    f3.gif

    Figure 3. The OpenStack Horizon login screen

     

    After you have successfully logged in, navigate to the Access & Security screen, where you can create a new SSH keypair:

     

    f4.gif

    Figure 4. Access & Security screen

     

    There are no keypairs currently defined, so click the Import Key Pair button to open the Import Keypair screen, which is shown in Figure 4.

     

    In our case, let's use the SSH public key of our global zone.

     

    First, run the following command to generate the SSH key. Enter yes at the command prompt.

     

    root@global:~# ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

    Generating public/private rsa key pair.

    /root/.ssh/id_rsa already exists.

    Overwrite (yes/no)? yes

     

    Next, get the key using the following command, and then enter the key into the Public Key field of the Import Key Pair screen (see Figure 5).

     

    root@global:~# cat .ssh/id_rsa.pub

    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA0Khp4Th5VcKQW4LttqzKAR8O60gj43cB0CbdpiizEhXEbVgjI7IlnZlo9i

    SEFpJlnZrFQC8MU2L7Hn+CD5nXLT/uK90eAEVXVqwc4Y7IVbEjrABQyB74sGnJy+SHsCGgetjwVrifR9fkxFHg

    jxXkOunXrPMe86hDJRpZLJFGYZZezJRtd1eRwVNSHhJdZmUac7cILFJen/wSsM8TOSAkh+ZWEhwY3o08nZg2IW

    dMImPbwPwtRohJSH3W7XkDE85d7UZebNJpD9kDAw6OmXsY5CLgV6gEoUExZ/J4k29WOrr1XKR3jiRqQlf3Kw4Y

    uK9JUi/gPhg2lTOhISgJoelorQ== root@global

     

    In the Key Pair Name field, enter hadoopkey.

     

    f5.gif

     

    Figure 5. Import Keypair screen

     

    Launch the name-node1 Instance

     

    Nova is the compute service in OpenStack, and it is responsible for scheduling and deploying new instances.

     

    Navigate to the Instances screen.

     

    f6.gif

     

    Figure 6. Instances screen

     

    Let's launch a new instance by clicking the Launch Instance button.

     

    We will call our instance name-node1. We will give it an Oracle Solaris non-global zone flavor called tiny. A flavor represents the size of the resources that we should give this instance. We can see in Figure 7 that we will get a root disk of 10 GB and 2,048 MB of RAM. We will choose to boot this instance from the image we uploaded in the previous section that's stored in Glance, which is called myzone (1.4GB).

     

    f7.gif

     

    Figure 7. Launch Instance screen

     

    When we are happy with the Details tab, we can move onto the Access & Security tab. There, you can see that our keypair (hadoopkey) has been preselected.

     

    f8.gif

     

    Figure 8. Access & Security tab

     

    You can move on to the Networking tab. There, you can see that our network, hadoop_net, has been preselected as our network. Then click the Launch button.

     

    f9.gif

     

    Figure 9. Networking tab

     

    After a little bit of time, we can see that our instance has successfully booted, which is indicated by its Active status (see Figure 10). You can see that the instance has the IP address 192.168.1.2.

     

    f10.gif

     

    Figure 10. Screen showing the instance's status is "active"

     

    Install Hadoop and Create More Scripts

     

    From the global zone, get the zone name using the zoneadm command:

     

    root@global:~# zoneadm list

    global

    instance-00000001

     

    Note: Your zone name might be different.

     

    Log in to the VM using the zlogin command:

     

    root@global:~# zlogin instance-00000001

    [Connected to zone 'instance-00000044' pts/2]

    Oracle Corporation      SunOS 5.11      11.2    December 2014

     

    Verify that you have the Hadoop tarball in /var/tmp:

     

    root@name-node1:~# ls /var/tmp

    hadoop-2.6.0.tar.gz

     

    Next, set up the Hadoop environment using the hadoop_setup.sh script provided in the section "Appendix—OpenStack hadoop_setup.sh Configuration Script." (For a full description of the script, see the article "How to Set Up a Hadoop 2.2 Cluster From the Unified Archive.")

     

    First, create the /usr/local/Scripts directory; we will use this directory for our scripts.

     

    root@name-node1:~# mkdir -p /usr/local/Scripts

     

    Then, copy the hadoop_setup.sh script content from the Appendix to create the hadoop_setup.sh script, and then set the permissions:

     

    root@name-node1:~# vi /usr/local/Scripts/hadoop_setup.sh

    root@name-node1:~# chmod +x /usr/local/Scripts/hadoop_setup.sh

     

    Run the script. The script will prompt you for the passwords of the following users: hdfs, yarn, mapred, and bob.

     

    root@name-node1:~# /usr/local/Scripts/hadoop_setup.sh

    80 blocks

    Enter the password for the hdfs user

    New Password:

    Re-enter new Password:

    ...

     

    Next, create the testssh script. We will use this script to verify the SSH setup.

     

    root@name-node1:~# vi /usr/local/Scripts/testssh

     

    #!/bin/ksh

     

    for zone in name-node1 name-node2 resource-manager data-node1 data-node2 data-node3

      do

     

      ssh -o StrictHostKeyChecking=no $zone exit

     

      done

     

    Create the startcluster script. We will use this script to start all the services on the Hadoop cluster.

     

    root@name-node1:~# vi /usr/local/Scripts/startcluster

     

    #!/bin/ksh

     

    su - hdfs -c "start-dfs.sh"

    su - yarn -c "start-yarn.sh"

    su - yarn -c 'ssh yarn@resource-manager /usr/local/hadoop/sbin/yarn-daemon.sh start resourcemanager'

    su - mapred -c 'ssh mapred@resource-manager /usr/local/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver'

     

    Create the stopcluster script. We will use this script to stop all the services on the Hadoop cluster.

     

    root@name-node1:~# vi /usr/local/Scripts/stopcluster

     

    #!/bin/ksh

     

    su - hdfs -c "stop-dfs.sh"

    su - yarn -c "stop-yarn.sh"

    su - yarn -c 'ssh yarn@resource-manager /usr/local/hadoop/sbin/yarn-daemon.sh stop resourcemanager'

    su - mapred -c 'ssh mapred@resource-manager /usr/local/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver'

     

    Create the verify-hadoop script, which will verify that the Hadoop processes are up and running.

     

    root@name-node1:~# vi /usr/local/Scripts/verify-hadoop

    #!/bin/ksh

     

    su - hdfs -c "ssh -q hdfs@name-node1 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep NameNode'" > /tmp/hadoop_output

    su - hdfs -c "ssh hdfs@name-node2 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep SecondaryNameNode'" >> /tmp/hadoop_output

    su - hdfs -c "ssh hdfs@data-node1 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep DataNode'" >> /tmp/hadoop_output

    su - hdfs -c "ssh hdfs@data-node2 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep DataNode'" >> /tmp/hadoop_output

    su - hdfs -c "ssh hdfs@data-node3 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep DataNode'" >> /tmp/hadoop_output

    su - yarn -c "ssh yarn@data-node1 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep NodeManager'"  >> /tmp/hadoop_output

    su - yarn -c "ssh yarn@data-node2 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep NodeManager'"  >> /tmp/hadoop_output

    su - yarn -c "ssh yarn@data-node3 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep NodeManager'"  >> /tmp/hadoop_output

    su - yarn -c "ssh yarn@resource-manager 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep ResourceManager'" >> /tmp/hadoop_output

    su - mapred -c "ssh mapred@resource-manager 'hostname ; /usr/jdk/instances/jdk1.7.0/bin/jps | grep HistoryServer'"  >> /tmp/hadoop_output

    cat /tmp/hadoop_output | grep -v Oracle | awk  '{print $1,$2}'

     

    rm /tmp/hadoop_output

     

    Change the scripts' permissions:

     

    root@name-node1:~# chmod -R +x /usr/local/Scripts

     

    Log out from the zone:

     

    root@name-node1:~# logout

    [Connection to zone 'instance-00000001' pts/3 closed]

     

    Once the Hadoop name-node1 instance is ready with the Hadoop configuration, we can create a snapshot in order to create a Glance image from it. We will use the Glance image in order to provision the other Hadoop nodes.

     

    From the Instances menu, select Create Snapshot.

     

    f11.gif

     

    Figure 11. Instances menu screen

     

    The Create Snapshot window will appear.

     

    f12.gif

     

    Figure 12. The Create Snapshot window

     

    In the Snapshot Name field, enter hadoop-base-image, and then click the Create Snapshot button.

     

    After few seconds, the Images screen will open, as shown in Figure 13.

     

    f13.gif

     

    Figure 13. The Images window

     

    Wait a few minutes until the image is created. Once the image is ready, you will see that its status has been changed to Active, as shown in Figure 14.

     

    f14.gif

     

    Figure 14. Screen showing the image's status is active

     

    Launch name-node2, resource-manager, and the Three DataNode Instances

     

    You can launch the instances using the Horizon dashboard; alternatively, you can use Heat in order to automate the Hadoop cluster deployment.

     

    Heat is the main project in the OpenStack Orchestration program. It implements an orchestration engine to launch multiple composite cloud applications based on templates in the form of text files that can be treated like code. For more information about heat, see https://wiki.openstack.org/wiki/Heat.

     

    In Heat terminology, a stack is the collection of objects—or resources—that will be created by Heat. This collection might include instances (VMs), networks, subnets, routers, ports, and so on. Heat uses the notion of a template to define a stack.

     

    In order to automate our Hadoop cluster deployment, we can create a Heat template that will include the instances' information, such as host name, image name, flavor name, key name, and network ID. But first, we need to get this information.

     

    Load the hadoop user credentials:

     

    root@global:~# source hadoop-openrc.sh

     

    Get the Glance images' names:

     

    code_e.png

     

    We can see the two Glance images that we have; we will use the hadoop-base-image image.

     

    Get the flavors' names:

     

    code_f.png

     

    Table 2 shows which flavor we are going to use for each instance.

     

                    

    FunctionZone NameFlavor
    NameNodename-node1Oracle Solaris non-global zone - tiny
    Secondary NameNodename-node2Oracle Solaris non-global zone - tiny
    ResourceManagerresource-managerOracle Solaris non-global zone - small
    DataNodedata-node1Oracle Solaris non-global zone - small
    DataNodedata-node2Oracle Solaris non-global zone - small
    DataNodedata-node3Oracle Solaris non-global zone - small

     

    Note: We will use the flavor Oracle Solaris non-global zone - small for the ResourceManager and DataNodes, because they need bigger storage capacity for HDFS. This flavor gives us the following resources: a 20 GB root disk, 3 GB of RAM, and four virtual CPUs.

     

    Get the network list:

     

    code_g.png

     

    We will use the hadoop_net ID.

     

    Now, let's create the Heat template based on the information we have gathered. For each instance, we need to define the following properties.

     

    • name: host name
    • image: image name
    • flavor: flavor name
    • key_name: key name
    • networks:

      + UUID: network_id

       

      + Fixed IP address: fixed_ips

       

     

    Edit the Heat template:

     

    Note: You need to change the network_id values in the template using the hadoop_net ID value that we got from the earlier command.

     

    root@global:~# vi hadoop-stack.yml

    heat_template_version: 2013-05-23

    description: Hadoop cluster Template

    resources:

      name-node2_server_port:

        type: OS::Neutron::Port

        properties:

        network_id: "e4a00424-70fe-4b1e-851b-dc53fba0f13d"

        fixed_ips: [ { 'ip_address': '192.168.1.3' } ]

     

      name-node2:

        type: OS::Nova::Server

        properties:

          name: "name-node2"

          image: "hadoop-base-image"

          flavor: "Oracle Solaris non-global zone - tiny"

          key_name: "hadoopkey"

          networks:

          - port: { get_resource: name-node2_server_port }

      resource-manager_server_port:

        type: OS::Neutron::Port

        properties:

        network_id: "e4a00424-70fe-4b1e-851b-dc53fba0f13d"

        fixed_ips: [ { 'ip_address': '192.168.1.4' } ]

     

      resource-manager:

        type: OS::Nova::Server

        properties:

          name: "resource-manager"

          image: "hadoop-base-image"

          flavor: "Oracle Solaris non-global zone - small"

          key_name: "hadoopkey"

          networks:

          - port: { get_resource: resource-manager_server_port }

     

      data-node1_server_port:

        type: OS::Neutron::Port

        properties:

        network_id: "e4a00424-70fe-4b1e-851b-dc53fba0f13d"

        fixed_ips: [ { 'ip_address': '192.168.1.5' } ]

      data-node1:

        type: OS::Nova::Server

        properties:

          name: "data-node1"

          image: "hadoop-base-image"

          flavor: "Oracle Solaris non-global zone - small"

          key_name: "hadoopkey"

          networks:

          - port: { get_resource: data-node1_server_port }

     

      data-node2_server_port:

        type: OS::Neutron::Port

        properties:

        network_id: "e4a00424-70fe-4b1e-851b-dc53fba0f13d"

        fixed_ips: [ { 'ip_address': '192.168.1.6' } ]

     

      data-node2:

        type: OS::Nova::Server

        properties:

          name: "data-node2"

          image: "hadoop-base-image"

          flavor: "Oracle Solaris non-global zone - small"

          key_name: "hadoopkey"

          networks:

          - port: { get_resource: data-node2_server_port }

     

      data-node3_server_port:

        type: OS::Neutron::Port

        properties:

        network_id: "e4a00424-70fe-4b1e-851b-dc53fba0f13d"

        fixed_ips: [ { 'ip_address': '192.168.1.7' } ]

     

      data-node3:

        type: OS::Nova::Server

        properties:

          name: "data-node3"

          image: "hadoop-base-image"

          flavor: "Oracle Solaris non-global zone - small"

          key_name: "hadoopkey"

          networks:

          - port: { get_resource: data-node3_server_port }

     

     

    You can validate the Hadoop Heat template syntax using the following command:

     

    root@global:~# heat template-validate --template-file hadoop-stack.yml

    {

      "Description": "Hadoop cluster Template",

      "Parameters": {}

    }

     

    Use the following heat stack-create command to create a stack from the template. The command will launch the instances that we defined in the template.

     

    code_h.png

     

    Use the heat stack-list command to verify successful creation of the stack:

     

    code_i.png

     

    You can see graphical representation of the Heat stack by navigating to the Orchestration menu, selecting the Stacks menu option, and then choosing HadoopStack to get the following screen:

     

    f15.gif

     

    Figure 15. The graphical representation of the Heat stack

     

    Once the instances finish the boot process, you will have six instances, as shown in Figure 16.

     

    f16.gif

     

    Figure 16. The Instances window

     

    Check the Network Topology

     

    The OpenStack Dashboard can show the Hadoop cluster network topology.

     

    Navigate to the Network Topology screen and then click Normal.

     

    f17.gif

     

    Figure 17. The Network topology

     

    You can see the network name, hadoop_net, and its address in addition to the IP address for each instance.

     

    Verify the SSH Setup

     

    On each zone, we need to add the Hadoop node names to /etc/hosts.

     

    Create a temporary file with the host names.

     

    root@global:~# vi /tmp/hosts

    ::1 localhost

    127.0.0.1 localhost loghost

    192.168.1.2 name-node1 

    192.168.1.3 name-node2

    192.168.1.4 resource-manager

    192.168.1.5 data-node1

    192.168.1.6 data-node2

    192.168.1.7 data-node3

     

    For each zone, copy the host names into /etc/hosts using the following command:

     

    root@global:~# for zone in `zoneadm list | grep -v global`; do echo \

    $zone ; cat /tmp/hosts | zlogin $zone 'cat -  > /etc/hosts' ; done

    instance-00000001

    instance-00000002

    instance-00000003

    instance-00000004

    instance-00000005

    instance-00000006

     

    Log in to the name-node1 zone:

     

    root@global:~# zlogin instance-00000001

     

    Run the testssh script to log in to the cluster nodes using the ssh command:

     

    root@name-node1:~# su - hdfs -c "/usr/local/Scripts/testssh"

    Warning: Permanently added 'name-node1' (RSA) to the list of known hosts.

    ...

    root@name-node1:~# su - yarn -c "/usr/local/Scripts/testssh"

    root@name-node1:~# su - mapred -c "/usr/local/Scripts/testssh"

     

    Format HDFS

     

    Before starting the Hadoop cluster, we need to format HDFS.

     

    To format HDFS, switch to user hdfs and then run the hdfs namenode -format command:

     

    root@name-node1:~# su - hdfs

     

    hdfs@name-node:$ hdfs namenode -format

     

    Look for the following output, which indicates HDFS has been set up:

     

    ... INFO common.Storage: Storage directory /var/data/1/dfs/nn has been successfully formatted ....

     

    Start the HDFS Services

     

    Run the following script to start the HDFS services:

     

    hdfs@name-node1:~$ start-dfs.sh

     

    Note: You might get the warning message WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable. This message indicates Hadoop is unable to use native platform libraries that accelerate the Hadoop suite. These native libraries are optional; the port of the Oracle Solaris Hadoop 2.x native libraries is a work in progress.

     

    Create a /tmp directory and set its permissions to 1777 (drwxrwxrwt). Then create the file system using the hadoop fs command:

     

    hdfs@name-node1:~$ hadoop fs -mkdir /tmp

    hdfs@name-node1:~$ hadoop fs -chmod -R 1777 /tmp

     

    Create a history directory and set permissions and ownership:

     

    hdfs@name-node1:~$ hadoop fs -mkdir /user

    hdfs@name-node1:~$ hadoop fs -mkdir /user/history

    hdfs@name-node1:~$ hadoop fs -chmod -R 1777 /user/history

    hdfs@name-node1:~$ hadoop fs -chown yarn /user/history

     

    Create the log directories:

     

    hdfs@name-node1:~$ hadoop fs -mkdir /var

    hdfs@name-node1:~$ hadoop fs -mkdir /var/log

    hdfs@name-node1:~$ hadoop fs -mkdir /var/log/hadoop-yarn

    hdfs@name-node1:~$ hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

     

    Create a directory for user bob and set ownership:

     

    hdfs@name-node1:~$ hadoop fs -mkdir /user/bob

    hdfs@name-node1:~$ hadoop fs -chown bob /user/bob

     

    Verify the HDFS directory structure:

     

    hdfs@name-node:~$ hadoop fs -ls -R /

    drwxrwxrwt  - hdfs supergroup        0 2014-02-26 10:43 /tmp

    drwxr-xr-x  - hdfs supergroup        0 2014-02-26 10:58 /user

    drwxr-xr-x  - bob  supergroup        0 2014-02-26 10:58 /user/bob

    drwxrwxrwt  - yarn supergroup        0 2014-02-26 10:50 /user/history

    drwxr-xr-x  - hdfs supergroup        0 2014-02-26 10:53 /var

    drwxr-xr-x  - hdfs supergroup        0 2014-02-26 10:53 /var/log

    drwxr-xr-x  - yarn mapred            0 2014-02-26 10:53 /var/log/hadoop-yarn

     

    Run the following script in order to stop the HDFS services:

     

    hdfs@name-node1:~$ stop-dfs.sh

     

    Log out from the hdfs user.

     

    hdfs@name-node:$ logout

     

    Start the Hadoop Cluster

     

    Run the following script in order to start the Hadoop cluster:

     

    root@name-node1:~# /usr/local/Scripts/startcluster

    Starting namenodes on [name-node1]

    name-node1: starting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-name-node1.out

    data-node1: starting datanode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-datanode-data-node1.out

    data-node3: starting datanode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-datanode-data-node3.out

    data-node2: starting datanode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-datanode-data-node2.out

    ...

     

    Verify that the Hadoop cluster successfully started by using the following script.  The script will print the zone name and the Hadoop process ID and process name.

     

    [root@name-node1 ~]# /usr/local/Scripts/verify-hadoop

    name-node1

    11730 NameNode

    name-node2

    11990 SecondaryNameNode

    data-node1

    11858 DataNode

    data-node2

    11838 DataNode

    data-node3

    11841 DataNode

    data-node1

    12196 NodeManager

    data-node2

    9679 NodeManager

    data-node3

    12266 NodeManager

    resource-manager

    12335 ResourceManager

    resource-manager

    12387 JobHistoryServer

     

    Use the following commands to switch to user hdfs and show the cluster topology:

     

    root@name-node1:$ su - hdfs

    hdfs@name-node1:~$ hdfs dfsadmin -printTopology

     

    13/11/26 05:19:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform...

    using builtin-java classes where applicable

    Rack: /default-rack

      192.168.1.5:50010 (data-node1)

      192.168.1.6:50010 (data-node2)

      192.168.1.7:50010 (data-node3)

     

    Run a MapReduce Job

     

    Switch to user bob:

     

    root@name-node1:~# su - bob

     

    Password: <enter bob password>

     

    Next, run a simple MapReduce job.

     

    The MapReduce example program used here is included in the Hadoop distribution. It is a straightforward estimation of the value of Pi using a quasi-Monte Carlo method.

     

    root@name-node1:~# hadoop jar \

    /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar \

    pi 10 20

     

    Where:

     

    • hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi specifies the Hadoop .jar file.
    • 10 specifies the number of maps.
    • 20 specifies the number of samples.

     

    Note: If you need connectivity to the outside world to import data into the Hadoop cluster, you can define external network connectivity. Refer to Installing and Configuring OpenStack in Oracle Solaris 11.2 for more information.

     

    Summary

     

    In this article, we saw how we can leverage Oracle Solaris technologies such as Oracle OpenStack for Oracle Solaris, Oracle Solaris Zones, and the Unified Archive feature of Oracle Solaris 11.2 to build a multinode Hadoop 2.6 cluster. Notice that the cluster in this example is a virtual cluster utilizing only one physical machine. Virtual clusters allow for efficient vertical scaling, which in turn promotes optimized resource utilization.

     

    Acknowledgments

     

    The authors would like to thank Girish Moodalbail, Debabrata Sarkar, and Glynn Foster for their contributions to this article.

     

    See Also

     

    See the OpenStack on Oracle Solaris Technology Spotlight web page.

     

    Also see these additional resources:

     

     

    Also see these additional publications by Orgad Kimchi:

     

     

    About the Authors

     

    Ekine Akuiyibo is software engineer in the DPA Technology Office at Oracle where he works on big data and cloud computing technologies. His current focus is investigating algorithms and implementations for machine learning at scale and optimized resource allocation in cloud computing.

     

    Orgad Kimchi is a principal software engineer on the ISV Engineering team at Oracle. For seven years he has specialized in virtualization, big data, and cloud computing technologies.

     

    Appendix—OpenStack hadoop_setup.sh Configuration Script

     

    #!/usr/bin/ksh

    beadm create before_hadoop_setup

    export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

    groupadd -g 200 hadoop

    useradd -u 100 -m -g hadoop hdfs

    echo "Enter the password for the hdfs user"

    passwd hdfs

    useradd -u 101 -m -g hadoop yarn

    echo "Enter the password for the yarn user"

    passwd yarn

    useradd -u 102 -m -g hadoop mapred

    echo "Enter the password for the mapred user"

    passwd mapred

    useradd -m -u 1000 bob

    echo "Enter the password for the bob user"

    passwd bob

    cp /var/tmp/hadoop-2.6.0.tar.gz /usr/local

    (cd /usr/local ; tar -xfz /usr/local/hadoop-2.6.0.tar.gz)

    ln -s /usr/local/hadoop-2.6.0 /usr/local/hadoop

    chown -R root:hadoop /usr/local/hadoop-2.6.0

    chmod -R 755 /usr/local/hadoop-2.6.0

     

    echo "export JAVA_HOME=/usr/java" >> $HADOOP_CONF_DIR/hadoop-env.sh

    echo "export HADOOP_LOG_DIR=/var/log/hadoop/hdfs" >> $HADOOP_CONF_DIR/hadoop-env.sh

    cat << EOF > $HADOOP_CONF_DIR/yarn-env.sh

    export JAVA_HOME=/usr/java

    export YARN_LOG_DIR=/var/log/hadoop/yarn

    export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop

    export HADOOP_HOME=/usr/local/hadoop

    export HADOOP_MAPRED_HOME=/usr/local/hadoop

    export HADOOP_COMMON_HOME=/usr/local/hadoop

    export HADOOP_HDFS_HOME=/usr/local/hadoop

    export YARN_HOME=/usr/local/hadoop

    export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

    export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop

    EOF

     

    cat << EOF > $HADOOP_CONF_DIR/mapred-env.sh

    JAVA_HOME=/usr/java

    HADOOP_MAPRED_LOG_DIR=/var/log/hadoop/mapred

    HADOOP_MAPRED_IDENT_STRING=mapred

    HADOOP_MAPRED_LOG_DIR=/var/log/hadoop/mapred

    HADOOP_MAPRED_IDENT_STRING=mapred

    EOF

     

    cat << EOF > $HADOOP_CONF_DIR/slaves

    data-node1

    data-node2

    data-node3

    EOF

     

    cat << EOF > $HADOOP_CONF_DIR/core-site.xml

    <?xml version="1.0"?>

    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

      <property>

        <name>fs.defaultFS</name>

        <value>hdfs://name-node1</value>

      </property>

    </configuration>

    EOF

     

    cat << EOF > $HADOOP_CONF_DIR/hdfs-site.xml

    <?xml version="1.0"?>

    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->

    <configuration>

    <property>

        <name>dfs.namenode.secondary.http-address</name>

        <value>name-node2:50090</value>

    </property>

      <property>

        <name>dfs.datanode.data.dir</name>

        <value>/var/data/1/dfs/dn</value>

      </property>

      <property>

        <name>dfs.namenode.name.dir</name>

        <value>/var/data/1/dfs/nn</value>

      </property>

      <property>

        <name>dfs.replication</name>

        <value>3</value>

      </property>

      <property>

        <name>dfs.permission.supergroup</name>

       <value>hadoop</value>

      </property>

    </configuration>

    EOF

     

    cat << EOF > $HADOOP_CONF_DIR/mapred-site.xml

    <?xml version="1.0"?>

    <configuration>

      <property>

        <name>mapreduce.framework.name</name>

        <value>yarn</value>

      </property>

      <property>

        <name>mapreduce.jobhistory.address</name>

        <value>resource-manager:10020</value>

      </property>

      <property>

        <name>mapreduce.jobhistory.webapp.address</name>

        <value>resource-manager:19888</value>

      </property>

      <property>

        <name>yarn.app.mapreduce.am.staging-dir</name>

        <value>/user</value>

      </property>

    </configuration>

    EOF

     

     

    cat << EOF > $HADOOP_CONF_DIR/yarn-site.xml

    <?xml version="1.0"?>

    <configuration>

    <!-- Site specific YARN configuration properties -->

      <property>

        <name>yarn.nodemanager.aux-services</name>

        <value>mapreduce_shuffle</value>

      </property>

      <property>

        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

        <value>org.apache.hadoop.mapred.ShuffleHandler</value>

      </property>

      <property>

        <name>yarn.resourcemanager.hostname</name>

        <value>resource-manager</value>

      </property>

      <property>

        <name>yarn.nodemanager.local-dirs</name>

        <value>file:///var/data/1/yarn/local</value>

      </property>

      <property>

        <name>yarn.nodemanager.log-dirs</name>

        <value>file:///var/data/1/yarn/logs</value>

      </property>

      <property>

        <name>yarn.log.aggregation.enable</name>

        <value>true</value>

      </property>

      <property>

        <description>Where to aggregate logs</description>

        <name>yarn.nodemanager.remote-app-log-dir</name>

        <value>hdfs://var/log/hadoop-yarn/apps</value>

      </property>

    </configuration>

    EOF

     

    pkg install --accept jdk-7

    pkg install  pkg://solaris/network/ssh

    mkdir -p /var/log/hadoop/yarn

    chown yarn:hadoop /var/log/hadoop/yarn

    mkdir -p /var/log/hadoop/hdfs

    chown hdfs:hadoop /var/log/hadoop/hdfs

    mkdir -p /var/log/hadoop/mapred

    chown mapred:hadoop /var/log/hadoop/mapred

    mkdir -p /var/data/1/dfs/nn

    chmod 700 /var/data/1/dfs/nn

    chown -R hdfs:hadoop /var/data/1/dfs/nn

    mkdir -p /var/data/1/dfs/dn

    chown -R hdfs:hadoop /var/data/1/dfs/dn

    mkdir -p /var/data/1/yarn/local

    mkdir -p /var/data/1/yarn/logs

    chown -R yarn:hadoop /var/data/1/yarn/local

    chown -R yarn:hadoop /var/data/1/yarn/logs

    mkdir -p /var/hadoop/run/yarn

    chown yarn:hadoop /var/hadoop/run/yarn

    mkdir -p /var/hadoop/run/hdfs

    chown hdfs:hadoop /var/hadoop/run/hdfs

    mkdir -p /var/hadoop/run/mapred

    chown mapred:hadoop /var/hadoop/run/mapred

     

    cat << EOF >> /etc/profile

    # Set JAVA_HOME

    export JAVA_HOME=/usr/java

    # Add Hadoop bin/ directory to PATH

    export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin

    export HADOOP_HOME=/usr/local/hadoop

    export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

    EOF

     

    echo 'export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin' >> /export/home/hdfs/.profile

    echo 'export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin' >> /export/home/yarn/.profile

    echo 'export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin' >> /export/home/mapred/.profile

    echo 'export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin' >> /export/home/bob/.profile

     

    cat << EOF >> /etc/hosts

    192.168.1.2 name-node1

    192.168.1.3 name-node2

    192.168.1.4 resource-manager

    192.168.1.5 data-node1

    192.168.1.6 data-node2

    192.168.1.7 data-node3

    EOF

    su - hdfs -c 'ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa'

    su - hdfs -c 'cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys'

    su - yarn -c 'ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa'

    su - yarn -c 'cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys'

     

    su - mapred -c 'ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa'

    su - mapred -c 'cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys'

     

     

    Revision 1.0, 07/07/2015

     

    Follow us:
    Blog | Facebook | Twitter | YouTube