by Victor Castillo, Kevin Klapak, Pedro Lay, and Suman Somasundar
This article provides instructions for installing Apache Spark on Oracle Solaris.
In today's real-time business environments, the rapid growth of big data is driving new workloads of compute-intensive analytic applications that require in-memory processing. Systems must analyze data quickly, while at the same time being able to handle sophisticated searches and more-complex machine-learning algorithms over large data sets.
The new SPARC S7 processor–based servers from Oracle, powered with Oracle's SPARC S7 processor and its Data Analytics Accelerator (DAX) capability, provide hardware acceleration to process a variety of big data analytic workloads. This approach reduces data transfers and leaves the CPU cores to focus on other tasks.
Apache Spark is a popular open source cluster computing framework from the Apache Software Foundation that allows user programs to load data into a cluster's memory and then query the memory repeatedly. Apache Spark is well-suited to machine-learning algorithms and iterative analytic processes.
Apache Spark consists of Spark Core, which provides the basic functionality of Apache Spark, and a set of libraries—including MLlib, Spark SQL, Spark Streaming, and GraphX—that enable the processing of a wider range of data and perform complex analytics. Spark Core includes components for task scheduling, memory management, fault recovery, and so on that enable the quick development of parallel applications. Besides running Apache Spark applications with the standalone scheduler that comes with Spark Core, Spark applications can be run with other resource managers; such as Hadoop YARN or Apache Mesos.
Apache Spark deployed on SPARC S7 processor–based servers with DAX capability through Open DAX APIs provides unique advantages for scale-out and addresses the challenge of larger and larger amounts of data. It offers unprecedented efficiency for processing a variety of complex analytical workloads, such as complex decision trees, K-means clustering, outlier detection, K-Nearest Neighbor pattern recognition, and so on.
Regardless of how much data there is or how complex an algorithm is being used, it is easy to get a head start with Apache Spark in Oracle Solaris by using the steps in the following section, which show how to install Apache Spark.
Installing Apache Spark
Installing Apache Spark on a SPARC S7 processor–based server follows the same steps as those documented at http://spark.apache.org/docs/latest/building-spark.html for a standard Spark build; however, the environment has to be properly set up in order for the installation to complete correctly. The steps to prepare the environment and install Apache Spark are as follows:
1. Verify that Java is installed; it should be at least version 7 (which will show as version 1.7):
# java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
2. Download Apache Maven and Spark from their respective websites. The current versions are apache-maven-3.3.9 and spark-1.6.1, which can be downloaded from these links:
/opt by running the following commands:
# mv apache-maven-3.3.9-bin.tar.gz /opt
# mv spark-1.6.1.tgz /opt
# cd /opt
# tar -xvf apache-maven-3.3.9-bin.tar.gz
# tar -xvf spark-1.6.1.tgz
4. If you are behind a corporate firewall, you need to set the proxy path for Maven in the
apache-maven-3.3.9/conf/settings.xml file for Maven to be able to download JAR files.
<host>"your proxy direction here"</host>
<host>"your proxy direction here"</host>
5. Set up the Oracle Solaris environment for Spark:
- Run the following commands:
# vi .profile
- Insert the following at the end of the file and then save the file:
# . .profile
- Change the Spark compression library from Snappy (the default) to LZ4 by adding
spark.io.compression.code lz4to the
6. As recommended in the standard Spark build instructions, increase the memory available for Maven:
# export MAVEN_OPTS="-Xmx5g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
7. Build Apache Spark using Apache Maven by running the following commands:
# cd /opt/spark-1.6.1
# build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
8. Verify the Spark installation was successful by executing the following command to bring up the Spark shell:
If Spark is installed successfully, the following output will appear:
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.1
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
A Simple Spark Example Running Standalone
One important concept of Spark is that data is split into partitions and stored in memory on the different nodes of the cluster, which is called a Resilient Distributed Dataset (RDD).
The following Scala program, executed from a Spark shell, creates an RDD with a set of 1,000 numbers and calculates the mean of the set of numbers.
scala> val disdata=sc.parallelize(1 to 1000)
disdata: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD at parallelize at <console>:27
scala> val sum = disdata.reduce(_+_)
sum: Int = 500500
scala> val count = disdata.count()
count: Long = 1000
scala> val avg = sum / count
avg: Long = 500
The program above can be written in a Python file and executed using
spark-submit, as follows:
1. Run the following command:
# vi Mean.py
2. Copy the following lines of code to
from pyspark import SparkContext, SparkConf
if(len(sys.argv) != 3):
print ("\nUsage: Mean.py <startValue> <endValue>\n")
conf = SparkConf().setAppName("Mean")
sc = SparkContext(conf=conf)
start = int(sys.argv)
end = int(sys.argv)
data = sc.parallelize(range(start, end + 1))
sum = data.reduce(lambda a, b: a + b)
count = data.count()
mean = (sum + 0.0) / count
print("\nMean of values from " + str(start) + " to " + str(end) + " : " + str(mean) + "\n")
3. Run the program using the following command:
# spark-submit --master local <path to Mean.py> 1 1000
# spark-submit --master local /Workspace/Mean.py 1 1000
Examples of Applications Using DAX
Please go to the Developer Access to Oracle's Software in Silicon Technology web page (requires registration) to learn more about the SPARC S7 processor's DAX capability and to access sample programs that use Apache Spark and Open DAX APIs to build analytic cubes and classify a K-Nearest Neighbors algorithm that is widely used in industry applications such as movie recommenders, data center analysis, and stock portfolio optimization.
Once you register at that site, you will have access to the code for these sample programs and you will able to create your own DAX Developer VM.
About the Authors
Victor Castillo is a senior sales consultant with over eight years of field experience working with customer IT architectures from various industries. He is currently a member of Oracle's Elite Engineering Exchange (EEE) program, which is a focused global group of elite sales consultants and through which a direct and collaborative exchange of information and experiences is established with the Oracle's systems engineers in order to keep product development and businesses' needs aligned. Currently, he is leading a EEE working group focused on the big data ecosystem on SPARC platforms.
Kevin Klapak is a product manager on the Oracle Optimized Solutions team. He joined the team in January 2015 after completing his master's degree from Carnegie Mellon University. He has a background in computer science and over five years of IT experience. Since joining Oracle, he has been working on database migration, Apache Spark/big data analytics, and systems security
Pedro Lay is a senior solution manager in the Systems Applications Solutions group at Oracle. His current focus is on performance, scalability and security of Oracle E-Business Suite, data warehousing and big data on SPARC servers, and Oracle engineered systems. He has over 20 years of industry experience covering application development, database and systems administration, and information technology solutions from architecture design to deployment.
Suman Somasundar graduated from Cornell University with a master's degree in computer science. He joined Oracle in March 2014 and started working on various big data technologies, initially on the Apache Mahout machine-learning library. His main focus for the last two years has been on optimizing open source big data technologies for Oracle Solaris and SPARC. More recently, he has been working to make Apache Spark and the MLlib component of Apache Spark use DAX on Oracle's SPARC M7 processor.