How to install Apache Spark and Cassandra Stack on Ubuntu 14.04

February 16, 2015

This article will cover installing Apache Spark and Apache Cassandra on Ubuntu 14.04. Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching. Spark and Cassandra work together to offer a power for solution for data processing.

STEP 1 INSTALL APACHE SPARK:

First setup some prerequisites like installing ntp Java etc..

Login as a user with sudo:

sudo apt-get update

sudo ntpdate pool.ntp.org

sudo apt-get install ntp

sudo apt-get install python-software-properties 

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update 

sudo apt-get install oracle-jdk7-installer

sudo apt-get install git 

Second download and install Apache Spark

cd ~

mkdir spark

cd spark/

wget http://mirror.tcpdiag.net/apache/spark/spark-1.2.0/spark-1.2.0.tgz

gunzip -c spark-1.2.0.tgz | tar -xvf -

cd spark-1.2.0/

sudo sbt/sbt assembly

cd ..

sudo cp -Rp spark-1.2.0 /usr/local/

cd /usr/local/

sudo ln -s spark-1.2.0 spark

Third create a spark user with proper privileges and ssh keys.

sudo addgroup spark

sudo useradd -g spark spark

sudo adduser spark sudo

sudo mkdir /home/spark

sudo chown spark:spark /home/spark

Add to sudoers file:

spark ALL=(ALL) NOPASSWD:ALL

sudo chown -R spark:spark /usr/local/spark/

sudo su spark

ssh-keygen -t rsa -P ""

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

ssh localhost

exit

Fourth setup some Apache Spark working directories with proper user permissions

sudo mkdir -p /srv/spark/{logs,work,tmp,pids}

sudo chown -R spark:spark /srv/spark

sudo chmod 4755 /srv/spark/tmp

Fifth let’s do a quick test

cd /usr/local/spark

bin/run-example SparkPi 10

Now lets adjust some Spark configuration files

cd /usr/local/spark/conf/

cp -p spark-env.sh.template spark-env.sh

vim spark-env.sh  

========================================================
SPARK-ENV.SH (ADD BELOW)
*Make sure to put your change SPARK_PUBLIC_DNS=“PUBLIC IP“ to your Public IP

export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="1g"
export SPARK_DRIVER_MEMORY="1g"
export SPARK_REPL_MEM="2g"
export SPARK_WORKER_PORT=9000
export SPARK_CONF_DIR="/usr/local/spark/conf"
export SPARK_TMP_DIR="/srv/spark/tmp"
export SPARK_PID_DIR="/srv/spark/pids"
export SPARK_LOG_DIR="/srv/spark/logs"
export SPARK_WORKER_DIR="/srv/spark/work"
export SPARK_LOCAL_DIRS="/srv/spark/tmp"
export SPARK_COMMON_OPTS="$SPARK_COMMON_OPTS -Dspark.kryoserializer.buffer.mb=32 "
LOG4J="-Dlog4j.configuration=file://$SPARK_CONF_DIR/log4j.properties"
export SPARK_MASTER_OPTS=" $LOG4J -Dspark.log.file=/srv/spark/logs/master.log "
export SPARK_WORKER_OPTS=" $LOG4J -Dspark.log.file=/srv/spark/logs/worker.log "
export SPARK_EXECUTOR_OPTS=" $LOG4J -Djava.io.tmpdir=/srv/spark/tmp/executor "
export SPARK_REPL_OPTS=" -Djava.io.tmpdir=/srv/spark/tmp/repl/\$USER "
export SPARK_APP_OPTS=" -Djava.io.tmpdir=/srv/spark/tmp/app/\$USER "
export PYSPARK_PYTHON="/usr/bin/python"
SPARK_PUBLIC_DNS="PUBLIC IP"
export SPARK_WORKER_INSTANCES=2
=========================================================

cp -p spark-defaults.conf.template spark-defaults.conf

vim spark-defaults.conf

=========================================================
SPARK-DEFAULTS (ADD BELOW)

spark.master            spark://hostnamepub:7077
spark.executor.memory   512m
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer

================================================================

Time to start Apache Spark up

cd /usr/local/spark/sbin

./start-master.sh

./start-slaves.sh

Note to stop processes do:

./stop-slaves

./stop-master

STEP 2 INSTALL APACHE CASSANDRA:

First lets get to installing Cassandra binaries

Login as a user with sudo:

cd ~

mkdir cassandra

cd cassandra

wget http://www.eng.lsu.edu/mirrors/apache/cassandra/2.1.2/apache-cassandra-2.1.2-bin.tar.gz

gunzip -c apache-cassandra-2.1.2-bin.tar.gz | tar -xvf -

sudo mv apache-cassandra-2.1.2 /usr/local/

cd /usr/local

sudo ln -s apache-cassandra-2.1.2 cassandra

sudo mkdir /var/lib/cassandra

sudo mkdir /var/log/cassandra

Second lets create a Cassandra User

sudo addgroup cassandra

sudo useradd -g cassandra cassandra

sudo mkdir /home/cassandra

sudo chown cassandra:cassandra /home/cassandra

sudo chown -R cassandra:cassandra /usr/local/cassandra/

If you don’t want to listen on 127.0.0.1 lets change that in the cassandra.yaml file
*If your using AWS this will be your internal IP

cd /usr/local/cassandra/conf/

vim cassandra.yaml
*Change the below sections to your IP you want to listen on

rpc_address: YOUR-IP

listen_address: YOUR-IP

- seeds: "YOUR-IP"

Now lets startup Apache Cassandra

sudo /usr/local/cassandra/bin/cassandra

STEP 3 TEST SPARK AND CASSANDRA:
Reference Koeninger on GitHub

sudo su spark

cd ~

mkdir cassandra-spark-test

cd cassandra-spark-test

git clone https://github.com/koeninger/spark-cassandra-example.git

sudo chown -R spark:spark spark-cassandra-example

cd spark-cassandra-example

sudo vim /etc/apt/sources.list.d/sbt.list

Add: deb http://dl.bintray.com/sbt/debian /

sudo apt-get update

sudo apt-get install sbt

sbt test

vim cassandra-example.conf

spark.cassandra.connection.host YOUR-CASSANDRA-IP <--change this

sbt assembly

/usr/local/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.BroadcastJoinExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar

STEP 4 BUILD YOUR OWN SAMPLE SCALA PROJECT THAT WORKS WITH SPARK

Note this particular example is not using Cassandra but gives you an idea of how you would make a project.

sudo su spark

cd ~

mkdir Sample-Spark-Scala-Project

cd Sample-Spark-Scala-Project

mkdir -p src/main/scala

Create sbt file

vim sampleproject.sbt

name := "Sample Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"

Write your Scala Code this particular example parses text and gives a count for lines with the letter a and lines with the letter b in the SPARK /usr/local/spark/README.md document

vim src/main/scala/SampleProject.scala

/* SampleProject.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SampleProject {
  def main(args: Array[String]) {
    val logFile = "/usr/local/spark/README.md" // File to Parse
    val conf = new SparkConf().setAppName("Sample Project")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

sbt package

/usr/local/spark/bin/spark-submit --class "SampleProject" --master local[4] target/scala-2.10/sample-project_2.10-1.0.jar

Hope this helps get you the ground work setup to use Spark and Cassandra.

Comments are closed.