This article will cover installing Apache Spark and Apache Cassandra on Ubuntu 14.04. Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching. Spark and Cassandra work together to offer a power for solution for data processing.
STEP 1 INSTALL APACHE SPARK:
First setup some prerequisites like installing ntp Java etc..
Login as a user with sudo:
sudo apt-get update
sudo ntpdate pool.ntp.org
sudo apt-get install ntp
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-jdk7-installer
sudo apt-get install git
Second download and install Apache Spark
cd ~
mkdir spark
cd spark/
wget http://mirror.tcpdiag.net/apache/spark/spark-1.2.0/spark-1.2.0.tgz
gunzip -c spark-1.2.0.tgz | tar -xvf -
cd spark-1.2.0/
sudo sbt/sbt assembly
cd ..
sudo cp -Rp spark-1.2.0 /usr/local/
cd /usr/local/
sudo ln -s spark-1.2.0 spark
Third create a spark user with proper privileges and ssh keys.
sudo addgroup spark
sudo useradd -g spark spark
sudo adduser spark sudo
sudo mkdir /home/spark
sudo chown spark:spark /home/spark
Add to sudoers file:
spark ALL=(ALL) NOPASSWD:ALL
sudo chown -R spark:spark /usr/local/spark/
sudo su spark
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost
exit
Fourth setup some Apache Spark working directories with proper user permissions
sudo mkdir -p /srv/spark/{logs,work,tmp,pids}
sudo chown -R spark:spark /srv/spark
sudo chmod 4755 /srv/spark/tmp
Fifth let’s do a quick test
cd /usr/local/spark
bin/run-example SparkPi 10
Now lets adjust some Spark configuration files
cd /usr/local/spark/conf/
cp -p spark-env.sh.template spark-env.sh
vim spark-env.sh
========================================================
SPARK-ENV.SH (ADD BELOW)
*Make sure to put your change SPARK_PUBLIC_DNS=“PUBLIC IP“ to your Public IP
export SPARK_WORKER_CORES="2"
export SPARK_WORKER_MEMORY="1g"
export SPARK_DRIVER_MEMORY="1g"
export SPARK_REPL_MEM="2g"
export SPARK_WORKER_PORT=9000
export SPARK_CONF_DIR="/usr/local/spark/conf"
export SPARK_TMP_DIR="/srv/spark/tmp"
export SPARK_PID_DIR="/srv/spark/pids"
export SPARK_LOG_DIR="/srv/spark/logs"
export SPARK_WORKER_DIR="/srv/spark/work"
export SPARK_LOCAL_DIRS="/srv/spark/tmp"
export SPARK_COMMON_OPTS="$SPARK_COMMON_OPTS -Dspark.kryoserializer.buffer.mb=32 "
LOG4J="-Dlog4j.configuration=file://$SPARK_CONF_DIR/log4j.properties"
export SPARK_MASTER_OPTS=" $LOG4J -Dspark.log.file=/srv/spark/logs/master.log "
export SPARK_WORKER_OPTS=" $LOG4J -Dspark.log.file=/srv/spark/logs/worker.log "
export SPARK_EXECUTOR_OPTS=" $LOG4J -Djava.io.tmpdir=/srv/spark/tmp/executor "
export SPARK_REPL_OPTS=" -Djava.io.tmpdir=/srv/spark/tmp/repl/\$USER "
export SPARK_APP_OPTS=" -Djava.io.tmpdir=/srv/spark/tmp/app/\$USER "
export PYSPARK_PYTHON="/usr/bin/python"
SPARK_PUBLIC_DNS="PUBLIC IP"
export SPARK_WORKER_INSTANCES=2
=========================================================
cp -p spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
=========================================================
SPARK-DEFAULTS (ADD BELOW)
spark.master spark://hostnamepub:7077
spark.executor.memory 512m
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
================================================================
Time to start Apache Spark up
cd /usr/local/spark/sbin
./start-master.sh
./start-slaves.sh
Note to stop processes do:
./stop-slaves
./stop-master
STEP 2 INSTALL APACHE CASSANDRA:
First lets get to installing Cassandra binaries
Login as a user with sudo:
cd ~
mkdir cassandra
cd cassandra
wget http://www.eng.lsu.edu/mirrors/apache/cassandra/2.1.2/apache-cassandra-2.1.2-bin.tar.gz
gunzip -c apache-cassandra-2.1.2-bin.tar.gz | tar -xvf -
sudo mv apache-cassandra-2.1.2 /usr/local/
cd /usr/local
sudo ln -s apache-cassandra-2.1.2 cassandra
sudo mkdir /var/lib/cassandra
sudo mkdir /var/log/cassandra
Second lets create a Cassandra User
sudo addgroup cassandra
sudo useradd -g cassandra cassandra
sudo mkdir /home/cassandra
sudo chown cassandra:cassandra /home/cassandra
sudo chown -R cassandra:cassandra /usr/local/cassandra/
If you don’t want to listen on 127.0.0.1 lets change that in the cassandra.yaml file
*If your using AWS this will be your internal IP
cd /usr/local/cassandra/conf/
vim cassandra.yaml
*Change the below sections to your IP you want to listen on
rpc_address: YOUR-IP
listen_address: YOUR-IP
- seeds: "YOUR-IP"
Now lets startup Apache Cassandra
sudo /usr/local/cassandra/bin/cassandra
STEP 3 TEST SPARK AND CASSANDRA:
Reference Koeninger on GitHub
sudo su spark
cd ~
mkdir cassandra-spark-test
cd cassandra-spark-test
git clone https://github.com/koeninger/spark-cassandra-example.git
sudo chown -R spark:spark spark-cassandra-example
cd spark-cassandra-example
sudo vim /etc/apt/sources.list.d/sbt.list
Add: deb http://dl.bintray.com/sbt/debian /
sudo apt-get update
sudo apt-get install sbt
sbt test
vim cassandra-example.conf
spark.cassandra.connection.host YOUR-CASSANDRA-IP <--change this
sbt assembly
/usr/local/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.BroadcastJoinExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar
STEP 4 BUILD YOUR OWN SAMPLE SCALA PROJECT THAT WORKS WITH SPARK
Note this particular example is not using Cassandra but gives you an idea of how you would make a project.
sudo su spark
cd ~
mkdir Sample-Spark-Scala-Project
cd Sample-Spark-Scala-Project
mkdir -p src/main/scala
Create sbt file
vim sampleproject.sbt
name := "Sample Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
Write your Scala Code this particular example parses text and gives a count for lines with the letter a and lines with the letter b in the SPARK /usr/local/spark/README.md document
vim src/main/scala/SampleProject.scala
/* SampleProject.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SampleProject {
def main(args: Array[String]) {
val logFile = "/usr/local/spark/README.md" // File to Parse
val conf = new SparkConf().setAppName("Sample Project")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
sbt package
/usr/local/spark/bin/spark-submit --class "SampleProject" --master local[4] target/scala-2.10/sample-project_2.10-1.0.jar
Hope this helps get you the ground work setup to use Spark and Cassandra.