How to install Apache Spark and Cassandra Stack on Ubuntu 14.04

February 16, 2015

This article will cover installing Apache Spark and Apache Cassandra on Ubuntu 14.04. Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching. Spark and Cassandra work together to offer a power for solution for data processing.

STEP 1 INSTALL APACHE SPARK:

First setup some prerequisites like installing ntp Java etc..

sudo /usr/local/cassandra/bin/cassandra

Second download and install Apache Spark

sudo su spark
 
cd ~
 
mkdir cassandra-spark-test
 
cd cassandra-spark-test
 
git clone https://github.com/koeninger/spark-cassandra-example.git
 
sudo chown -R spark:spark spark-cassandra-example
 
cd spark-cassandra-example
 
sudo vim /etc/apt/sources.list.d/sbt.list
 
Add: deb http://dl.bintray.com/sbt/debian /
 
sudo apt-get update
 
sudo apt-get install sbt
 
sbt test
 
vim cassandra-example.conf
 
spark.cassandra.connection.host YOUR-CASSANDRA-IP <--change this
 
sbt assembly
 
/usr/local/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.BroadcastJoinExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar

Third create a spark user with proper privileges and ssh keys.

sudo su spark
 
cd ~
 
mkdir Sample-Spark-Scala-Project
 
cd Sample-Spark-Scala-Project
 
mkdir -p src/main/scala

Fourth setup some Apache Spark working directories with proper user permissions

vim sampleproject.sbt
 
name := "Sample Project"
 
version := "1.0"
 
scalaVersion := "2.10.4"
 
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"

Fifth let’s do a quick test

vim src/main/scala/SampleProject.scala
 
/* SampleProject.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
 
object SampleProject {
  def main(args: Array[String]) {
    val logFile = "/usr/local/spark/README.md" // File to Parse
    val conf = new SparkConf().setAppName("Sample Project")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

Now lets adjust some Spark configuration files

sbt package
 
/usr/local/spark/bin/spark-submit --class "SampleProject" --master local[4] target/scala-2.10/sample-project_2.10-1.0.jar

Time to start Apache Spark up

cd /usr/local/spark/sbin
 
./start-master.sh
 
./start-slaves.sh
 
Note to stop processes do:
 
./stop-slaves
 
./stop-master

STEP 2 INSTALL APACHE CASSANDRA:

First lets get to installing Cassandra binaries

Login as a user with sudo:
 
cd ~
 
mkdir cassandra
 
cd cassandra
 
wget http://www.eng.lsu.edu/mirrors/apache/cassandra/2.1.2/apache-cassandra-2.1.2-bin.tar.gz
 
gunzip -c apache-cassandra-2.1.2-bin.tar.gz | tar -xvf -
 
sudo mv apache-cassandra-2.1.2 /usr/local/
 
cd /usr/local
 
sudo ln -s apache-cassandra-2.1.2 cassandra
 
sudo mkdir /var/lib/cassandra
 
sudo mkdir /var/log/cassandra

Second lets create a Cassandra User

sudo addgroup cassandra
sudo useradd -g cassandra cassandra
sudo mkdir /home/cassandra
sudo chown cassandra:cassandra /home/cassandra
sudo chown -R cassandra:cassandra /usr/local/cassandra/

If you don’t want to listen on 127.0.0.1 lets change that in the cassandra.yaml file
*If your using AWS this will be your internal IP

cd /usr/local/cassandra/conf/
 
vim cassandra.yaml
*Change the below sections to your IP you want to listen on
 
rpc_address: YOUR-IP
 
listen_address: YOUR-IP
 
- seeds: "YOUR-IP"

Now lets startup Apache Cassandra

sudo /usr/local/cassandra/bin/cassandra

STEP 3 TEST SPARK AND CASSANDRA:
Reference Koeninger on GitHub

sudo su spark
 
cd ~
 
mkdir cassandra-spark-test
 
cd cassandra-spark-test
 
git clone https://github.com/koeninger/spark-cassandra-example.git
 
sudo chown -R spark:spark spark-cassandra-example
 
cd spark-cassandra-example
 
sudo vim /etc/apt/sources.list.d/sbt.list
 
Add: deb http://dl.bintray.com/sbt/debian /
 
sudo apt-get update
 
sudo apt-get install sbt
 
sbt test
 
vim cassandra-example.conf
 
spark.cassandra.connection.host YOUR-CASSANDRA-IP <--change this
 
sbt assembly
 
/usr/local/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.BroadcastJoinExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar

STEP 4 BUILD YOUR OWN SAMPLE SCALA PROJECT THAT WORKS WITH SPARK

Note this particular example is not using Cassandra but gives you an idea of how you would make a project.

sudo su spark
 
cd ~
 
mkdir Sample-Spark-Scala-Project
 
cd Sample-Spark-Scala-Project
 
mkdir -p src/main/scala

Create sbt file

vim sampleproject.sbt
 
name := "Sample Project"
 
version := "1.0"
 
scalaVersion := "2.10.4"
 
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"

Write your Scala Code this particular example parses text and gives a count for lines with the letter a and lines with the letter b in the SPARK /usr/local/spark/README.md document

vim src/main/scala/SampleProject.scala
 
/* SampleProject.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
 
object SampleProject {
  def main(args: Array[String]) {
    val logFile = "/usr/local/spark/README.md" // File to Parse
    val conf = new SparkConf().setAppName("Sample Project")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}
sbt package
 
/usr/local/spark/bin/spark-submit --class "SampleProject" --master local[4] target/scala-2.10/sample-project_2.10-1.0.jar

Hope this helps get you the ground work setup to use Spark and Cassandra.

Comments are closed.