This article will cover installing Apache Spark and Apache Cassandra on Ubuntu 14.04. Spark is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching. Spark and Cassandra work together to offer a power for solution for data processing.
STEP 1 INSTALL APACHE SPARK:
First setup some prerequisites like installing ntp Java etc..
sudo /usr/local/cassandra/bin/cassandra |
Second download and install Apache Spark
sudo su spark cd ~ mkdir cassandra-spark-test cd cassandra-spark-test git clone https://github.com/koeninger/spark-cassandra-example.git sudo chown -R spark:spark spark-cassandra-example cd spark-cassandra-example sudo vim /etc/apt/sources.list.d/sbt.list Add: deb http://dl.bintray.com/sbt/debian / sudo apt-get update sudo apt-get install sbt sbt test vim cassandra-example.conf spark.cassandra.connection.host YOUR-CASSANDRA-IP <--change this sbt assembly /usr/local/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.BroadcastJoinExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar |
Third create a spark user with proper privileges and ssh keys.
sudo su spark cd ~ mkdir Sample-Spark-Scala-Project cd Sample-Spark-Scala-Project mkdir -p src/main/scala |
Fourth setup some Apache Spark working directories with proper user permissions
vim sampleproject.sbt name := "Sample Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" |
Fifth let’s do a quick test
vim src/main/scala/SampleProject.scala /* SampleProject.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SampleProject { def main(args: Array[String]) { val logFile = "/usr/local/spark/README.md" // File to Parse val conf = new SparkConf().setAppName("Sample Project") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } |
Now lets adjust some Spark configuration files
sbt package /usr/local/spark/bin/spark-submit --class "SampleProject" --master local[4] target/scala-2.10/sample-project_2.10-1.0.jar |
Time to start Apache Spark up
cd /usr/local/spark/sbin ./start-master.sh ./start-slaves.sh Note to stop processes do: ./stop-slaves ./stop-master |
STEP 2 INSTALL APACHE CASSANDRA:
First lets get to installing Cassandra binaries
Login as a user with sudo: cd ~ mkdir cassandra cd cassandra wget http://www.eng.lsu.edu/mirrors/apache/cassandra/2.1.2/apache-cassandra-2.1.2-bin.tar.gz gunzip -c apache-cassandra-2.1.2-bin.tar.gz | tar -xvf - sudo mv apache-cassandra-2.1.2 /usr/local/ cd /usr/local sudo ln -s apache-cassandra-2.1.2 cassandra sudo mkdir /var/lib/cassandra sudo mkdir /var/log/cassandra |
Second lets create a Cassandra User
sudo addgroup cassandra sudo useradd -g cassandra cassandra sudo mkdir /home/cassandra sudo chown cassandra:cassandra /home/cassandra sudo chown -R cassandra:cassandra /usr/local/cassandra/ |
If you don’t want to listen on 127.0.0.1 lets change that in the cassandra.yaml file
*If your using AWS this will be your internal IP
cd /usr/local/cassandra/conf/ vim cassandra.yaml *Change the below sections to your IP you want to listen on rpc_address: YOUR-IP listen_address: YOUR-IP - seeds: "YOUR-IP" |
Now lets startup Apache Cassandra
sudo /usr/local/cassandra/bin/cassandra |
STEP 3 TEST SPARK AND CASSANDRA:
Reference Koeninger on GitHub
sudo su spark cd ~ mkdir cassandra-spark-test cd cassandra-spark-test git clone https://github.com/koeninger/spark-cassandra-example.git sudo chown -R spark:spark spark-cassandra-example cd spark-cassandra-example sudo vim /etc/apt/sources.list.d/sbt.list Add: deb http://dl.bintray.com/sbt/debian / sudo apt-get update sudo apt-get install sbt sbt test vim cassandra-example.conf spark.cassandra.connection.host YOUR-CASSANDRA-IP <--change this sbt assembly /usr/local/spark/bin/spark-submit --properties-file cassandra-example.conf --class org.koeninger.BroadcastJoinExample target/scala-2.10/cassandra-example-assembly-0.1-SNAPSHOT.jar |
STEP 4 BUILD YOUR OWN SAMPLE SCALA PROJECT THAT WORKS WITH SPARK
Note this particular example is not using Cassandra but gives you an idea of how you would make a project.
sudo su spark cd ~ mkdir Sample-Spark-Scala-Project cd Sample-Spark-Scala-Project mkdir -p src/main/scala |
Create sbt file
vim sampleproject.sbt name := "Sample Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0" |
Write your Scala Code this particular example parses text and gives a count for lines with the letter a and lines with the letter b in the SPARK /usr/local/spark/README.md document
vim src/main/scala/SampleProject.scala /* SampleProject.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SampleProject { def main(args: Array[String]) { val logFile = "/usr/local/spark/README.md" // File to Parse val conf = new SparkConf().setAppName("Sample Project") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } } |
sbt package /usr/local/spark/bin/spark-submit --class "SampleProject" --master local[4] target/scala-2.10/sample-project_2.10-1.0.jar |
Hope this helps get you the ground work setup to use Spark and Cassandra.