How to set up a Apache Spark cluster in your local machine

The past few days i grew some interest in Apache Spark and thought of playing around with it a little bit. If you haven't heard about it go an take a look its a pretty cool project it claims to be around 40x faster than Hadoop in some situation. The incredible increase in performance is gained by leveraging in-memory computing technologies. I want go into details about Apache Spark here if you want to get a better look at Spark just check out there web site - Apache Spark.

In this post we will be going through the steps to setup an Apache Spark cluster on your local machine. we will setup one master node and two worker nodes. If you are completely new to Spark i recommend you to go through First Steps with Spark - Screencast #1 it will get you started with spark and tell you how to install Scala and other stuff you need.

We will be using the  launch scripts that are provided by Spark to make our lives more easier. First of all there are a couple of configurations we need to set.

conf/slaves

When using the launch scripts this file is used to identify the host-names of the machine that the slave nodes will be running. All you have to do is provide the host names of the machines one per line. since we are setting up everything in our machine we will only need to add "localhost" to this file.

conf/spark-env.sh

There are a set of variables that you can set to override the default values. this can be done by putting in values in the "spark-env.sh" file. There is a template available "conf/spark-env.sh.template" you can use this template to create the spark-env.sh file. Several variable that can be added is mentioned in the template is self. we will add the following lines to the file.

export SCALA_HOME=/home/pulasthi/work/spark/scala-2.9.3
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_DIR=/home/pulasthi/work/sparkdata

Here SPARK_WORKER_MEMORY specifies the amount of memory you want to allocate for a worker node if this value is not given the default value is the total memory available - 1G. Since we are running everything in our local machine we woundt want the slave the use up all our memory. I am running on a machine with 8GB of ram and since we are creating 2 slave node we will give each of the 2GB of ram.

The SPARK_WORKER_INSTANCES specified the number of instances here its given as 2 since we will only create 2 slave nodes.

The SPARK_WORKER_DIR will be the location that the run applications will run and which will include both logs and scratch space. Make sure that the directory can be written to by the application that is permission are set properly.

After we have these configurations ready we are good to go. now lets start by running the master node.
Just execute the launch script for the master that is "start-master.sh"


./bin/start-master.sh

Once the master is started you should be able to access the web ui at http://localhost:8080.

Now you can proceed to start the slaves. This can be done by running the "start-slaves.sh" launch script.

Note: In order to start slaves the master need to be able to access the slave machines through ssh. since we are running on the same machine that is your machine should be accessible through ssh. make sure you have ssh installed run "which sshd". if you don't have it installed install it with the following command.

 sudo apt-get install openssh-server

You will also need to specify an password for the root since this will be requested when running the slaves. If you do not have a root password set use the following command to set an password.

sudo passwd


With the slaves successfully started now you have a Spark cluster up and running. If everything went according to plan the web-ui for the master should show the two slave nodes.


Now lets connect to the cluster from the interactive shell by executing the following command

MASTER=spark://IP:PORT ./spark-shell
You can find the IP and the PORT in the top left corner of the web ui for the master. When successfully connected the web ui will show that there is an active task.

Hope to write more posts regarding Spark in the future. if you want to learn a bit more about Spark there is some great documentations on the Spark site is self here. Go an check it out.

Popular posts from this blog

Writing Unit Tests to test JMS Queue listener code with ActiveMQ

Apache Hadoop MapReduce - Detailed word count example from scratch