How to set up a Apache Spark cluster in your local machine
The past few days i grew some interest in Apache Spark and thought of playing around with it a little bit. If you haven't heard about it go an take a look its a pretty cool project it claims to be around 40x faster than Hadoop in some situation. The incredible increase in performance is gained by leveraging in-memory computing technologies. I want go into details about Apache Spark here if you want to get a better look at Spark just check out there web site - Apache Spark.
In this post we will be going through the steps to setup an Apache Spark cluster on your local machine. we will setup one master node and two worker nodes. If you are completely new to Spark i recommend you to go through First Steps with Spark - Screencast #1 it will get you started with spark and tell you how to install Scala and other stuff you need.
We will be using the launch scripts that are provided by Spark to make our lives more easier. First of all there are a couple of configurations we need to set.
conf/slaves
When using the launch scripts this file is used to identify the host-names of the machine that the slave nodes will be running. All you have to do is provide the host names of the machines one per line. since we are setting up everything in our machine we will only need to add "localhost" to this file.
conf/spark-env.sh
There are a set of variables that you can set to override the default values. this can be done by putting in values in the "spark-env.sh" file. There is a template available "conf/spark-env.sh.template" you can use this template to create the spark-env.sh file. Several variable that can be added is mentioned in the template is self. we will add the following lines to the file.
export SCALA_HOME=/home/pulasthi/work/spark/scala-2.9.3
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_DIR=/home/pulasthi/work/sparkdata
Here SPARK_WORKER_MEMORY specifies the amount of memory you want to allocate for a worker node if this value is not given the default value is the total memory available - 1G. Since we are running everything in our local machine we woundt want the slave the use up all our memory. I am running on a machine with 8GB of ram and since we are creating 2 slave node we will give each of the 2GB of ram.
The SPARK_WORKER_INSTANCES specified the number of instances here its given as 2 since we will only create 2 slave nodes.
The SPARK_WORKER_DIR will be the location that the run applications will run and which will include both logs and scratch space. Make sure that the directory can be written to by the application that is permission are set properly.
After we have these configurations ready we are good to go. now lets start by running the master node.
Just execute the launch script for the master that is "start-master.sh"
Once the master is started you should be able to access the web ui at http://localhost:8080.
Now you can proceed to start the slaves. This can be done by running the "start-slaves.sh" launch script.
Note: In order to start slaves the master need to be able to access the slave machines through ssh. since we are running on the same machine that is your machine should be accessible through ssh. make sure you have ssh installed run "which sshd". if you don't have it installed install it with the following command.
You will also need to specify an password for the root since this will be requested when running the slaves. If you do not have a root password set use the following command to set an password.
With the slaves successfully started now you have a Spark cluster up and running. If everything went according to plan the web-ui for the master should show the two slave nodes.
Now lets connect to the cluster from the interactive shell by executing the following command
Hope to write more posts regarding Spark in the future. if you want to learn a bit more about Spark there is some great documentations on the Spark site is self here. Go an check it out.
In this post we will be going through the steps to setup an Apache Spark cluster on your local machine. we will setup one master node and two worker nodes. If you are completely new to Spark i recommend you to go through First Steps with Spark - Screencast #1 it will get you started with spark and tell you how to install Scala and other stuff you need.
We will be using the launch scripts that are provided by Spark to make our lives more easier. First of all there are a couple of configurations we need to set.
conf/slaves
When using the launch scripts this file is used to identify the host-names of the machine that the slave nodes will be running. All you have to do is provide the host names of the machines one per line. since we are setting up everything in our machine we will only need to add "localhost" to this file.
conf/spark-env.sh
There are a set of variables that you can set to override the default values. this can be done by putting in values in the "spark-env.sh" file. There is a template available "conf/spark-env.sh.template" you can use this template to create the spark-env.sh file. Several variable that can be added is mentioned in the template is self. we will add the following lines to the file.
export SCALA_HOME=/home/pulasthi/work/spark/scala-2.9.3
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_DIR=/home/pulasthi/work/sparkdata
Here SPARK_WORKER_MEMORY specifies the amount of memory you want to allocate for a worker node if this value is not given the default value is the total memory available - 1G. Since we are running everything in our local machine we woundt want the slave the use up all our memory. I am running on a machine with 8GB of ram and since we are creating 2 slave node we will give each of the 2GB of ram.
The SPARK_WORKER_INSTANCES specified the number of instances here its given as 2 since we will only create 2 slave nodes.
The SPARK_WORKER_DIR will be the location that the run applications will run and which will include both logs and scratch space. Make sure that the directory can be written to by the application that is permission are set properly.
After we have these configurations ready we are good to go. now lets start by running the master node.
Just execute the launch script for the master that is "start-master.sh"
./bin/start-master.sh
Once the master is started you should be able to access the web ui at http://localhost:8080.
Now you can proceed to start the slaves. This can be done by running the "start-slaves.sh" launch script.
Note: In order to start slaves the master need to be able to access the slave machines through ssh. since we are running on the same machine that is your machine should be accessible through ssh. make sure you have ssh installed run "which sshd". if you don't have it installed install it with the following command.
sudo apt-get install openssh-server
You will also need to specify an password for the root since this will be requested when running the slaves. If you do not have a root password set use the following command to set an password.
sudo passwd
With the slaves successfully started now you have a Spark cluster up and running. If everything went according to plan the web-ui for the master should show the two slave nodes.
Now lets connect to the cluster from the interactive shell by executing the following command
MASTER=spark://IP:PORT ./spark-shell
You can find the IP and the PORT in the top left corner of the web ui for the master. When successfully connected the web ui will show that there is an active task.Hope to write more posts regarding Spark in the future. if you want to learn a bit more about Spark there is some great documentations on the Spark site is self here. Go an check it out.
GREAT POST.
ReplyDeleteThanks for sharing the view from which i gather knowledge about the related topic. Your way of writing is easy to understand and expecting your next post
ReplyDeleteBlue Prism training in chennai
Blue Prism Training in Anna nagar
RPA Training in Anna nagar
Blue prism training chennai
Data Science course in Anna nagar
RPA Training in adyar
Really you have enclosed very good informations.please furnish more informations in future.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
Big Data Course in Chennai
JAVA Training in Chennai
Python Training in Chennai
Selenium Training in Chennai
Hadoop training in chennai
Big data training in chennai
big data course in chennai
Informative post indeed, I’ve being in and out reading posts regularly and I see a lot of engaging people sharing things and majority of the shared information is very valuable and so, here’s my fine read.
ReplyDeleteBig Data Hadoop Training In Chennai | Big Data Hadoop Training In anna nagar | Big Data Hadoop Training In omr | Big Data Hadoop Training In porur | Big Data Hadoop Training In tambaram | Big Data Hadoop Training In velachery
Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
ReplyDeleteApache Spark Training Bangalore
Firstly talking about the Blog it is providing the great information providing by you . Thanks for that .Hope More articles from you .
ReplyDeleteJava Training in Chennai
Java Course in Chennai
This comment has been removed by the author.
ReplyDeletesd
ReplyDeleteThe continuation of the sentence would specify the environment or platform where the cluster is being set up, Why Popular Game such as a local machine, a cloud service, or a specific operating system.
ReplyDelete