Posts

Showing posts from 2016

Shell script for Tree structured copying to copy large data files to large number of nodes with scp

Sometimes you need to copy a large file to number of remote hosts. I recently had a similar situation where i had to copy a 56GB data file to around 30 compute nodes in an HPC cluster. And i did not have the option to copy it to the shared disk (since it was pretty filled up). So i had to copy the file to the private scratch area of each node. Having the data in the private scratch area is better for the application since you get better read performance ( at least in the system i was working on).

So copying to each node from my machine or from the head node would take a very long time. because of network bandwidth limitations. So i came up with a small shell script that would do the copy in a tree like structure. How the script goes is that once it is provided with the set of nodes and the data file and destination. first it will copy the data to the first node in the file say node1. Then it will start copying from both the headnode and node1 to node2 and node3 respectively. likewise…

Setting up Heron Cluster with Apache Aurora Locally

Image
In this post we will be looking at how we can setup Heron steam processing engine in Apache Aurora in our local machine. Oh Boy this is going to be a long post :D. I am doing this on Ubuntu 14.04 and these steps should be similar to any Linux machine. Heron supports deployment in Apache Aurora out of the box. Apache Aurora will act as the Scheduler for Heron after the setup is complete. In order to do this first you will have to setup Apache Zookeeper and allow Heron to communicate with it. Here Apache Zookeeper will act as the State Manager of the Heron deployment. if you just want to setup a local cluster without the hassle of  installing aurora take a look at my previous blog post - Getting started with Heron stream processing engine in Ubuntu 14.04

Setting Up Apache Aurora Cluster locally  First thing we need to do is to setup Apache Aurora locally. I will try to explain as much of the configurations as i can as we go on. First lets get a Apache Aurora cluster running on our loca…

Getting started with Heron stream processing engine in Ubuntu 14.04

I was trying to get started with Heron which is a stream processing engine from twitter and faced some problems when trying to do the initial setup on Ubuntu. I am using Ubuntu 14.04 so not these problems might not happen in other Ubuntu versions. The steps below are simply following the steps in the Heron documentation. But since i am working on Ubuntu we will only show the steps for Ubuntu.

Step 1.a : Download installation script files
You can download the script files that match to Ubuntu from https://github.com/twitter/heron/releases/tag/0.14.0

For the 0.14.0 release the files you need to download will be the following.

heron-client-install-0.14.0-ubuntu.sh
heron-tools-install-0.14.0-ubuntu.sh

Optionally - You want need the following for the steps in the blog post

heron-api-install-0.14.0-ubuntu.sh
heron-core-0.14.0-ubuntu.tar.gz

Step 1.b: Execute the client and tools shell scripts
$ chmod +x heron-client-install-VERSION-PLATFORM.sh $ ./heron-client-install-VERSION-PLATFORM.sh --user …

Apache Hadoop MapReduce - Detailed word count example from scratch

Image
In this post we will look at how to create and run a word count program in Apache Hadoop. In order to make it easy for a beginner we will cover most of the setup steps as well. Please note that this blog entry is for Linux based environment. I am running Ubuntu 14.04 LTS on my machine. For windows users steps might be a little different, information regarding running Hadoop on Windows is available at  Build and Install Hadoop 2.x or newer on Windows.

Prerequisites

1. Need to have Java installed (preferabally a newer java version such as 1.7 or 1.8 )

Download Oracle JDK 8 from http://www.oracle.com/technetwork/java/javase/downloads/index.html Extract the archive to a folder named jdk1.8.0 Set the following environment variables. (You can set the variables in the .bashrc file) JAVA_HOME= PATH=$JAVA_HOME/bin:$PATH export JAVA_HOME PATH
2. SSH, If you do not have ssh installed in your machine use the following command to install ssh and rsync which is also needed

$ sudo apt-get install ssh…

Reading and Writing Binary files in java - Converting Endianness ( between Big-endian byte order and Little-endian byte order)

Image
In this post we will look into working with binary files in java. The end goal of the post will be to create simple java application that reads a binary file and write it back to a different binary file. The difference between the two files would be the byte order. If we read in a file with Big-endian byte order we will write it back with a Little-endian byte order and vice-versa. Lets first understand what is meant by byte order and what Big-endian and Little-endian are. If you are already familiar with this you can skip this section of the post

You can find the complete code for Binary format converter that is explained in this post at GitHub - https://github.com/pulasthi/binary-format-converter

Understanding byte order and Endianness
The endianness refers to the order that bytes are stored when storing multi byte values such as Integers and Doubles this is also known as the byte order. The endianness does not have any meaning when you consider a single byte that is stored, and it …

Access Apache Spark Web UI when cluster is running on closed port server machines

When you have a Apache spark cluster running on a server were ports are closed you cannot simply access the Spark master web UI by localhost:8080. The solution to this is to use SSH Tunnels. Which is pretty straight forward.

Note: You can checkout my blog post on how to setup a Spark standalone cluster locally  (The steps are pretty much the same when you are setting it up on a server) - How to set up a Apache Spark cluster in your local machine

Scenario 1:

The first most basic scenario would be if you have direct ssh access to the server where the Apache spark master is running on. The all you have to do is run the following command in a terminal window on your local machine ( Laptop or desktop that you use) after you start the master in the server machine.

$ ssh -L 8080:localhost:8080 username@your.server.name
Once you have run this command you can access the Spark Web UI by simply going to "http://localhost:8080/" on your web browser. Likewise you might want to create S…