rashan's

Saturday, November 29, 2014

Running Hadoop MapReduce example jobs

The official Hadoop installation comes with some example MapReduce jobs. In this post let us get to know about them and try to run them on our local server.
Note: In case you want to know about setting up Hadoop on a Ubuntu server, please refer to my previous posts. In fact, in this post I assume that you have already read them and working with a similar setup.

1. Install Hadoop 2.5.x on Ubuntu Trusty (14.04.1 LTS)
2. Basic setup of the Hadoop file system

The bellow table lists the example jobs available on Hadoop 2.5.2;

Job Name	Description
aggregatewordcount	An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist	An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp	A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount	An example job that count the pageview counts from a database.
distbbp	A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep	A map/reduce program that counts the matches of a regex in the input.
join	A job that effects a join over sorted, equally partitioned datasets
multifilewc	A job that counts words from several files.
pentomino	A map/reduce tile laying program to find solutions to pentomino problems.
pi	A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter	A map/reduce program that writes 10GB of random textual data per node.
randomwriter	A map/reduce program that writes 10GB of random data per node.
secondarysort	An example defining a secondary sort to the reduce.
sort	A map/reduce program that sorts the data written by the random writer.
sudoku	A sudoku solver.
teragen	Generate data for the terasort
terasort	Run the terasort
teravalidate	Checking results of terasort
wordcount	A map/reduce program that counts the words in the input files.
wordmean	A map/reduce program that counts the average length of the words in the input files.
wordmedian	A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation	A map/reduce program that counts the standard deviation of the length of the words in the input files.

Basic setup of the Hadoop file system

In my last post, I have describe the simplest way of setting up Hadoop 2.5.2 on a fresh Ubuntu Trusty installation. In this post I am going to setup the Hadoop file system (or DataNodes) in my local installation.

First, let us create a DataNode, which will be used to store data on Hadoop’s file system. Typically, multiple DataNodes are used to achieve RAID like features but in this case we will configure only one node. (Note: In RAID, we are replicating the data in multiple disks of the same server, but in Hadoop, the DataNodes can be in multiple servers!)

:/user/local/hadoop$ bin/hdf namenode –format

Typically, DataNode connects to the NameNode and responds to requests coming from them. Once we locate the data nodes, then we can even directly talk with them. In certain scenarios (e.g. data replication), DataNode may also directly communicate with other DataNodes. More on this later, but for now let us focus on our simple use case!

After formatting\getting ready the DataNode, let’s start the daemons related to DataNode and NameNode.

:/user/local/hadoop$ sbin/start-dfs.sh

If everything goes well, you should be able to verify the running daemons by visiting the web interface running on port 50070 of the Ubuntu installation. In my case, the out put was like bellow;

> http://<my_server_IP>:50070/

Install Hadoop 2.5.x on Ubuntu Trusty (14.04.1 LTS)

With the release of a new version of Hadoop (2.5.2), I thought of checking out its features by installing it on one of my VPCs running Ubuntu 14.04. The idea is to setup a simple installation, as quickly as possible. The installation guide on the Hadoop web site looks straightforward, but setting up Hadoop on a fresh Ubuntu installation involves few extra steps. There fore, the next few sections will focus on setting up Hadoop Single Node in pseudo-distributed mode on a fresh Ubuntu Trusty installation. To make things simple, I will be working in command line (yes, its much more productive than using the GUI!). For other modes and advanced setup scenarios, please refer to the Hadoop documentation pages.

Using Rashan’s WCF Compression Library

In one of my previous posts, I have discussed the general idea of compressing the traffic of a WCF service and how I started implementing a reusable component to achieve this. In this post, let us see how easy to enable compression in any WCF service by using my reusable library, without changing any code of an existing implementation!

Installing the component

The compression library should be installed on both the WCF service hosting project and the client project, that consumes the service. This can be done easily by using Nuget.

If the Nuget package console is used, then execute the following command for each project (i.e. service and Clients).

PM> Install-Package Rashans.WCF.Compression

If the Nuget GUI is used, then search for “wcf compression encoder” and select the package with Id Rashans.WCF.Compression.

Improving the performance of a WCF service

Recently I was tasked to look into certain performance issues of a large Enterprise application having a service oriented architecture. The front end web site is based on Microsoft ASP.Net MVC and the entire business logic layer of the application is implemented as a series of WCF web services that are primarily exposed over TCP (i.e. using net.tcp binding). In order to increase the scalability of the system, each tier of the application deployed in separate servers. After spending some time improving the web front end, I quickly realized that the primary reason for intermittent slowness of the application is due to the fact that a large amount of data passing between the service layer and the web layer (i.e. web server and application server). If we are to improve this situation, then we have either to improve the network infrastructure or reduce the amount of data. Since, upgrading the network connectivity was not an option, I was focusing my effort on the later. As a result, I was investigating the possibility of compressing the traffic between the web server and application server.

How to enable compression on WCF traffic?

As it turns out, enabling compression is not trivial in WCF services. However, if your services are hosted in IIS, then you will be able to utilize the built-in compression module for certain message types. [UPDATE: Since, .net v4.5, you can actually enable compression for binary message encoder] Fortunately, I found a wonderful sample on MSDN where data compression is achieved via a custom message encoder. After going through the sample implementation, I decided to implement a reusable component where I can enable compression for both binary and text message encoders with the extra option of able to decide the compression algorithm (Gzip, Bzip2, Deflate, etc…). I created a Nuget package of my initial implementation and now it is available on nuget.org I will explain more on using this package in a future article and I hope it will be useful for others as well.

Technorati Tags: WCF,Compression,GZip,Nuget

rashan's

Saturday, November 29, 2014

Running Hadoop MapReduce example jobs

Friday, November 28, 2014

Basic setup of the Hadoop file system

Tuesday, November 25, 2014

Install Hadoop 2.5.x on Ubuntu Trusty (14.04.1 LTS)

Thursday, July 31, 2014

Using Rashan’s WCF Compression Library

Installing the component

Tuesday, July 29, 2014

Improving the performance of a WCF service

How to enable compression on WCF traffic?