Saturday, November 29, 2014

Running Hadoop MapReduce example jobs

The official Hadoop installation comes with some example MapReduce jobs. In this post let us get to know about them and try to run them on our local server.
Note: In case you want to know about setting up Hadoop on a Ubuntu server, please refer to my previous posts. In fact, in this post I assume that you have already read them and working with a similar setup.

1. Install Hadoop 2.5.x on Ubuntu Trusty (14.04.1 LTS)
2. Basic setup of the Hadoop file system

The bellow table lists the example jobs available on Hadoop 2.5.2;
Job Name
Description
 aggregatewordcount An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount An example job that count the pageview counts from a database.
  distbbp A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep A map/reduce program that counts the matches of a regex in the input.
  join A job that effects a join over sorted, equally partitioned datasets
  multifilewc A job that counts words from several files.
  pentomino A map/reduce tile laying program to find solutions to pentomino problems.
  pi A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter A map/reduce program that writes 10GB of random textual data per node.
  randomwriter A map/reduce program that writes 10GB of random data per node.
  secondarysort An example defining a secondary sort to the reduce.
  sort A map/reduce program that sorts the data written by the random writer.
  sudoku A sudoku solver.
  teragen Generate data for the terasort
  terasort Run the terasort
  teravalidate Checking results of terasort
  wordcount A map/reduce program that counts the words in the input files.
  wordmean A map/reduce program that counts the average length of the words in the input files.
  wordmedian A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation A map/reduce program that counts the standard deviation of the length of the words in the input files.

Friday, November 28, 2014

Basic setup of the Hadoop file system

In my last post, I have describe the simplest way of setting up Hadoop 2.5.2 on a fresh Ubuntu Trusty installation. In this post I am going to setup the Hadoop file system (or DataNodes) in my local installation.

First, let us create a DataNode, which will be used to store data on Hadoop’s file system. Typically, multiple DataNodes are used to achieve RAID like features but in this case we will configure only one node. (Note: In RAID, we are replicating the data in multiple disks of the same server, but in Hadoop, the DataNodes can be in multiple servers!)

:/user/local/hadoop$ bin/hdf namenode –format

Typically, DataNode connects to the NameNode and responds to requests coming from them. Once we locate the data nodes, then we can even directly talk with them. In certain scenarios (e.g. data replication), DataNode may also directly communicate with other DataNodes. More on this later, but for now let us focus on our simple use case!

After formatting\getting ready the DataNode, let’s start the daemons related to DataNode and NameNode.

:/user/local/hadoop$ sbin/start-dfs.sh

If everything goes well, you should be able to verify the running daemons by visiting the web interface running on port 50070 of the Ubuntu installation. In my case, the out put was like bellow;

> http://<my_server_IP>:50070/ 

image

Tuesday, November 25, 2014

Install Hadoop 2.5.x on Ubuntu Trusty (14.04.1 LTS)

With the release of a new version of Hadoop (2.5.2), I thought of checking out its features by installing it on one of my VPCs running Ubuntu 14.04. The idea is to setup a simple installation, as quickly as possible.  The installation guide on the Hadoop web site looks straightforward, but setting up Hadoop on a fresh Ubuntu installation involves few extra steps. There fore, the next few sections will focus on setting up Hadoop Single Node in pseudo-distributed mode on a fresh Ubuntu Trusty installation. To make things simple, I will be working in command line (yes, its much more productive than using the GUI!). For other modes and advanced setup scenarios, please refer to the Hadoop documentation pages.