rashan's: Running Hadoop MapReduce example jobs

The official Hadoop installation comes with some example MapReduce jobs. In this post let us get to know about them and try to run them on our local server.
Note: In case you want to know about setting up Hadoop on a Ubuntu server, please refer to my previous posts. In fact, in this post I assume that you have already read them and working with a similar setup.

1. Install Hadoop 2.5.x on Ubuntu Trusty (14.04.1 LTS)
2. Basic setup of the Hadoop file system

The bellow table lists the example jobs available on Hadoop 2.5.2;

Job Name	Description
aggregatewordcount	An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist	An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp	A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount	An example job that count the pageview counts from a database.
distbbp	A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep	A map/reduce program that counts the matches of a regex in the input.
join	A job that effects a join over sorted, equally partitioned datasets
multifilewc	A job that counts words from several files.
pentomino	A map/reduce tile laying program to find solutions to pentomino problems.
pi	A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter	A map/reduce program that writes 10GB of random textual data per node.
randomwriter	A map/reduce program that writes 10GB of random data per node.
secondarysort	An example defining a secondary sort to the reduce.
sort	A map/reduce program that sorts the data written by the random writer.
sudoku	A sudoku solver.
teragen	Generate data for the terasort
terasort	Run the terasort
teravalidate	Checking results of terasort
wordcount	A map/reduce program that counts the words in the input files.
wordmean	A map/reduce program that counts the average length of the words in the input files.
wordmedian	A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation	A map/reduce program that counts the standard deviation of the length of the words in the input files.

Each of the above jobs can be executed by using the following command pattern.
:/user/local/hadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar <Job Name> <Job Arguments>
Wait… Before we can go and execute the sample, we need to create a directory for the user on the distributed file system. Having this directory is a pre-requisite for running any MapReduce job.
:/user/local/hadoop$ bin/hdf dfs –mkdir /user
:/user/local/hadoop$ bin/hdf dfs –mkdir /user/<username> (use the user of the current shell)
As an example, let us execute the wordcount job, which will analyze an existing set of text files and output the word count. To make things simple, let us use the Hadoop configuration files as the input files. First, we have to move the input files to our distributed file system as bellow.
:/user/local/hadoop$ bin/hdfs dfs -put etc/hadoop input
Now let us run the job;
:/user/local/hadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount input/hadoop output
This will run the MapReduce job and save the result into the output directory on the distributed file system. We can view the file content (i.e. the word count result for the content in our input files) as follows;
:/user/local/hadoop$ bin/hdfs dfs -cat output/*
We can copy the result files to our local file system as follows;
:/user/local/hadoop$ bin/hdfs dfs -get output result
Similarly, we can view the result contents from the local file system as follows;
:/user/local/hadoop$ cat result/*
If the output above is not convincing, then use the bellow; here I am first sorting the result in descending order of the count and view the results page by page.
:/user/local/hadoop$ cat result/* | sort -k2n,2r | less
We can play with the rest of the sample jobs like this, but that would be a topic for another day. If you cant wait, please feel free to experiment!

Technorati Tags: Hadoop,MapReduce,BigData,Ubuntu

rashan's

Saturday, November 29, 2014

Running Hadoop MapReduce example jobs

1 comment: