Tuesday, November 25, 2014

Install Hadoop 2.5.x on Ubuntu Trusty (14.04.1 LTS)

With the release of a new version of Hadoop (2.5.2), I thought of checking out its features by installing it on one of my VPCs running Ubuntu 14.04. The idea is to setup a simple installation, as quickly as possible.  The installation guide on the Hadoop web site looks straightforward, but setting up Hadoop on a fresh Ubuntu installation involves few extra steps. There fore, the next few sections will focus on setting up Hadoop Single Node in pseudo-distributed mode on a fresh Ubuntu Trusty installation. To make things simple, I will be working in command line (yes, its much more productive than using the GUI!). For other modes and advanced setup scenarios, please refer to the Hadoop documentation pages.

Installing the Pre-requisites

Before we start installing Hadoop, the following packages should be install;

SSH server

Technorati Tags: ,,,

:~$ sudo apt-get install openssh-server

SSH Client 

:~$ sudo apt-get install ssh

rSync

:~$ sudo apt-get install rsync

Java Jre 

:~$ sudo apt-get install openjdk-7-jre-headless

Note: Although I am installing OpenJdk version of Java, other types of editions (i.e. Oracle) might be fine as well. For further information regarding supported (and not supported) Java versions, please refer to the Hadoop Java Versions page.

Installing Hadoop 2.5.2

Download the latest Hadoop package from the Apache mirrors. In my case, I executed the bellow, but feel free to use a different mirror site (in fact, its quite possible that the one I am using now wont be there for ever, so always get an available site from the Apache mirrors)

:~$ wget http://apache.mirror.nexicom.net/hadoop/common/stable/hadoop-2.5.2.tar.gz


This will eventually download the Hadoop package to your current directory (in my case, its my home directory). Now extract the compressed archive as bellow;

:~$ tar xvf hadoop-2.5.2.tar.gz

Since usually we keep user installed applications under /usr/local, I’ve moved the extracted package and created a simlink for easy access.

:~$ cd /usr/local

:/user/local$ sudo mv ~/hadoop-2.5.2 .

:/user/local$ sudo ln –s hadoop-2.5.2 hadoop

Configure SSH Server

In order to enable the master node to start the daemons on slave nodes, we should enable password less SSH logins as bellow (otherwise we have to manually start them by login into each slave!).

First generate a SSH key pair to be used by the SSH client. Technically this is for the master node.

:/user/local$ ssh-keygen –t dsa –P “” –f ~/.ssh/id_dsa

Now we should add the public key of master node generated above into the authorized list of keys in slave nodes. Since, in this case our slaves are also reside on the same machine, doing the following will work for all slave nodes.

:/user/local$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Now check that you can login via SSH without passwords. Technically, this will open a separate console session to the same machine, so if successful, exit from the SSH session.

:/user/local$ ssh localhost

:~$ exit

:/user/local$

Configure Hadoop

Before we can start the Hadoop processes, we need to update few configuration files so that Hadoop knows how to find certain dependencies (e.g. Java). The configuration files that we need to update are located inside the etc directory under extracted Hadoop package. If you’ve followed my steps up to now, then you should be able to access them as follows. I am going to use my favorite editor, nano, but you can use what ever editor you like Winking smile. First, I am going to set some environment variables inside hadoop-env.sh.

:/user/local$ cd hadoop

:/user/local/hadoop$ nano etc/hadoop/hadoop-env.sh

Put the following on the top of the file (i.e. make sure these get executed before anything!)

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

Update the core-site.xml and add the following configuration;

:/user/local/hadoop$ nano etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Update the hdfs-site.xml and add the following configuration;

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

With this, the primary configuration of Hadoop is over and stay tune for my next post where we will see how to run the MapReduce job locally!

1 comment:

  1. Hi can u please tell how to install version 2.5.0 of hadoop. i need to use all three modes of it. i need it for my university course but dont want to use VM with windows. its slow.

    ReplyDelete