Tuesday, December 29, 2015

How to install hadoop on Linux - Single node setup (Pseudo-Distributed Mode )

This blog explains how to install and configure hadoop in Pseudo-Distributed Mode on Linux (Oracle Enterprise Linux). And common errors encountered during setup.

Install sun jdk

download jdk 6 (latest update) from http://www.oracle.com/technetwork/java/javase/overview/index.html
 I have downloaded update 37 from below url.
 http://www.oracle.com/technetwork/java/javase/downloads/jdk6u37-downloads-1859587.html

cd /softwares/hadoop
./jdk-6u37-linux-x64.bin
jdk will be installed under same folder ( jdk1.6.0_37)


Download latest stable hadoop version 

Download hadoop-1.0.4-bin.tar.gz from apache mirror http://www.motorlogy.com/apache/hadoop/common/stable/

Extract hadoop:
 tar -xvf hadoop-1.0.4-bin.tar.gz

Now hadoop is extracted under /scratch/rajiv/softwares/hadoop/hadoop-1.0.4


set JAVA_HOME for hadoop

 vi hadoop-1.0.4/conf/hadoop-env.sh

uncomment below line and update path to jdk 1.6

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

I have updated it to  
 export JAVA_HOME=/softwares/hadoop/jdk1.6.0_37


Configure SSH

You would need ssh server(demon) and client on your box. 


$yum -y install openssh-server openssh-clients

If yum is repository is not configured, refer this post



Ensure if the SSH server daemon sshd is running.
$ /sbin/service sshd status
If the SSH server daemon sshd is not running, start this daemon using below command:

$ /sbin/service sshd start


Alternativly to determine if SSH is running, enter the following command:
$ pgrep sshd

If SSH is running, this command returns one or more process ids


Run "which ssh" to check if you have ssh client installed.

Generate ssh key (passphraseless )
$ ssh-keygen -t rsa -P ""

This creates rsa keypair under home folder with empty password.
Verify that id_rsa  id_rsa.pub files are present under .ssh folder.

Copy keypair to authorizes keys

copy id_rsa.pub to authorized_keys
hadoop-1.0.4]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

To Test ssh configuration run  "ssh localhost". Below output confirms that ssh is working.
    The authenticity of host 'localhost (127.0.0.1)' can't be established.

    RSA key fingerprint is 89:20:58:b4:06:c6:c6:5a:08:1e:43:eb:cc:e5:45:49.

    Are you sure you want to continue connecting (yes/no)? yes

    Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Modify hadoop config file - core-site.xml 

1) modify core-site.xml to add two properties under configuration tag.

a) hadoop.tmp.dir - if this property is not configured, hdfs will be configured under "/tmp/hadoop-/dfs". 
b) fs.default.name - if this property is not configured, startup of secondarynamenode will fail

Sample conf/core-site.xml

2) modify hdfs-site.xml
add dfs.replication property and set value to 1


Sample hdfs-site.xml

c) modify mapred-site.xml 
add mapred.job.tracker property and set value to "localhost:9001"
Format hdfs file system
Run below command
$./hadoop-1.0.4/bin/hadoop namenode -format

start hadoop

/scratch/rajiv/softwares/hadoop/hadoop-1.0.4/bin/start-all.sh


 make sure all hadoop processes are running

run "$JAVA_HOME/bin/jps" and make sure below processes are running TaskTracker JobTracker DataNode SecondaryNameNode NameNode


stop hadoop

./bin/stop-all.sh  

stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode


Common errors and troubleshooting steps

1) If ssh is not configured, following errors will be displayed while starting hadoop

$ ./start-all.sh

starting namenode, logging to /scratch/rajiv/softwares/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-rajiv-namenode-myhostname.out rajiv@localhost's password: rajiv@localhost's password: localhost: Permission denied, please try again. localhost: Permission denied, please try again. rajiv@localhost's password: localhost: starting datanode, logging to /scratch/rajiv/softwares/hadoop/hadoop-1.0.4/bin/../logs/hadoop-rajiv-datanode-myhostname.out

2) If hdfs is not configured, starting hadoop will fail  with following message


 To fix this make sure " core-site.xml" has below snippet

fs.default.name
hdfs://localhost:54310



 $./start-all.sh


starting namenode, logging to /scratch/rajiv/softwares/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-rajiv-namenode-myhostname.out localhost: starting datanode, logging to /scratch/rajiv/softwares/hadoop/hadoop-1.0.4/bin/../logs/hadoop-rajiv-datanode-myhostname.out localhost: starting secondarynamenode, logging to /scratch/rajiv/softwares/hadoop/hadoop-1.0.4/bin/../logs/hadoop-rajiv-secondarynamenode-myhostname.out localhost: Exception in thread "main" java.lang.IllegalArgumentException: Does not contain a valid host:port authority: file:/// localhost: at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:162) localhost: at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:198) localhost: at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:228) localhost: at org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:222) localhost: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:161) localhost: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:129) localhost: at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:567) starting jobtracker, logging to /scratch/rajiv/softwares/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-rajiv-jobtracker-myhostname.out localhost: starting tasktracker, logging to /scratch/rajiv/softwares/hadoop/hadoop-1.0.4/bin/../logs/hadoop-rajiv-tasktracker-myhostname.out

3) Cause: invalid tmp folder specified:
$ ./bin/hadoop namenode -format
To fix this check core-site.xml and ensure hadoop.tmp.dir element has correct value.


13/03/26 21:49:33 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = myhostname/myipaddress STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.4 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 ************************************************************/ 13/03/26 21:49:33 INFO util.GSet: VM type = 64-bit 13/03/26 21:49:33 INFO util.GSet: 2% max memory = 17.77875 MB 13/03/26 21:49:33 INFO util.GSet: capacity = 2^21 = 2097152 entries 13/03/26 21:49:33 INFO util.GSet: recommended=2097152, actual=2097152 13/03/26 21:49:34 INFO namenode.FSNamesystem: fsOwner=rajiv 13/03/26 21:49:34 INFO namenode.FSNamesystem: supergroup=supergroup 13/03/26 21:49:34 INFO namenode.FSNamesystem: isPermissionEnabled=true 13/03/26 21:49:34 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 13/03/26 21:49:34 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), acc essTokenLifetime=0 min(s) 13/03/26 21:49:34 INFO namenode.NameNode: Caching file names occuring more than 10 times 13/03/26 21:49:34 ERROR namenode.NameNode: java.io.IOException: Cannot create directory /u01/hadoop/tmp/dfs/na me/current at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:297) at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1320) at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:1339) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1164) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1271) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)

4) while running the basic example,

$ ./bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

13/05/07 23:04:22 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/05/07 23:04:22 WARN snappy.LoadSnappy: Snappy native library not loaded 13/05/07 23:04:22 INFO mapred.JobClient: Cleaning up the staging area file:/scratch/softwares/hadoop/tmp/mapred/staging/rajiv762105165/.staging/job_local_0001 13/05/07 23:04:22 ERROR security.UserGroupInformation: PriviledgedActionException as:rajiv cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/rajiv/input org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/rajiv/input at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.hadoop.examples.Grep.run(Grep.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.Grep.main(Grep.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Cause:

the input folder is not present in hdfs

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:54310/user/rajiv/input


Solution:


./bin/hadoop dfs -ls hdfs:/user/rajiv
if input folder is not present then copy it from local file system

./bin/hadoop dfs -copyFromLocal ./input hdfs:/user/rajiv/input

above command assume that input folder is present under current dir


5) if hadoop is not started, while running example below error is displayed

bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 13/05/07 23:09:05 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 0 time(s). 13/05/07 23:09:06 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 1 time(s).



http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Example%3A+WordCount+v1.0