Hadoop can be configured in three different modes:
- Local (Standalone) Mode
- Pseudo-Distributed Mode
- Fully-Distributed Mode
This blog explains Local (Standalone) Mode which is the easiest to configure. And if you just want to get started and run a MapReduce job, you can try this.
Install sun jdk
download jdk 6 (latest update) from http://www.oracle.com/technetwork/java/javase/overview/index.htmlI have downloaded update 37 from below url.
http://www.oracle.com/technetwork/java/javase/downloads/jdk6u37-downloads-1859587.html
cd /scratch/rajiv/hadoop/hadoop-1.0.4
./jdk-6u37-linux-x64.bin
jdk will be installed under same folder ( jdk1.6.0_37)
Download latest stable hadoop version
Download hadoop-1.0.4-bin.tar.gz from apache mirror http://www.motorlogy.com/apache/hadoop/common/stable/Extract hadoop:
tar -xvf hadoop-1.0.4-bin.tar.gz
Now hadoop is extracted under /scratch/rajiv/hadoop/hadoop-1.0.4
set JAVA_HOME for hadoop
vi hadoop-1.0.4/conf/hadoop-env.sh
uncomment below line and update path to jdk 1.6
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
I have updated it to
export JAVA_HOME=/scratch/rajiv/hadoop/jdk1.6.0_37
Run sample MapReduce program
hadoop-1.0.4-bin.tar.gz has sample programs which are present in hadoop-examples-1.0.4.jar. Lets try Grep.java from this jar.
The source code this class is not present in hadoop distribution. But it can be viewed from Hadoop version control system. Hadoop SVN repository provide option to browse source code onilne.
To view source code of hadoop version 1.0.4, got branch-1.0
And click on Grep.java and view the revision
This job has three steps:
Job configuration - prepare job parameters like input & output folder. Also specify Mapper and Reducer functions
Job client - used to submit job
$cd /scratch/rajiv/hadoop/hadoop-1.0.4
$mkdir inputfiles
$cd inputfiles
$wget http://hadoop.apache.org/index.html
$cd ..
$./bin/hadoop jar hadoop-examples-*.jar grep inputfiles outputfiles 'Apache'
wget command download index.html file to inputfiles folder.
Above command read index.html present in inputfiles folder and grep for occurrences of strings 'Apache'. And writes the string and count to outputfiles folder.
Now examine the output folder
ls outputfiles/
_SUCCESS part-00000
file part-00000 contains output of the MapReduce job.
cat outputfiles/part-00000
46 Apache
In this mode, hadoop run as a single process.
When the above command is running, run "ps -ef | grep RunJar" from another terminal and it shows that there is a java process running which invokes below:
"org.apache.hadoop.util.RunJar hadoop-examples-1.0.4.jar grep inputfiles"
Source code of org.apache.hadoop.util.RunJar can be viewed here
RunJar class basically load Grep.class from hadoop-examples-1.0.4.jar and execute the main method.
Note that in Standalone mode hdfs file system is not configured and MapReduce program runs as single java process. To see hadoop in action you would need to configure Pseudo-Distributed Mode or Fully-Distributed Mode.
Run sample MapReduce program
hadoop-1.0.4-bin.tar.gz has sample programs which are present in hadoop-examples-1.0.4.jar. Lets try Grep.java from this jar.
The source code this class is not present in hadoop distribution. But it can be viewed from Hadoop version control system. Hadoop SVN repository provide option to browse source code onilne.
To view source code of hadoop version 1.0.4, got branch-1.0
And click on Grep.java and view the revision
This job has three steps:
- Mapper - Mapper class is set to RegexMapper
- Combiner - this is set to LongSumReducer
- Reducer - Reducer class is set to LongSumReducer
Job configuration - prepare job parameters like input & output folder. Also specify Mapper and Reducer functions
Job client - used to submit job
$cd /scratch/rajiv/hadoop/hadoop-1.0.4
$mkdir inputfiles
$cd inputfiles
$wget http://hadoop.apache.org/index.html
$cd ..
$./bin/hadoop jar hadoop-examples-*.jar grep inputfiles outputfiles 'Apache'
Now examine the output folder
ls outputfiles/
_SUCCESS part-00000
file part-00000 contains output of the MapReduce job.
cat outputfiles/part-00000
46 Apache
When the above command is running, run "ps -ef | grep RunJar" from another terminal and it shows that there is a java process running which invokes below:
"org.apache.hadoop.util.RunJar hadoop-examples-1.0.4.jar grep inputfiles"
Source code of org.apache.hadoop.util.RunJar can be viewed here
RunJar class basically load Grep.class from hadoop-examples-1.0.4.jar and execute the main method.
Note that in Standalone mode hdfs file system is not configured and MapReduce program runs as single java process. To see hadoop in action you would need to configure Pseudo-Distributed Mode or Fully-Distributed Mode.
No comments:
Post a Comment