Thursday, December 19, 2013

How to configure 'Ubuntu Software Center' behind proxy

Go to "System Settings" menu (top right icon), select "Network" and then select "Network Proxy" from popup.

Add http , http and ftp proxy server details and click "Apply System Wide" button.

To ensure your changes are saved, open "/etc/apt/apt.conf" as root, you should see lines starting with:

Acquire::http::Proxy
Acquire::http::Proxy
 Acquire::http::proxy...
 Acquire::https::proxy...
 Acquire::ftp::proxy...


Now need to export above as environment variables:
 open  /etc/bash.bashrc and add below  lines.

export http_proxy=http://fully-qualified-proxy-hostname:port/
export https_proxy=https://fully-qualified-proxy-hostname:port:80/
export ftp_proxy=fttp://fully-qualified-proxy-hostname:port:80/



 Now run "source /etc/bash.bashrc" from terminal and then open "Ubuntu Software Center". It should work now.

Acquire::http::Proxy
/etc/apt/

Thursday, August 22, 2013

How to build hadoop source code using ant & ivy (hadoop release-1.0.4)


Download hadoop source code

Use svn anonymous checkout to get hadoop source code

create folder to checkout source code, for example:
mkdir -p /scratch/rajiv/softwares/hadoop/source


Now checkout from svn repository.

$svn checkout http://svn.apache.org/repos/asf/hadoop/common/tags/release-1.0.4/ release-1.0.4
svn: OPTIONS of 'http://svn.apache.org/repos/asf/hadoop/common/tags/release-1.0.4': could not connect to server (http://svn.apache.org)
Got above error initially as I am behind proxy server.

Edit ~/.subversion/servers file and un-comment below lines and provide your proxy server hostname(preferably fully qualified hostname) and port. If you don't know the proxy serve name, just run "wget google.com" from terminal, it will print proxy server hostname.

Make sure you edit these properties under "[global]" section. Same properties are present under "group".
But modifying them will not have any effect. You will get same exception as above.
#http-proxy-host=my.proxy.server.name.here#http-proxy-port=80#http-compression = no

Now run svn checkout again
$svn checkout http://svn.apache.org/repos/asf/hadoop/common/tags/release-1.0.4/ release-1.0.4

Now the hadoop source code is checkout out under current folder.
$ls hadoop-common-1.0.4
 


Build hadoop source code
Hadoop 1.0.4 source doesn't have maven project defined.
There is no pom.xml present and trying to build using maven will fail with below exception.
[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/scratch/rajiv/softwares/hadoop/source/hadoop-common-1.0.4). Please verify you invoked Maven from the correct directory. -> [Help 1]
Maven project was available till hadoop release-0.23.7.
Then versions from release-0.3.0  to release-0.9.2 ant is used.
Then version from release-1.0.0 to release-1.2.0-rc1 ant and ivy are used(noticed that during build, ivy downloads maven2 artifacts).
 
Now from hadoop 2.0 on-wards ant and ivy are removed and only maven project is present.

 
Refer 
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.23.7/
 
http://svn.apache.org/repos/asf/hadoop/common/tags/release-1.0.4/

http://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.1-alpha/


Install ant and Ivy

Download and extract ivy
 wget http://apache.osuosl.org//ant/ivy/2.3.0/apache-ivy-2.3.0-bin.tar.gz

download and extract ant
 
wget http://archive.apache.org/dist/ant/binaries/apache-ant-1.8.4-bin.tar.gz
$ant jar 

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/tools/ant/launch/Launcher
Caused by: java.lang.ClassNotFoundException: org.apache.tools.ant.launch.Launcher
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.tools.ant.launch.Launcher.  Program will exit

set JAVA_HOME and ANT_HOME to fix above error.

$ setenv JAVA_HOME
/scratch/rajiv/softwares/hadoop/jdk
$ setenv ANT_HOME /scratch/rajiv/softwares/hadoop/ant

$/scratch/rajiv/softwares/hadoop/ant/bin/ant jar 
Buildfile: /scratch/rajiv/softwares/hadoop/source/hadoop-common-1.0.4/build.xml

clover.setup:

clover.info:
     [echo]
     [echo]      Clover not found. Code coverage reports disabled.
     [echo]

clover:

ivy-download:
      [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
      [get] To: /scratch/rajiv/softwares/hadoop/source/hadoop-common-1.0.4/ivy/ivy-2.1.0.jar
      [get] Error getting http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar to /scratch/rajiv/softwares/hadoop/source/hadoop-common-1.0.4/ivy/ivy-2.1.0.jar

BUILD FAILED
/scratch/rajiv/softwares/hadoop/source/hadoop-common-1.0.4/build.xml:2419: java.net.NoRouteToHostException: No route to host
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:529)
        at java.net.Socket.connect(Socket.java:478)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:388)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:523)
        at sun.net.www.http.HttpClient.(HttpClient.java:227)
        at sun.net.www.http.HttpClient.New(HttpClient.java:300)
        at sun.net.www.http.HttpClient.New(HttpClient.java:317)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
        at org.apache.tools.ant.taskdefs.Get$GetThread.openConnection(Get.java:660)
        at org.apache.tools.ant.taskdefs.Get$GetThread.get(Get.java:579)
        at org.apache.tools.ant.taskdefs.Get$GetThread.run(Get.java:569)

Total time: 10 seconds
[rajiv@myhostname hadoop-common-1.0.4]$
ant is not able to get ivy jars. Pass http proxy host and port as -D arguments to ant to fix this.

$ /scratch/rajiv/softwares/hadoop/ant/bin/ant jar -Dhttp.proxyHost=your-proxy-server-name-here -Dhttp.proxyPort=80

This also doesn't work
set  ANT_OPTS
 setenv ANT_OPTS "-Dhttp.proxyHost=your-proxy-server-name-here -Dhttp.proxyPort=80"
$ /scratch/rajiv/softwares/hadoop/ant/bin/ant jar 
cd to hadoop source code folder, make sure build.xml file present under this folder:
 

$cd /scratch/rajiv/softwares/hadoop/source/hadoop-common-1.0.4
 
run ant:
 
$/scratch/rajiv/softwares/hadoop/ant/bin/ant jar



 
sample output (trimmed), refer this link for complete build output log
 

/scratch/rajiv/softwares/hadoop/ant/bin/ant jar Buildfile: /scratch/rajiv/softwares/hadoop/source/hadoop-common-1.0.4/build.xml

clover.setup:

clover.info:
[echo]
[echo] Clover not found. Code coverage reports disabled.
[echo]
 ........................... 
 
BUILD SUCCESSFUL
Total time: 8 minutes 34 seconds
[rajiv@myhostname hadoop-common-1.0.4]$
 






Friday, May 10, 2013

How MapReduce Works - Stock Quote Example


MapReduce program has two phases Map phase and Reduce phase. As part of Map phase Shuffle operation is also performed.

  • Map task
  • Shuffle - sort & group
  • Reduce Task



Stock Quote Example

Download historical data for any listed company from google finance.
For example: This link provides historical data for Google stock for last one year in CSV format
This csv file contains fields like Date, Opening Price, High, Low price and closing price for each day.
Sample data below:



Now lets use Hadoop to find out the Highest stock price for a given year.

Mapper - This function extracts required fields from above data and create a file as shown below:


May, 873.88
May, 863.87
May, 861.85
May, 846.8
May, 834.55
May, 824.72
April, 783.75
April, 779.55
April, 786.99
April, 805.75
April, 814.2
April, 814.83
April, 802.25


The Shuffle step sort and group data for each month as shown below:
It sorts the data(key value pair) by key. Now input to reducer function look like:

(May, [ 873.88, 863.87, 861.85, 846.8, 834.55, 824.72 ] )
(April, [775.5 , 778.75 , 786.06, 804.25, 813.46, 804.54, 795.01] )

Now Reducer function loop through this data and find out the highest price for last one year.
(May , 873.88)

Data flow:
input data > Mapper function > Shuffle(sort/ group) > Reducer function > output file


Wednesday, May 8, 2013

How to install hadoop on Linux - Local (Standalone) Mode

 Hadoop can be configured in three different modes:

  1. Local (Standalone) Mode
  2. Pseudo-Distributed Mode
  3. Fully-Distributed Mode

This blog explains Local (Standalone) Mode which is the easiest to configure. And if you just want to get started and run a MapReduce job, you can try this.

Install sun jdk

download jdk 6 (latest update) from http://www.oracle.com/technetwork/java/javase/overview/index.html
 I have downloaded update 37 from below url.
 http://www.oracle.com/technetwork/java/javase/downloads/jdk6u37-downloads-1859587.html

cd /scratch/rajiv/hadoop/hadoop-1.0.4
./jdk-6u37-linux-x64.bin
jdk will be installed under same folder ( jdk1.6.0_37)


Download latest stable hadoop version 

Download hadoop-1.0.4-bin.tar.gz from apache mirror http://www.motorlogy.com/apache/hadoop/common/stable/

Extract hadoop:
 tar -xvf hadoop-1.0.4-bin.tar.gz

Now hadoop is extracted under /scratch/rajiv/hadoop/hadoop-1.0.4


set JAVA_HOME for hadoop

 vi hadoop-1.0.4/conf/hadoop-env.sh

uncomment below line and update path to jdk 1.6

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

I have updated it to  
 export JAVA_HOME=/scratch/rajiv/hadoop/jdk1.6.0_37

Run sample MapReduce program

hadoop-1.0.4-bin.tar.gz has sample programs which are present in hadoop-examples-1.0.4.jar. Lets try Grep.java from this jar.

The source code this class is not present in hadoop distribution. But it can be viewed from Hadoop version control system. Hadoop SVN repository provide option to browse source code onilne.


 To view source code of hadoop version 1.0.4, got branch-1.0

And click on Grep.java and view the revision

This job has three steps:
  • Mapper - Mapper class is set to RegexMapper
  • Combiner - this is set to LongSumReducer
  • Reducer - Reducer class is set to LongSumReducer

Job configuration -  prepare job parameters like input & output folder. Also specify Mapper and Reducer functions

Job client - used to submit job



$cd /scratch/rajiv/hadoop/hadoop-1.0.4
$mkdir inputfiles
$cd inputfiles
$wget http://hadoop.apache.org/index.html
$cd ..
$./bin/hadoop jar hadoop-examples-*.jar grep inputfiles outputfiles 'Apache'

wget command download index.html file to inputfiles folder. Above command read index.html present in inputfiles folder and grep for occurrences of strings 'Apache'. And writes the string and count to outputfiles folder. 


Now examine the output folder

ls outputfiles/
_SUCCESS  part-00000

file part-00000 contains output of the MapReduce job.


cat outputfiles/part-00000
46      Apache



In this mode, hadoop run as a single process.
When the above command is running, run "ps -ef | grep RunJar" from another terminal and it shows that there is a java process running which invokes below:

"org.apache.hadoop.util.RunJar hadoop-examples-1.0.4.jar grep inputfiles"

Source code of org.apache.hadoop.util.RunJar  can be viewed here

RunJar class basically load Grep.class from hadoop-examples-1.0.4.jar and execute the main method.

Note that in Standalone mode hdfs file system is not configured and MapReduce program runs as single java process. To see hadoop in action you would need to configure Pseudo-Distributed Mode or Fully-Distributed Mode.





Friday, April 19, 2013

MapReduce - Part I


Mapreduce is a paradigm for distributed computing. It provides a framework for parallel computing. MapReduce enables large scale distributed data processing. MapReduce can be applied to many large scale computing problems. The name MapReduce is inspired from Map and Reduce functions in LISP programming language.But it is not an implementation of this lisp functions.

From a user's perspective, there are two basic operations in MapReduce: Map and Reduce.

Typical problems solved by MapReduce:
Read a lot of data
Map: extract something you care about from each record.
Shuffle and sort - MapReduce framework will read output from Map phase, and perform sort and grouping.

Reduce: This phase aggregate, summarize, filter or transform data and write the results to file.


Why do we need distributed computing like MapReduce

Otherwise some problems are too big to solve

Example: 

20+ billion web pages x 20KB = 400+ terabytes
- One computer can read 30-35 MB/sec from disk
   ~four months to read the web
~1,000 hard drives just to store the web
   Even more to do something with the data
 
Using MapReduce: same problem can be solved with 1000 machines, < 3 hours


The Map function reads a stream of data and parses it into intermediate (key, value) pairs. When that is complete, the Reduce function is called once for each unique key that was generated by Map and is given the key and a list of all values that were generated for that key as a parameter. The keys are presented in sorted order.

Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes). 

MapReduce is not a general-purpose framework for all forms of parallel programming. Rather, it is designed specifically for problems that can be broken up into the the map-reduce paradigm.

Credit:
Much of this information is from below articles:
Distributed Systems course - www.cs.rutgers.edu/~pxk/417/notes

Saturday, April 13, 2013

The Google File System (GFS)

A brief overview on GFS.

GFS development was motivated by need of a scalable distributed file system. GFS supports large-scale data processing workloads on commodity hardware. In GFS files are divided in to fixed size chunks. And replicated over chunkservers to deliver aggregate performance and fault tolerance. Each chunk has a unique 64 bit chunk handle.

GFS has single master for simplicity and multiple chunkservers(replicas). Master and chunkservers coordinate using heartbeat messages. GFS is fault tolerant and supports TeraBytes of space.

Here is the architecture diagram from GFS paper.



In above diagram, the GFS client contact GFS master to obtain chunk location. And then contact one of the chunkservers to obtain data.

Reference: