Install Nutch 1.7 and Hadoop 1.2.0: June 2013

Instalation and Configuration of Hadoop 1.2.0 and Nutch 1.7 (Single and Multi node cluster)
This content is a modified version of the post

Links
NutchHadoopTutorial
HDFS Shell Commands
Nutch Wiki
Hadoop Wiki
Running-hadoop-on-ubuntu-linux-single-node-cluster
Running-hadoop-on-ubuntu-linux-multi-node-cluster/

Assumptions

You must install first Java 1.7+, Ant, Openssh
Preparation

First of all, download Apache Nutch 1.7 Source Code from

http://nutch.apache.org/

and Hadoop 1.2.0 binary from

http://hadoop.apache.org/common/releases.html#Download.

Create user “nutch" and generate key. It should be create /home/nutch directory.

$ sudo adduser nutch
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Install Hadoop 1.2.0

(Single node cluster)

Download hadoop binary.tar.gz, unpack to /home/nutch/hadoop

$ sudo tar -xvf ~/Downloads/hadoop-1.2.0.tar.gz -C /home/nutch
$ sudo mv /home/nutch/hadoop-1.2.0 /home/nutch/hadoop
$ sudo chown -R nutch /home/nutch/hadoop
$ su - nutch
$ cd ~/hadoop
$ export HADOOP_HOME=/home/nutch/hadoop
$ export PATH=:$HADOOP_HOME/bin

Edit conf/hadoop-env.sh and add the following line:

export CLASSPATH=/your_java_home_path

Now change the contents of conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml

conf/core-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/home/nutch/hadoop/tmp</value>
</property>

<property>

<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>

</property>

</configuration>

conf/hdfs-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>
<name>dfs.name.dir</name>
<value>/home/nutch/filesystem/name</value>
</property>

<property>
<name>dfs.data.dir</name>

<value>/home/nutch/filesystem/data</value>
</property>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

</configuration>

conf/mapred-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>

<name>mapred.job.tracker</name>
<value>localhost:9001</value>

</property>

<property>
<name>mapred.map.tasks</name>
<value>2</value>
</property>

<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>

<property>
<name>mapred.system.dir</name>
<value>/home/nutch/filesystem/mapreduce/system</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/home/nutch/filesystem/mapreduce/local</value>
</property>

</configuration>

Go to install nutch section if you only have a one server.

(Multi node cluster)

Hadoop Map/Reduce

Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across the nodes in the cluster. A Map/Reduce computation has two phases, a map phase and a reduce phase. The input to the computation is a data set of key/value pairs.

Architecture

The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master server or jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker is the point of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. The jobtracker manages the assignment of map and reduce tasks to the tasktrackers. The tasktrackers execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases.

Hadoop DFS

Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.

Architecture

Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation consists of a single Namenode, a master server that manages the filesystem namespace and regulates access to files by clients. In addition, there are a number of Datanodes, one per node in the cluster, which manage storage attached to the nodes that they run on. The Namenode makes filesystem namespace operations like opening, closing, renaming etc. of files and directories available via an RPC interface. It also determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction from the Namenode.

Assume that you have two nodes to run this configuration.

First, edit /etc/host on all nodes and add the following lines:

x.x.x.x master1

y.y.y.y    slave1

Assuming you configured master machine as described in the single-node cluster tutorial Install hadoop part you will only have to change a few variables.

Important: You have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.

First, we have to change the fs.default.name variable (in conf/core-site.xml) which specifies the NameNode (the HDFS master) host and port. In our case, this is the master machine.

Now change the contents of conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml

conf/core-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/home/nutch/hadoop/tmp</value>
</property>

<property>

<name>fs.default.name</name>
<value>hdfs://master1:9000</value>

</property>

</configuration>

conf/hdfs-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>
<name>dfs.name.dir</name>
<value>/home/nutch/filesystem/name</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/home/nutch/filesystem/data</value>
</property>

<property>
<name>dfs.replication</name>
<value>2</value>
</property>

</configuration>

conf/mapred-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>
<name>mapred.job.tracker</name>
<value>master1:9001</value>
</property>

<property>
<name>mapred.map.tasks</name>
<value>2</value>
</property>

<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>

<property>
<name>mapred.system.dir</name>

<value>/home/nutch/filesystem/mapreduce/system</value>
</property>

<property>
<name>mapred.local.dir</name>

<value>/home/nutch/filesystem/mapreduce/local</value>
</property>

</configuration>

Edit conf/masters

master1

Edit conf/slaves

slave1

First, edit /etc/host on all nodes and add the following lines:

IP_of_master1 master1
IP_of_slave2 slave1

Create user nutch in all servers.

$ sudo adduser nutch

We use master1 as master and slave1 as slave. I configure the work environment on master1 and copy /home/nutch/hadoop to slave1 in the same directory.

$ scp -r ~/hadoop nutch@slave1:/home/nutch/

Now configure the SSH first. Login to master1 with user “nutch” and create a SSH key if you did not create yet, then copy the public key to slave1.

$ ssh-keygen -t rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ scp ~/.ssh/authorized_keys nutch@slave1:/home/nutch/.ssh/authorized_keys

Install Nutch 1.7

Download Nutch Source Code and unpack to /home/nutch/nutch (in master server only if you configure a multi node cluster)

$ tar -xvzf apache-nutch-1.7-src.tar.gz -C ~/nutch
$ mv apache-nutch-1.7 nutch

Copy conf/nutch-default.xml to conf/nutch-site.xml and edit conf/nutch-site.xml. Search the http.agent.name key and set value to your crawler name. Then copy the hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, master, slaves from hadoop/conf to nutch/conf.

$ cd ~/hadoop/conf

$ cp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml masters slaves ~/nutch/conf/

Now go to ~/nutch , edit default.properties and change the name from apache-nutch to nutch.

Build nutch and export ~/nutch/runtime/local/lib to CLASSPATH

$ cd ~/nutch
$ ant runtime
$ export CLASSPATH=:/home/nutch/nutch/runtime/local/lib

Preparing Hadoop to Run (Master only if you have a multi node configuration)

Go to ~/hadoop and format the namenode.

$ bin/hadoop namenode -format

Make a directory urls in ~/nutch and add into a file ~/nutch/seed.txt the host you want to crawl.

$ cd ~/nutch && mkdir urls

$ nano urls/seed.txt

Now Start Hadoop and add urls directory to HDFS of Hadoop

$ cd ~/hadoop
$ bin/start-all.sh
$ bin/hadoop dfs -put ~/nutch/urls urls

You can verify with:

$ bin/hadoop dfs -ls

$ bin/hadoop dfs -cat urls/seed.txt

Run Nutch Job

Then submit the Nutch job to Hadoop

$ bin/hadoop -jar /home/nutch/nutch/runtime/deploy/nutch-1.7.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5

You can monitor hadoop:
Hadoop jobs on http://localhost:50030
Hadoop HDFS http://localhost:50070

Sunday, June 30, 2013

(Single node cluster)

(Multi node cluster)