Sunday, June 30, 2013

Instalation and Configuration of Hadoop 1.2.0 and Nutch 1.7 (Single and Multi node cluster)
This content is a modified version of the post 


You must install first Java 1.7+, Ant, Openssh
Preparation
First of all, download Apache Nutch 1.7 Source Code  from
and Hadoop 1.2.0 binary from



Create user “nutch" and generate key. It should be create /home/nutch directory.

$ sudo adduser nutch 
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys



Install Hadoop 1.2.0

(Single node cluster)


Download hadoop binary.tar.gz, unpack to  /home/nutch/hadoop

$ sudo tar -xvf ~/Downloads/hadoop-1.2.0.tar.gz -C /home/nutch 
$ sudo mv /home/nutch/hadoop-1.2.0 /home/nutch/hadoop 
$ sudo chown -R nutch /home/nutch/hadoop 
$ su - nutch 
$ cd ~/hadoop 
$ export HADOOP_HOME=/home/nutch/hadoop 
$ export PATH=:$HADOOP_HOME/bin


Edit conf/hadoop-env.sh and add the following line:

export CLASSPATH=/your_java_home_path


Now change the contents of conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml
conf/core-site.xml



<?xml version=”1.0″?> <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?> 
<configuration> 

<property> 
<name>hadoop.tmp.dir</name> 
<value>/home/nutch/hadoop/tmp</value> 
</property> 

<property> 
<name>fs.default.name</name> <value>hdfs://localhost:9000</value> 
</property> 

</configuration>

conf/hdfs-site.xml

<?xml version=”1.0″?> <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?> 
<configuration> 

<property> 
<name>dfs.name.dir</name> 
<value>/home/nutch/filesystem/name</value> 
</property> 

<property> <name>dfs.data.dir</name> 
<value>/home/nutch/filesystem/data</value> 
</property> 

<property> 
<name>dfs.replication</name> 
<value>1</value> 
</property> 

</configuration>

conf/mapred-site.xml

<?xml version=”1.0″?> <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?> <!– Put site-specific property overrides in this file. –> 
<configuration> 

<property> 
<name>mapred.job.tracker</name> <value>localhost:9001</value> 
</property> 

<property> 
<name>mapred.map.tasks</name> 
<value>2</value> 
</property> 

<property> 
<name>mapred.reduce.tasks</name> 
<value>2</value> 
</property> 

<property> 
<name>mapred.system.dir</name> 
<value>/home/nutch/filesystem/mapreduce/system</value> 
</property> 

<property>
<name>mapred.local.dir</name> 
<value>/home/nutch/filesystem/mapreduce/local</value>
</property> 

</configuration>


Go to install nutch section if you only have a one server.

 (Multi node cluster)


Hadoop Map/Reduce

Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across the nodes in the cluster. A Map/Reduce computation has two phases, a map phase and a reduce phase. The input to the computation is a data set of key/value pairs.



Architecture



The Hadoop Map/Reduce framework has a master/slave architecture. It has a single master server or jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker is the point of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which puts them in a queue of pending jobs and executes them on a first-come/first-served basis. The jobtracker manages the assignment of map and reduce tasks to the tasktrackers. The tasktrackers execute tasks upon instruction from the jobtracker and also handle data motion between the map and reduce phases.


Hadoop DFS


Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.


Architecture



Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation consists of a single Namenode, a master server that manages the filesystem namespace and regulates access to files by clients. In addition, there are a number of Datanodes, one per node in the cluster, which manage storage attached to the nodes that they run on. The Namenode makes filesystem namespace operations like opening, closing, renaming etc. of files and directories available via an RPC interface. It also determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction from the Namenode.


Assume that you have two nodes to run this configuration.

First, edit /etc/host on all nodes and add the following lines:
x.x.x.x master1 
y.y.y.y slave1


Assuming you configured master machine as described in the single-node cluster tutorial Install hadoop part you will only have to change a few variables.
Important: You have to change the configuration files conf/core-site.xml, conf/mapred-site.xml and conf/hdfs-site.xml on ALL machines as follows.


First, we have to change the fs.default.name variable (in conf/core-site.xml) which specifies the NameNode (the HDFS master) host and port. In our case, this is the master machine.


Now change the contents of conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml
conf/core-site.xml

<?xml version=”1.0″?> <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?> 

<configuration> 

<property> 
<name>hadoop.tmp.dir</name> 
<value>/home/nutch/hadoop/tmp</value> 
</property> 

<property> 
<name>fs.default.name</name> <value>hdfs://master1:9000</value> 
</property> 

</configuration>

conf/hdfs-site.xml
<?xml version=”1.0″?> <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?> 
<configuration> 

<property> 
<name>dfs.name.dir</name>
<value>/home/nutch/filesystem/name</value> 
</property>

<property> 
<name>dfs.data.dir</name> 
<value>/home/nutch/filesystem/data</value> 
</property>

<property> 
<name>dfs.replication</name> 
<value>2</value> 
</property>

</configuration>

conf/mapred-site.xml
<?xml version=”1.0″?> <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?> <!– Put site-specific property overrides in this file. –> 
<configuration> 

<property> 
<name>mapred.job.tracker</name> 
<value>master1:9001</value>
</property> 

<property> 
<name>mapred.map.tasks</name> 
<value>2</value> 
</property> 

<property> 
<name>mapred.reduce.tasks</name> 
<value>2</value> 
</property> 

<property> 
<name>mapred.system.dir</name>
<value>/home/nutch/filesystem/mapreduce/system</value> </property> 

<property> 
<name>mapred.local.dir</name> 
<value>/home/nutch/filesystem/mapreduce/local</value> </property> 

</configuration>



Edit conf/masters
master1

Edit conf/slaves
slave1
First, edit /etc/host on all nodes and add the following lines:
IP_of_master1 master1 
IP_of_slave2 slave1




Create user nutch in all servers.

$ sudo adduser nutch



 We use master1 as master and slave1 as slave. I configure the work environment on master1 and  copy /home/nutch/hadoop to slave1 in the same directory.



$ scp -r ~/hadoop nutch@slave1:/home/nutch/



 Now configure the SSH first. Login to master1 with user “nutch” and create a SSH key if you did not create yet, then copy the public key to slave1.

$ ssh-keygen -t rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ scp ~/.ssh/authorized_keys nutch@slave1:/home/nutch/.ssh/authorized_keys



Install Nutch 1.7


Download Nutch Source Code and unpack to /home/nutch/nutch (in master server only if you configure a multi node cluster)


$ tar -xvzf apache-nutch-1.7-src.tar.gz -C ~/nutch 
$ mv apache-nutch-1.7 nutch


Copy conf/nutch-default.xml to conf/nutch-site.xml and edit conf/nutch-site.xml. Search the http.agent.name key and set value to your crawler name. Then copy the hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, master, slaves from hadoop/conf to nutch/conf.


$ cd ~/hadoop/conf 
$ cp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml masters slaves ~/nutch/conf/


Now go to ~/nutch , edit default.properties and change the name from apache-nutch to nutch.

 Build nutch and export ~/nutch/runtime/local/lib to CLASSPATH


$ cd ~/nutch 
$ ant runtime 
$ export CLASSPATH=:/home/nutch/nutch/runtime/local/lib




Preparing Hadoop to Run (Master only if you have a multi node configuration)


 Go to ~/hadoop and format the namenode.


$ bin/hadoop namenode -format


Make a directory urls in ~/nutch and add into a file ~/nutch/seed.txt the host you want to crawl.

$ cd ~/nutch && mkdir urls 
$ nano urls/seed.txt



Now Start Hadoop and add urls directory to HDFS of Hadoop

$ cd ~/hadoop 
$ bin/start-all.sh 
$ bin/hadoop dfs -put ~/nutch/urls urls


You can verify with:
$ bin/hadoop dfs -ls
$ bin/hadoop dfs -cat urls/seed.txt


Run Nutch Job


 Then submit the Nutch job to Hadoop

$ bin/hadoop -jar /home/nutch/nutch/runtime/deploy/nutch-1.7.job org.apache.nutch.crawl.Crawl urls -dir urls -depth 1 -topN 5



You can monitor hadoop:
 Hadoop jobs on  http://localhost:50030
Hadoop HDFS http://localhost:50070

47 comments:

  1. the post on hadoop is really informative, you have discussed the primary things that one should kept in mind There are numerous ways you have stuffed here and share your awesome information.
    Hadoop Training in hyderabad

    ReplyDelete
  2. Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training

    ReplyDelete
  3. Hello Friend,

    Thanks for nice tutorial.

    Actually I have configured Hadoop 1.2.0 and Nutch 1.7.0 at single node,
    But i am getting output
    SEQ org.apache.hadoop.io.Text!org.apache.nutch.crawl.CrawlDatumr� �t��Wm8̉Qڼy� http://infosys.com/ K�G� '�?� (org.apache.nutch.protocol.ProtocolStatus � _pst_ http://www.infosys.com/� Content-Type� text/html�'� (org.apache.nutch.protocol.ProtocolStatus � _repr_� � Content-Type� text/html

    Why this output come i am unable to understand
    Can you please give me solution.
    Thanks In Advance

    ReplyDelete
  4. Thank you so much for sharing this great information. Today I stand as a successful hadoop certified professional. Thanks to Hadoop training institutes in chennai

    ReplyDelete
  5. Your article is very useful for me. Thanks for sharing the wonderful information.
    AWS course chennai | AWS certification in chennai | AWS cerfication chennai

    ReplyDelete
  6. The Hadoop tutorial you have explained is most useful for begineers who are taking Hadoop Administrator Online Training
    Thank you for sharing Such a good tutorials on Hadoop

    ReplyDelete
  7. As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.

    ReplyDelete
  8. Nice information about Hadoop.
    The best place to learn hadoop is steinmetzils
    100% Job assurence is provided by them.
    visit: http://www.steinmetzils.com/

    ReplyDelete
  9. Nice information about Hadoop.
    The best place to learn hadoop is steinmetzils
    100% Job assurence is provided by them.
    visit: http://www.steinmetzils.com/

    ReplyDelete
  10. Big data is used extensively in MNC today as using big data leads to accurate decision making and there are is a huge demand for the big data analysts.
    Big data training in Chennai | Hadoop training in Chennai | Big data training institute in Chennai

    ReplyDelete
  11. Needed to compose you a very little word to thank you yet again regarding the nice suggestions you’ve contributed here.
    google-cloud-platform-training-in-chennai


    ReplyDelete
  12. Greetings. I know this is somewhat off-topic, but I was wondering if you knew where I could get a captcha plugin for my comment form? I’m using the same blog platform like yours, and I’m having difficulty finding one? Thanks a lot.
    java training in chennai

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. I have never read more interesting articles than yours before. You make me so easy to understand and I will continue to share this site. Thank you very much and more power!
    Selenium Training in Chennai
    Selenium Training Institute in Chennai
    Java Courses in Chennai
    core Java training in chennai
    iOS Training Chennai
    best ios training in chennai

    ReplyDelete
  15. Any how I am here now and would just like to say thanks a lot for a tremendous post and an all-round exciting blog
    occupational health and safety course in chennai

    ReplyDelete
  16. Mình đã tìm thấy các thông tin cần thiết ở đây, cảm ơn bạn. Mình cũng muốn giới thiệu về một Công ty dịch thuật uy tín - Công ty cổ phần dịch thuật miền trung - MIDtrans có văn phòng chính tại địa chỉ 02 Hoàng Diệu, TP Đồng Hới, tỉnh Quảng Bình có Giấy phép kinh doanh số 3101023866 cấp ngày 9/12/2016 là đơn vị chuyên cung cấp dịch vụ dịch thuật, phiên dịch dành các cá nhân. Hệ thống thương hiệu và các Công ty dịch thuật con trực thuộc: trung tâm dịch thuật sài gòn 247 địa chỉ 47 Điện Biên Phủ, Phường Đakao, Quận 1 TP HCM, dịch thuật phan thiết, bình thuận : địa chỉ 100 , Lê lợi, TX Phan Thiết là nhà cung ứng dịch vụ dịch thuật uy tín hàng đầu tại Bình Thuận vietnamese translate : dịch vụ dịch thuật cho người nước ngoài có nhu cầu, giao diện tiếng Anh dễ sử dụng; dịch thuật công chứng quận 12 (mười hai) : nhà cung ứng dịch vụ dịch vụ dịch thuật phiên dịch hàng đầu tại Quận 12 (mười hai), TP HCM; dịch thuật đà nẵng midtrans : Địa chỉ 54 Đinh Tiên Hoàng, Quận Hải Châu, TP Đà Nẵng chuyên cung cấp dịch vụ dịch thuật công chứng, dịch thuật chuyên ngành tại Đà Nẵng; dịch thuật hà nội midtrans : địa chỉ 101 Láng Hạ, Đống Đa, Hà Nội là nhà cung ứng dịch vụ biên dịch, phiên dịch chuyên nghiệp tại địa bàn Hà Nội. Chúng tôi chuyên cung cấp các dịch vụ biên dịch và phiên dịch, dịch thuật công chứng chất lượng cao hơn 50 ngôn ngữ khác nhau như tiếng Anh, Nhật, Hàn, Trung, Pháp, Đức, Nga, Tây Ban Nha, Bồ Đào Nha, Ý, Ba Lan, Phần Lan, Thái Lan, Hà Lan, Rumani, Lào, Campuchia, Philippin, Indonesia, La Tinh, Thụy Điển, Malaysia, Thổ Nhĩ Kỳ..vv... Dịch thuật MIDtrans tự hào với đội ngũ lãnh đạo với niềm đam mê, khát khao vươn tầm cao trong lĩnh vực dịch thuật, đội ngũ nhân sự cống hiến và luôn sẵn sàng cháy hết mình. Chúng tôi phục vụ từ sự tậm tâm và cố gắng từ trái tim những người dịch giả.Tự hào là công ty cung cấp dịch thuật chuyên ngành hàng đầu với các đối tác lớn tại Việt nam trong các chuyên ngành hẹp như: y dược (bao gồm bệnh lý), xây dựng (kiến trúc), hóa chất, thủy nhiệt điện, ngân hàng, tài chính, kế toán. Các dự án đã triển khai của Công ty dịch thuật chuyên nghiệp MIDtrans đều được Khách hàng đánh giá cao và đạt được sự tín nhiệm về chất lượng biên phiên dịch đặc biệt đối với dịch hồ sơ thầu , dịch thuật tài liệu tài chính ngân hàng, dịch thuật tài liệu y khoa đa ngữ chuyên sâu. Đó là kết quả của một hệ thống quản lý chất lượng dịch thuật chuyên nghiệp, những tâm huyết và kinh nghiệm biên phiên dịch nhiều năm của đội ngũ dịch giả của chúng tôi. Hotline: 0947688883. email: info@dichthuatmientrung.com.vn . Các bạn ghé thăm site ủng hộ nhé. Cám ơn nhiều

    ReplyDelete
  17. I am happy for sharing on this blog its awesome blog I really impressed. Thanks for sharing. Great efforts.

    Looking for Training Institute in Bangalore , India. Softgen Infotech is the best one to offers 85+ computer training courses including IT software course in Bangalore, India. Also it provides placement assistance service in Bangalore for IT.
    Best Software Training Institute in Bangalore

    ReplyDelete
  18. Really i appreciate the effort you made to share the knowledge. The topic here i found was really effective...

    Start your journey with Training Institute in Bangaloreand get hands-on Experience with 100% Placement assistance from Expert Trainers with 8+ Years of experience @eTechno Soft Solutions Located in BTM Layout Bangalore.
    SAP Training in Bangalore

    ReplyDelete
  19. It’s really great information for becoming a better Blogger. Keep sharing, Thanks...

    Learn Hadoop Training from the Industry Experts we bridge the gap between the need of the industry. Softgen Infotech provide the Best Hadoop Training in Bangalore with 100% Placement Assistance. Book a Free Demo Today.
    Big Data Analytics Training in Bangalore
    Tableau Training in Bangalore
    Data Science Training in Bangalore
    Workday Training in Bangalore

    ReplyDelete
  20. Thanks for one marvelous posting! I enjoyed reading it; you are a great author. I will make sure to bookmark your blog and may come back someday. I want to encourage that you continue your great posts.

    ReplyDelete
  21. https://syntheticworldwide.com/
    sales@syntheticworldwide.com

    Buy cheap liquid herbal incense at your best online shop

    ReplyDelete
  22. to the medical condition of most patients. many people suffering from cancer, chronic pain, anxiety and man more have overcome their illment with cannabis. you can order weed from our store





    ReplyDelete
  23. This post is so interactive and informative.keep update more information...
    AWS Training in Tambaram
    AWS Training in Chennai

    ReplyDelete