1. download nutch0.7.2,因为0.9存在一个与lucene兼容的问题,
ArrayIndexOutOfBoundsException
该问题的修改方法见http://blog.sina.com.cn/s/blog_537c07f6010009t9.html
2. touch a new file about url
$vi urls/site.txt
input a url, for example "http://www.cnn.com"
3. modify configuration file
$vi conf/crawl-urlfilter.txt
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
=====TO=======
+^http://([a-z0-9]*\.)*cnn.com/
4 run nutch
$bin/nutch crawl urls/site.txt -dir cnn.com -depth 3 -threads 4 >& cnn.log
then wait the program finish
5. see the result
$bin/nutch readdb cnn.com/db -stats
it may show show something like blow:
---------------
071026 174805 parsing file:/root/nutch-0.7.2/conf/nutch-default.xml
071026 174805 parsing file:/root/nutch-0.7.2/conf/nutch-site.xml
071026 174805 No FS indicated, using default:local
Stats for org.apache.nutch.db.WebDBReader@1a16869
-------------------------------
Number of pages: 1096
Number of links: 5023
------------------
6. config tomcat
modify /webapps/nutch/WEB-INF/classes/nutch-site.xml
add follows:
/home/nic/dev/nutch-0.7.2/testsite/
7. 很多情况下throw ClassNotFoundException or ClassNotInistialException or NutchBeanException , etc. I think its version problems.
