What is a good Java-based crawler for an academic project regarding building a search engine?

834 Views Asked by At

Okay, so I have been looking for the last two days for a crawler that suits my needs. I want to build a search engine and I want to do the indexing myself. This will be part of an academic project. Although I do not have the processing power to crawl the entire web, I would like to use a crawler that is actually capable of doing this. So what I am looking for is a crawler that:

  1. supports multithreading
  2. doesn't miss many links
  3. gives me the opportunity to (override a method so that I can) access the content of the pages crawled so that I can save it, parse it etc.
  4. obeys robots.txt files
  5. crawls html pages (also php,jsp etc.).
  6. recognizes pages with same content and only returns one.

What it doesn't (necessarily) have to do is:

  1. supporting pageranking.
  2. index results.
  3. crawl images/audio/video/pdf etc.

I found a few libraries/projects that came very close to my needs, but as far as I know they don't support everything I need:

  1. First I came across crawler4j. The only problem with this one is that it doesn't support politeness interval per host. Therefore, by setting the politeness level to a decent value of 1000ms, makes the crawler terribly slow.
  2. I also found flaxcrawler. This did support multithreading but it appears to have problems with finding and following links in webpages.

I also looked at more complete and complex 'crawlers' such as Heritrix and Nutch. Although I am not that good with more complex stuff but I am definitely willing to use it if I am sure that it would be able to do what I need it to do: crawl the web and give me all the pages so that I can read them.

Long story short: I am looking for a crawler that goes very fast through all pages on the web and gives me the opportunity to do something with them.

1

There are 1 best solutions below

0
On

AFAIK, Apache Nutch suits most of your requirements. Nutch also has a plugin architecture which is helpful to write your own if you need. You can go through the wiki [0] and ask in the mailing list if you have any questions

[0] http://wiki.apache.org/nutch/FrontPage