Crawling a sites html and javascript in a similar fashion to Google bot

556 Views Asked by At

Im trying to automatically crawl a given site by following all internal links, to do this ive been playing with pythons mechanize library, although this doesnt allow me to work with javascript and ajax content.

How does Google Bot and other major search engine spiders / bots do this, is there another tool out that can complement mechanize in this scenario ?

Im aware i could reverse engineer the javascript to work out what its doing and them mimic that, but i want to automate the crawl, so it wouldn't be practical if i first had to comb through each sites javascript.

1

There are 1 best solutions below

0
On BEST ANSWER

To implement such a big spider, there're some problems to solve before implementing it:

  • Just want to follow all the links in a page automatically?
    This is easy. When you fetch a page, parse it and get the href values in all <a> tags and then emit requests of those new urls.
    If you don't want to hardcode it, the CrawlSpider of scrapy will do the work automatically. And it's also easy to do this work using requests and lxml.
    This is a simple problem to solve.
  • Want to parse the javascript statements?
    This is a big problem but there're some good tools to use, such as PhantomJS and similar, qtwebkit and selenium.
    I don't know how Google handles this problems, but another advanced way is to modify the core of Chromium or Firefox. It's a little harder but may improve the efficiency of your spider in a large degree.
  • What's your purpose to implement such a spider?
    Crawl pages to do a search engine like Google? Crawl some articles,books or videos for personal usage? When you know what you want to do with the spider, then you know how to implement it.

There're some problems when crawling a site and it may help you implement a robust spider. Here it is.