Im trying to automatically crawl a given site by following all internal links, to do this ive been playing with pythons mechanize
library, although this doesnt allow me to work with javascript and ajax content.
How does Google Bot and other major search engine spiders / bots do this, is there another tool out that can complement mechanize
in this scenario ?
Im aware i could reverse engineer the javascript to work out what its doing and them mimic that, but i want to automate the crawl, so it wouldn't be practical if i first had to comb through each sites javascript.
To implement such a big spider, there're some problems to solve before implementing it:
This is easy. When you fetch a page, parse it and get the
href
values in all<a>
tags and then emit requests of those new urls.If you don't want to hardcode it, the CrawlSpider of scrapy will do the work automatically. And it's also easy to do this work using requests and lxml.
This is a simple problem to solve.
This is a big problem but there're some good tools to use, such as PhantomJS and similar, qtwebkit and selenium.
I don't know how Google handles this problems, but another advanced way is to modify the core of Chromium or Firefox. It's a little harder but may improve the efficiency of your spider in a large degree.
Crawl pages to do a search engine like Google? Crawl some articles,books or videos for personal usage? When you know what you want to do with the spider, then you know how to implement it.
There're some problems when crawling a site and it may help you implement a robust spider. Here it is.