I am trying to implement a crawler in Colly, which can crawl several articles from a news source. I have given two start URLs. The behaviour I want is that thorugh pagination and outlinks, Colly would keep crawling for webpages (potentially infinitely long).
However, the behaviour I am observing is that Colly stops its visiting process one level after. This is observed by the fact that Colly does not paginate beyond one page. I am certain the links, XPATH, etc is accurate. I also specified max depth as 10 while initializing the collector object, but it doesn't seem to work as expected.
Here is my code:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Using IndonesiaX as sample
colly.AllowedDomains("www.zerohedge.com"),
// Cache responses to prevent multiple download of pages
// even if the collector is restarted
colly.CacheDir("./cache"),
colly.MaxDepth(10),
colly.Async(true),
)
// On every a element which has href attribute call callback
c.OnXML("//h2[contains(@class,'Article_title')]//a[@href]", func(e *colly.XMLElement) {
link := e.Attr("href")
// fmt.Println(link)
// start scraping the page under the link found
e.Request.Visit(link)
})
c.OnXML("//div[contains(@class, 'SimplePaginator')]//a[@href]", func(e *colly.XMLElement) {
link := e.Attr("href")
// fmt.Println(link)
// start scraping the page under the link found
e.Request.Visit(link)
})
c.OnXML("//header[contains(@class, 'ArticleFull')]//h1/text()", func(e *colly.XMLElement) {
// fmt.Println(e.Text)
// start scraping the page under the link found
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
r.Visit(r.URL.String())
})
c.Visit("https://www.zerohedge.com/covid-19")
c.Visit("https://www.zerohedge.com/medical")
c.Wait()
}
Any help in this would be appreciated.
After a bit of investigation. I found out that the element SimplePaginator does not allways exist. The website is doing some trickries to stop crawllers.