Parsing nested elements using go-colly scraper

1k Views Asked by At

I'm using go-colly to scrape data from a webpage:

enter image description here

I'm unable to parse out the src image from this nested HTML element.

    c.OnHTML(".result-row", func(e *colly.HTMLElement) {
        qoquerySelection := e.DOM
        fmt.Println(qoquerySelection.Find("img").Attr("src"))
...

This .result-row works for a lot of things like:

link := e.ChildAttrs("a", "href")

and

e.ChildText(".result-price")

How can I get the nested image src value?

1

There are 1 best solutions below

0
On

If I understood correctly, my solution should manage your needs. First, let me present the code:

package main

import (
    "fmt"
    "strings"

    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector(colly.AllowedDomains(
        "santabarbara.craigslist.org",
    ))

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Response Code:", r.StatusCode)
    })

    c.OnHTML("img", func(h *colly.HTMLElement) {
        imgSrc := h.Attr("src")
        imgSrc = strings.Replace(imgSrc, "50x50c", "1200x900", 1)
        imgSrc = strings.Replace(imgSrc, "300x300", "1200x900", 1)
        imgSrc = strings.Replace(imgSrc, "600x450", "1200x900", 1)
        fmt.Println(imgSrc)
    })

    c.Visit("https://santabarbara.craigslist.org/apa/7570100710.html")
}

After selecting all of the images on the web page, you've to replace the icon format with the largest one (in our case 1200x900). I saw these formats in a script tag present near the bottom of the page.
The rest should be pretty straightforward. Let me know if this solves your issue or if you need something else, thanks!