Unable to Select an option from the dropdown for web scraping using gocolly\colly

328 Views Asked by At

I want to scrape data from the below public website using Golang gocolly/colly -

https://eds.ospi.k12.wa.us/BusDepreciation/default.aspx?pageName=busSearch

For the above website, I want to select all the "School District" options available in the dropdown one by one and scrape all the data. So far I am able to scrape only the HTML of the page but I am not able to find any way to select the dropdown options to get the data for different options.

enter image description here

My Go code

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    // Instantiate default collector
    c := colly.NewCollector()

    c.OnHTML("tbody tr", func(e *colly.HTMLElement) {
        fmt.Printf("BODY----%+v\n", e)

    })

    c.Visit("https://eds.ospi.k12.wa.us/BusDepreciation/default.aspx?pageName=busSearch")

}

I would appreciate it if anyone could refer me to the related document. Also, if it is not possible with gocolly/colly then please suggest to me another option in Golang or Python for selecting the dropdown options.

I also want to know if we should use Selenium for scraping large data as in our scenario as an alternate approach? if yes how can we do it in Golang or Python? or should we use scrapy?

1

There are 1 best solutions below

1
On

I was able to achieve what you're struggling with through the following code:

package main

import (
    "fmt"
    "strings"

    "github.com/gocolly/colly/v2"
)

func main() {
    // Instantiate default collector
    c := colly.NewCollector(colly.AllowedDomains(
        "eds.ospi.k12.wa.us",
    ))

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Response Code:", r.StatusCode)
    })

    c.OnHTML("#Content_ctl00_organizationDropDowns_lstDistrict", func(e *colly.HTMLElement) {
        selectDiv := e.DOM
        options := selectDiv.Children()
        fmt.Println(strings.Repeat("#", 50))
        for _, v := range options.Nodes {
            fmt.Println(v.FirstChild.Data)
        }
    })

    c.Visit("https://eds.ospi.k12.wa.us/BusDepreciation/default.aspx?pageName=busSearch")
}

The relevant changes can be summarized in the following list:

  1. Adding allowed domains in the Collector initialization (always a best practice)
  2. On the OnRequest callback action, I set up the User-Agent header. It's helpful to let you scrap and crawl websites with some restrictions
  3. In the OnHTML, I selected the node based on the id "Content_ctl00_organizationDropDowns_lstDistrict". Then, I used the DOM method to get the DOM object. With the Children method you can get all of the children nodes which are the options you're concerned with.
  4. Lastly, you only print the Data field of the nodes.

Undoubtedly, the code can still be improved but it should be a good starting point to scrape what you need. Let me know if this solves your issue!