While scraping this link enter link description here , i just want to scrape library links, but the code I wrote extracts all the links, I couldn't manage to filter it. (I'm parsing the urls for later use in github api
, so I only need the path parts , but I don't want to parse the links that don't work for me to avoid unnecessary operations, so I only need library links)
type repo struct {
Link string `json:"link"`
Name string `json:"name"`
}
allRepos := make([]repo, 0)
collector := colly.NewCollector(
colly.AllowedDomains("github.com"))
collector.OnHTML("ul", func(e *colly.HTMLElement) {
r := repo{}
r.Link = e.ChildAttr("a", "href")
url, _ := url.Parse(r.Link)
repos := repo{
Link: url.Path,
}
allRepos = append(allRepos, repos)
})
collector.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Sends HTTP requests to the server
collector.Visit("https://github.com/avelino/awesome-go/blob/main/README.md")
fmt.Println(allRepos)
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", "\t")
//githubApi := "https://api.github.com/repos"
for _, repos := range allRepos {
fmt.Println(repos.Link)
}
I was able to manage what you need. Let me share with you my code:
In the above code snippet you can see how I set up the callbacks to scrape the GitHub repo.
The relevant changes were done in the
OnHTMLmethod. Here, we used a jQuery selector to get all of thelibelow thearticleandultags. Then, you've to range over the underlying nodes and get theFirstChildthat will always be anatag. You've to grab thehrefattribute and append it to thereposvariable that you just found.I hope that this solves your issue. Let me know or maybe share your solution if you've already found another one by yourself!