Gocolly scraping only certain links

945 Views Asked by At

While scraping this link enter link description here , i just want to scrape library links, but the code I wrote extracts all the links, I couldn't manage to filter it. (I'm parsing the urls for later use in github api

http://api.github.com/repos/[username]/[reponame]

, so I only need the path parts , but I don't want to parse the links that don't work for me to avoid unnecessary operations, so I only need library links)

type repo struct {
Link string `json:"link"`
Name string `json:"name"`

}

allRepos := make([]repo, 0)
collector := colly.NewCollector(
    colly.AllowedDomains("github.com"))

collector.OnHTML("ul", func(e *colly.HTMLElement) {

    r := repo{}
    r.Link = e.ChildAttr("a", "href")
    url, _ := url.Parse(r.Link)

    repos := repo{
        Link: url.Path,
    }
    allRepos = append(allRepos, repos)
})

collector.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
})
// Sends HTTP requests to the server
collector.Visit("https://github.com/avelino/awesome-go/blob/main/README.md")

fmt.Println(allRepos)
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", "\t")
//githubApi := "https://api.github.com/repos"
for _, repos := range allRepos {
    fmt.Println(repos.Link)
}
1

There are 1 best solutions below

1
On

I was able to manage what you need. Let me share with you my code:

package main

import (
    "fmt"
    "strings"

    "github.com/gocolly/colly/v2"
)

type Repo struct {
    Link string `json:"link"`
    Name string `json:"name"`
}

func main() {
    repos := []Repo{}
    c := colly.NewCollector(colly.AllowedDomains(
        "github.com",
    ))

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Response Code:", r.StatusCode)
    })

    // to get the "a" tag
    c.OnHTML("article>ul>li", func(h *colly.HTMLElement) {
        listItem := h.DOM
        for _, v := range listItem.Nodes {
            for _, a := range v.FirstChild.Attr {
                if a.Key == "href" && strings.Contains(a.Val, "github.com") {
                    repos = append(repos, Repo{Link: a.Val, Name: v.FirstChild.FirstChild.Data})
                }
            }
        }
    })

    c.Visit("https://github.com/avelino/awesome-go/blob/main/README.md")

    for _, v := range repos {
        fmt.Printf("%v\t%v\n", v.Name, v.Link)
    }
}

In the above code snippet you can see how I set up the callbacks to scrape the GitHub repo.
The relevant changes were done in the OnHTML method. Here, we used a jQuery selector to get all of the li below the article and ul tags. Then, you've to range over the underlying nodes and get the FirstChild that will always be an a tag. You've to grab the href attribute and append it to the repos variable that you just found.

Note: as you were concerned only with GitHub repos, I added a clause in the if construct in order to exclude the not-relevant links. If you plan to remove this link, pay attention to the links as you've to deal also the page's navigation links such as page#section-1.

I hope that this solves your issue. Let me know or maybe share your solution if you've already found another one by yourself!