How to properly scrape 1 URL at a time with attached attribute?

155 Views Asked by At

I am looking to scrape multiple website domains for various href's within their careers pages.

I only want the links to the jobs and nothing else, and the easiest way I have found to do that is to parse the scrapy response and pull the href's from a specific CSS path.

So far my solution is to create 2 dictionaries each with a generic key, this being URL and Attribute. The keys then have a pre-identified CSS path and the careers page URL.

I am going to create multiple dictionaries automatically in the future from a file of data.

I am storing all of these dictionaries in a list in Python and my plan was to call each dictionary, one at a time, from the list and to use the associated URL and attribute as the required input for scrapy.

# Each List contains two dictionaries,
# One containing the website's careers URL,
# the other containing the location on their jobs container on that page.
# The below is an example but I will name the lists 1,2,3 etc so in a database I can call them easier.
List1= ["https://exampleurl.com/careers", ".joblist a::attr(href)"]
List2 = ["https://exampleurl.com/en/Company/Career-Opportunities", ".content a::attr(href)"]
Dicti = {"URL" : List1[0], "Att" : List1[1]}

This is essentially how I have the list of dictionaries set up.

I am then using

start_urls = [
        List1[Dicti["URL"]],
        List2[Dicti["URL"]]
    ]

I am then also parsing the data like so,

jobs = response.css(Dicti["Att"]).extract()

I think this is potentially where I am going wrong. Although it does load each URL and scrape the HTML from each URL it then isn't parsing the from the attributes correctly.

I tried scraping the lists one at a time, though only having 1 list in the starting URL. This works perfectly, it's when I try to input more than 1 list into the start url.

What exactly am I doing wrong, maybe I misunderstand how the spider works after reading the information. I essentially want to run list1, then stop the spider and run a new instance for list2, all while saying the extracted data.

Any advice on how to overcome this would be massively appreciated.

1

There are 1 best solutions below

1
On

Either organzize your data as a list of short lists

urls = [
    ["https://exampleurl.com/careers", ".joblist a::attr(href)"],
    ["https://exampleurl.com/en/Company/Career-Opportunities", ".content a::attr(href)"],
    ...
]

and then iterate over urls and access the components like

for entry in urls:
    url = entry[0]
    attribute = entry[1]

or shorter

for url, attribute in urls:
    ...

or make a list of small dictionaries

urls = [
    {'URL': "https://exampleurl.com/careers", 'ATT': ".joblist a::attr(href)"},
    {'URL': "https://exampleurl.com/en/Company/Career-Opportunities", 'ATT': ".content a::attr(href)"},
    ...
] 

and then iterate over urls and access the components like

for dict_ in urls:
    url = dict_['URL']
    attribute = dict_['ATT']