How to properly scrape 1 URL at a time with attached attribute?

155 Views Asked by At

I am looking to scrape multiple website domains for various href's within their careers pages.

I only want the links to the jobs and nothing else, and the easiest way I have found to do that is to parse the scrapy response and pull the href's from a specific CSS path.

So far my solution is to create 2 dictionaries each with a generic key, this being URL and Attribute. The keys then have a pre-identified CSS path and the careers page URL.

I am going to create multiple dictionaries automatically in the future from a file of data.

I am storing all of these dictionaries in a list in Python and my plan was to call each dictionary, one at a time, from the list and to use the associated URL and attribute as the required input for scrapy.

# Each List contains two dictionaries,
# One containing the website's careers URL,
# the other containing the location on their jobs container on that page.
# The below is an example but I will name the lists 1,2,3 etc so in a database I can call them easier.
List1= ["", ".joblist a::attr(href)"]
List2 = ["", ".content a::attr(href)"]
Dicti = {"URL" : List1[0], "Att" : List1[1]}

This is essentially how I have the list of dictionaries set up.

I am then using

start_urls = [

I am then also parsing the data like so,

jobs = response.css(Dicti["Att"]).extract()

I think this is potentially where I am going wrong. Although it does load each URL and scrape the HTML from each URL it then isn't parsing the from the attributes correctly.

I tried scraping the lists one at a time, though only having 1 list in the starting URL. This works perfectly, it's when I try to input more than 1 list into the start url.

What exactly am I doing wrong, maybe I misunderstand how the spider works after reading the information. I essentially want to run list1, then stop the spider and run a new instance for list2, all while saying the extracted data.

Any advice on how to overcome this would be massively appreciated.


There are 1 best solutions below


Either organzize your data as a list of short lists

urls = [
    ["", ".joblist a::attr(href)"],
    ["", ".content a::attr(href)"],

and then iterate over urls and access the components like

for entry in urls:
    url = entry[0]
    attribute = entry[1]

or shorter

for url, attribute in urls:

or make a list of small dictionaries

urls = [
    {'URL': "", 'ATT': ".joblist a::attr(href)"},
    {'URL': "", 'ATT': ".content a::attr(href)"},

and then iterate over urls and access the components like

for dict_ in urls:
    url = dict_['URL']
    attribute = dict_['ATT']