I am looking to scrape multiple website domains for various href's within their careers pages.
I only want the links to the jobs and nothing else, and the easiest way I have found to do that is to parse the scrapy response and pull the href's from a specific CSS path.
So far my solution is to create 2 dictionaries each with a generic key, this being URL and Attribute. The keys then have a pre-identified CSS path and the careers page URL.
I am going to create multiple dictionaries automatically in the future from a file of data.
I am storing all of these dictionaries in a list in Python and my plan was to call each dictionary, one at a time, from the list and to use the associated URL and attribute as the required input for scrapy.
# Each List contains two dictionaries,
# One containing the website's careers URL,
# the other containing the location on their jobs container on that page.
# The below is an example but I will name the lists 1,2,3 etc so in a database I can call them easier.
List1= ["https://exampleurl.com/careers", ".joblist a::attr(href)"]
List2 = ["https://exampleurl.com/en/Company/Career-Opportunities", ".content a::attr(href)"]
Dicti = {"URL" : List1[0], "Att" : List1[1]}
This is essentially how I have the list of dictionaries set up.
I am then using
start_urls = [
List1[Dicti["URL"]],
List2[Dicti["URL"]]
]
I am then also parsing the data like so,
jobs = response.css(Dicti["Att"]).extract()
I think this is potentially where I am going wrong. Although it does load each URL and scrape the HTML from each URL it then isn't parsing the from the attributes correctly.
I tried scraping the lists one at a time, though only having 1 list in the starting URL. This works perfectly, it's when I try to input more than 1 list into the start url.
What exactly am I doing wrong, maybe I misunderstand how the spider works after reading the information. I essentially want to run list1, then stop the spider and run a new instance for list2, all while saying the extracted data.
Any advice on how to overcome this would be massively appreciated.
Either organzize your data as a list of short lists
and then iterate over
urls
and access the components likeor shorter
or make a list of small dictionaries
and then iterate over
urls
and access the components like