Python : How to scrape a page to get an information that will be used to scrape another one, and so on?

122 Views Asked by At

I need to build a python script that aims to scrape a web page to retrieve a number in a "Show More" button.

This number will then be used as a parameter to request a URL that will return a JSON that contains data + a number. This last number will be used as a parameter to request the URL that will return a JSON that contains data + a number, etc.. The process goes on until the JSON return empty data + a number. When the data is empty, the scraper should stop.

I used Scrapy, but this doesn't work. Scrapy is asynchronous and based on my case, I need to wait for the first JSON result to give me the next information so I can scrape the second URL, and so on.

What do you suggest me to use as a Python library ? I have read that Selenium does the job but it is much more slower than Scrapy.

1

There are 1 best solutions below

0
On BEST ANSWER

Scrapy's asynchronous behaviour is best seen when you have multiple URLs to scrape at a given time. In this case you would be enqueuing new requests only after parsing the previous one, so it shouldn't be a problem.

I don't know the exact structure of your JSON response, so let's assume you have two keys, data and number. You could write a Scrapy spider with a parsing method similar to this::

def parse(self, response):
    result = json.loads(response.body)
    # do something with the data

    # request next page
    if result['data']:
        next_url = ...  # construct URL using result['number']
        yield Request(next_url)