How to send thousands of HTTP Requests using grerequests?

145 Views Asked by At

I need to request all the review pages for a company on Glassdoor, and in certain cases, there can be thousands of pages. I am trying to use grequests to do this, but I found that when I sent more than about 100 requests at once I would start to receive 403 error.

I came up with this code to batch the requests into blocks of 100:

"reviews_url": "https://www.glassdoor.com/Reviews/Apple-Reviews-E1138.htm?"

batch = 100
responses = []
for j in range(math.ceil(num_pages/batch)):
    print("Batching requests: {}/{}".format(min(num_pages, (j+1)*batch),num_pages))
    rs = (
        grequests.get(
            reviewsUrl.replace(".htm", "_P" + str(k + 1) + ".htm"),
            headers=DEFAULT_HEADERS,
        )
        for k in range(min(num_pages, (j)*batch), min(num_pages, (j+1)*batch))
    )
    responses += grequests.map(rs)
    time.sleep(uniform(10,15))

This works and I get what I need, but it is way too slow and I need to do this for ~8000 companies. Is there a better way to do this? I tried reducing the sleep time between batches and began getting 403's again.

1

There are 1 best solutions below

2
On BEST ANSWER

Error 403 means that your request is okay, but that server refuses it. In your case, because you are making too many requests at the same time.

Webscrapping without timeouts (time.sleep(uniform(10,15))) abuses server resources and may impact service for other users. So most sites limit number of requests that you can do in some (short) timeframe. Server communicates that you went over this limit by sending you error 403. Sometimes servers also use 420 or 429. Not obeying this message is at least impolite and commonly against terms of service.

You can try:

  1. Do you really need to fetch data every time? Longer download time doesn't matter if you need to do it only once (for example saving data to csv file and reading from it later).
  2. Increasing timeout between requests.
  3. Check if website offers some way to download data in bulk (did you try Glassdoor api)?