I need to request all the review pages for a company on Glassdoor, and in certain cases, there can be thousands of pages. I am trying to use grequests
to do this, but I found that when I sent more than about 100 requests at once I would start to receive 403 error
.
I came up with this code to batch the requests into blocks of 100:
"reviews_url": "https://www.glassdoor.com/Reviews/Apple-Reviews-E1138.htm?"
batch = 100
responses = []
for j in range(math.ceil(num_pages/batch)):
print("Batching requests: {}/{}".format(min(num_pages, (j+1)*batch),num_pages))
rs = (
grequests.get(
reviewsUrl.replace(".htm", "_P" + str(k + 1) + ".htm"),
headers=DEFAULT_HEADERS,
)
for k in range(min(num_pages, (j)*batch), min(num_pages, (j+1)*batch))
)
responses += grequests.map(rs)
time.sleep(uniform(10,15))
This works and I get what I need, but it is way too slow and I need to do this for ~8000 companies. Is there a better way to do this? I tried reducing the sleep time between batches and began getting 403's again.
Error 403 means that your request is okay, but that server refuses it. In your case, because you are making too many requests at the same time.
Webscrapping without timeouts (
time.sleep(uniform(10,15))
) abuses server resources and may impact service for other users. So most sites limit number of requests that you can do in some (short) timeframe. Server communicates that you went over this limit by sending you error 403. Sometimes servers also use 420 or 429. Not obeying this message is at least impolite and commonly against terms of service.You can try: