Connection error during image scraping from Craigslist

126 Views Asked by At

As part of a project to scrape data from Craigslist, I include image scraping. I've noticed in testing that sometimes the connection is refused. Is there a way around this, or do I need to incorporate error catching for this in my code? I recall the twitter API limits queries, so a sleep timer is incorporated. Curious if I have the same situation with Craigslist. See code and error below.

import requests
from bs4 import BeautifulSoup


#loops through each image and stores it in a local folder
for img in soup_test.select('a.thumb'):
    imgcount += 1
    filename = (pathname +  "/" + motoid + " - "+str(imgcount)+".jpg")
    with open(filename, 'wb') as f:
        response = requests.get(img['href'])
        f.write(response.content)

ConnectionError: HTTPSConnectionPool(host='images.craigslist.org', port=443): Max retries exceeded with url: /00707_fbsCmug4hfR_600x450.jpg (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))

I have 2 questions about this behavior.

  1. Do CL servers have any rules or protocols such as blocking nth request within a certain time frame?

  2. Is there a way to pause the loop after a connection has been denied? Or do I just incorporate error catching so that it doesn't halt my program?

0

There are 0 best solutions below