So I am trying to download PDF files from different URLs and some URLs gives an error. When I checked the URLs manually by opening it in the browser with VPN connection, it took too long to load but loaded correctly.
So what should I do to make it work? Also one PDF is downloaded but it is empty, so if possible explain that as well.
URL "https://www.lisbonct.com/sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf" gives an error:
Failed to download PDF from https://www.lisbonct.com/sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf. Skipping to the next URL. Error: SOCKSHTTPSConnectionPool(host='www.lisbonct.com', port=443): Max retries exceeded with url: /sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x00000271AACE00D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
While URL "https://web.franklintn.gov/FlippingBook/FranklinZoningOrdinance/index.html" is downloaded but the PDF file is empty.
Here is the code:
from pathlib import Path
import requests
import time
def get_tor_session():
session = requests.session()
session.proxies = {'http': 'socks5://127.0.0.1:9150',
'https': 'socks5://127.0.0.1:9150'}
return session
folder_path = Path("J:/Magma Systems Data/bilal-code/PDF Files/New PDFs/testing_single_pdf")
urls = [
"https://www.lisbonct.com/sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf",
"https://web.franklintn.gov/FlippingBook/FranklinZoningOrdinance/index.html"
# Add other URLs here
]
i = 1
session = get_tor_session()
for url in urls:
print(f'Downloading PDF from URL:', url)
filename = folder_path / (str(i) + '.pdf')
try:
response = session.get(url, timeout=300) # Increased timeout to 5 minutes (300 seconds)
response.raise_for_status() # Check for any HTTP errors
# Check if the response content is empty (PDF not yet loaded)
while len(response.content) < 1000: # Modify the condition based on expected content size
time.sleep(30) # Wait for 30 seconds before rechecking
response = session.get(url, timeout=300) # Get the updated response
response.raise_for_status() # Check for errors again
filename.write_bytes(response.content)
i += 1
print(f'PDF file downloaded for:', url)
except requests.exceptions.RequestException as e:
print(f"Failed to download PDF from {url}. Skipping to the next URL.")
print(f"Error: {e}")
continue
#print(f'PDF files downloaded.')
I tried to increase the "timeout" to 300 but still it gives an error within 5 seconds and won't wait for the PDF to load.