So I am trying to download PDF files from different URLs and some URLs gives an error. When I checked the URLs manually by opening it in the browser with VPN connection, it took too long to load but loaded correctly.

So what should I do to make it work? Also one PDF is downloaded but it is empty, so if possible explain that as well.

URL "https://www.lisbonct.com/sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf" gives an error:

Failed to download PDF from https://www.lisbonct.com/sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf. Skipping to the next URL. Error: SOCKSHTTPSConnectionPool(host='www.lisbonct.com', port=443): Max retries exceeded with url: /sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x00000271AACE00D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

While URL "https://web.franklintn.gov/FlippingBook/FranklinZoningOrdinance/index.html" is downloaded but the PDF file is empty.

Here is the code:

from pathlib import Path
import requests
import time

def get_tor_session():
    session = requests.session()
    session.proxies = {'http':  'socks5://127.0.0.1:9150',
                       'https': 'socks5://127.0.0.1:9150'}
    return session

folder_path = Path("J:/Magma Systems Data/bilal-code/PDF Files/New PDFs/testing_single_pdf")

urls = [
    "https://www.lisbonct.com/sites/g/files/vyhlif791/f/uploads/pzc_zoningregs_2022.pdf",
     "https://web.franklintn.gov/FlippingBook/FranklinZoningOrdinance/index.html"
    # Add other URLs here
]

i = 1
session = get_tor_session()
for url in urls:
    print(f'Downloading PDF from URL:', url)
    filename = folder_path / (str(i) + '.pdf')

    try:
        response = session.get(url, timeout=300)  # Increased timeout to 5 minutes (300 seconds)
        response.raise_for_status()  # Check for any HTTP errors

        # Check if the response content is empty (PDF not yet loaded)
        while len(response.content) < 1000:  # Modify the condition based on expected content size
            time.sleep(30)  # Wait for 30 seconds before rechecking
            response = session.get(url, timeout=300)  # Get the updated response
            response.raise_for_status()  # Check for errors again

        filename.write_bytes(response.content)
        i += 1
        print(f'PDF file downloaded for:', url)

    except requests.exceptions.RequestException as e:
        print(f"Failed to download PDF from {url}. Skipping to the next URL.")
        print(f"Error: {e}")
        continue

#print(f'PDF files downloaded.')

I tried to increase the "timeout" to 300 but still it gives an error within 5 seconds and won't wait for the PDF to load.

0

There are 0 best solutions below