How to download thousands of excel files from website?

79 Views Asked by At

this is the website in question.

I want to download every year's excel files from 2009.

I don't want to manually click all of them, any help? Not a lot of experience with coding so keep it simple please. I know python is an option but I don't have any experience with python, VBA could work but I am not having any luck there.

For now all I want to do is download them all, but if there is also a way to consolidate the files that would be very helpful too. In full honesty I can't code at all, but it seems that is the only way to do this task and might take to long to learn a whole coding language and need these files downloaded as soon as possible.

If this is too complicated, please advise if I should bite the bullet and manually do it.

Thanks.

1

There are 1 best solutions below

13
Federicofkt On BEST ANSWER

Assuming that you already have an environment in which you can run the code (if not, please install anaconda and use jupyter notebook), this code does the trick:

First time you have to install the libraries (run this only once):

%pip install bs4
%pip install requests

Then you can run the code:

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin 


def download_excel_files(url, download_path):

    response = requests.get(url)
    

    if response.status_code == 200:

        soup = BeautifulSoup(response.content, 'html.parser')
        
        links = soup.find_all('a', href=True)
        
        for link in links:
            file_url = link['href']
            if file_url.endswith('.xls') or file_url.endswith('.xlsx'): 
                full_file_url = urljoin(url, file_url)
                file_name = os.path.join(download_path, os.path.basename(file_url))
                with open(file_name, 'wb') as f:
                    file_response = requests.get(full_file_url)
                    f.write(file_response.content)
                print(f"File '{file_name}' succesfully downloaded")
    else:
        print("Page error")

And then you loop on the years and call the function like so:

base_url = 'https://clientportal.jse.co.za/downloadable-files?RequestNode=/Safex/agriculture.stats/'


for year in range(2009, 2025):
    url = base_url + str(year)
    download_folder = str(year)
    os.makedirs(download_folder, exist_ok=True)
    download_excel_files(url, download_folder)

Since I am assuming that you're just using the code for this purpose, I'll avoid to explain every passage. If you need something more please ask.