How to wait the fully download of a file with selenium(firefox) in python

Question

How to wait the fully download of a file with selenium(firefox) in python

130 Views Asked by Hugocrt At 17 October 2023 at 16:13

I'm a student who's currently working on a python project. I need to download a csv file (13,8Mo) called 'prix-des-carburants-en-france-flux-instantane-v2' on the web (https://data.economie.gouv.fr/explore/dataset/prix-des-carburants-en-france-flux-instantane-v2/) throught firefox.

So I wrote a script that allows me to retrieve it. Firstly, I'm going on the site, retrieve csv name and last update time then I go on the 'export' section. After that I click on the csv download link which starts my download. But sometimes (like in 1% case) my download stops in the middle. There are two cases :

first one, it stops like if it was fully downloaded, but the file size is not 13.8Mo, it can be any size below 13.8Mo;
the second one, a pop-up is on my browser and tell something like 'Cancel all downloads? If you exit now, a download in progress will be canceled. Do you really want to leave?' and nothing happend without human intervention.

I've read some similar questions on stackoverflow HOWEVER I was not satisfied with those answers. It was recommanding to use 'os' (which is not the most like pythonic way to procede if I'm not wrong) or it was about 'request' module.

I like my approach, I'm just clicking on the link like a human, that means I use the firefox downloading system (and it normally works fine, no need to have a 'try except' block with this, we use something that is already developped)

Notice that I haven't really found the source of the problem because this bug doesn't appear often !

So I've tried to implement a 'wait_for_fully_downloaded' method. If I'm correct, a file appears in the folder before the complete download. So my method looks every second if my file size grows. That means when the size doesn't change, it is fully downloaded.

To be honest it doesn't change anything, I still have same problems.

Do you think it can be the 'with' ? Like for the second one problem, it wants to close my browser before it ends ? if it can be that should I use quit() ? quit is closing my browser if crashing during execution ?

For the first one problem maybe I could try to put a bigger check_interval parameter... but it means my program running but doing nothing during this !

Tell me if I should abandon this approach because there are better ways to perform that.

Thanks for your help !

Here is my code :


"""
    Module which provides methods for scraping data using Firefox webdriver.
"""
import time
from pathlib import Path
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException


class FirefoxScraperHolder:
    """
        Class for scraping data using Firefox webdriver.
    """
    def __init__(self, target_url):
        """
            Initialize a FirefoxScraperHolder instance.

            :param target_url: The URL to scrape data from.
        """
        self.cwf = Path(__file__).resolve().parent

        self.options = webdriver.FirefoxOptions()
        self.driver = webdriver.Firefox(options=self.set_preferences())
        self.target_url = target_url

        self._updated_data_date = None
        self._csv_id = None

    def set_preferences(self):
        """
            Set Firefox webdriver preferences
            (Here only for downloading files)

            :return: The configured Firefox options.
        """
        # 2 means we use chosen directory as download folder
        self.options.set_preference("browser.download.folderList", 2)
        self.options.set_preference("browser.download.dir", str(self.cwf))
        return self.options

    @property
    def updated_data_date(self):
        """
        Get the last updated data date.

        :return: The last updated data date.
        :rtype: str
        """
        return self._updated_data_date

    @property
    def csv_id(self):
        """
        Get the filename of the downloaded CSV.

        :return: The filename of the downloaded CSV.
        :rtype: str
        """
        return self._csv_id

    def perform_scraping(self, aria_label, ng_if):
        """
            Perform the scraping process.

            :param aria_label: ARIA label for the CSV element.
            :param ng_if: NG-if attribute for updated data date element.
        """
        try:
            with self.driver:
                self.driver.maximize_window()
                self.driver.get(self.target_url)

                # Retrieves csv information
                self.click_on(By.LINK_TEXT, "Informations")
                self._updated_data_date = self.retrieve_text_info(
                    By.CSS_SELECTOR,
                    f"[ng-if='{ng_if}']")
                self._csv_id = self.retrieve_text_info(
                    By.CLASS_NAME,
                    'ods-dataset-metadata-block__metadata-value'
                ) + '.csv'

                # Download csv
                self.click_on(By.LINK_TEXT, "Export")
                self.click_on(By.CSS_SELECTOR, f"[aria-label='{aria_label}']")
                self.wait_for_fully_downloaded()

        except WebDriverException as exception:
            print(f"An error occurred during the get operation: {exception}")

    def click_on(self, find_by, value):
        """
            Click on a web element identified by 'find_by' and 'value'.

            :param find_by: The method used to find the element
            (e.g., By.LINK_TEXT).
            :param value: The value to search for.
        """
        # Here 'wait' and 'EC' avoid error due to the loading of the website
        wait = WebDriverWait(self.driver, 20)
        element = wait.until(EC.element_to_be_clickable((find_by, value)))
        element.click()

    def remove_cwf_existing_csvs(self):
        """
            Remove existing CSV files from the current working folder.
        """
        for file in self.cwf.glob('*.csv'):
            file.unlink(missing_ok=True)

    def retrieve_text_info(self, find_by, value):
        """
            Retrieve text information of a web element identified
            by 'find_by' and 'value'.

            :param find_by: The method used to find the element
            (e.g., By.CSS_SELECTOR).
            :param value: The value to search for.
            :return: The text information of the web element.
            :rtype: str
        """
        # Here 'wait' and 'EC' avoid error due to the loading of the website
        wait = WebDriverWait(self.driver, 20)
        info = wait.until(EC.visibility_of_element_located((find_by, value)))
        return info.text

    def wait_for_fully_downloaded(self, timeout=60, check_interval=1):
        """
            Wait for a file to be fully downloaded.

            :param timeout: Maximum time to wait in seconds.
            Default is 60 seconds.
            :param check_interval: Interval for checking file size
             in seconds. Default is 1 second.
        """
        file_path = self.cwf / self._csv_id
        start_time = time.time()

        while time.time() - start_time < timeout:
            if file_path.is_file():
                initial_size = file_path.stat().st_size
                time.sleep(check_interval)
                # Checks if the file size changes during check_interval
                if file_path.stat().st_size == initial_size:
                    return
        return

# Here my main code :

# target_url = ('https://data.economie.gouv.fr/explore/dataset/prix-des-'
#               'carburants-en-france-flux-instantane-v2/')
# csv_aria_label = 'Dataset export (CSV)'
# updated_data_date_ng_if = 'ctx.dataset.metas.data_processed'
# 
# # Retrieves datas
# firefox_scraper = FirefoxScraperHolder(target_url)
# firefox_scraper.remove_cwf_existing_csvs()
# firefox_scraper.perform_scraping(csv_aria_label, updated_data_date_ng_if)

Original Q&A

There are 1 best solutions below

**ShishiSonsonCode** · Accepted Answer · 2023-10-17T20:25:59.987000

Call the following function when after start downloading the files.

def wait_until_download_finishes():
    dl_wait = True
    while dl_wait:
        time.sleep(5)
        dl_wait = False
        for file_name in os.listdir(DOWNLOAD_FILE_PATH):
            if file_name.endswith('.part'):
                dl_wait = True

DOWNLOAD_FILE_PATH is the path where you are downloading the file.

Every file that is being downloaded is ending with .part is the indicator of file being downloading by Firefox. So when there is no .part file in DWONLOAD_FILE_PATH it means that your file is fully downloaded

How to wait the fully download of a file with selenium(firefox) in python

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SELENIUM-WEBDRIVER

Related Questions in WEB-SCRAPING

Related Questions in DATA-SCIENCE

Trending Questions

Popular # Hahtags

Popular Questions