I'm a student who's currently working on a python project. I need to download a csv file (13,8Mo) called 'prix-des-carburants-en-france-flux-instantane-v2' on the web (https://data.economie.gouv.fr/explore/dataset/prix-des-carburants-en-france-flux-instantane-v2/) throught firefox.
So I wrote a script that allows me to retrieve it. Firstly, I'm going on the site, retrieve csv name and last update time then I go on the 'export' section. After that I click on the csv download link which starts my download. But sometimes (like in 1% case) my download stops in the middle. There are two cases :
- first one, it stops like if it was fully downloaded, but the file size is not 13.8Mo, it can be any size below 13.8Mo;
- the second one, a pop-up is on my browser and tell something like 'Cancel all downloads? If you exit now, a download in progress will be canceled. Do you really want to leave?' and nothing happend without human intervention.
I've read some similar questions on stackoverflow HOWEVER I was not satisfied with those answers. It was recommanding to use 'os' (which is not the most like pythonic way to procede if I'm not wrong) or it was about 'request' module.
I like my approach, I'm just clicking on the link like a human, that means I use the firefox downloading system (and it normally works fine, no need to have a 'try except' block with this, we use something that is already developped)
Notice that I haven't really found the source of the problem because this bug doesn't appear often !
So I've tried to implement a 'wait_for_fully_downloaded' method. If I'm correct, a file appears in the folder before the complete download. So my method looks every second if my file size grows. That means when the size doesn't change, it is fully downloaded.
To be honest it doesn't change anything, I still have same problems.
Do you think it can be the 'with' ? Like for the second one problem, it wants to close my browser before it ends ? if it can be that should I use quit() ? quit is closing my browser if crashing during execution ?
For the first one problem maybe I could try to put a bigger check_interval parameter... but it means my program running but doing nothing during this !
Tell me if I should abandon this approach because there are better ways to perform that.
Thanks for your help !
Here is my code :
"""
Module which provides methods for scraping data using Firefox webdriver.
"""
import time
from pathlib import Path
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import WebDriverException
class FirefoxScraperHolder:
"""
Class for scraping data using Firefox webdriver.
"""
def __init__(self, target_url):
"""
Initialize a FirefoxScraperHolder instance.
:param target_url: The URL to scrape data from.
"""
self.cwf = Path(__file__).resolve().parent
self.options = webdriver.FirefoxOptions()
self.driver = webdriver.Firefox(options=self.set_preferences())
self.target_url = target_url
self._updated_data_date = None
self._csv_id = None
def set_preferences(self):
"""
Set Firefox webdriver preferences
(Here only for downloading files)
:return: The configured Firefox options.
"""
# 2 means we use chosen directory as download folder
self.options.set_preference("browser.download.folderList", 2)
self.options.set_preference("browser.download.dir", str(self.cwf))
return self.options
@property
def updated_data_date(self):
"""
Get the last updated data date.
:return: The last updated data date.
:rtype: str
"""
return self._updated_data_date
@property
def csv_id(self):
"""
Get the filename of the downloaded CSV.
:return: The filename of the downloaded CSV.
:rtype: str
"""
return self._csv_id
def perform_scraping(self, aria_label, ng_if):
"""
Perform the scraping process.
:param aria_label: ARIA label for the CSV element.
:param ng_if: NG-if attribute for updated data date element.
"""
try:
with self.driver:
self.driver.maximize_window()
self.driver.get(self.target_url)
# Retrieves csv information
self.click_on(By.LINK_TEXT, "Informations")
self._updated_data_date = self.retrieve_text_info(
By.CSS_SELECTOR,
f"[ng-if='{ng_if}']")
self._csv_id = self.retrieve_text_info(
By.CLASS_NAME,
'ods-dataset-metadata-block__metadata-value'
) + '.csv'
# Download csv
self.click_on(By.LINK_TEXT, "Export")
self.click_on(By.CSS_SELECTOR, f"[aria-label='{aria_label}']")
self.wait_for_fully_downloaded()
except WebDriverException as exception:
print(f"An error occurred during the get operation: {exception}")
def click_on(self, find_by, value):
"""
Click on a web element identified by 'find_by' and 'value'.
:param find_by: The method used to find the element
(e.g., By.LINK_TEXT).
:param value: The value to search for.
"""
# Here 'wait' and 'EC' avoid error due to the loading of the website
wait = WebDriverWait(self.driver, 20)
element = wait.until(EC.element_to_be_clickable((find_by, value)))
element.click()
def remove_cwf_existing_csvs(self):
"""
Remove existing CSV files from the current working folder.
"""
for file in self.cwf.glob('*.csv'):
file.unlink(missing_ok=True)
def retrieve_text_info(self, find_by, value):
"""
Retrieve text information of a web element identified
by 'find_by' and 'value'.
:param find_by: The method used to find the element
(e.g., By.CSS_SELECTOR).
:param value: The value to search for.
:return: The text information of the web element.
:rtype: str
"""
# Here 'wait' and 'EC' avoid error due to the loading of the website
wait = WebDriverWait(self.driver, 20)
info = wait.until(EC.visibility_of_element_located((find_by, value)))
return info.text
def wait_for_fully_downloaded(self, timeout=60, check_interval=1):
"""
Wait for a file to be fully downloaded.
:param timeout: Maximum time to wait in seconds.
Default is 60 seconds.
:param check_interval: Interval for checking file size
in seconds. Default is 1 second.
"""
file_path = self.cwf / self._csv_id
start_time = time.time()
while time.time() - start_time < timeout:
if file_path.is_file():
initial_size = file_path.stat().st_size
time.sleep(check_interval)
# Checks if the file size changes during check_interval
if file_path.stat().st_size == initial_size:
return
return
# Here my main code :
# target_url = ('https://data.economie.gouv.fr/explore/dataset/prix-des-'
# 'carburants-en-france-flux-instantane-v2/')
# csv_aria_label = 'Dataset export (CSV)'
# updated_data_date_ng_if = 'ctx.dataset.metas.data_processed'
#
# # Retrieves datas
# firefox_scraper = FirefoxScraperHolder(target_url)
# firefox_scraper.remove_cwf_existing_csvs()
# firefox_scraper.perform_scraping(csv_aria_label, updated_data_date_ng_if)
Call the following function when after start downloading the files.
DOWNLOAD_FILE_PATH is the path where you are downloading the file.
Every file that is being downloaded is ending with .part is the indicator of file being downloading by Firefox. So when there is no .part file in DWONLOAD_FILE_PATH it means that your file is fully downloaded