I write a small downloader in R, in order to download some log files from remote server in one run:
file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"
downloader <- function(file_remote, file_local, credentials) {
data_bin <- RCurl::getBinaryURL(
file_remote,
userpwd = credentials,
ftp.use.epsv = FALSE,
forbid.reuse = TRUE
)
writeBin(data_bin, file_local)
}
purrr::walk2(
file_remote,
file_local,
~ downloader(
file_remote = .x,
file_local = .y,
credentials = credentials
)
)
This works, but slowly, especially compare it to some FTP clients like WinSCP, downloading 64 log files, each 2kb, takes minutes.
Is there a faster way to download a lot of files in R?
The
curl
package has a way to perform async requests, which means that downloads are performed simultaneously instead of one after another. Especially with smaller files this should give you a large boost in performance. Here is a barebone function that does that (since version 5.0.0, thecurl
package has a native version of this function also calledmulti_download
):Now we need some test files to compare it to your baseline approach. I use covid data from the Johns Hopkins University GitHub page as it contains many small csv files which should be similar to your files.
We could also infer the file names from the URLs but I assume that is not what you want. So now lets compare the approaches for these 821 files:
The new approach is 13.3 times faster than the original one. I would assume that the difference will be bigger the more files you have. Note though, that this benchmark is not perfect as my internet speed fluctuates quite a bit.
The function should also be improved in terms of handling errors (currently you get a message how many requests have been successful and how many errored, but no indication which files exist). My understanding is also that
multi_run
writes files to the memory beforesave_download
writes them to disk. With small files this is fine, but it might be an issue with larger ones.baseline function
Created on 2022-06-05 by the reprex package (v2.0.1)