I have a .csv file containing transaction IDs of nearly 1 million transactions associated with a bitcoin wallet (both sent and received transactions), which I read into RStudio as a tibble. Now I am trying to add another column to the table that lists the fees for each transaction. This is done using an API call.
For example, to get the fee for the txid 73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f, I have to open: https://blockchain.info/q/txfee/73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f and read the data there directly.
This is my current code to record fees for the first 500 transactions:
library(readr)
library(curl)
tx <- read_csv("transactions.csv", col_names = c("txid", "amount"), skip = 0, n_max = 500)
tx$fee <- 0
for (i in 1:nrow(tx))
tx$fee[i] <- scan(paste0("https://blockchain.info/q/txfee/", tx$txid[i]))
write_csv(tx, "tx_with_fees.csv")
Clearly, my biggest bottleneck is the time taken to access the webpage. The method used to read data hardly seems to matter (I tried curl, get and scan). With the above code, it takes around 0.4 seconds to record fees for each transaction.
What I did next was to simply open 5 instances of RStudio and run the code for different sets of 100 rows in each instance. This way I have been able to process each row in 0.1 seconds on average. That's a 4x improvement in speed but I am sure there are more efficient ways to parallelize than simply opening multiple instances of RStudio.
What would be the best way to do that?