Best way to add a column to data.table by reading data from a variable url

91 Views Asked by At

I have a .csv file containing transaction IDs of nearly 1 million transactions associated with a bitcoin wallet (both sent and received transactions), which I read into R as a data table. Now I am trying to add another column to the table that lists the fees for each transaction. This can be done using an API call.

For example, to get the fee for the txid 73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f, I have to open: https://blockchain.info/q/txfee/73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f and read the data there directly.

What I have done: First I edited the .csv file using Excel to add a new column for the url for each row. Then wrote the following code in R:

for(i in 1:nrow(transactions))
transactions$fee[i] <- scan(transactions$url[i])

But this way it updates just 2-3 rows in 1 second. Since I am a novice, there must be much more efficient ways of doing the same thing.

1

There are 1 best solutions below

1
On

We can do a lot better (~15x) than scan() by using curl::curl_fetch_memory, e.g. with your URL:

URL <- "https://blockchain.info/q/txfee/73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f"

microbenchmark::microbenchmark(
  times = 50L,
  scan = scan(URL, what = integer(), quiet = TRUE),
  GET = as.integer(httr::content(httr::GET(URL))),
  curl = as.integer(rawToChar(curl::curl_fetch_memory(URL)$content))
)
# Unit: microseconds
#  expr      min       lq       mean    median        uq       max neval
#  scan 9388.292 9885.680 10216.9262 10164.120 10502.839 11016.553    50
#   GET 7195.900 7611.485  8342.2855  7832.446  7948.521 22781.104    50
#  curl  511.834  565.067   611.4956   610.391   642.799   790.482    50

identical(
  scan(URL, what = integer(), quiet = TRUE),
  as.integer(rawToChar(curl::curl_fetch_memory(URL)$content))
)
# [1] TRUE

NB: I used integer since your particular URL fits, but as.numeric may be more appropriate.

That said, I still think hitting the web is the biggest bottleneck and you may find some payoff to trying to get a payload with >1 transaction at a time. If not, your biggest performance improvement will come from parallelizing.