I am currently trying to download a large database using RPostgres - however, when I try to write the results to csvs in batches, it writes the same data over and over (always the first 1M records). Here is the code I am currently using:
wrds <- dbConnect(Postgres(),
host='wrds-pgdata.wharton.upenn.edu',
port=9737,
dbname='wrds',
sslmode='require',
user='username')
res <- dbSendQuery(wrds, "SELECT VARNAMES FROM DATABASE WHERE CONDITIONS")
i = 0
while (!dbHasCompleted(res)) {
i <- i+1
chunk <- dbFetch(res, 1000000) # Fetch 1,000,000 at a time
filename <- paste("path\\EuropeanSMEs_", i)
write.csv(data, filename, row.names = FALSE)
}
dbClearResult(res)
dbDisconnect(wrds)
How do I amend the code that it fetches a new set of 1M rows on every iteration?
I don't think there is a simple way to do this. Using
PROC EXPORT
from SAS would work (you would then need to download the.csv
file from WRDS). If I really needed to have a CSV file of a large WRDS data set, I might adapt my function here to write to a local CSV file rather than a PostgreSQL database.Another option might be to look for a variable that roughly partitions the data into one million rows. For example,
year
or similar and use theWHERE
part of the SQL to get rows in chunks. Usingdbplyr
(see examples here, you could probably write a small function tocollect()
data by year or some other partitioning variable and write to a CSV.