I have a script which produces several intermediate data files which would significantly exceed the maximum number of rows in R
(2^31-1). My system is large enough to store the data (e.g. I can store matrices of that size, but not transform them to long format), but I don't know which file formats can appropriately deal with the data. I want to achieve two things simultaneously: (1) store data with more than 2^31 rows and (2) continue using data.table (or similar) functionality while processing the data.
I know that there are methods for achieving (1) like the arrow
package, but my understanding is that these file formats then require a whole other way of processing the data, preventing (2). From what I understand the bit64
package cannot be used to 'cheat' R and get index numbers for more rows.
Basically I have written a whole bunch of code already building on data.table functionality, and I would prefer to continue using that instead of rewriting everything. Is there a solution for that?
Sorry no reproducible example (not sure it's appropriate for this question).
Data formats for very large data while preserving data.table functionality
85 Views Asked by Nils R At
0