Load first N rows from an .RData file

2.2k Views Asked by At

I googled around, but I could not find an answer to my question. Functions like scan (base package) and fread (data.table package) do a very good job in reading just the first N lines from a .txt or .csv specified by the user. However, when it comes to .RData, load loads the entire file and there is no way to specify how many values shall be read from it.

I have .RData files which are over 3GB of size, each containing a single data.frame or data.table, and don't always need to load the entire file, but just, say, the first 100 or 1,000 rows of the object. Is there a way to do this?

3

There are 3 best solutions below

3
On

Try read_lines_raw:

first_1000 <- read_lines_raw(rdata_filename,skip=0,n_max = 1000)
2
On

What about this simple work around?

my_data <- head(readRDS("my_data.RDS"), n = 1000)

Set the n parameter of head() to whatever you need.

You could even make yourself a little function if you plan to do this a lot.

read_rds <- function(file, n) {
  # note file can either be a connection object or a character string containing a path
  return(head(readRDS(file), n))
} 
0
On

My guess is there isn't an out-of-the-box solution for this.

If we look at a sample, ASCII-encoded, not compressed, RDS file, we see that it is stored in column major order:

saveRDS(mtcars[1:5, 1:2], "testrds.rds", ascii = TRUE, compress = FALSE)

Yields this file (with comments inserted by me)

A        ## ASCII file
3        ## some version info and ??
262146
197888
6
CP1252
787
2
14
5       ## This seems to indicate 5 items in this vector (column)
21      ## first column starts here (but how would you know?)
21
22.8
21.4
18.7    ## first column ends here
14
5       ## Again, This seems to indicate 5 items in this vector (column)
6       ## second column starts here
6
4
6
8       ## second column ends here
1026
1
262153    # Attributes start here: names, row.names, class 
5
names                ## col names
16
2
262153
3
mpg                  ### first col name
262153
3
cyl                  ### second col name
1026
1
262153
9
row.names            ## 2nd attribute: row.names 
16
5
262153
9
Mazda\040RX4         ### first row name
262153
13
Mazda\040RX4\040Wag  ### second row name
262153
10
Datsun\040710        ### ...
262153
14
Hornet\0404\040Drive
262153
17
Hornet\040Sportabout ### last row name
1026
1
262153
5
class                ## 3rd attribute: class
16
1
262153
10
data.frame           ### value of class
254

As you can see with this simple RDS file, reading the first few rows of data still requires parsing the whole file, and would involve knowing which rows to skip over. And you'd want more documentation of RDS files than is in the R Internals doc.

Based on this simple example, one could probably make some guesses and get a rough draft function working for RDS files you know are data frames, but it would take a bit of work--and a lot more work if you wanted to make sure it's robust enough to handle more complex data frames (e.g., with factor and Date columns). If you have RData files, they will have a similar but slightly more complex format as they can handle multiple objects.

All-in-all, I think RDS and RData are poor choices for data you might want to partially load. You'd do better with a CSV or TSV, and then you could use the standard options you mention in your question (or vroom::vroom) to load only the data you want into memory.