Memory mapped file format operating on bigger than memory files. Arrow?

840 Views Asked by At

I have a matrix with 50K columns, 500K rows that I would like to subset by column-names/index pretty fast without using much memory (e.g. memory mapped). Most columns are {NA,1,2}, minority (1%) of the columns are quantitative or strings. What file formats/framework in R are the most suitable for doing this?

I thought I could use feather for this, but it seems to load the entire file and uses almost as much memory as the data.table. equivalent, even though I set as_data_frame=F.

  f="/path/to/matrix.50Kcolums.500Krows.tsv"
  df <- data.table::fread(f) #
  arrow::write_feather(df,paste0(f,".feather"))  
  df <- read_feather(f.arrow, as_data_frame = FALSE) # uses almost as much memory as fread()
  df <- as.data.frame(df[,grep("columns_with_some_name", names(df))]) # this is what I need it to do fast and without using much memory. 

Any thoughts ?

1

There are 1 best solutions below

0
On

@Jon Keane is correct. Using col_select should allow you to achieve this.

(conbench2) pace@pace-desktop:~/dev/arrow/r$ /usr/bin/time -v Rscript -e "print(arrow::read_feather('/home/pace/dev/data/feather/big/data.feather', col_select=c('f0', 'f7000', 'f32000'), as_data_frame = FALSE))"
Table
500000 rows x 3 columns
$f0 <int32>
$f7000 <int32>
$f32000 <int32>
    Command being timed: "Rscript -e print(arrow::read_feather('/home/pace/dev/data/feather/big/data.feather', col_select=c('f0', 'f7000', 'f32000'), as_data_frame = FALSE))"
    User time (seconds): 1.16
    System time (seconds): 0.51
    Percent of CPU this job got: 150%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.11
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 262660
    Average resident set size (kbytes): 0
    ...

That being said, feather may not be the best format when your entire file does not fit into memory. In this case, even if you specify memory-mapped, you are still going to have to perform I/O. If you are repeatedly accessing the same small set of columns again and again you should be fine. They will quickly be loaded into the page cache and the I/O cost will disappear.

On the other hand, if you are accessing random columns each time or you expect large gaps of time to pass between runs (in which case the pages won't be in the page cache) you may consider parquet. Parquet will require more CPU time to compress / decompress but should reduce the amount of data you need to load. Of course, for relatively small amounts of data (e.g. 0.2% of that dataset) the difference in performance will probably be pretty small. Even then it may spare your hard disk as the table you describe takes up ~100GB as feather and since "Most columns are {NA,1,2}" I would expect the data is highly compressible.