R data.table. interface to on-disk fst files: fst_table

716 Views Asked by At

I want to use for a large dataset the fst_table function from the package "fstpackage" found here: https://github.com/fstpackage/fsttable.

devtools::install_github("fstpackage/fsttable")
library(fsttable)
nr_of_rows <- 1e6
x <- data.table::data.table(X = 1:nr_of_rows, Y = LETTERS[1 + (1:nr_of_rows) %% 26])
fst::write_fst(x, "1.fst")
ft <- fst_table("1.fst")

I can extract rows and columns of the created file, however, is it possible to do operations like:

ft[X == 1,]

as in a standard data.table? or can I create a key of this data.table for fast serialization? My goal with this is to extract data using values of the columns without loading all the dataset into the memory.

1

There are 1 best solutions below

0
On

Original

Unfortunately, fsttable only works to load the dataset and select columns/rows. Although in the documentation of the package says:

This fst_table can be used as a regular data.table object

The reality is that regular data.table operations such as the one you mentioned can not be performed (at least with version 0.1.3). The main reason behind it is that we are in fact not working with a data.table object, but rather with a data.table interface:

> class(ft)
[1] "datatableinterface" "data.table"         "data.frame" 

However, the data from the fsttable object can be "pulled" as a vector and then be filtered. Following your example:

ft[,list(X)]$X
ft[,list(X)][['X']]
ft[,list(X)] %>% pull()

And then filtered, for example:

> ft[,list(X)]$X[ft[,list(X)]$X==1]
[1] 1

I presume there should be an easy way to convert a fsttable object to a genuine data.table by pulling each variable and then binding all them together.

Edit

Actually, read_fst() of fst package (available in CRAN, by the same author) has an argument to upload datasets as data.table, no need to fsttable package

ft <- read_fst("ft", as.data.table = T)

# if only a few columns are desired
ft <- read_fst("ft", columns = c("X"), as.data.table = T)

# if a tibble is needed
ft <- read_fst("ft", as.data.table = T) %>% as_tibble()