my data (df) contains ~2,0000K rows and ~5K unique names. For each unique name, I want to select all the rows from df which contains that specific name. For example, the data frame df looks as follows:
id names
1 A,B,D
2 A,B
3 A
4 B,D
5 C,E
6 C,D,E
7 A,E
I want to select all the rows which contains 'A' (A is among 5K unique names) in the column 'names'. So, the output will be:
id names
1 A,B,D
2 A,B
3 A
7 A,E
I am trying to do this using parallel processing using mclapply with number of nodes = 20 and 80 GB memory. Still I am getting out of memory issue.
Here is my code to select the rows containing specific name:
subset_select = function(x,df){
indx <- which(
rowSums(
`dim<-`(grepl(x, as.matrix(df), fixed=TRUE), dim(df))
) > 0
)
new_df = df[indx, ]
return(new_df)
}
df_subset = subset_select(name,df)
My question is: is there any other way to get the subset of data for each 5K unique names more efficiently (in terms of runtime and memory consumption)? TIA.
Here is a parallelized way with package
parallel.First, the data set has 2M rows. The following code is meant to show it, not more. See the commented line after
scan.Now the code.
The parallelized
mclapplyloop breaks the data into chunks ofNrows each and processes them independently. Then, the return valueinx2must beunlisted.The
mclapplytook less than half the time the straightforwardgreptook.R version 4.1.1 (2021-08-10) on Ubuntu 20.04.3 LTS.