In Parallel processing, select all the rows which contains a specific keyword in r

138 Views Asked by At

my data (df) contains ~2,0000K rows and ~5K unique names. For each unique name, I want to select all the rows from df which contains that specific name. For example, the data frame df looks as follows:

id  names
1   A,B,D
2   A,B
3   A
4   B,D
5   C,E
6   C,D,E
7   A,E

I want to select all the rows which contains 'A' (A is among 5K unique names) in the column 'names'. So, the output will be:

id  names
1   A,B,D
2   A,B
3   A
7   A,E

I am trying to do this using parallel processing using mclapply with number of nodes = 20 and 80 GB memory. Still I am getting out of memory issue.

Here is my code to select the rows containing specific name:

subset_select = function(x,df){
  indx <- which(
    rowSums(
      `dim<-`(grepl(x, as.matrix(df), fixed=TRUE), dim(df))
    ) > 0
  )
  new_df = df[indx, ]
  return(new_df)
}

df_subset = subset_select(name,df)

My question is: is there any other way to get the subset of data for each 5K unique names more efficiently (in terms of runtime and memory consumption)? TIA.

2

There are 2 best solutions below

2
Rui Barradas On BEST ANSWER

Here is a parallelized way with package parallel.
First, the data set has 2M rows. The following code is meant to show it, not more. See the commented line after scan.

x <- scan(file = "~/tmp/temp.txt")
#Read 2000000 items
df1 <- data.frame(id = seq_along(x), names = x)

Now the code.
The parallelized mclapply loop breaks the data into chunks of N rows each and processes them independently. Then, the return value inx2 must be unlisted.

library(parallel)

ncores <- detectCores() - 1L
pat <- "A"

t1 <- system.time({
  inx1 <- grep(pat, df1$names)
})

t2 <- system.time({
  N <- 10000L
  iters <- seq_len(ceiling(nrow(df1) / N))
  inx2 <- mclapply(iters, function(k){
    i <- seq_len(N) + (k - 1L)*N
    j <- grep(pat, df1[i, "names"])
    i[j]
  }, mc.cores = ncores)
  inx2 <- unlist(inx2)
})

identical(df1[inx1, ], df1[inx2, ])  
#[1] TRUE

rbind(t1, t2)
#   user.self sys.self elapsed user.child sys.child
#t1     5.325    0.001   5.371      0.000     0.000
#t2     0.054    0.093   2.446      3.688     0.074

The mclapply took less than half the time the straightforward grep took.
R version 4.1.1 (2021-08-10) on Ubuntu 20.04.3 LTS.

2
HenrikB On

If you need to repeat this for multiple "names", then base::by() might be useful to pre-group the data, e.g.

data <- read.table(header=TRUE, text=
"id  names
1   A,B,D
2   A,B
3   A
4   B,D
5   C,E
6   C,D,E
7   A,E
8   A
9   A,B"
)

groups <- by(data, INDICES = data$names, FUN = function(x) x$id)
print(groups)
#> data$names: A
#> [1] 3 8
#> ------------------------------------------------------------ 
#> data$names: A,B
#> [1] 2 9
#> ------------------------------------------------------------ 
#> data$names: A,B,D
#> [1] 1
#> ------------------------------------------------------------ 
#> data$names: A,E
#> [1] 7
#> ------------------------------------------------------------ 
#> data$names: B,D
#> [1] 4
#> ------------------------------------------------------------ 
#> data$names: C,D,E
#> [1] 6
#> ------------------------------------------------------------ 
#> data$names: C,E
#> [1] 5

print(groups$A)
#> [1] 3 8

Then one can find all groups with A and their id:s as:

name <- "A"
groups_subset <- groups[grep(name, names(groups))]
idxs <- sort(unlist(groups_subset, use.names = FALSE))
data_subset <- data[idxs, ]
rownames(data_subset) <- NULL  ## optional
print(data_subset)
#>   id names
#> 1  1 A,B,D
#> 2  2   A,B
#> 3  3     A
#> 4  7   A,E
#> 5  8     A
#> 6  9   A,B

Does that look correct to you? (Disclaimer: I'm the author) If so, then you can try to see if using future.apply and its future_by() helps you run in parallel;

library(future.apply)

## Run in parallel using forked ("multicore") processin
## All cores by default, otherwise add 'workers = 20' 
plan(multicore)

data <- ... as above ...

groups <- future_by(data, INDICES = data$names, FUN = function(x) x$id)

name <- "A"
groups_subset <- groups[grep(name, names(groups))]
idxs <- sort(unlist(groups_subset, use.names = FALSE))
data_subset <- data[idxs, ]
rownames(data_subset) <- NULL  ## optional
print(data_subset)
#>   id names
#> 1  1 A,B,D
#> 2  2   A,B
#> 3  3     A
#> 4  7   A,E
#> 5  8     A
#> 6  9   A,B