how to make mclapply in Rscript maximize use of all available linux cores?

113 Views Asked by At

I'm reading in a parquet file with ~1 million rows, wrangling each row, and writing out csvs. The data wrangling itself is quite simple: I select all rows of a UserID (of which there are several for each UserID in random order within the dataframe) and write out the UserID to its individual csv. But since there are so many rows, the script runs for ~5 hours. I have hundreds of parquet files overall and I need to parallelize. I used the mclapply() function to parallalize by UserID. The script runs successfully, but is barely faster than when I run it with a single core. I opened the command line and ran htop and confirmed that each core is only utilizing 5% or less of its available memory on this script. When I initially run the script, each core is 100% utilized, but a few minutes later the utilization plummets. How can I ensure CPUs are used efficiently with mclapply? I've tried increase the mc.cores argument from 16 to 100 and I get the same problem every time. I'm on a Linux Ubuntu VM with 16 cores and 128GBenter image description here, but I can adjust the settings to give myself more cores and/or memory.

0

There are 0 best solutions below