Process memory size with furrr, globals and packages is too high

127 Views Asked by At

I just noticed when launching multiple sessions, each is ~100MB.

enter image description here

This gets even worse when launching in shiny, size jumps to 200BM.

I tried to limit the memory by removing globals and packages

packages_to_load <- c("paws", "jsonlite")
plan(multisession, workers = 10)
  results_calc_rds = future_map(.x = tokens, 
                                .f = my_fun,
                                .options = furrr::furrr_options(seed = NULL, 
                                                                globals = FALSE,
                                                                packages = packages_to_load))

But it doesn't seem to have an impact.

Has anyone had an idea how to make these sessions as tiny as possible?

All I need is the packages paws and jsonline to do some AWS invokes.

Thank you!

1

There are 1 best solutions below

1
VonC On

Instead of specifying packages to load in the future_map, you could try and load them before planning your workers.
And Use explicit namespacing for functions to avoid loading entire packages, e.g., jsonlite::fromJSON instead of loading jsonlite.
Make sure only necessary global variables and necessary date are being sent to the workers. Avoid passing large datasets if not required.

# Pre-load the necessary packages
library(paws)
library(jsonlite)

# Plan the workers without specifying packages to load
plan(multisession, workers = 10)

# Use explicit namespacing and minimize data sent to workers
results_calc_rds = future_map(.x = tokens, 
                              .f = function(x) {
                                # Directly use functions from namespaces
                                result <- paws::some_function(x)
                                json <- jsonlite::fromJSON(result)
                                # Process your data here
                                return(json)
                              },
                              .options = furrr::furrr_options(seed = NULL, 
                                                              globals = FALSE))

By pre-loading the packages and using explicit namespacing, you should avoid the overhead of loading packages for each worker.

But R's memory management may not always immediately reflect the memory savings, as garbage collection is not instantaneous.
However, as noted in "Advanced R / Memory usage and garbage collection" by Hadley Wickham, you do not need to call gc() occasionally to prompt R to clean up unused memory:

Despite what you might have read elsewhere, there’s never any need to call gc() yourself. R will automatically run garbage collection whenever it needs more space; if you want to see when that is, call gcinfo(TRUE). The only reason you might want to call gc() is to ask R to return memory to the operating system. However, even that might not have any effect: older versions of Windows had no way for a program to return memory to the OS.


I still need to load other packages and data within my scripts...

If you need to load other packages and data within your scripts and are concerned about memory usage, you can consider lazy loading your packages, by using requireNamespace to check for the presence of a package without loading it into memory. Load it only when it is certain to be used.

If you need to pass large datasets to your workers, consider compressing them before sending and decompressing within the worker function.

Also, convert data frames to more memory-efficient objects like data.table or use the fst package for fast serialization of data frames.
And use the future package's ability to identify and transfer only the necessary parts of global variables to workers.

If running on a server, use Docker containers to set a memory limit for each R session.

Your code would be:

# Check for package presence without loading them
if (!requireNamespace("paws", quietly = TRUE)) {
  install.packages("paws")
}
if (!requireNamespace("jsonlite", quietly = TRUE)) {
  install.packages("jsonlite")
}

# Plan the workers without specifying packages to load
plan(multisession, workers = 10)

# Use a custom function to load packages and data only as needed
load_packages_and_data <- function() {
  # Load packages using requireNamespace
  if (!requireNamespace("paws", quietly = TRUE)) {
    require("paws")
  }
  if (!requireNamespace("jsonlite", quietly = TRUE)) {
    require("jsonlite")
  }
  # Load or process your data here
}

results_calc_rds = future_map(
  .x = tokens, 
  .f = function(x) {
    # Call the function to load packages and data within the worker
    load_packages_and_data()
    # Use the necessary functions from the packages
    result <- paws::some_function(x)
    json <- jsonlite::fromJSON(result)
    # Process your data here
    return(json)
  },
  .options = furrr::furrr_options(seed = NULL, globals = FALSE)
)

By conditionally loading packages and data only when needed, you can minimize the memory footprint of each worker.