python ray out of memory (OOM) with RAY_memory_monitor_refresh_ms set to 0

706 Views Asked by At

I'm using Modin to use pandas code that can't fit in memory. It works very well locally with dataset (30GB) and my RAM(16GB). Now I want to speed this up and decided to run a cluster on GCP with modin on Ray My dataset ~50GB stored on gcs as multiple files ~40mb in size and I'm setting up 7 n1-standard-2 machines ~7GB each. I want to test this setup, before going to TBs datasets. But when I'm trying to create dataset, my workers got killed with the following error, despite setting all env variables:

(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

I'm using the code below to initialize ray:

import modin.pandas as pd
import ray
import os
pd.DEFAULT_NPARTITIONS=280
os.environ["MODIN_ENGINE"] = "ray"

runtime_env = {
    'env_vars': {
        "RAY_memory_monitor_refresh_ms": "0",
        "RAY_memory_usage_threshold": "3"
     }
}
ray.init(runtime_env=runtime_env, _plasma_directory="/tmp")
df = pd.read_parquet("gs://test-data-set/parquets/")

Any advice would be appreciated.

1

There are 1 best solutions below

1
On

The default threshold is 0.95, which means that the Raylet will start killing processes when the combined memory usage of the worker heap, the object store, and the raylet exceeds 95% of the available memory on the node.

If you set the threshold too low, the Raylet may kill tasks or actors unnecessarily, which can degrade the performance of your application. If you set the threshold too high, your application may run out of memory and crash.

You may start with the default threshold of 0.95 and adjust it as needed. Monitor the memory usage of your application and adjust the threshold accordingly but be aware of the limitations of the memory usage threshold.

Use this document from the error message you posted for more info about OOM prevention.