Out-of-memory problem submitting a tensorflow2 job on Google AI Platform Engine

352 Views Asked by Patrick At 27 July 2025 at 19:11

I'm trying to submit a Tensorflow2 training job (fine tuning an object detection model) with gcloud on Google AI Platform Engine. My dataset is not big (raccoon dataset, which is 10M or so). I've tried many configurations but each time get the same error:

The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL)

My command:

gcloud ai-platform jobs submit training OD_ssd_fpn_large \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-east1 \
--config cloud.yml \
--  \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

My last try with cloud.yml file involved large models:

trainingInput:
runtimeVersion: "2.2"
pythonVersion: "3.7"
scaleTier: CUSTOM
masterType: large_model
workerCount: 5
workerType: large_model
parameterServerCount: 3
parameterServerType: large_model

but always the same error. Any hint or help greatly appreciated

Original Q&A

There are 1 best solutions below

Rally H On 05 November 2020 at 07:55

Reading all data is consuming RAM, hence you are running out of memory. You need to get a bigger instance type (large_model or complex_model_l; see this documentation for machine types for more details).

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: n1-highcpu-16
  parameterServerType: n1-highmem-8
  evaluatorType: n1-highcpu-16
  workerCount: 9
  parameterServerCount: 3
  evaluatorCount: 1

Or you need to reduce your dataset.

Out-of-memory problem submitting a tensorflow2 job on Google AI Platform Engine

There are 1 best solutions below

Related Questions in OUT-OF-MEMORY

Related Questions in TENSORFLOW2.0

Related Questions in GCLOUD

Related Questions in OBJECT-DETECTION-API

Trending Questions

Popular # Hahtags

Popular Questions