Out-of-memory problem submitting a tensorflow2 job on Google AI Platform Engine

364 Views Asked by At

I'm trying to submit a Tensorflow2 training job (fine tuning an object detection model) with gcloud on Google AI Platform Engine. My dataset is not big (raccoon dataset, which is 10M or so). I've tried many configurations but each time get the same error:

The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL)

My command:

gcloud ai-platform jobs submit training OD_ssd_fpn_large \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-east1 \
--config cloud.yml \
--  \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

My last try with cloud.yml file involved large models:

trainingInput:
runtimeVersion: "2.2"
pythonVersion: "3.7"
scaleTier: CUSTOM
masterType: large_model
workerCount: 5
workerType: large_model
parameterServerCount: 3
parameterServerType: large_model

but always the same error. Any hint or help greatly appreciated

1

There are 1 best solutions below

0
On

Reading all data is consuming RAM, hence you are running out of memory. You need to get a bigger instance type (large_model or complex_model_l; see this documentation for machine types for more details).

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: n1-highcpu-16
  parameterServerType: n1-highmem-8
  evaluatorType: n1-highcpu-16
  workerCount: 9
  parameterServerCount: 3
  evaluatorCount: 1

Or you need to reduce your dataset.