Why pod on GKE cluster is OOMkilled when trying to run a very simple Kubeflow pipeline using TFX?

637 Views Asked by At

I'm following the TFX on Cloud AI Platform Pipelines tutorial to implement a Kubeflow orchestrated pipeline on Google Cloud. The main difference is that I'm trying to implement an Object Detection solution instead of the Taxi application proposed by the tutorial.

For this reason I (locally) created a dataset of images labelled via labelImg and converted it to a .tfrecord using this script that I've uploaded on a GS bucket. Then I followed the TFX tutorial creating the GKE cluster (the default one, with this configuration) and the Jupyter Notebook needed to run the code, importing the same template.

The main difference is in the first component of the pipeline, where I changed the CSVExampleGen component to an ImportExampleGen one:

def create_pipeline(
    pipeline_name: Text,
    pipeline_root: Text,
    data_path: Text,
    # TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
    # query: Text,
    preprocessing_fn: Text,
    run_fn: Text,
    train_args: tfx.proto.TrainArgs,
    eval_args: tfx.proto.EvalArgs,
    eval_accuracy_threshold: float,
    serving_model_dir: Text,
    metadata_connection_config: Optional[
        metadata_store_pb2.ConnectionConfig] = None,
    beam_pipeline_args: Optional[List[Text]] = None,
    ai_platform_training_args: Optional[Dict[Text, Text]] = None,
    ai_platform_serving_args: Optional[Dict[Text, Any]] = None,
) -> tfx.dsl.Pipeline:
  """Implements the chicago taxi pipeline with TFX."""

  components = []

  # Brings data into the pipeline or otherwise joins/converts training data.
  example_gen = tfx.components.ImportExampleGen(input_base=data_path)
  # TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
  # example_gen = tfx.extensions.google_cloud_big_query.BigQueryExampleGen(
  #     query=query)
  components.append(example_gen)

No other components are inserted in the pipeline and the data path points to the location of the folder on the bucket containing the .tfrecord:

DATA_PATH = 'gs://(project bucket)/(dataset folder)'

This is the runner code (basically identical to the one of the TFX tutorial):

def run():
  """Define a kubeflow pipeline."""

  # Metadata config. The defaults works work with the installation of
  # KF Pipelines using Kubeflow. If installing KF Pipelines using the
  # lightweight deployment option, you may need to override the defaults.
  # If you use Kubeflow, metadata will be written to MySQL database inside
  # Kubeflow cluster.
  metadata_config = tfx.orchestration.experimental.get_default_kubeflow_metadata_config(
  )

  runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
      kubeflow_metadata_config=metadata_config,
      tfx_image=configs.PIPELINE_IMAGE)
  pod_labels = {
      'add-pod-env': 'true',
      tfx.orchestration.experimental.LABEL_KFP_SDK_ENV: 'tfx-template'
  }
  tfx.orchestration.experimental.KubeflowDagRunner(
      config=runner_config, pod_labels_to_attach=pod_labels
  ).run(
      pipeline.create_pipeline(
          pipeline_name=configs.PIPELINE_NAME,
          pipeline_root=PIPELINE_ROOT,
          data_path=DATA_PATH,
          # TODO(step 7): (Optional) Uncomment below to use BigQueryExampleGen.
          # query=configs.BIG_QUERY_QUERY,
          preprocessing_fn=configs.PREPROCESSING_FN,
          run_fn=configs.RUN_FN,
          train_args=tfx.proto.TrainArgs(num_steps=configs.TRAIN_NUM_STEPS),
          eval_args=tfx.proto.EvalArgs(num_steps=configs.EVAL_NUM_STEPS),
          eval_accuracy_threshold=configs.EVAL_ACCURACY_THRESHOLD,
          serving_model_dir=SERVING_MODEL_DIR,
          # TODO(step 7): (Optional) Uncomment below to use provide GCP related
          #               config for BigQuery with Beam DirectRunner.
          # beam_pipeline_args=configs
          # .BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS,
          # TODO(step 8): (Optional) Uncomment below to use Dataflow.
          # beam_pipeline_args=configs.DATAFLOW_BEAM_PIPELINE_ARGS,
          # TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
          # ai_platform_training_args=configs.GCP_AI_PLATFORM_TRAINING_ARGS,
          # TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
          # ai_platform_serving_args=configs.GCP_AI_PLATFORM_SERVING_ARGS,
      ))


if __name__ == '__main__':
  logging.set_verbosity(logging.INFO)
  run()

The pipeline is then created and a run is invoked with the following code from the Notebook:

!tfx pipeline create  --pipeline-path=kubeflow_runner.py --endpoint={ENDPOINT} --build-image
!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}

The problem is that, while the pipeline from the example runs without problem, this pipeline always fails with the pod on the GKE cluster exiting with code 137 (OOMKilled).

This is a snapshot of the cluster workload status and this is a full log dump of the run that crashes.

I've already tried reducing the dataset size (it is now about 6MB for the whole .tfrecord) and splitting it locally in two sets (validation and training) since the crash seems to happen when the component should split the dataset, but neither of these changed the situation.

Do you have any idea on why it goes out of memory and what steps could I take to solve this?

Thank you very much.

1

There are 1 best solutions below

0
On

If an application has a memory leak or tries to use more memory than a set limit amount, Kubernetes will terminate it with an “OOMKilled—Container limit reached” event and Exit Code 137.

When you see a message like this, you have two choices: increase the limit for the pod or start debugging. If, for example, your website was experiencing an increase in load, then adjusting the limit would make sense. On the other hand, if the memory use was sudden or unexpected, it may indicate a memory leak and you should start debugging immediately.

Remember, Kubernetes killing a pod like that is a good thing—it prevents all the other pods from running on the same node.

also refer similar issues link1 and link2,hope it helps.Thanks