TFX Pipeline stopped working due to Dataflow job workers getting stuck on startup

187 Views Asked by At

I have a TFX Pipeline running in GCP AI Platform Pipelines (managed Kubeflow). It was running fine for some time but suddenly stopped working properly during the BigQuery ExampleGen step.

The BQ ExampleGen utilizes Dataflow to read the data from BQ and save to TRecords. The Dataflow job starts but is not doing anything - it is stuck during starting/preparation of a worker.

The worker log shows that python dependencies are being installed using pip. The problem is that pip is constantly downloading different versions of the same package in order to resolve dependency conflicts, but it does not show what the conflict is. I have connected to the worker VM while it was starting, not it showed pip constantly running and consuming 100% CPU - it was not finishing for more than an hour I waited before just stopping the job.

TFX Version: 0.26.3 (tied with 0.26.4 with the same result) Apache Beam SDL: 2.28 (tried with 2.29 with the same result)

I have even tried doing pip install of TFX 0.26.3 in an Apache Beam docker image (the same one used by Dataflow workers) and it was also stuck trying to install it.

I have tried installing TFX 0.30.0 in the Apache Beam docker image, and it installed fine, but I cannot use TFX 0.30 in my AI Platform Pipeline as it seems only TFX 0.26 is supported.

Anyone else experienced the same issue and maybe resolved the issue?

1

There are 1 best solutions below

0
On

I have resolved the issue finally by setting the TFX container version to 0.26.1 instead of 0.26.3 as it was by default from the TFX template.