ModuleNotFoundError message when run gcp dataflow pipeline with python

97 Views Asked by At

I'm trying to install dependencies in a dataflow pipeline. First I used requirements_file flag but i get (ModuleNotFoundError: No module named 'unidecode' [while running 'Map(wordcleanfn)-ptransform-54']) the unique package added is unidecode. trying a second option I configured a Docker image following the google documentation:

FROM apache/beam_python3.10_sdk:2.52.0

ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1

RUN pip install unidecode

RUN apt-get update && apt-get install -y

ENTRYPOINT ["/opt/apache/beam/boot"]

It was compiled in the gcp project vm and pushed to artifact registry Then I generated the template for pipeline with:

python -m mytestcode \
    --project myprojectid \
    --region us-central1 \
    --temp_location gs://mybucket/beam_test/tmp/ \
    --runner DataflowRunner \
    --staging_location gs://mybucket/beam_test/stage_output/ \
    --template_name mytestcode_template \
    --customvariable 500 \
    --experiments use_runner_v2 \
    --sdk_container_image us-central1-docker.pkg.dev/myprojectid/myimagerepo/dataflowtest-image:0.0.1 \
    --sdk_location container

After all I created the job from template with the UI, but the error is the same, please someone can help me? I understand that the workers are using de default beam sdk, is correct that? how I can fix it?

0

There are 0 best solutions below