Argument "--python-modules-installer-option" not working in pythonshell Glue Jobs

1.8k Views Asked by At

I am trying to have a setup similar to that of this article: https://aws.amazon.com/blogs/big-data/simplify-and-optimize-python-package-management-for-aws-glue-pyspark-jobs-with-aws-codeartifact/

I would like to install some packages using a custom --index-url <my-index-url>. To do this, I am following the Glue Job documentation here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html According to the guide, I should add a parameter to the job like this:

--python-modules-installer-option with value --index-url <my-index-url>.

However, this argument does not get picked up at all. The logs do not show any sign that this argument is used.

When I try to install something from my custom index, it fails, as the parameter is not picked up.

Even trying with a simple value like --upgrade does not work.

However, other options such as --additional-python-modules do get picked up, and of course, the module installation goes through the default pip3 index set by the Python environment and not through the one I set, causing the job to fail if the package I specify is not in my index.

To reproduce this issue:

  • go to AWS Glue Jobs
  • create a new Python job with "Python Shell script editor" and selecting the boilerplate code option (it doesn't matter the code inside for reproducing this issue)
  • create and select a proper AWS Glue IAM role to run the job with
  • add any pip3 valid option as a Job parameter like: Key: "--python-modules-installer-option", Value: "<valid-pip3-option>".

Thanks!

2

There are 2 best solutions below

1
On

That flag and the blog is for Glue ETl
For Shell the value of --additional-python-modules is passed directly to pip you can specify your options directly inside that value (as if you were passing parameters to pip)

0
On

After struggling to get this working for nearly 2 hrs I want to share what finally worked for me.

NOTE: This answer is specifically for jobs of type Python Shell in AWS Glue. Spark jobs require a slightly different approach.

The answer from @GonzaloHerreros is basically what is needed but here is more detail:

Assumptions

  • you have already published a valid Python package (say, mypackage) to CodeArtifact
  • your CodeArtifact repository has permissions appropriately set to allow the IAM user of your Glue job to access and get packages from the repo.
  • Your job can access the Internet (if not, the general approach described below will still work but you'll need to take the additional step of setting up a VPC endpoint for CodeArtifact)

Setting it up

You use the --additional-python-modules job parameter to specify your package as well as the URL to your CodeArtifact repo. The URL must include a token needed for authentication.

The URL is of the form https://aws:<codeartifact token>@<codeartifact domain>-<account ID>.d.codeartifact.<aws region>.amazonaws.com/pypi/<repository name>/simple

You can get <codeartifact token> with this command from a shell that has the AWS CLI installed:

aws codeartifact get-authorization-token --domain-owner <account id> --domain <domain name> --query 'authorizationToken' --output text

So at the end the value of your --additional-python-modules will look something like mypackage --index-url=https://aws:eyJ2ZXIiOjEsImlzdSI6MTcw...@mydomain-999999999999.d.codeartifact.us-east-1.amazonaws.com/pypi/myrepo/simple

Gotchas and pain points

  • The token is really long, and if you're using the AWS Console the UI will tell you (in red lettering) that the parameter can only be 256 characters. Just ignore this and save, it takes the entire thing.
  • Enter the value of the job parameter exactly as described above. No quotes, no trailing / after simple
  • The token is valid for just 12 hours, which is a real pain. You could use the step function approach described in the article that OP linked to to set up the token before running the job, but honestly that's a bit of a pain as well.