How to upload packages to an instance in a Processing step in Sagemaker?

1.7k Views Asked by At

I have to do large scale feature engineering on some data. My current approach is to spin up an instance using SKLearnProcessor and then scale the job by choosing a larger instance size or increasing the number of instances. I require using some packages that are not installed on Sagemaker instances by default and so I want to install the packages using .whl files.

Another hurdle is that the Sagemaker role does not have internet access.

import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

sess = sagemaker.Session()
sess.default_bucket()        

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     sagemaker_session = sess,
                                     instance_type="ml.t3.medium",
                                     instance_count=1)

sklearn_processor.run(code='script.py')

Attempted resolutions:

  1. Upload the packages to a CodeCommit repository and clone the repo into the SKLearnProcessor instances. Failed with error fatal: could not read Username for 'https://git-codecommit.eu-west-1.amazonaws.com': No such device or address. I tried cloning the repo into a sagemaker notebook instance and it works, so its not a problem with my script.
  2. Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
  3. Also looked into using the package s3fs but it didn't seem suitable to copy the wheel files.

Alternatives

My client is hesitant to spin up containers from custom docker images. Any alternatives?

2

There are 2 best solutions below

1
On BEST ANSWER

2. Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.

This approach seems sound.

You may be better off overriding the command field on the SKLearnProcessor to /bin/bash, run a bash script like install_and_run_my_python_code.sh that installs the wheel containing your python dependencies, then runs your main python entry point script.

Additionally, instead of using AWS S3 calls to download your code in a script, you could use a ProcessingInput to download your code rather than doing this with AWS CLI calls in a bash script, which is what the SKLearnProcessor does to download your entry point script.py code across all the instances.

0
On

There are a couple other options. Here's all the options I could think of (some have been mentioned)

  1. Use sagemaker.processing.*Processor.run(code=<bash script>) bash script to pull from repo: Commit to Codecommit/Github/Bitbucket repo and use your run bash script to setup by cloning from that same repo
  2. Hijack ProcessingInput: point ProcessingInput to a local directory, or do pre-setup by uploading files to S3 and point ProcessingInput to that same S3.
  3. Make your library into a package and push to a local repo (Nexus, EC2, CodArtifact, support this) and install these like any python package with bash run file.
  4. Hijack sagemaker.sklearn.estimator and use source_dir to specify your entire source directory
  5. Compile a wheel from your package and push to container with ProcessingInput and install.

None of these are great. I didn't do 1, because I didn't want to push latest code each time to test, didn't want to build a package (3) or compile a wheel (5), and Using an estimator to run a processing job seems wrong. Went with 2 and in my case it's necessary to get individual files in the right place (ProcessingInput doesn't support individual files, and I had pyproject.toml in a higher directory) so my job script that runs the job does pre-setup by ordering and uploading files to S3, and I specify an S3 uri in ProcessingInput.