I have to do large scale feature engineering on some data. My current approach is to spin up an instance using SKLearnProcessor
and then scale the job by choosing a larger instance size or increasing the number of instances. I require using some packages that are not installed on Sagemaker instances by default and so I want to install the packages using .whl files.
Another hurdle is that the Sagemaker role does not have internet access.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
sess = sagemaker.Session()
sess.default_bucket()
region = boto3.session.Session().region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
sagemaker_session = sess,
instance_type="ml.t3.medium",
instance_count=1)
sklearn_processor.run(code='script.py')
Attempted resolutions:
- Upload the packages to a CodeCommit repository and clone the repo into the SKLearnProcessor instances. Failed with error
fatal: could not read Username for 'https://git-codecommit.eu-west-1.amazonaws.com': No such device or address
. I tried cloning the repo into a sagemaker notebook instance and it works, so its not a problem with my script. - Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
- Also looked into using the package
s3fs
but it didn't seem suitable to copy the wheel files.
Alternatives
My client is hesitant to spin up containers from custom docker images. Any alternatives?
2. Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
This approach seems sound.
You may be better off overriding the
command
field on theSKLearnProcessor
to/bin/bash
, run a bash script likeinstall_and_run_my_python_code.sh
that installs the wheel containing your python dependencies, then runs your main python entry point script.Additionally, instead of using AWS S3 calls to download your code in a script, you could use a ProcessingInput to download your code rather than doing this with AWS CLI calls in a bash script, which is what the
SKLearnProcessor
does to download your entry pointscript.py
code across all the instances.