I am working on deploying a full ML pipeline for SageMaker and Airflow. I would like to separate training and processing part of the pipeline.
I have a question concerning the SageMakerProcessingOperator
(source_code). This operator relies on create_processing_job() function. When using this operator, I would like to extend the base docker image used for processing in order to use an home-made script. Currently, the processing works fine when I push my container to aws ECR. However, I would prefer to use a part of the script stored inside my packaged model (with tar.gz format).
For training and registering the model, we can specify the image used to extend with sagemaker_submit_directory
and SAGEMAKER_PROGRAM
env variable (cf aws_doc). However it looks like it is not possible using the SageMakerProcessingOperator.
Below is a extract of the config used in the operator, with no success yet.
"Environment": {
"sagemaker_enable_cloudwatch_metrics": "false",
"SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
"SAGEMAKER_REGION": f"{self.region_name}",
"SAGEMAKER_SUBMIT_DIRECTORY": f"{self.train_code_path}",
"SAGEMAKER_PROGRAM": f"{self.processing_entry_point}",
"sagemaker_job_name": f"{self.process_job_name}",
},
Did anyone manage to use these parameters for Sagemaker create_processing_job() ? Or is it only limited to AWS ECR ?
SageMaker Processing Job and SageMaker training job are different so the underlying architecture is different and we cannot combine both of them.