I am currently trying to use the AutoMLStep to train a machine learning model, register it in the workspace, and use it for inference as a deserialized model. My current project folder/file structure is the following:
project/
│
├── src/
│
├──data_prep.py
├──register_model.py
├── pipeline.py
(mostly basing my work on https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-use-automlstep-in-pipelines) In the pipeline.py script, I create my pipeline PythonScriptStep objects (2 in this case) as well as the AutoMLStep. The AutoMLStep is defined as follow:
train_step = AutoMLStep(name='AutoML_Classification',
automl_config=automl_config,
passthru_automl_config=False,
outputs=[metrics_data, model_data],
allow_reuse=True)
Where:
metrics_data = PipelineData(name='metrics_data',
datastore=blobstore,
pipeline_output_name=metrics_output_name,
training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
datastore=blobstore,
pipeline_output_name='best_model_ticketing',
training_output=TrainingOutput(type='Model'))
For the register_model.py script, which is the last step in my pipeline sequence, I want to register the model, and use it to make predictions. I've tried the following:
from azureml.core.model import Model, Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
import argparse
import os
import pickle
from azureml.pipeline.core import PipelineRun
from azureml.pipeline.steps.automl_step import AutoMLStepRun
parser = argparse.ArgumentParser()
parser.add_argument("--model_name", required=True)
parser.add_argument("--model_path", required=True)
args = parser.parse_args()
run = Run.get_context()
ws = Workspace.from_config() if type(run) == _OfflineRun else run.experiment.workspace
pipeline_run_id = run.parent.id
pipeline_run = PipelineRun(experiment=run.experiment, run_id=pipeline_run_id) # This is the Pipeline run, that orchestrates the overall pipeline
best_model_output = pipeline_run.get_pipeline_output('best_model_ticketing')
num_file_downloaded = best_model_output.download('.', show_progress=True)
model_filename = best_model_output._path_on_datastore
with open(model_filename, "rb" ) as f:
best_model = pickle.load(f)
file_name = f"../outputs/model/{args.model_name}.pkl"
os.makedirs(os.path.dirname(file_name), exist_ok=True)
pickle.dump(value = best_model, filename = file_name)
print("Pickeling of model complete")
# Register model in AzureML
model = Model.register(model_path = file_name,
model_name = args.model_name,
description = "Model, with Hyperparameters Tuned",
workspace = ws)
Which leads to
Traceback (most recent call last):
File "src/register_model.py", line 26, in <module>
best_model = pickle.load(f)
EOFError: Ran out of input
Ideally, to integrate this with my current project script, I'd like to use a similar approach to this:
# Begin pickling the model
# non AutoML training done prior to this to create best_xgb_model in same script
print("Begin pickling the model")
model_name = args.registered_model_name
# save model in ./model
print("Exporting model as a .pkl")
import os
file_name = f"../outputs/model/{model_name}.pkl"
os.makedirs(os.path.dirname(file_name), exist_ok=True)
joblib.dump(value = best_xgb_model, filename = file_name)
print("Pickeling of model complete")
# Register model in AzureML
print("Registering Model with AzureML")
model = Model.register(
model_path = file_name,
model_name = model_name,
description = "Model, with Hyperparameters Tuned",
workspace = ws
)
Which allows the model to be used this way:
model_path = Model.get_model_path(model_name = args.registered_model_name, _workspace=ws) # get path of *latest* model
# Deserialize the model file back into xgb model
best_xgb_model = joblib.load(model_path)
Bottom line of all this is how can I retrieve the AutoMLStep best fitted model in the following step(register_model.py), in such a way that I can use a joblib.dump, register the model, and load for predictions. I've tried registering the model directly (doesnt save the model as .pkl file) and wasn't able to use for inference with the get_model_path.
Help would be greatly appreciated.
The way to import the pickle files for different models to retrieve the best model based on the metrics is different. We need to download entire set of metrics and model files and update those in the datastore. Use the path of the downloaded metric files and use them for inference. Using run ID need to download the models information.
Download the metrics and model.
then use them for the inferencing.