Azure ML - Retrieve an AutoMLStep model and use for inference

423 Views Asked by At

I am currently trying to use the AutoMLStep to train a machine learning model, register it in the workspace, and use it for inference as a deserialized model. My current project folder/file structure is the following:

project/
│
├── src/
             │
             ├──data_prep.py
             ├──register_model.py
├── pipeline.py

(mostly basing my work on https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-use-automlstep-in-pipelines) In the pipeline.py script, I create my pipeline PythonScriptStep objects (2 in this case) as well as the AutoMLStep. The AutoMLStep is defined as follow:

train_step = AutoMLStep(name='AutoML_Classification',
    automl_config=automl_config,
    passthru_automl_config=False,
    outputs=[metrics_data, model_data],
    allow_reuse=True)

Where:

metrics_data = PipelineData(name='metrics_data',
                           datastore=blobstore,
                           pipeline_output_name=metrics_output_name,
                           training_output=TrainingOutput(type='Metrics'))
model_data = PipelineData(name='model_data',
                           datastore=blobstore,
                           pipeline_output_name='best_model_ticketing',
                           training_output=TrainingOutput(type='Model'))

For the register_model.py script, which is the last step in my pipeline sequence, I want to register the model, and use it to make predictions. I've tried the following:

from azureml.core.model import Model, Dataset
from azureml.core.run import Run, _OfflineRun
from azureml.core import Workspace
import argparse
import os
import pickle

from azureml.pipeline.core import PipelineRun
from azureml.pipeline.steps.automl_step import AutoMLStepRun

parser = argparse.ArgumentParser()
parser.add_argument("--model_name", required=True)
parser.add_argument("--model_path", required=True)
args = parser.parse_args()

run = Run.get_context()
ws = Workspace.from_config() if type(run) == _OfflineRun else run.experiment.workspace

pipeline_run_id = run.parent.id
pipeline_run = PipelineRun(experiment=run.experiment, run_id=pipeline_run_id)  # This is the Pipeline run, that orchestrates the overall pipeline
best_model_output = pipeline_run.get_pipeline_output('best_model_ticketing')
num_file_downloaded = best_model_output.download('.', show_progress=True)

model_filename = best_model_output._path_on_datastore
with open(model_filename, "rb" ) as f:
    best_model = pickle.load(f)

file_name = f"../outputs/model/{args.model_name}.pkl"
os.makedirs(os.path.dirname(file_name), exist_ok=True)
pickle.dump(value = best_model, filename = file_name)
print("Pickeling of model complete")


# Register model in AzureML
model = Model.register(model_path = file_name,
                       model_name = args.model_name,
                        description = "Model, with Hyperparameters Tuned",
                        workspace = ws)

Which leads to

Traceback (most recent call last):
  File "src/register_model.py", line 26, in <module>
    best_model = pickle.load(f)
EOFError: Ran out of input

Ideally, to integrate this with my current project script, I'd like to use a similar approach to this:

# Begin pickling the model
# non AutoML training done prior to this to create best_xgb_model in same script
print("Begin pickling the model")
model_name = args.registered_model_name

# save model in ./model
print("Exporting model as a .pkl")

import os
file_name = f"../outputs/model/{model_name}.pkl"
os.makedirs(os.path.dirname(file_name), exist_ok=True)
joblib.dump(value = best_xgb_model, filename = file_name)
print("Pickeling of model complete")

# Register model in AzureML
print("Registering Model with AzureML")
model = Model.register(
                        model_path = file_name,
                        model_name = model_name,
                        description = "Model, with Hyperparameters Tuned",
                        workspace = ws
                    )

Which allows the model to be used this way:

model_path = Model.get_model_path(model_name = args.registered_model_name, _workspace=ws) # get path of *latest* model
# Deserialize the model file back into xgb model
best_xgb_model = joblib.load(model_path)

Bottom line of all this is how can I retrieve the AutoMLStep best fitted model in the following step(register_model.py), in such a way that I can use a joblib.dump, register the model, and load for predictions. I've tried registering the model directly (doesnt save the model as .pkl file) and wasn't able to use for inference with the get_model_path.

Help would be greatly appreciated.

1

There are 1 best solutions below

0
On

The way to import the pickle files for different models to retrieve the best model based on the metrics is different. We need to download entire set of metrics and model files and update those in the datastore. Use the path of the downloaded metric files and use them for inference. Using run ID need to download the models information.

# Retrieved from Azure Machine Learning web UI
run_id = ‘runID'
experiment = ws.experiments[‘name’]
run = next(run for run in ex.get_runs() if run.id == run_id)

enter image description here

Download the metrics and model.

automl_run = next(r for r in run.get_children() if r.name == ‘’name of model”)
outputs = automl_run.get_outputs()
metrics = outputs[‘metrics_AutoML_Classification']
model = outputs['model_AutoML_Classification']

metrics.get_port_data_reference().download('.')
model.get_port_data_reference().download('.')


import pandas as pd
import json

metrics_filename = metrics_output._path_on_datastore
# metrics_filename = path to downloaded file
with open(metrics_filename) as f:
   metrics_output_result = f.read()
   
deserialized_metrics_output = json.loads(metrics_output_result)
df = pd.DataFrame(deserialized_metrics_output)
df

then use them for the inferencing.