Access df_loaded and/or run_id in Load Data section of best trial notebook of Databricks AutoML run

18 Views Asked by At

The block of code below is part of the best trial notebook that is auto-generated by executing a Databricks AutoML run.

import mlflow
import os
import uuid
import shutil
import pandas as pd

# Create temp directory to download input data from MLflow
input_temp_dir = os.path.join(os.environ["SPARK_LOCAL_DIRS"], "tmp", str(uuid.uuid4())[:8])
os.makedirs(input_temp_dir)


# Download the artifact and read it into a pandas DataFrame
input_data_path = mlflow.artifacts.download_artifacts(run_id="e2a4a93aafb24aa9956e83f6b7ab3e28", artifact_path="data", dst_path=input_temp_dir)

df_loaded = pd.read_parquet(os.path.join(input_data_path, "training_data"))
# Delete the temp data
shutil.rmtree(input_temp_dir)

# Preview data
df_loaded.head(5)
  1. The run_id in the above code block, e2a4a93aafb24aa9956e83f6b7ab3e28, can I grab it from the AutoMLSummary returned from running automl.regress? If I use summary.best_trial.mlflow_run_id, I get a different value. So what is this run_id and how do I get it?

  2. Aside from the above code block, is there a way to grab the dataset that's been loaded into df_loaded? It's essentially the input dataset that I fed into automl.regress except it has a column that indicates whether each row is part of training, validation, and testing subsets.

I am fairly new to Databricks AutoML, so am not sure what's the best way to get this done.

Thanks ahead of time.

As I mentioned, I tried grabbing the run_id from summary.best_trial.mlflow_run_id, but the values do not match. I have tried reading the documentation for automl and mlflow, but no luck.

0

There are 0 best solutions below