Azure ML Pipeline: Specify input-path on compute

46 Views Asked by At

I got a python script which shall be executed as an Azure ML pipeline step. This script expects to find several file-sets in a certain tree structure e.g.

data/
├─ project_A/
│  ├─ data.csv
│  ├─ config.toml
├─ project_B/
│  ├─ data.csv
│  ├─ config.toml
├─ project_.../
│  ├─ ...

The script receives the base-path e.g. ./data/ as command line argument and walks the sub-directories. Each file-set e.g. project_A is made available as a URI_FOLDER Azure ML Data Asset. The Azure ML Python sdk v2 is used and the definition of the component looks like this:

prep = command(
    name="preprocessing",
    code="./src/preprocessing.py",
    # actual paths passed at pipeline level
    inputs=dict(
        project_A=Input(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RO_MOUNT),  
        project_B=Input(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RO_MOUNT), 
        project_...),
    command="python preprocessing.py <base_dir>",
)

I would like to know how ensure a certain tree structure on the compute such that I can pass the <base_dir> to the script.

1

There are 1 best solutions below

0
JayashankarGS On

Whatever you give in inputs, it is passed to command line arguments in the Python file. Here is the sample:

command_job = command(
    code="./src",
    command="python main.py --iris-csv ${{inputs.iris_csv}} --learning-rate ${{inputs.learning_rate}} --boosting ${{inputs.boosting}}",
    environment="AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu@latest",
    inputs={
        "iris_csv": Input(
            type="uri_file",
            path="https://azuremlexamples.blob.core.windows.net/datasets/iris.csv",
        ),
        "learning_rate": 0.9,
        "boosting": "gbdt",
    },
    compute="cpu-cluster",
)

If you observe iris_csv, learning_rate, and boosting, they are given in the inputs parameter, which is further passed to Python command arguments as ${{inputs.iris_csv}}, ${{inputs.learning_rate}}, and ${{inputs.boosting}}. It's not like passing arguments in the command and using them in the inputs parameter.

In your case, if you are passing only base_path, give it in the inputs parameter as uri_folder and pass it to the command. Then take a path to project_A, project_B, etc., inside your Python file like below:

project_A_path = os.path.join(<base_path>,'project_A') and project_B_path = os.path.join(<base_path>,'project_B')

Command definition:

prep = command(
    name="preprocessing",
    code="./src/preprocessing.py",
    # actual paths passed at pipeline level
    inputs=dict(
        base_path=Input(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RO_MOUNT)
        ),
    command="python preprocessing.py --base-path ${{inputs.base_path}}",
)

Or if there are only a few project folders, you can pass the project folder directly along with the base path as well:

prep = command(
    name="preprocessing",
    code="./src/preprocessing.py",
    # actual paths passed at pipeline level
    inputs=dict(
        base_path=Input(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RO_MOUNT),
        project_A=Input(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RO_MOUNT),  
        project_B=Input(type=AssetTypes.URI_FOLDER, mode=InputOutputModes.RO_MOUNT), 
        project_...),
    command="python preprocessing.py --base-path ${{inputs.base_path}} --project_a_path ${{inputs.project_A}} --project_b_path ${{inputs.project_B}}",
)

I would still recommend you use only the base path in command arguments and construct the required project path inside the Python file.

Refer to this notebook for more about the command job.