How to split a larger-than-disk file using Azure ML?

95 Views Asked by At

Using Azure ML components and pipelines: How to split a larger-than-disk (PGN) file into shards and save the output files to a designated uri_folder on a blob storage? Feel free to provide any best-practices to achieve the goal.

I set up a component and a pipeline with the following yml configuration files:

Component

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: split_file_to_shards
display_name: Split file to shards
version: 0.0.9
type: command

inputs:
  input_data_file:
    type: uri_file
    mode: ro_mount

outputs:
  output_data_dir:
    type: uri_folder
    mode: rw_mount

environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest

code: ./
command: >-
  split -u -n r/100 --verbose ${{inputs.input_data_file}} ${{outputs.output_data_dir}}

Pipeline

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
experiment_name: sample-experiment

compute: azureml:vm-cluster-cpu

inputs:
  input_data_file:
    type: uri_file
    path: azureml:larger-than-disk-file@latest

outputs:
  output_data_dir:
    type: uri_folder
    path: azureml://datastores/<blob_storage_name>/paths/<path_to_folder>/

jobs:
  split_pgn_to_shards:
    type: command
    component: azureml:split_file_to_shards@latest
    inputs:
      input_data_file: ${{parent.inputs.input_data_file}}
    outputs:
      output_data_dir: ${{parent.outputs.output_data_dir}}

Run commands

> az ml component create -f component.yml
> az ml job create -f pipeline.yml

I expect Azure ML to mount the input file on a ro_mount and write the processed files to rw_mount. I understood the remaining options download and upload to actively download the file to the VM's local disk and upload the files after processing to the mount, respectively, which is not what I want.

The command argument -u in split is used for unbuffered write to output.

From the monitoring Network I/O I unexpectedly see the file being downloaded to disk. In addition, I get the following error from the component:

Disk full while running job. Please consider reducing amount of data accessed, or upgrading VM SKU.
Total space: 6958 MB, available space: 1243 MB (under AZ_BATCH_NODE_ROOT_DIR).
0

There are 0 best solutions below