Create dataset in azure ml with substring dynamic

101 Views Asked by At

Is it possible to create a dataset in AzureML with a dynamic substring?

I have this:

data_paths = f'/raw/folder_files/data_a01923das-djed.parquet'
x = Dataset.File.from_files(path = [(adls_storage, data_paths)])

The above works, only that every so often another file is placed with the same name but the date is changed. Something like this:

data_paths = f'/raw/folder_files/data_bjdidoe-9323.parquet'
x = Dataset.File.from_files(path = [(adls_storage, data_paths)])

The point is, this is not predictable.

There is a way to read it with some regular expression, for example:

data_paths = f'/raw/folder_files/data_*.parquet'
x = Dataset.File.from_files(path = [(adls_storage, data_paths)])

So that I can always access the file independently of this substring?

The name always starts with "data_", the rest changes.

1

There are 1 best solutions below

0
On

One possible solution is to use AzureMachineLearningFileSystem to get the list of all the files and use regular expression with glob to shortlist the required files. Below is sample code for the task with datastore:

import  pandas  as  pd
from azureml.fsspec import AzureMachineLearningFileSystem
subscription_id = ''
resource_group = ''
workspace_name = ''
input_datastore_name = ''
target_datastore_name = 'tds'
path_on_datastore = 'folder'

uri = f'azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace_name}/datastores/{input_datastore_name}/paths/{path_on_datastore}'

fs = AzureMachineLearningFileSystem(uri)

f_list = fs.glob()
fs.glob("folder/data_*.parquet")

With the above code snippet, you can get the shortlisted data files then you can create dataset for each of the file.

enter image description here

For more details, please check this documentation.