Loading huggingface dataset with load_dataset() from Azure Blob Storage

294 Views Asked by At

I need to modify existing code to be able to use the huggingface load_dataset() function. It currently works on local files, but I need to migrate the dataset to an Azure blob storage and be able to load from there. I'm confused as of how to do this.

The documentation does mention having support via fsspec, but it doesn't show what to do with the fs variable once I have it. So basically I'm doing:

storage_options = {"anon": True}
storage_options = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY}
storage_options={"tenant_id": TENANT_ID, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET}

import adlfs
fs = adlfs.AzureBlobFileSystem(**storage_options)

But what next? Do I just do load_dataset(fs)? Can I pass the fs object instead of a path? Or how do I actually load the dataset from the blob storage with load_dataset?

UPDATE:

I'm doing:

from datasets import load_dataset, load_from_disk  
storage_options = {"connection_string": "DefaultEndpointsProtocol=https;AccountName=acname;AccountKey=key;EndpointSuffix=core.windows.net"} 
data = load_dataset("abfs://ctnr-invoices", storage_options=storage_options) 

This still tries to find files locally, as I can see via the error message: FileNotFoundError: Couldn't find a dataset script at C:\Users\me\Documents\abfs:\ctnr-invoices\ctnr-invoices.py or any data file in the same directory. I haven't found documentation about "az://".

1

There are 1 best solutions below

3
On

The docstring for load_dataset includes:

    storage_options (`dict`, *optional*, defaults to `None`):
        **Experimental**. Key/value pairs to be passed on to the dataset file-system backend, if any.

(datasets = 2.14.6)

So, you need to use options like the ones you are trying (with correct credentials), making sure that the paths begin "az://".

The fs object is a filesystem instance, allowing you to list/find files inside containers and get their information, perform operations (copy, delete...) and open file-like objects. I believe HF is doing this internally, so for just loading, you shouldn't need it.