I need to modify existing code to be able to use the huggingface load_dataset()
function. It currently works on local files, but I need to migrate the dataset to an Azure blob storage and be able to load from there. I'm confused as of how to do this.
The documentation does mention having support via fsspec, but it doesn't show what to do with the fs variable once I have it. So basically I'm doing:
storage_options = {"anon": True}
storage_options = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY}
storage_options={"tenant_id": TENANT_ID, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET}
import adlfs
fs = adlfs.AzureBlobFileSystem(**storage_options)
But what next? Do I just do load_dataset(fs)
? Can I pass the fs object instead of a path? Or how do I actually load the dataset from the blob storage with load_dataset?
UPDATE:
I'm doing:
from datasets import load_dataset, load_from_disk
storage_options = {"connection_string": "DefaultEndpointsProtocol=https;AccountName=acname;AccountKey=key;EndpointSuffix=core.windows.net"}
data = load_dataset("abfs://ctnr-invoices", storage_options=storage_options)
This still tries to find files locally, as I can see via the error message: FileNotFoundError: Couldn't find a dataset script at C:\Users\me\Documents\abfs:\ctnr-invoices\ctnr-invoices.py or any data file in the same directory.
I haven't found documentation about "az://".
The docstring for
load_dataset
includes:(datasets = 2.14.6)
So, you need to use options like the ones you are trying (with correct credentials), making sure that the paths begin "az://".
The
fs
object is a filesystem instance, allowing you to list/find files inside containers and get their information, perform operations (copy, delete...) and open file-like objects. I believe HF is doing this internally, so for just loading, you shouldn't need it.