When using load_dataset() to load a Mozilla Common Voice (v11) dataset, the resulting dataset (ds) has audio.arrays as numpy arrays. I don't know how to reproduce this.
How do you set just one feature as ndarrays?
In examining Common Voice:
> tt = load_dataset(
data_args.dataset_name,
data_args.dataset_config_name,
split=f'{data_args.train_split_name}[:15%]', # Load only the first %
cache_dir=model_args.cache_dir,
token=model_args.token,
)
> type(tt.select([0])['audio'][0]['path'])
<class 'str'>
> type(tt.select([0])['audio'][0]['array'])
<class 'numpy.ndarray'>
> type(tt.select([0])['path'][0]) # They repeat paths as a top level feature
<class 'str'>
But in my own code I can't store numpy arrays, EXCEPT I found ds = ds.with_format('np')
, which does result in re-loaded datasets, but ALL top level features end up as numpy data types (see full code to test/reproduce below):
> type(test_ds['path'][0])
<class 'numpy.str_'>
I only need the "audio -> array" data to be 1d numpy arrays.
Here's test code to create a dataset and reload it to examine types:
#!/usr/bin/env python
# Trying to save and reload a numpy array to/from a huggingface dataset
# The type of the loaded array must be a numpy array()
from datasets import Dataset, Features, Array2D, Sequence, Value
import numpy as np
audio_arrays = [np.random.rand(16000), np.random.rand(16000)]
features = Features({
# Each audio contains a np array of audio data, and a path to the src audio file
'audio': Sequence({
#'array': Sequence(feature=Array2D(shape=(None,), dtype="float32")),
'array': Sequence(feature=Value('float32')),
'path': Value('string'),
}),
'path': Value('string'), # Path is redundant in common voice set also
})
ddata = {
'path': [], # This will be a list of strings
'audio': [], # This will be a list of dictionaries
}
ddata['path'] = ['/foo0/', '/bar0/'] # # ensures we see storage difference
ddata['audio'] = [
{'array': audio_arrays[0], 'path': '/foo1/' },
{'array': audio_arrays[1], 'path': '/bar1/', },
]
ds = Dataset.from_dict(ddata)
ds = ds.with_format('np')
ds.save_to_disk('/tmp/ds.ds')
loaded_dataset = Dataset.load_from_disk('/tmp/ds.ds')
ld = loaded_dataset
au = ld['audio'][0]
ar = ld['audio'][0]['array']
print("Type of audio array:", type(ar))
print("Type of path:", type(ld['path'][0]))
print("Type of au path:", type(ld['audio'][0]['path']))
import ipdb; ipdb.set_trace(context=16); pass
Got it. Output is:
Note, the audio arrays as bytes will error, with invalid format, when accessed. We're using soundfile to reformat them to proper complete WAV file bytes represenations.
Code to test/reproduce: