I created a Parquet dataset partitioned as follows:
2019-taxi-trips/
    - month=1/
        - data.parquet
    - month=2/
        - data.parquet
    ...
    - month=12/
        - data.parquet
This organization follows the Parquet dataset partitioning convention used by Hive Metastore. This partitioning scheme was generated by hand, so there is no _metadata file anywhere in the directory tree.
I would like to now read this dataset into Dask.
With data located on local disk, the following code works:
import dask.dataframe as dd
dd.read_parquet(
    "/Users/alekseybilogur/Desktop/2019-taxi-trips/*/data.parquet",
    engine="fastparquet"
)
I copied these files to an S3 bucket (via s3 sync; partition folders are top level keys in the bucket, like so), and attempted to read them off of cloud storage using the same basic function:
import dask.dataframe as dd; dd.read_parquet(
    "s3://2019-nyc-taxi-trips/*/data.parquet",
    storage_options={
        "key": "...",
        "secret": "..."
    },
    engine="fastparquet")
This raises IndexError: list index out of range. Full stack trace here.
Is not is currently possible to read in such a dataset directly from AWS S3?
 
                        
There is currently a bug in
fastparquetthat is preventing this code from working. See Dask GH#6713 for details.In the meantime, until this bug is resolved, one easy solution to this issue is to use the
pyarrowbackend instead.