Using fsspec at pandas.DataFrame.to_csv command

5.4k Views Asked by At

I want to write the csv-file from pandas dataframe on remote machine connecting via smtp-ssh. Does anybody know how add "storage_options" parameter correctly?

Pandas documentation says that I have to use some dict as parameter's value. But I don't understand which exactly.

hits_df.to_csv('hits20.tsv', compression='gzip', index='False', chunksize=1000000, storage_options={???})

Every time I got ValueError: storage_options passed with file object or non-fsspec file path

What am I doing wrong?

3

There are 3 best solutions below

1
On

If you do not have cloud storage access, you can access public data by specifying an anonymous connection like this

pd.read_csv('name',<other fields>, storage_options={"anon": True})

Else one should pass storage_options in dict format, you will get name and key by your cloud VM host (including Amazon S3, Google Cloud, Azure, etc.)

pd.read_csv('name',<other fields>, \
           storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY})
0
On

You will find the set of values to use by experimenting directly with the implementation backend SFTPFileSystem. Whatever kwargs you use these are the same ones that would go into stoage_options. Short story: paramiko is not the same as command-line SSH, so some trialing will be required.

If you have things working via the file system class, you can use the alternative route

fs = fsspec.implementations.sftp.SFTPFileSystem(...)
# same as fs = fsspec.filesystem("ssh", ...)
with fs.open("my/file/path", "rb") as f:
    pd.read_csv(f, other_kwargs)
1
On

Pandas is supporting fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.

The argument storage_options will allow you to expose s3fs arguments to pandas.

You can specify an AWS Profile manually using the storage_options which takes a dict. An example bellow:

import boto3

AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")

df.to_csv(
    f"s3://{AWS_S3_BUCKET}/{key}",
    storage_options={
        "key": AWS_ACCESS_KEY_ID,
        "secret": AWS_SECRET_ACCESS_KEY,
        "token": AWS_SESSION_TOKEN,
    },
)