Not seeing column names when reading csv from s3 in pandas

79 Views Asked by At

I am using the following bit of code to read the iris dataset from an s3 bucket.

import pandas as pd
import s3fs

s3_path = 's3://h2o-public-test-data/smalldata/iris/iris.csv'

s3 = s3fs.S3FileSystem(anon=True)
with s3.open(s3_path, 'rb') as f:
    df = pd.read_csv(f, header = True)

However, the column names are just the contents of the first row of the dataset. How do I fix that?

2

There are 2 best solutions below

0
On

The following changes are required:

  1. s3_path should omit the s3://.
  2. iris.csv is a file without header. In case you need a file with header then you should go for iris_wheader.csv file.
  3. In read_csv header accepts boolean value

Your final code should look something like this

import s3fs
import pandas as pd

s3 = s3fs.S3FileSystem(anon=True)

with s3.open('h2o-public-test-data/smalldata/iris/iris_wheader.csv', 'rb') as f:
    df = pd.read_csv(f, header=0)
    print(df.head())

Edit: You can directly read the file in pandas as follows:

import pandas as pd

df = pd.read_csv('s3://h2o-public-test-data/smalldata/iris/iris_wheader.csv', header=0, storage_options={
    "anon": True
})
print(df.head())

You still need to install s3fs. Just that no need to open a file for accessing it.

0
On

See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html for all the parameters.

If you don't have a CSV with the column names, you can use the names parameter to specify the names you want. In that case, you do not need to set header to True.

df = pd.read_csv(file_path, names=['yan', 'tan', 'tetherer', 'mether', 'pip'])