Do I need to check integrity use pandas to upload and download file from s3?

443 Views Asked by At

I use pandas to upload and download file from s3 in the following style (pandas use s3fs in the background)

import pandas as pd
pd.read_csv("s3://bucket/path/to/file.csv")

If the file is large, it is usually a concern that download (or upload) is not complete and partial file is handled.

Do I need to implement some md5 check here to ensure the integrity of data? Or it is already handled by s3fs?

1

There are 1 best solutions below

0
On BEST ANSWER

In short, yes. Generally when people upload large quantities of data to an external bucket they provide an md5sum with the data but unfortunately that isn't always the case. You have no way of knowing if the data has changed from the bucket to your local computer without validating the md5sum. s3fs has a checksum method, and I wrote a little custom function to get the md5 of a string so that you can both verify the md5 of the file object in s3 and then calculate the md5 after you have read it locally like so:

import pandas as pd
import io
from hashlib import md5
from s3fs import S3FileSystem

fs = S3FileSystem(anon=False)
checksum = fs.checksum('s3://fun_bucket/check_df.csv')
print("S3FS checksum is: %i" %checksum)

def tokenize(mystr):
    new_str = ""
    for c in mystr.decode():
        new_str += c
    return md5(str(new_str).encode()).hexdigest()


with fs.open('s3://sjcb/check_df.csv') as f:
    data = f.read()
    hash_checksum = int(tokenize(data), 16)
    print("Read data checksum is: %i" %hash_checksum)
    if checksum == hash_checksum:
        df = pd.read_csv(io.BytesIO(data), encoding='utf8')

print(df)

When I run this I get:

S3FS checksum is: 185552205801727997486039422858559195205
Read data checksum is: 185552205801727997486039422858559195205
   one  two  three
0    1    2      3
1    1    2      3
2    1    2      3

This prints the checksums for you to manually inspect but obviously the conditional won't generate a df if they aren't equal.