I use pandas to upload and download file from s3 in the following style (pandas use s3fs in the background)
import pandas as pd
pd.read_csv("s3://bucket/path/to/file.csv")
If the file is large, it is usually a concern that download (or upload) is not complete and partial file is handled.
Do I need to implement some md5 check here to ensure the integrity of data? Or it is already handled by s3fs?
In short, yes. Generally when people upload large quantities of data to an external bucket they provide an md5sum with the data but unfortunately that isn't always the case. You have no way of knowing if the data has changed from the bucket to your local computer without validating the md5sum. s3fs has a checksum method, and I wrote a little custom function to get the md5 of a string so that you can both verify the md5 of the file object in s3 and then calculate the md5 after you have read it locally like so:
When I run this I get:
This prints the checksums for you to manually inspect but obviously the conditional won't generate a df if they aren't equal.