How do I download an S3 file only if it has changed?

6.7k Views Asked by At

I have a 900 MB file that I'd like to download to disk from S3 if it isn't already in place downloaded. Is there an easy way for me to only download the file if it isn't already in place? I know S3 supports querying MD5 checksum of files, but I'm hoping not to have to build this logic myself.

2

There are 2 best solutions below

1
On

You can use AWS CLI's s3 sync command.

Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination.

According to this forum thread, you can use sync to synchronize only one file:

aws s3 sync s3://bucket/path/ local/path/ --exclude "*" --include "File.txt"

It says: sync the given paths, exclude all files, but include "File.txt" - so it will sync only "File.txt" under those given paths.


Or with the Java SDK:

According to the javadoc, there is a getObjectMetadata method which will return information about an S3 object (file) without downloading it's contents.

The method returns an ObjectMetadata object which can give you some useful information:

Gets the value of the Last-Modified header, indicating the date and time at which Amazon S3 last recorded a modification to the associated object.

Gets the base64 encoded 128-bit MD5 digest of the associated object (content - not including headers) according to RFC 1864.

Gets the hex encoded 128-bit MD5 digest of the associated object according to RFC 1864.

0
On

I have used below code to download S3 files which have timestamp greater than the local folder timestamp. First it's check if any of the files in S3 folder have timestamp greater than the local folder timestamp. If yes then download those files only.

    TransferManager transferManager = TransferManagerBuilder.standard().build();
    AmazonS3 amazonS3 = AmazonS3ClientBuilder.standard().build();
            Path location = Paths.get("/data/test/");
            FileTime lastModifiedTime = null;
            try {
                lastModifiedTime = Files.getLastModifiedTime(location, LinkOption.NOFOLLOW_LINKS);
            } catch (IOException e) {
                e.printStackTrace();
            }

Date lastUpdatedTime = new Date(lastModifiedTime.toMillis());        

    ObjectListing listing = amazonS3.listObjects("bucket", "test-folder");
            List<S3ObjectSummary> summaries = listing.getObjectSummaries();
            for (S3ObjectSummary os: summaries) {
                if(os.getLastModified().after(lastUpdatedTime)) {
                    try {
                        String fileName="/data/test/"+os.getKey();
                        Download multipleFileDownload = transferManager.download(bucket, os.getKey(), new File(fileName));                        
                        while (multipleFileDownload.isDone() == false) {
                            Thread.sleep(1000);
                        }
                    }catch(InterruptedException i){
                        LOG.error("Exception Occurred while downloading the file ",i);
                    }
                }
            }