How to prevent duplicate keys with S3?

4.2k Views Asked by At

How can it be possible to implement in an application logic a way to prevent duplicate keys given the eventual consistency nature of S3.

So one way to check if a key exists is:

public boolean exists(String path, String name) {
    try {
        s3.getObjectMetadata(bucket, getS3Path(path) + name); 
    } catch(AmazonServiceException e) {
        return false;
    }
    return true;
}

Is there a guarantee that when we gate our application logic with this, it will always return whether the key exists or not given, again the eventual consistency of S3? Let's say two requests came with both the exact same key/path would one get the response that it exists (e.g. by exists() == true) or both will be stored but just in different versions?

I would like to point out that I am using S3 as document storage (similar to a JSON storage)

2

There are 2 best solutions below

0
On

Using other "S3-Compatible" like Wasabi solves this problem as stated in this article:

Wasabi also uses a data consistency model that means any operation followed by another operation will always give the same results. This Wasabi data consistency approach is in contrast to the Amazon S3 model which is "eventually consistent" in that you may get different results in two requests.

8
On

That code won't work as intended.

The first time ever you call s3.getObjectMetadata(...) on a key that S3 has never seen before, it will correctly tell you that there's no such key. However, if after that you upload an object with that key, and call s3.getObjectMetadata(...) again, you may still see S3 telling you that there's no such key.

This is documented on the Introduction to Amazon S3: Amazon S3 data consistency model page:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all Regions with one caveat. The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.

There's no way to do exactly what you describe with S3 alone. You need a strongly consistent data store for that kind of query. Something like DynamoDB (with strongly consistent reads), RDS, etc.

Alternatively, if you want to try to use just S3, there's one thing you might be able to do, depending on the specifics of the problem you have. If you have the liberty to choose the key that you'll use to write the object to S3, and if you know the full contents of the object that you'll write, you could use keys that are the hash of the contents of the object. Hash-collision apart, a given key will only ever exist in S3 if that exact piece of data is there, because given the piece of data there's only 1 possible name for it.

The write operation would become idempotent. Here's why. If you check for existence and it returns false, you can write the object. If the "return false" was due to eventual consistency, it probably isn't an issue because all you'll be doing is you'll be overwriting an object with the exact same contents, which is almost like a "no-op" (exception being if you're triggering jobs for when objects are written; you'd need to check the idempotency of those, too).

However, that solution may not be applicable to your case. If it isn't, then you'll need to use a strongly consistent storage system for metadata.