s3fs local filecache of versioned flies

684 Views Asked by At

I want to use s3fs based on fsspec to access files on S3. Mainly because of 2 neat features:

  1. local caching of files to disk with checking if files change, i.e. a file gets redownloaded if the local and remote file differ
  2. file version id support for versioned S3 buckets, i.e. the ability to open different versions of the same remote file based on their version id

I don't need this for high frequency use and the files don't change often. It is mainly for using unit/integration test data stored on S3, which changes only if tests and related test data get updated (versions!).

I got both of the above working separately just fine, but it seems I can't get the combination of the two working. That is, I want to be able to cache different versions of the same file locally. It seems that as soon as you use a filecache, the version id disambiguation is lost.

fs = fsspec.filesystem("filecache", target_protocol='s3', cache_storage='/tmp/aws', check_files=True, version_aware=True)
with fs.open("s3://my_bucket/my_file.txt", "r", version_id=version_id) as f:
    text = f.read()

No matter what version_id is, I always get the most recent file from S3, which is also the one that gets cached locally.

What I expect is that I always get the correct file version and the local cache either keeps separate files for each version (preferred) or just updates the local file whenever I request a version different from the cached one.

Is there a way I can achieve this with the current state of the libraries or is this currently not possible? I am using s3fs==fsspec==2022.3.0.

1

There are 1 best solutions below

0
On

After checking with the developers this combination seems not to be possible with the current state of the libraries, since the hash of the target file is based on the filepath alone, disregarding any other kwargs such as version_id.