Querying Parquet from S3 using Bloom filter

494 Views Asked by At

I have some data in an s3 bucket in Parquet format. The data consists of various datasets containing a UUID key followed by values. I need to query individual UUIDs.

My question is whether it is possible to use the metadata provided by each Parquet file (specifically the Bloom filter), to see whether a specific UUID is (can be) located in each file, and then querying the file. The idea is not to query every single file in hopes of finding the required data, as this would take much too long.

Ideally, I would be going through each file in the bucket, obtaining the metadata, and seeing whether Parquet has hashed the requested UUID into a specific file. When I find a file containing the specific UUID, query it (e.g. with S3 Select).

0

There are 0 best solutions below