programmatically deleting parquet partitions from S3 bucket using pyspark

805 Views Asked by At

I have a parquet file partitioned in the S3 file system (s3fs) like so:

STATE='DORMANT'
-----> DATE=2020-01-01
-----> DATE=2020-01-02
             ....
-----> DATE=2020-11-01

STATE='ACTIVE'
-----> DATE=2020-01-01
-----> DATE=2020-01-02
             ....
-----> DATE=2020-11-01

Every day new data is appended to the parquet file and partitioned accordingly.

I would like to keep only the last 90 days of data and delete the rest. So when the 91'st data of data comes in, it appends and then deletes day 1 in the DATE partition. When day 92 comes in, it deletes day 2 and so on.

Is this possible via pyspark?

0

There are 0 best solutions below