How do I change how AWS Glue Jobs format partitions in S3?

360 Views Asked by At

I'm running Glue Jobs for a bunch of related tables with a ts (timestamp) partition. By default, each Glue job writes the output files in S3 using this folder structure (for a given table and timestamp):

s3://someBucket/someFolder/table1/ts=2023-03-08T21:20:17Z/data*.parquet
s3://someBucket/someFolder/table2/ts=2023-03-08T21:20:17Z/data*.parquet
s3://someBucket/someFolder/table3/ts=2023-03-08T21:20:17Z/data*.parquet

Since all these tables will share the same timestamp, I would much rather they were in this folder structure instead:

s3://someBucket/someFolder/2023-03-08T21:20:17Z/table1/data*.parquet
s3://someBucket/someFolder/2023-03-08T21:20:17Z/table2/data*.parquet
s3://someBucket/someFolder/2023-03-08T21:20:17Z/table3/data*.parquet

Is it possible and if so, how?

Thanks in advance!

Edit: In case you're interested in the code, it's not much different than what gets generated by AWS for simple Glue jobs in Python:

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    format="json",
    connection_options={
        "paths": [s3Source],
    },
    format_options={
        "multiline": False
    },
    transformation_ctx="S3bucket_node1",
)

# Script generated for node S3 bucket
S3bucket_node3 = glueContext.getSink(
    path=s3Destination,
    connection_type="s3",
    updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=["ts"],
    compression="snappy",
    enableUpdateCatalog=True,
    transformation_ctx="S3bucket_node3",
)
S3bucket_node3.setCatalogInfo(
    catalogDatabase="some_data_catalog", catalogTableName="table1"
)
S3bucket_node3.setFormat("glueparquet")
S3bucket_node3.writeFrame(S3bucket_node1)
job.commit()

There seems to be no way to change how partitions are formatted in the S3 folder structure, ugh.

0

There are 0 best solutions below