I'm running Glue Jobs for a bunch of related tables with a ts
(timestamp) partition. By default, each Glue job writes the output files in S3 using this folder structure (for a given table and timestamp):
s3://someBucket/someFolder/table1/ts=2023-03-08T21:20:17Z/data*.parquet
s3://someBucket/someFolder/table2/ts=2023-03-08T21:20:17Z/data*.parquet
s3://someBucket/someFolder/table3/ts=2023-03-08T21:20:17Z/data*.parquet
Since all these tables will share the same timestamp, I would much rather they were in this folder structure instead:
s3://someBucket/someFolder/2023-03-08T21:20:17Z/table1/data*.parquet
s3://someBucket/someFolder/2023-03-08T21:20:17Z/table2/data*.parquet
s3://someBucket/someFolder/2023-03-08T21:20:17Z/table3/data*.parquet
Is it possible and if so, how?
Thanks in advance!
Edit: In case you're interested in the code, it's not much different than what gets generated by AWS for simple Glue jobs in Python:
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
format="json",
connection_options={
"paths": [s3Source],
},
format_options={
"multiline": False
},
transformation_ctx="S3bucket_node1",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.getSink(
path=s3Destination,
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["ts"],
compression="snappy",
enableUpdateCatalog=True,
transformation_ctx="S3bucket_node3",
)
S3bucket_node3.setCatalogInfo(
catalogDatabase="some_data_catalog", catalogTableName="table1"
)
S3bucket_node3.setFormat("glueparquet")
S3bucket_node3.writeFrame(S3bucket_node1)
job.commit()
There seems to be no way to change how partitions are formatted in the S3 folder structure, ugh.