AWS Glue - Writing File Takes A Very Long Time

Question

AWS Glue - Writing File Takes A Very Long Time

3.3k Views Asked by Qwaz At 01 July 2025 at 00:52

Hi, I have an ETL job in AWS Glue that takes a very long time to write. It reads data from S3 and performs a few transformations (all are not listed below, but the transformations do not seem to be the issue) and then finally writes the data frame to S3. However, this writing operation seems to take a very long time. Approx 30 min for a file that is about 20 MB even when I'm using 10 workers (worker type G.1X). I have used print statements to see what takes time and it seems to be the last operation of writing the file to S3. I have not had this issue before using the same kind of setup.

I'm using Glue version 3.0, Python version 3, and Spark version 3.1.

The number of files that are in the source are almost 50 000 files spread out over many folders, new files are generated automatically every day. The approximate average file size is about 10 KB

Any suggestions on this issue?

#Glue context & spark session
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
#Solves the issues with old datetime in the new version of Spark
spark_conf = SparkConf()
spark_conf.setAll([
    ('spark.sql.legacy.parquet.int96RebaseModeInRead', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.int96RebaseModeInWrite', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.datetimeRebaseModeInWrite', 'CORRECTED')
    ])
conf = SparkConf().set('spark.sql.legacy.parquet.datetimeRebaseModeInRead','CORRECTED')
sc = SparkSession.builder.config(conf=spark_conf).enableHiveSupport().getOrCreate()
#sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#Source(/s) - create dynamic frame
dy = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={
        "paths": [
            "s3://.../files/abc/"
        ],
        "recurse": True,
        "groupFiles": "inPartition"
    },
    transformation_ctx="dy",
)

df = dy.toDF()

#Transformation(/s)
df_ready = df\
    .sort(['ID', 'timestamp'], descending=True)\
    .withColumn("timestamp_prev", 
                lag(df.timestamp)
                .over(Window()
                      .partitionBy("ID").orderBy('timestamp')))

df_ready.repartition(1).write.mode('overwrite').parquet("s3a://.../thisismywritefolder/df_ready/")

Original Q&A

There are 1 best solutions below

**Robert Kossendey** · Answer 1

Robert Kossendey On 29 March 2022 at 07:39

You are repartitioning in the end, which prevents Glue from writing in parallel. If you remove the repartition, you should see increased speeds while writing.

AWS Glue - Writing File Takes A Very Long Time

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in AWS-GLUE

Related Questions in AWS-GLUE-SPARK

Related Questions in AWS-GLUE3.0

Trending Questions

Popular # Hahtags

Popular Questions