AWS Glue - Writing File Takes A Very Long Time

3.3k Views Asked by At

Hi, I have an ETL job in AWS Glue that takes a very long time to write. It reads data from S3 and performs a few transformations (all are not listed below, but the transformations do not seem to be the issue) and then finally writes the data frame to S3. However, this writing operation seems to take a very long time. Approx 30 min for a file that is about 20 MB even when I'm using 10 workers (worker type G.1X). I have used print statements to see what takes time and it seems to be the last operation of writing the file to S3. I have not had this issue before using the same kind of setup.

I'm using Glue version 3.0, Python version 3, and Spark version 3.1.

The number of files that are in the source are almost 50 000 files spread out over many folders, new files are generated automatically every day. The approximate average file size is about 10 KB

Any suggestions on this issue?

#Glue context & spark session
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
#Solves the issues with old datetime in the new version of Spark
spark_conf = SparkConf()
spark_conf.setAll([
    ('spark.sql.legacy.parquet.int96RebaseModeInRead', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.int96RebaseModeInWrite', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.datetimeRebaseModeInWrite', 'CORRECTED')
    ])
conf = SparkConf().set('spark.sql.legacy.parquet.datetimeRebaseModeInRead','CORRECTED')
sc = SparkSession.builder.config(conf=spark_conf).enableHiveSupport().getOrCreate()
#sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#Source(/s) - create dynamic frame
dy = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={
        "paths": [
            "s3://.../files/abc/"
        ],
        "recurse": True,
        "groupFiles": "inPartition"
    },
    transformation_ctx="dy",
)

df = dy.toDF()

#Transformation(/s)
df_ready = df\
    .sort(['ID', 'timestamp'], descending=True)\
    .withColumn("timestamp_prev", 
                lag(df.timestamp)
                .over(Window()
                      .partitionBy("ID").orderBy('timestamp')))

df_ready.repartition(1).write.mode('overwrite').parquet("s3a://.../thisismywritefolder/df_ready/")
1

There are 1 best solutions below

3
On

You are repartitioning in the end, which prevents Glue from writing in parallel. If you remove the repartition, you should see increased speeds while writing.