dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
format="csv",
connection_options={
"paths": ["s3://somefile.csv/"],
'recurse':True,
'groupFiles': 'inPartition',
'groupSize': '100000'
},
format_options={
"withHeader": True,
"separator": ","
}
)
It takes 45 secs to read from S3. Is there any way to optimize the read time?
You could try the
optimizePerformance
option if you're using glue 3.0. It batches records to reduce IO. See this for more detailsAlso, could you convert the CSV to something like Parquet upstream of the read?