How can I optimize the read from S3?

892 Views Asked by At
 dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
 connection_type="s3",
 format="csv",
 connection_options={
     "paths": ["s3://somefile.csv/"],
     'recurse':True, 
     'groupFiles': 'inPartition', 
     'groupSize': '100000'
 },
 format_options={
     "withHeader": True,
     "separator": ","
 }
)

It takes 45 secs to read from S3. Is there any way to optimize the read time?

1

There are 1 best solutions below

1
On

You could try the optimizePerformanceoption if you're using glue 3.0. It batches records to reduce IO. See this for more details

dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
 connection_type="s3",
 format="csv",
 connection_options={
     "paths": ["s3://somefile.csv/"],
     'recurse':True, 
     'groupFiles': 'inPartition', 
     'groupSize': '100000'
 },
 format_options={
     "withHeader": True,
     "separator": ",",
     "optimizePerformance": True, 
 }
)

Also, could you convert the CSV to something like Parquet upstream of the read?