I need to query an s3 inventory report some_name.csv.gz file, filter its contents and change some columns and then replace it (with the same name some_name.csv.gz)with the updated version.
All of this is to be done in a python lambda.
For the query/filter part I am using s3.select_object_content
s3 = boto3.client('s3')
resp = s3.select_object_content(
Bucket=bucket_name,
Key=filename,
ExpressionType='SQL',
Expression="select * from s3object s where s._6 like '%NOT-SSE%'",
InputSerialization={'CSV':{'FileHeaderInfo': 'None'},'CompressionType': 'GZIP'},
OutputSerialization = {'CSV': {}}
)
for the write part I am looping through every record and writing it to a file after some potential modifications to the record
records = []
with open('csvfile.csv','wb') as file:
for event in resp['Payload']:
if 'Records' in event:
rec = event['Records']['Payload'].decode('utf-8')
# change one of the columns if it is empty
row = rec.split(',')
if row[2] == '':
row[2] = 'NOT_EMPTY'
rec = ''.join(row)
file.write(rec)
file.write('\n')
# gzip file
# upload it back to s3
Is it a bad idea to write the file line by line? (what if the file is very large) Are there better/more scalable alternatives?