aws wrangler (pandas layer). problem with path to S3 bucket

435 Views Asked by At

here is my python code in my lambda layer. Shout out to John R, for some of this paginator code. from api gateway, I pass in path param (bucket) and query string params (fmt & date), such as:

https://3snk9o61.execute-api.us-east-1.amazonaws.com/v1/br-candles?fmt=json&date=today

This code is probably overly convoluted but it works. My problem is on this line: raw_df = wr.s3.read_csv(path1,path2, use_threads=True) The commented line above that is the original and works fine, but I dont want to parse the whole bucket contents. I want the dataframe to be limited to just the specific objects that are defined in the "object_list". The error that I get "no files found on s3://br-candles/br4.csv" implies that its not seeing multiple files. It is just finding the first file but its supposed to parse a list of files. Probably a very simple fix but I would appreciate any advice. Thanks

import json
import base64
import awswrangler as wr
import boto3

def lambda_handler(event, context):
    
    s3 = boto3.client('s3')
    object_list = []
    bucket_name = event['pathParameters']['bucket']
    
    format = event['queryStringParameters']['fmt']
    day = event['queryStringParameters']['date']
    print(day)
    paginator = s3.get_paginator("list_objects_v2")
    page_iterator = paginator.paginate(Bucket=bucket_name)
    for result in page_iterator:
      object_list += filter(lambda obj: obj['Key'].endswith('.csv'), result['Contents'])
    object_list.sort(key=lambda x: x['LastModified'])
    
    A = (object_list[-1]['Key'])
    B = (object_list[-4]['Key'])
    full_path = f"s3://{bucket_name}"
    path1 = f"s3://{bucket_name}/{A}"
    path2 = f"s3://{bucket_name}/{B}"
    #raw_df = wr.s3.read_csv(path=full_path, path_suffix=['.csv'], use_threads=True)
    raw_df = wr.s3.read_csv(path1,path2, use_threads=True)
    
    for df in raw_df:
      if day == 'today':
       
        etc.etc.. no issues below
1

There are 1 best solutions below

0
bob On

I solved it with this syntax

 raw_df = wr.s3.read_csv(path=[f'{full_path}/{A}', f'{full_path}/{B}'], use_threads=True)

in this way, it is only reading into the dataframe, just the few objects that I want.