How could I use efficiently Amazon Comprehend over a dataset?

642 Views Asked by At

I need to perform a sentiment analysis over a csv dataset using aws comprehend using Aws comprehend and I wanted to know how could I perform this analysis the fastest way possible and save all the results of each analysis in a single JSON file ?

As of now, I have a server that reads each row (text) of my dataset, and for each row, I trigger a lambda function that perform the analysis over the row and send back the result to the server. The result is then appended to a json.

server snippet code:

 # performing analysis and appending results
    with open(tmp_csv) as csv_data:

        csv_reader = csv.DictReader(csv_data)
        
        for csv_row in csv_reader:
               result = client.invoke(FunctionName='SentimentAnalysis',InvocationType='RequestResponse',
                                      Payload=json.dumps(csv_row))
               object = json.loads(json.loads(result['Payload'].read().decode())['body'])
               result_json_data.append(object)
               json_data.append(csv_row)

lambda function code (sentiment analysis):

def lambda_handler(event, context):
    text = (event["text"])
    if text == "":
        text = "No text"
    language_analysis = comprehend.detect_dominant_language(Text = text)
    language = language_analysis['Languages'][0]['LanguageCode']
    if (language not in  ["ar", "hi", "ko", "zh-TW", "ja", "zh", "de", "pt", "en", "it", "fr", "es"]):
        language = "en"
    sentiment = comprehend.detect_sentiment(Text=text, LanguageCode= language)
    response = {}
    response['Sentiment'] = sentiment['Sentiment']
    response['id'] = event['id']
    response['SentimentScore'] = sentiment['SentimentScore']
    json_response = json.dumps(response)
    print(json_response)
    return {
        'body': json_response
    }

As you can see, with the server, I send a text to the lambda function (client.invoke) that will first detect the language (detect_dominent_language) of this text and then perform the analysis with the detect_sentiment function. It finally sends back the result to the server.

The problem with this implementation is that it takes too much time to execute, because my dataset have more than a hundred thousand entries and all the analysis are made sequentially.

What should I do ? Should I continue to use the invoke method inside a loop to call my lambda function multiple times ?

Maybe one solution would be to trigger only one lambda function that will read the csv dataset and perform the complete analysis, but in that case how could I fully take advantage of the lambda memory (approximatly a size of 10gb) ?

Thank you.

1

There are 1 best solutions below

0
On

You have a few different options.

  1. If your usecase is latency sensitive and you need response within seconds then you can use Comprehend's BatchDetectSentiment API. It takes an input batch of 25 requests and parallelizes these internally.

  2. If your usecase is not latency sensitive then look into the asynchronous API startSentimentDetectionJob. The API can take upto a million documents from S3 in a single request and produce a single output file. You will need some trivial pre/post-processing for json format.