How Do You Filter Documents by Size Before Sending to AWS Comprehend via boto3?

1.1k Views Asked by At

I'm currently attempting to use the boto3 library to perform batch sentiment analysis on a collection of documents with AWS' Comprehend service. The service has some limitations in document size (documents cannot exceed 5000 bytes); therefore, I'm attempting to pre-filter documents before using the boto3 API. See the code snippet below:

...
batch = []
for doc in docs:
    if isinstance(doc, str) and len(doc) > 0 and sys.getsizeof(doc) < 5000:
        batch.append(doc)

data = self.client.batch_detect_sentiment(TextList=batch, LanguageCode=language)
...

My assumption was that trying to filter documents by using sys.getsizeof would result in filtering out any strings which would go beyond the 5000 byte limit of the service. However, I'm still receiving the following exception with my filtering:

botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the BatchDetectSentiment operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 5523 bytes

Is there a more effective way to calculate the size of a document sent to Comprehend for the purpose of avoiding hitting the max document size limit?

1

There are 1 best solutions below

0
On

There are 2 approaches here:

  1. As Daniel mentioned you can use len(doc.encode('utf-8')) to determine the end size of string as it takes into account the encoding, not just how much memory the python string object takes.

  2. You can handle the exception whenever it occurs. Just like that:

try:
    data = self.client.batch_detect_sentiment(TextList=batch, LanguageCode=language)
except self.client.exceptions.TextSizeLimitExceededException:
    print('The batch was too long')
else:
    print(data)