Can Amazon Comprehend extract and categorizing data from classifieds

311 Views Asked by At

I have a large dataset from which I would like to extract and categorize specific elements. Below is a most common example:

enter image description here

I would like to know if this is possible using Amazon Comprehend or maybe there are better tools to do that. I am not a developer and looking to hire someone to program this for me. But I would like to understand conceptually if something like this feasible before I hire someone.

1

There are 1 best solutions below

2
On

Comprehend is capable of extracting and categorizing text from your document. You can use Comprehend’s Custom Entity Recognition.

For this, you will provide annotated training data as input. You can leverage Ground Truth in Amazon SageMaker to do the annotations, and directly provide Ground Truth output to Comprehend Entity Recognition Training job. You can also provide your own annotations file for the training job - https://docs.aws.amazon.com/comprehend/latest/dg/API_EntityRecognizerInputDataConfig.html.

The relevant APIs for Amazon Comprehend would be -

  1. Training - https://docs.aws.amazon.com/comprehend/latest/dg/API_CreateEntityRecognizer.html
  2. Async Inference - https://docs.aws.amazon.com/comprehend/latest/dg/API_StartEntitiesDetectionJob.html OR Sync Inference Over Custom Endpoint - https://docs.aws.amazon.com/comprehend/latest/dg/API_CreateEntityRecognizer.html

Here is a detailed example of how to train custom entity recognizers with Amazon Comprehend - https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html

Annotation file example for this use-case.

File Line Begin Offset End Offset Type
doc1 3 0 2 Width
doc1 3 5 6 Ratio
doc1 3 9 10 Diameter
doc1 0 12 20 Brand
doc1 0 6 6 Quantity
doc1 6 8 10 Price
doc1 1 20 22 Condition
doc1 0 42 48 Season
doc2 0 45 48 Quantity
doc2 1 78 79 Price

The file doc1 should contain the text that you want to extract entities from.