How to highlight custom extractions using a2i's crowd-textract-analyze-document?

421 Views Asked by At

I would like to create a human review loop for images that undergone OCR using Amazon Textract and Entity Extraction using Amazon Comprehend.

My process is:

  1. send image to Textract to extract the text
  2. send text to Comprehend to extract entities
  3. find the Block IDs in Textract's output of the entities extracted by Comprehend
  4. add new Blocks of type KEY_VALUE_SET to textract's JSON output per the docs
  5. create a Human Task with crowd-textract-analyze-document element in the template and feed it the modified textract output

What fails to work in this process is step 5. My custom entities are not rendered properly. By "fails to work" I mean that the entities are not highlighted on the image when I click them on the sidebar. There is no error in the browser's console.

Has anyone tried such a thing?

Sorry for not including examples. I will remove secrets/PII from my files and attach them to the question

1

There are 1 best solutions below

0
On BEST ANSWER

I used the AWS documentation of the a2i-crowd-textract-detection human task element to generate the value of the initialValue attribute. It appears the doc for that attribute is incorrect. While the the doc shows that the value should be in the same format as the output of Textract, namely:

[
        {
            "BlockType": "KEY_VALUE_SET",
            "Confidence": 38.43309020996094,
            "Geometry": { ... }
            "Id": "8c97b240-0969-4678-834a-646c95da9cf4",
            "Relationships": [
                { "Type": "CHILD", "Ids": [...]},
                { "Type": "VALUE", "Ids": [...]}
            ],
            "EntityTypes": ["KEY"],
            "Text": "Foo bar"
        },
]

the a2i-crowd-textract-detection expects the input to have lowerCamelCase attribute names (rather than UpperCamelCase). For example:

[
        {
            "blockType": "KEY_VALUE_SET",
            "confidence": 38.43309020996094,
            "geometry": { ... }
            "id": "8c97b240-0969-4678-834a-646c95da9cf4",
            "relationships": [
                { "Type": "CHILD", "ids": [...]},
                { "Type": "VALUE", "ids": [...]}
            ],
            "entityTypes": ["KEY"],
            "text": "Foo bar"
        },
]

I opened a support case about this documentation error to AWS.