AWS textract ignores some text fragments

64 Views Asked by At

I have a document in which some sensitive information is hidden by asterisks: enter image description here

"All undisputed payments will be due ***** (term omitted) from the date of invoice."

Textract does not include these asterisks in the "analyze document" request response

"page": "1",
        "text": "according to Phyto-Source's customary commercial procedures. All undisputed payments will be"
      },
      {
        "blockType": "LINE",
        "confidence": "99.86083",
        "geometry": {
          "boundingBox": {
            "width": 0.028914347,
            "height": 0.010553688,
            "left": 0.09513216,
            "top": 0.7369265
          }
        },
        "page": "1",
        "text": "due"
      },
      {
        "blockType": "LINE",
        "confidence": "99.91259",
        "geometry": {
          "boundingBox": {
            "width": 0.6898532,
            "height": 0.013386101,
            "left": 0.18501168,
            "top": 0.73675185
          }
        },
        "page": "1",
        "text": "(term omitted) from the date of invoice. The invoice date shall be the date that Forbes"
      },

I suspect textract is very smart, it understands that "*****" means confidential info and hides it. Is it correct? Is there a way to get asterisks in the response? Thanks

I tried to send document using different fonts and DPI. No difference

0

There are 0 best solutions below