Moving away from simple regex extraction to NER?

250 Views Asked by At

We have a relatively "simple" project from the business: digitize some contracts scan (PDF files) with OCR and extract entities from the text.

Entities can be something as simple as a specific price located in a certain subsection of the contract, or a generic definition of a process which can be found e.g. somewhere around section 5. For the same entity, different formulations and languages are used interchangeably in different contracts.

We have a limited amount of examples (10 to 20 per entity) to develop the extraction algorithm.

Given the specific nature of every entity for the moment we created many functions which act on strings extracted by amazon-textract from the PDFs and use regex rules plus some additional tinkering of the results to get the things we need.

This is the best solution so far for immediate results but it's quite hard to modify in case something is not working. Furthermore, to improve results only someone with knowledge of the code can intervene and modify it by basically introducing a new or in the regex rule. And this is still quite annoying because we have to go back to the code and see where things are not working. Of course this is far from ideal.

I thought about using a Named Entity Recognition (NER) model trained by the input of users who could highlight the entities directly in the text, but given the limited training set is it even possible to use a similar method? I'm under the impression that, to have a consistent model, we need at least 100 examples per entity.

Is there any cleverer alternative to use just regex? Or in general how you think our pipeline could be improved?

1

There are 1 best solutions below

1
On

Caveat - Hacky way!!

Duplicate the dataset Annotate until it reaches 100 since that's the limit for AWS. Create a CSV file, and feed it to textract. Train the model