For the purpose of my project, I am given large pdfs and need to manually extract one specific value (commission). I am looking for ay machine learning or AI model that would be able to automate this process. The structure of the pdfs vary, so ideally the model would be able to scan the pdf and return the commission percent for any type of pdf. For example the value can be provided in such ways:
Commission Rate = 20%
The commission rate for this transaction is 20%.
Premium Commission Net
50000 20% 40000
I think your case is quite specific and you will be hard pressed to find a model that does exactly what you want without prior work. In my opinion you should perform the following tasks:
Annotate a representative sample of your dataset with different forms of pdf.
Use successively an OCR for example pytesseract and then regexes to locate the desired information. Test this technique with a portion of the annotated set.
Finally, test on the rest of the annotated data to evaluate your model.