I would like to parse form fields from signed PDF's. With this I mean for example the checkboxes. I have already tried different ways (with Python) like PyPDF2, pikepdf or even pdfminer, however I only get the letters out and not the form fields. If someone has an approach how I could parse form fields from signed PDFs it would be my salvation. I can parse the individual letters, but not the form fields. I'm already thinking about trying OCR, but it seems very complicated to me and it might be easier.
Does anyone of you have an idea how I can parse the form fields out of signed PDF?
Thanks in advance!
disclaimer: I am the author of
borb, the library used in this answer.It's unclear what you want precisely.
Either option is possible using
borbIf you want to extract information of the form fields, I would recommend you look at section 4.4 of the examples repository. I'll post the example here for the sake of completeness.
This example reads an input PDF, and then fetches the values of the form fields.
You can also do more low-level manipulations,
borbrepresents the PDF as a JSON-like datastructure (nested arrays, dictionaries and primitives). So you can get the information relatively easily.If you want to apply OCR to a PDF, I would recommend yet another example in the examples repository. This time in section 7.2.