I have 40+ forms and I want to make sure that common fields have the same names. This would greatly help me in storing and sorting the entries.
Apart from that i want to make sure the formatting is consistent through all the forms. Is there a way i can extract this data from the pdf forms preferably to excel file, so that i can check / make corrections.
the properties i am looking for are:
- Field ID
- Field Name
- Field Type
- Font
- Font Size
- Font Color
- Alignment
- Multiline
- Date Format
I'd say you can definitely do part 1, extracting field names.
The easiest way is to open each and just use the 'Prepare Form' tool and 'Export Data'. You can pick a "text file", which is tab-delimited text file you can open/import in Excel. Do that 40+ times for each file.
After that, there is JavaScript and you can probably get form fields out, I'm not sure. I've kind of done the opposite, using Javascript to import data into a form, so I'm guessing the reverse is possible. I wrote a Gist on this a while back, it shows how to export form data through the UI and import data into the form through JS. It has also the link to Adobe's PDF SDK for understanding what can be done in JavaScript, and I included other links that helped me understand the JS "environment" inside of Acrobat.
After JavaScript comes very custom solutions involving free and pay-for tools written by open and closed source vendors. On the free-or-pay side, UniDOC has its UniPDF product. You'll have to know how to read/write Go, but it can be done, even the second part of your question, getting properties on the fields. They have a free tier that lets you process 100 documents per month before having to pay.
I made a very simple PDF with two form fields: one date, and one multiline text. I used their analysis example, pdf_all_objects.go, to get a dump of those two fields. Spotting the date format is pretty straight forward:
Seeing whether the text field was multiline or not is harder. There's no clear word "multiline". Instead, it's a bitwise value that's OR'd together with other values. I had to save the field as multiline and non-multiline and then diff the object streams to spot the subtle difference:
"Ff": 12582912
vs"Ff": 4096
is the difference between multiline and not./Helv 12 Tf 0
is "12-pt Helvetica, black".So... doable, but hard.