How to extract field names, and properties, from PDF form?

3k Views Asked by At

I have 40+ forms and I want to make sure that common fields have the same names. This would greatly help me in storing and sorting the entries.

Apart from that i want to make sure the formatting is consistent through all the forms. Is there a way i can extract this data from the pdf forms preferably to excel file, so that i can check / make corrections.

the properties i am looking for are:

  1. Field ID
  2. Field Name
  3. Field Type
  4. Font
  5. Font Size
  6. Font Color
  7. Alignment
  8. Multiline
  9. Date Format
2

There are 2 best solutions below

0
On

I'd say you can definitely do part 1, extracting field names.

The easiest way is to open each and just use the 'Prepare Form' tool and 'Export Data'. You can pick a "text file", which is tab-delimited text file you can open/import in Excel. Do that 40+ times for each file.

After that, there is JavaScript and you can probably get form fields out, I'm not sure. I've kind of done the opposite, using Javascript to import data into a form, so I'm guessing the reverse is possible. I wrote a Gist on this a while back, it shows how to export form data through the UI and import data into the form through JS. It has also the link to Adobe's PDF SDK for understanding what can be done in JavaScript, and I included other links that helped me understand the JS "environment" inside of Acrobat.

After JavaScript comes very custom solutions involving free and pay-for tools written by open and closed source vendors. On the free-or-pay side, UniDOC has its UniPDF product. You'll have to know how to read/write Go, but it can be done, even the second part of your question, getting properties on the fields. They have a free tier that lets you process 100 documents per month before having to pay.

I made a very simple PDF with two form fields: one date, and one multiline text. I used their analysis example, pdf_all_objects.go, to get a dump of those two fields. Spotting the date format is pretty straight forward:

.../Subtype/Widget/T(My_date_field)/Type/Annot>><</JS(AFDate_FormatEx\("mm/dd/yyyy"\);)/S/JavaScript>><</JS(AFDate_KeystrokeEx\("mm/dd/yyyy"\);)/S/JavaScript>>

Seeing whether the text field was multiline or not is harder. There's no clear word "multiline". Instead, it's a bitwise value that's OR'd together with other values. I had to save the field as multiline and non-multiline and then diff the object streams to spot the subtle difference:

@@ -150,7 +150,7 @@ Decoded:
 =========================================================
  26: 27 0 *core.PdfIndirectObject
 *core.PdfObjectDictionary
-Dict("AA": Dict(), "DA": /Helv 12 Tf 0 g, "F": 4, "FT": Tx, "Ff": 4096, "MK": Dict(), "P": Ref(17 0), "Rect": [159.063000, 604.518000, 309.063000, 626.518000], "Subtype": Widget, "T": My_multiline, "Type": Annot, )
+Dict("AA": Dict(), "DA": /Helv 12 Tf 0 g, "F": 4, "FT": Tx, "Ff": 12582912, "MK": Dict(), "P": Ref(17 0), "Rect": [159.063000, 604.518000, 309.063000, 626.518000], "Subtype": Widget, "T": My_multiline, "Type": Annot, )

"Ff": 12582912 vs "Ff": 4096 is the difference between multiline and not.

/Helv 12 Tf 0 is "12-pt Helvetica, black".

So... doable, but hard.

1
On

The Form DATA FILE (FDF or eXtended XFDF) is easily exported from the PDF and can be manipulated from blank to be imported to a PDF to fill the fields automatically. In some ways similar to the PDF it can contain binary media but is predominantly text based thus easy to parse.

It could look something like this, so very easy to import to other applications

%FDF-1.4
%âãÏÓ
1 0 obj
<<
/FDF <<
/F (BrunnoFormExample.pdf)
/Fields [<<
/T (Address 1 Text Box)
/V (questions)
>> <<
/T (Address 2 Text Box)
/V (stackoverflow.com)
>> <<
/T (City Text Box)
/V (My Capitol @ WWW)
>> <<
/T (Country Combo Box)
/V (Austria)
>> <<
/T (Driving License Check Box)
/V /Yes
>> <<
/T (Family Name Text Box)
/V (Miranda Marques)
>> <<
/T (Favourite Colour List Box)
/V (Violet)
>> <<
/T (Gender List Box)
/V (Man)
>> <<
/T (Given Name Text Box)
/V (Brunno)
>> <<
/T (Height Formatted Field)
/V (150)
>> <<
/T (House nr Text Box)
/V (70970330)
>> <<
/T (Language 1 Check Box)
/V /Off
>> <<
/T (Language 2 Check Box)
/V /Yes
>> <<
/T (Language 3 Check Box)
/V /Off
>> <<
/T (Language 4 Check Box)
/V /Yes
>> <<
/T (Language 5 Check Box)
/V /Off
>>
<<
/T (Postcode Text Box)
/V (HTTPS2)
>>]
/ID [<5E0A553555622A0516E9877CA55217A6> <90A86CDE1915E44BE48046FECF63C769>]
/UF (BrunnoFormExample.pdf)
>>
/Type /Catalog
>>
endobj
trailer
<<
/Root 1 0 R
>>
%%EOF

HOWEVER it is only the field data thus you cant change style or colour those are part of the PDF page data and thus for changing a pages content you need a full blown editor /API with analysis and modification abilities, rather than import export function. There are many GUI PDF editors with API or SDK abilities such as windows Foxit Phantom but you need to use your favourite choice for platform or programming language, thuse for JS you can use iText Aspose or Spire etc...