I'm looking for either a code snippet or other solution capable of converting a high volume (thousands) of .pdf's into .html or .doc while at the same time:
- maintaining hierarchical structure of headings
- capturing images in the document, uploading them to an image server and creating an absolute link to it, and maintaining table formatting.
Does such a tool exist and if so, who makes it? If not, who are some of the thought leaders in the space that I can connect with?
Check pdftohtml
You can then add some scripting around it to do a batch conversion.
The results aren’t that great, though.