I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe
.
I tried several things so far:
- the
pdfminer.six
library, produced messy HTML, - trying to grab the produced HTML, when rendering a PDF with pdf.js, which is apparently hidden in a Shadow DOM with no access to its inner HTML
- finally I came across
pdf2htmlEX
(https://github.com/pdf2htmlEX/pdf2htmlEX) which produced exactly what I wanted.
Locally, this solution worked great, however in the production state (Heroku) I was unable to install it correctly. The project is deprecated and the documentation is limited and terrible. The problem has something to do with broken dependencies.
So, how to convert PDFs to HTML effectively without losing any format using Python or any other tool?
Thanks a lots.
if anyone is willing to help me getting the pdf2htmlEX
to work on heroku, leave a comment and I will post more details in a different post
This is not going to be trivial. But I'll give some pointers.
You need an
app.json
in which you define your buildpacks.https://devcenter.heroku.com/articles/app-json-schema#buildpacks
If this project is available via
apt
it's going to be easy. You just use the Heroku's Apt buildpack define anAptfile
that says which packages it needs to install. ExampleThen it installs it automatically and you are done.
If it is not available as a package you will need to create your own buildpack.
https://devcenter.heroku.com/articles/buildpack-api
Example used here.
Another solution is to dockerize your project and execute it as a docker container.