I'm using pdf2htmlEX
to convert a pdf to html, and the output displays correctly when it's generated locally on a mac, but not when it's generated in production on amazon linux. Multiple pages have this issue, but I'll use page 22 of this pdf as a specific example.
For the incorrect html output (generated on linux):
- while certain text is not visible when it's rendered in the browser, the correct text is in the underlying html upon inspection with chrome dev tools
- which is caused by the element's css
visibility
attribute (specified by class nameff13
) being set tohidden
, where in the correct conversion it is set tovisible
- and I can see in dev tools under the css styles computed tab for
rendered fonts
that the correct font isDejaVu Sans
and the incorrect font isHelvetica
I checked and confirmed that DejaVuSans.ttf
(and other DejaVu fonts) is installed on the linux machine at /usr/share/fonts/dejavu/
, so my best guess is that for some reason the pdf2htmlEX
program can't find the font file when it does the conversion, so it marks the css visibility
property as hidden
. I also tried to install the core mac (source here) and microsoft fonts, reboot the machine, and try again, but it didn't seem to help.
Does anyone know either how to fix this or troubleshoot from here? Thanks in advance for any help!
You need to ensure font files for all unembedded PDF fonts are in the fontconfig path. You can see the path list in the fontconfig config file (usually /etc/fonts/fonts.config). Look at the top of this file for the list of directories. If your font file is not in one of these then it will not be found.
In your case I would move the font files into /usr/share/fonts rather then in a subdirectory.