pdf2HtmlEX - Text on html is different than the source pdf

1.5k Views Asked by At

I am using to pdf2htmlEX in order to convert pdf files to html. I also extract the text from the file afterwards.

The Problem:

I encountered with a file that the text at the converted html is unreadable: https://dspace.mit.edu/openaccess-disseminate/1721.1/101159

The command i use:

pdf2htmlEX --tounicode 1 ./file.pdf

The text on the html has many spaces and many quotes - enter image description here

[2]"M."Ha h n ,"O ."B ar bie ri,"F.P ."C a m p a na ,"R ."K öt z,"R ."G alla y,"A p p l."Ph ys ."A :"M a te r."S ci."P ro ce ss."8 2 "(2 00 6 )"

Setting other values for the --tounicode arg make the text is gibberish.

There is an online tool that uses this library and the html produced there is just fine, which makes it not a pdf2htmlEX bug but a configuration or versions problem. May be something related to poppler or fontforge.

Versions:

pdf2htmlEX version 0.14.6
Copyright 2012-2015 Lu Wang <[email protected]> and other contributors
Libraries: 
  poppler 0.54.0
  libfontforge 20180906
  cairo 1.14.6
Default data-dir: /usr/local/share/pdf2htmlEX
Supported image format: png jpg svg

Tried also using the new repository that sustain this project and getting the same results, see issue: https://github.com/pdf2htmlEX/pdf2htmlEX/issues/92

For your knowledge, pdf2htmlEX uses wide range of characters as spaces such as " ' ( ) +. So replacing them all is not an option.

Any way to make pdf2htmlEX not using these characters?

1

There are 1 best solutions below

1
On

I think the following two steps will work:

  1. Remove unnecessary spaces and quotes by using regular expression.
  2. Put/add paragraph tag for every references like below:
<div>
::before
<p>[2] something </p>
::after
</div>