I am using PDFDomTree with pdfbox-2.0.9 in my java application to convert a pdf file to html file. Following code I have used to convert a pdf.
try {
PDDocument document = PDDocument.load(new File("some path"));
PDFDomTree parser = new PDFDomTree(PDFDomTreeConfig.createDefaultConfig());
Writer output = new PrintWriter(new File("some output path"), "utf-8");
parser.writeText(document, output);
output.close();
document.close();
} catch (IOException | ParserConfigurationException e) {
throw e;
}
Now my issue is when I tried to analyse output html, I realised that the converter was not able to detect whitespace between two words due to which I got some words concatenated.
Corresponding pdf file can be accessed from here if needed.
Can anyone please help me with this?
The text extractor at hand, Pdf2Dom's
PDFDomTree
, is based on PDFBox'PDFTextStripper
but only uses it to parse the PDF drawing instructions into characters with style and position while it does all the analysis of these rich characters itself.In particular it ignores all incoming white space characters in its
PDFBoxTree
parent class:(
org.fit.pdfdom.PDFBoxTree
overrideprocessTextPosition
)In that
[...process character...]
block it tries to recognize word gaps by hard coded distances:(inside the
[...process character...]
block above)As the text in your PDF is small to start with (9pt determined by Pdf2Dom) and in many lines very tightly set, gaps between words usually are smaller than the
1.0
assumed above (distx > 1.0f
).In my eyes there a 2 issues here:
dropping white spaces means throwing away information; (In some situations this might be advantageous, I've seen PDFs with the same line drawn twice with either drawing string argument containing spaces where the other contains visible characters; but these are exceptions.)
having hard-coded distance limits
distx > 1.0f
,distx < -6.0f
, etc. even though the font sizes (and with them the gap sizes) can vary much.These issues should be fixed in the code. Two possible work-arounds for PDFs like your demo.pdf:
Choosing different distance limits
A true fix should try and make the distance limits dynamic, depending on the font size and probably even the average character distance in the current line up to the current position. A work-around for your PDF would be to replace the hard-coded distance by a smaller hard-coded one.
E.g. using
.5f
instead of the1.0f
as word distance, i.e. replacing the test above byThis results in Pdf2Dom recognizing the word gaps in your document (or at least many more, I have not checked all of them).
Interpreting white spaces as splits
Instead of ignoring white spaces, you can explicitly interpret them as word gaps, e.g. by enhancing the
processTextPosition
override like thisI have not analyzed the code in depth, so I can only call this a work-around. To make it a real fix, you have to test it for side effects and also extend it to look into the exact nature of the white space: There are other white space characters than the normal space, some of them zero-width, some non-breaking, etc. All these different types of white space deserve special treatment.
PS: As many
PDFBoxTree
members are protected (and not private), it is easily possible to apply the second work-around without having to patch Pdf2Dom:(ExtractText test
testDemoImproved
)