I´m using PDFClown to analyze a PDF Document. In many documents it seems that some characters in PDFClown have different heights even if they obviously have the same height. Is there a workaround?
This is the Code:
while(_level.moveNext()) {
ContentObject content = _level.getCurrent();
if(content instanceof Text) {
ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)_level.getCurrentWrapper();
for(ContentScanner.TextStringWrapper textString : text.getTextStrings()) {
List<CharInfo> chars = new ArrayList<>();
for(TextChar textChar : textString.getTextChars()) {
chars.add(new CharInfo(textChar.getBox(), textChar.getValue()));
}
}
}
else if(content instanceof XObject) {
// Scan the external level
if(((XObject)content).getScanner(_level)!=null){
getContentLines(((XObject)content).getScanner(_level));
}
}
else if(content instanceof ContainerObject){
// Scan the inner level
if(_level.getChildLevel()!=null){
getContentLines(_level.getChildLevel());
}
}
}
Here is an example PDFDocument:
In this Document I marked two text chunks which both contains the word "million". When analyzing the size of each char in both "million" the following happens:
- "m" in the first mark has the height : 14,50 and the width : 8,5
- "i" in the first mark has the height: 14,50 and thw width: 3,0
- "l" in the first mark has the height : 14,50 and the width 3,0
- "m" in the second mark has the height: 10,56 and the width: 6,255
- "i" in the second mark has the height: 10,56 and the width: 2,23
- "l" in the second mark has the height: 10,56 and the width: 2,23
Even if all chars of the two text chunks obviously have the same size pdf clown said that the sizes are different.
The issue is caused by a bug in PDF Clown: it assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap. I.e. it assumes that these structures only intermingle as
or
but never as
or
Unfortunately this assumption is wrong, marked content sections and save/restore graphics state blocks can intermingle any way they like.
E.g. in the document at hand there are sequences like this:
Here
[...1...]is contained in the save/restore graphics state block enveloped byqandQand[...2...]is contained in the marked content block enveloped by/P <</MCID 0 >>BDCandEMC.Due to the wrong assumption, though, and the way
/P <</MCID 0 >>BDCandQare arranged, PDF Clown parses the above as[...1...]and an empty marked content block and[...2...]being contained in a save/restore graphics state block.Thus, if there are changes in the graphics state inside
[...2...], PDF Clown assumes them limited to the lines above while they actually are not.The only easy way I found to repair this was to disable the marked content parsing in PDF Clown.
To do this I changed
org.pdfclown.documents.contents.tokens.ContentParseras follows:In
parseContentObjects()I disablked thecontentObject instanceof EndMarkedContentoption:In
parseContentObjectI removed theif(operation instanceof BeginMarkedContent)branch:With these changes in place, the character sizes are properly extracted.
As an aside, while the returned individual character boxes seem to imply that the box is completely custom to the character in question, that is not true: Merely the width of the box is character specific, the height is calculated from overall font properties (and the current font size) but not specifically to the character, cf. the
org.pdfclown.documents.contents.fonts.FontmethodgetHeight(char):Individual character height calculation still is a TODO.