I´m using PDFClown to analyze a PDF Document. In many documents it seems that some characters in PDFClown have different heights even if they obviously have the same height. Is there a workaround?
This is the Code:
while(_level.moveNext()) {
ContentObject content = _level.getCurrent();
if(content instanceof Text) {
ContentScanner.TextWrapper text = (ContentScanner.TextWrapper)_level.getCurrentWrapper();
for(ContentScanner.TextStringWrapper textString : text.getTextStrings()) {
List<CharInfo> chars = new ArrayList<>();
for(TextChar textChar : textString.getTextChars()) {
chars.add(new CharInfo(textChar.getBox(), textChar.getValue()));
}
}
}
else if(content instanceof XObject) {
// Scan the external level
if(((XObject)content).getScanner(_level)!=null){
getContentLines(((XObject)content).getScanner(_level));
}
}
else if(content instanceof ContainerObject){
// Scan the inner level
if(_level.getChildLevel()!=null){
getContentLines(_level.getChildLevel());
}
}
}
Here is an example PDFDocument:
In this Document I marked two text chunks which both contains the word "million". When analyzing the size of each char in both "million" the following happens:
- "m" in the first mark has the height : 14,50 and the width : 8,5
- "i" in the first mark has the height: 14,50 and thw width: 3,0
- "l" in the first mark has the height : 14,50 and the width 3,0
- "m" in the second mark has the height: 10,56 and the width: 6,255
- "i" in the second mark has the height: 10,56 and the width: 2,23
- "l" in the second mark has the height: 10,56 and the width: 2,23
Even if all chars of the two text chunks obviously have the same size pdf clown said that the sizes are different.
The issue is caused by a bug in PDF Clown: it assumes that marked content sections and save/restore graphics state blocks are properly contained in each other and don't overlap. I.e. it assumes that these structures only intermingle as
or
but never as
or
Unfortunately this assumption is wrong, marked content sections and save/restore graphics state blocks can intermingle any way they like.
E.g. in the document at hand there are sequences like this:
Here
[...1...]
is contained in the save/restore graphics state block enveloped byq
andQ
and[...2...]
is contained in the marked content block enveloped by/P <</MCID 0 >>BDC
andEMC
.Due to the wrong assumption, though, and the way
/P <</MCID 0 >>BDC
andQ
are arranged, PDF Clown parses the above as[...1...]
and an empty marked content block and[...2...]
being contained in a save/restore graphics state block.Thus, if there are changes in the graphics state inside
[...2...]
, PDF Clown assumes them limited to the lines above while they actually are not.The only easy way I found to repair this was to disable the marked content parsing in PDF Clown.
To do this I changed
org.pdfclown.documents.contents.tokens.ContentParser
as follows:In
parseContentObjects()
I disablked thecontentObject instanceof EndMarkedContent
option:In
parseContentObject
I removed theif(operation instanceof BeginMarkedContent)
branch:With these changes in place, the character sizes are properly extracted.
As an aside, while the returned individual character boxes seem to imply that the box is completely custom to the character in question, that is not true: Merely the width of the box is character specific, the height is calculated from overall font properties (and the current font size) but not specifically to the character, cf. the
org.pdfclown.documents.contents.fonts.Font
methodgetHeight(char)
:Individual character height calculation still is a TODO.