How can I extract text information such as text positional coordinates, width, height and e.t.c., ?? I tried this with 'Pdf clown' library and It works perfectly fine for normal text, but, for rotated text (90/-90degrees) it outputs width/height as 0(zero).
And the scaling factors (scaleX, scaleY) for texts with (90/-90 deg) are displaying as (0, 0) repectively, where as for inverted texts ( rotated with 180deg) it is (-1, -1).
I want info for rotated text to highlight them (as width value is zero, I am unable to highlight them). Please help me. I'm working on .NET environment.
File I am using: https://nofile.io/f/Kvf2DkXvfj4/edit9.pdf
Code: Using TextInfoExtractionSample.cs from pdfclown samples
Output (for three various alignments of text in the file above)
Text [x:283,y:104,w:126,h:-23] [font size:-24 , font sytle : ArialMT]: inverted_text
Text [x:265,y:244,w:0,h:121] [font size:0 , font sytle : ArialMT]: vertical_text
Text [x:347,y:131,w:0,h:167] [font size:0 , font sytle : ArialMT]: vertical_minus90
As I'm more at home with Java than .Net, I analyzed the problem and created a first workaround in PDF Clown / Java; I'll try and port it to .Net later. It shouldn't be too difficult, though, to do it yourself.
The issue
The sample file you provided makes the issue pretty clear when running it through the PDF Clown
TextInfoExtractionSample.Screenshot of
edit9.pdf:Screenshot of
edit9.pdfafter applyingTextInfoExtractionSample:Upright text
Everything looks ok.
Upside down text
The individual character boxes (green) look ok but the box for the whole string "inverted_text" (dashed black) excludes the outermost characters.
Vertical text
The individual character boxes are reduced to 0x0 rectangles (invisible in the screen shot but apparent in content stream analysis). The box for the whole string is reduced to a line (dashed black) on the base line of the string missing a bit length.
Text at angles in-between
The character boxes are upright, parallel to the page borders, with their base line segment inside the box. As the text is at an angle, though, the upper and lower parts of the characters partially are outside their respective character box while neighboring characters are partially inside.
The boxes for the whole strings also are parallel to the page.
In a nutshell
The text character and string boxes only work properly for upright text.
In the sources
This matches what one finds in the source code:
The Java
Rectangle2Dand .NetRectangleFclasses used for the character boxes by design are meant for rectangles parallel to the coordinate system axes and are used in that manner in PDF Clown. Thus, they cannot properly represent width and height of characters at arbitrary angles.PDF Clown classes don't include an
Angleattribute to represent the rotation of the character.The calculation of the character box dimensions only takes the values on the main diagonal of the aggregated transformation matrix into account, i.e.
ScaleXandScaleY, and ignoresShearXandShearY. For text which is not upright or upside down, though,ShearXandShearYare important, for vertical textScaleXandScaleYare 0.The transition from baseline (native PDF way of positioning text) to top-of-character (PDF Clown text positioning) is done by change of y coordinate alone and, therefore, only works properly for upright and upside down text.
A work-around
A real fix of the issue would require using a completely different class for character and string boxes, a class that models rectangles at arbitrary angles.
A quicker work-around, though, can be to add an
anglemember to theTextCharclass and toITextStringand implementations, and then to consider that angle when processing the boxes. This work-around is implemented here.As already mentioned above, the work-around is first implemented in Java.
In Java
First we add an angle member to
TextChar, calculate correct values for box dimensions and the angle inShowTextoperation class, and correctly set these values in theContentScanner.TextStringWrapper.Then we add an angle getter to
TextStringWrapper(andITextStringin general) which returns the angle of the first text char of the string. And we improve theTextStringWrappermethodgetBoxto take the angle of the text chars into account when determining the string box.Finally we'll extend the
TextInfoExtractionSampleto take the angle values into account when drawing the boxes.I named that angle member
Alphaas I named that angle α in my sketches. At hindsightThetaor simplyAnglewould have been more appropriate.TextChar
New member variable
alphaA new and a changed constructor
A getter for the angle
(TextChar.java)
ShowText
Update inner interface
IScannermethodscanCharto transport the angle(ShowText.java inner interface
IScanner)Update
scanmethod to correctly calculate rectangle dimensions and angle and forward them to theIScannerimplementation(ShowText.java)
ContentScanner inner class TextStringWrapper
Update
TextStringWrapperconstructorShowText.IScannercallback to accept the angle argument and use it for constructing theTextCharA getter for the angle
A
getBoximplementation that takes the angle into account(ContentScanner.java inner class
TextStringWrapper)ITextString
New angle getter
(ITextString.java)
TextExtractor inner class TextString
New angle getter
(TextExtractor.java)
TextInfoExtractionSample
Changes to
extractto properly use the angle in outlining the boxes(TextInfoExtractionSample method
extract)The result
Both character boxes and string boxes now are as intended:
So width and height outputs now also are ok: