How can I extract text information such as text positional coordinates, width, height and e.t.c., ?? I tried this with 'Pdf clown' library and It works perfectly fine for normal text, but, for rotated text (90/-90degrees) it outputs width/height as 0(zero).
And the scaling factors (scaleX, scaleY) for texts with (90/-90 deg) are displaying as (0, 0) repectively, where as for inverted texts ( rotated with 180deg) it is (-1, -1).
I want info for rotated text to highlight them (as width value is zero, I am unable to highlight them). Please help me. I'm working on .NET environment.
File I am using: https://nofile.io/f/Kvf2DkXvfj4/edit9.pdf
Code: Using TextInfoExtractionSample.cs from pdfclown samples
Output (for three various alignments of text in the file above)
Text [x:283,y:104,w:126,h:-23] [font size:-24 , font sytle : ArialMT]: inverted_text
Text [x:265,y:244,w:0,h:121] [font size:0 , font sytle : ArialMT]: vertical_text
Text [x:347,y:131,w:0,h:167] [font size:0 , font sytle : ArialMT]: vertical_minus90
As I'm more at home with Java than .Net, I analyzed the problem and created a first workaround in PDF Clown / Java; I'll try and port it to .Net later. It shouldn't be too difficult, though, to do it yourself.
The issue
The sample file you provided makes the issue pretty clear when running it through the PDF Clown
TextInfoExtractionSample
.Screenshot of
edit9.pdf
:Screenshot of
edit9.pdf
after applyingTextInfoExtractionSample
:Upright text
Everything looks ok.
Upside down text
The individual character boxes (green) look ok but the box for the whole string "inverted_text" (dashed black) excludes the outermost characters.
Vertical text
The individual character boxes are reduced to 0x0 rectangles (invisible in the screen shot but apparent in content stream analysis). The box for the whole string is reduced to a line (dashed black) on the base line of the string missing a bit length.
Text at angles in-between
The character boxes are upright, parallel to the page borders, with their base line segment inside the box. As the text is at an angle, though, the upper and lower parts of the characters partially are outside their respective character box while neighboring characters are partially inside.
The boxes for the whole strings also are parallel to the page.
In a nutshell
The text character and string boxes only work properly for upright text.
In the sources
This matches what one finds in the source code:
The Java
Rectangle2D
and .NetRectangleF
classes used for the character boxes by design are meant for rectangles parallel to the coordinate system axes and are used in that manner in PDF Clown. Thus, they cannot properly represent width and height of characters at arbitrary angles.PDF Clown classes don't include an
Angle
attribute to represent the rotation of the character.The calculation of the character box dimensions only takes the values on the main diagonal of the aggregated transformation matrix into account, i.e.
ScaleX
andScaleY
, and ignoresShearX
andShearY
. For text which is not upright or upside down, though,ShearX
andShearY
are important, for vertical textScaleX
andScaleY
are 0.The transition from baseline (native PDF way of positioning text) to top-of-character (PDF Clown text positioning) is done by change of y coordinate alone and, therefore, only works properly for upright and upside down text.
A work-around
A real fix of the issue would require using a completely different class for character and string boxes, a class that models rectangles at arbitrary angles.
A quicker work-around, though, can be to add an
angle
member to theTextChar
class and toITextString
and implementations, and then to consider that angle when processing the boxes. This work-around is implemented here.As already mentioned above, the work-around is first implemented in Java.
In Java
First we add an angle member to
TextChar
, calculate correct values for box dimensions and the angle inShowText
operation class, and correctly set these values in theContentScanner.TextStringWrapper
.Then we add an angle getter to
TextStringWrapper
(andITextString
in general) which returns the angle of the first text char of the string. And we improve theTextStringWrapper
methodgetBox
to take the angle of the text chars into account when determining the string box.Finally we'll extend the
TextInfoExtractionSample
to take the angle values into account when drawing the boxes.I named that angle member
Alpha
as I named that angle α in my sketches. At hindsightTheta
or simplyAngle
would have been more appropriate.TextChar
New member variable
alpha
A new and a changed constructor
A getter for the angle
(TextChar.java)
ShowText
Update inner interface
IScanner
methodscanChar
to transport the angle(ShowText.java inner interface
IScanner
)Update
scan
method to correctly calculate rectangle dimensions and angle and forward them to theIScanner
implementation(ShowText.java)
ContentScanner inner class TextStringWrapper
Update
TextStringWrapper
constructorShowText.IScanner
callback to accept the angle argument and use it for constructing theTextChar
A getter for the angle
A
getBox
implementation that takes the angle into account(ContentScanner.java inner class
TextStringWrapper
)ITextString
New angle getter
(ITextString.java)
TextExtractor inner class TextString
New angle getter
(TextExtractor.java)
TextInfoExtractionSample
Changes to
extract
to properly use the angle in outlining the boxes(TextInfoExtractionSample method
extract
)The result
Both character boxes and string boxes now are as intended:
So width and height outputs now also are ok: