Text info extraction from pdf

1.6k Views Asked by At

How can I extract text information such as text positional coordinates, width, height and e.t.c., ?? I tried this with 'Pdf clown' library and It works perfectly fine for normal text, but, for rotated text (90/-90degrees) it outputs width/height as 0(zero).

And the scaling factors (scaleX, scaleY) for texts with (90/-90 deg) are displaying as (0, 0) repectively, where as for inverted texts ( rotated with 180deg) it is (-1, -1).

I want info for rotated text to highlight them (as width value is zero, I am unable to highlight them). Please help me. I'm working on .NET environment.

File I am using: https://nofile.io/f/Kvf2DkXvfj4/edit9.pdf

Code: Using TextInfoExtractionSample.cs from pdfclown samples

Output (for three various alignments of text in the file above)

Text [x:283,y:104,w:126,h:-23] [font size:-24 , font sytle : ArialMT]: inverted_text

Text [x:265,y:244,w:0,h:121] [font size:0 , font sytle : ArialMT]: vertical_text

Text [x:347,y:131,w:0,h:167] [font size:0 , font sytle : ArialMT]: vertical_minus90

1

There are 1 best solutions below

1
On BEST ANSWER

As I'm more at home with Java than .Net, I analyzed the problem and created a first workaround in PDF Clown / Java; I'll try and port it to .Net later. It shouldn't be too difficult, though, to do it yourself.

The issue

The sample file you provided makes the issue pretty clear when running it through the PDF Clown TextInfoExtractionSample.

Screenshot of edit9.pdf:

screen shot of original

Screenshot of edit9.pdf after applying TextInfoExtractionSample:

screen shot after applying <code>TextInfoExtractionSample</code>

Upright text

Everything looks ok.

Upside down text

The individual character boxes (green) look ok but the box for the whole string "inverted_text" (dashed black) excludes the outermost characters.

Vertical text

The individual character boxes are reduced to 0x0 rectangles (invisible in the screen shot but apparent in content stream analysis). The box for the whole string is reduced to a line (dashed black) on the base line of the string missing a bit length.

Text at angles in-between

The character boxes are upright, parallel to the page borders, with their base line segment inside the box. As the text is at an angle, though, the upper and lower parts of the characters partially are outside their respective character box while neighboring characters are partially inside.

The boxes for the whole strings also are parallel to the page.

In a nutshell

The text character and string boxes only work properly for upright text.

In the sources

This matches what one finds in the source code:

  • The Java Rectangle2D and .Net RectangleF classes used for the character boxes by design are meant for rectangles parallel to the coordinate system axes and are used in that manner in PDF Clown. Thus, they cannot properly represent width and height of characters at arbitrary angles.

  • PDF Clown classes don't include an Angle attribute to represent the rotation of the character.

  • The calculation of the character box dimensions only takes the values on the main diagonal of the aggregated transformation matrix into account, i.e. ScaleX and ScaleY, and ignores ShearX and ShearY. For text which is not upright or upside down, though, ShearX and ShearY are important, for vertical text ScaleX and ScaleY are 0.

  • The transition from baseline (native PDF way of positioning text) to top-of-character (PDF Clown text positioning) is done by change of y coordinate alone and, therefore, only works properly for upright and upside down text.

A work-around

A real fix of the issue would require using a completely different class for character and string boxes, a class that models rectangles at arbitrary angles.

A quicker work-around, though, can be to add an angle member to the TextChar class and to ITextString and implementations, and then to consider that angle when processing the boxes. This work-around is implemented here.

As already mentioned above, the work-around is first implemented in Java.

In Java

First we add an angle member to TextChar, calculate correct values for box dimensions and the angle in ShowText operation class, and correctly set these values in the ContentScanner.TextStringWrapper.

Then we add an angle getter to TextStringWrapper (and ITextString in general) which returns the angle of the first text char of the string. And we improve the TextStringWrapper method getBox to take the angle of the text chars into account when determining the string box.

Finally we'll extend the TextInfoExtractionSample to take the angle values into account when drawing the boxes.

I named that angle member Alpha as I named that angle α in my sketches. At hindsight Theta or simply Angle would have been more appropriate.

TextChar

New member variable alpha

  private final double alpha;

A new and a changed constructor

  // <constructors>
  public TextChar(
    char value,
    Rectangle2D box,
    TextStyle style,
    boolean virtual
    )
  {
      this(value, box, 0, style, virtual);
  }

  public TextChar(
    char value,
    Rectangle2D box,
    double alpha,
    TextStyle style,
    boolean virtual
    )
  {
    this.value = value;
    this.box = box;
    this.alpha = alpha;
    this.style = style;
    this.virtual = virtual;
  }
  // </constructors>

A getter for the angle

  public double getAlpha() {
      return alpha;
  }

(TextChar.java)

ShowText

Update inner interface IScanner method scanChar to transport the angle

void scanChar(
  char textChar,
  Rectangle2D textCharBox,
  double alpha
  );

(ShowText.java inner interface IScanner)

Update scan method to correctly calculate rectangle dimensions and angle and forward them to the IScanner implementation

[...]
for(char textChar : textString.toCharArray())
{
  double charWidth = font.getWidth(textChar) * scaledFactor;

  if(textScanner != null)
  {
    /*
      NOTE: The text rendering matrix is recomputed before each glyph is painted
      during a text-showing operation.
    */
    AffineTransform trm = (AffineTransform)ctm.clone(); trm.concatenate(tm);
    double charHeight = font.getHeight(textChar,fontSize);

    // vvv--- changed
    double ascent = font.getAscent(fontSize);
    double x = trm.getTranslateX() + ascent * trm.getShearX();
    double y = contextHeight - trm.getTranslateY() - ascent * trm.getScaleY();
    double dx = charWidth * trm.getScaleX();
    double dy = charWidth * trm.getShearY();
    double alpha = Math.atan2(dy, dx);
    double w = Math.sqrt(dx*dx + dy*dy);
    dx = charHeight * trm.getShearX();
    dy = charHeight * trm.getScaleY();
    double h = Math.sqrt(dx*dx + dy*dy);
    Rectangle2D charBox = new Rectangle2D.Double(x, y, w, h);

    textScanner.scanChar(textChar,charBox, alpha);
    // ^^^--- changed
  }

  /*
    NOTE: After the glyph is painted, the text matrix is updated
    according to the glyph displacement and any applicable spacing parameter.
  */
  tm.translate(charWidth + charSpace + (textChar == ' ' ? wordSpace : 0), 0);
}
[...]

(ShowText.java)

ContentScanner inner class TextStringWrapper

Update TextStringWrapper constructor ShowText.IScanner callback to accept the angle argument and use it for constructing the TextChar

getBaseDataObject().scan(
  state,
  new ShowText.IScanner()
  {
    @Override
    public void scanChar(
      char textChar,
      Rectangle2D textCharBox,
      double alpha
      )
    {
      textChars.add(
        new TextChar(
          textChar,
          textCharBox,
          alpha,
          style,
          false
          )
        );
    }
  }
  );

A getter for the angle

public double getAlpha() {
    return textChars.isEmpty() ? 0 : textChars.get(0).getAlpha();
}

A getBox implementation that takes the angle into account

public Rectangle2D getBox(
  )
{
  if(box == null)
  {
    AffineTransform rot = null;
    Rectangle2D tempBox = null;
    for(TextChar textChar : textChars)
    {
      Rectangle2D thisBox = textChar.getBox();
      if (rot == null) {
          rot = AffineTransform.getRotateInstance(textChar.getAlpha(), thisBox.getX(), thisBox.getY());
          tempBox = (Rectangle2D)thisBox.clone();
      } else {
          Point2D corner = new Point2D.Double(thisBox.getX(), thisBox.getY());
          rot.transform(corner, corner);
          tempBox.add(new Rectangle2D.Double(corner.getX(), corner.getY(), thisBox.getWidth(), thisBox.getHeight()));
      }
    }
    if (tempBox != null) {
        try {
            Point2D corner = new Point2D.Double(tempBox.getX(), tempBox.getY());
            rot.invert();
            rot.transform(corner, corner);
            box = new Rectangle2D.Double(corner.getX(), corner.getY(), tempBox.getWidth(), tempBox.getHeight());
        } catch (NoninvertibleTransformException e) {
            e.printStackTrace();
        }
    }
  }
  return box;
}

(ContentScanner.java inner class TextStringWrapper)

ITextString

New angle getter

  public double getAlpha();

(ITextString.java)

TextExtractor inner class TextString

New angle getter

public double getAlpha() {
    return textChars.isEmpty() ? 0 : textChars.get(0).getAlpha();
}

(TextExtractor.java)

TextInfoExtractionSample

Changes to extract to properly use the angle in outlining the boxes

[...]
for (ContentScanner.TextStringWrapper textString : text.getTextStrings())
{
    Rectangle2D textStringBox = textString.getBox();
    System.out.println("Text [" + "x:" + Math.round(textStringBox.getX()) + "," + "y:" + Math.round(textStringBox.getY()) + "," + "w:"
            + Math.round(textStringBox.getWidth()) + "," + "h:" + Math.round(textStringBox.getHeight()) + "] [font size:"
            + Math.round(textString.getStyle().getFontSize()) + "]: " + textString.getText());

    // Drawing text character bounding boxes...
    colorIndex = (colorIndex + 1) % textCharBoxColors.length;
    composer.setStrokeColor(textCharBoxColors[colorIndex]);
    for (TextChar textChar : textString.getTextChars())
    {
        // vvv--- changed
        Rectangle2D box = textChar.getBox();
        composer.beginLocalState();
        AffineTransform rot = AffineTransform.getRotateInstance(textChar.getAlpha());
        composer.applyMatrix(rot.getScaleX(), rot.getShearY(), rot.getShearX(), rot.getScaleY(),
                box.getX(), composer.getScanner().getContextSize().getHeight() - box.getY());
        composer.add(new DrawRectangle(0, - box.getHeight(), box.getWidth(), box.getHeight()));

        composer.stroke();
        composer.end();
        // ^^^--- changed
    }

    // Drawing text string bounding box...
    composer.beginLocalState();
    composer.setLineDash(new LineDash(new double[] { 5 }));
    composer.setStrokeColor(textStringBoxColor);
    // vvv--- changed
    AffineTransform rot = AffineTransform.getRotateInstance(textString.getAlpha());
    composer.applyMatrix(rot.getScaleX(), rot.getShearY(), rot.getShearX(), rot.getScaleY(),
            textStringBox.getX(), composer.getScanner().getContextSize().getHeight() - textStringBox.getY());
    composer.add(new DrawRectangle(0, - textStringBox.getHeight(), textStringBox.getWidth(), textStringBox.getHeight()));
    // ^^^--- changed
    composer.stroke();
    composer.end();
}
[...]

(TextInfoExtractionSample method extract)

The result

Both character boxes and string boxes now are as intended:

Screenshot with the work-around in action

So width and height outputs now also are ok:

Text [x:415,y:104,w:138,h:23] [font size:-24]: inverted_text
Text [x:247,y:365,w:128,h:23] [font size:0]: vertical_text
Text [x:364,y:131,w:180,h:23] [font size:0]: vertical_minus90