PDFTron: converting pixels to fontsize

481 Views Asked by At

I have some text in a pdf that has been OCR'ed. The OCR returns the bounding boxes of the words to me. I'm able to draw the bounding boxes (wordRect) on the pdf and everything seems correct.

But when i tell my fontsize to be the height of these bounding boxes, it all goes wrong. The text appears way smaller than it should be and doesn't match the height.

There's some conversion i am missing. How can i make sure the text is as high as the bounding boxes?

pdftron.PDF.Font font = pdftron.PDF.Font.Create(convertedPdf.GetSDFDoc(), pdftron.PDF.Font.StandardType1Font.e_helvetica);
for (int j = 0; j < ocrStream.pr_WoordList.Count; j++)
{
           wordRect = (Rectangle) ocrStream.pr_Rectangles[j];

           Element textBegin = elementBuilder.CreateTextBegin();
           gStateTextRun = textBegin.GetGState();
           gStateTextRun.SetTextRenderMode(GState.TextRenderingMode.e_stroke_text);
           elementWriter.WriteElement(textBegin);

           fontSize = wordRect.Height;
           double descent;

           if (hasColorImg)
           {
               descent = (-1 * font.GetDescent() / 1000d) * fontSize;
               textRun = elementBuilder.CreateTextRun((string)ocrStream.pr_WoordList[j], font, fontSize);

              //translate the word to its correct position on the pdf

              //the bottom line of the wordrectangle is the baseline for the font, that's why we need the descender
              textRun.SetTextMatrix(1, 0, 0, 1, wordRect.Left, wordRect.Bottom + descent );
1

There are 1 best solutions below

1
On BEST ANSWER

How can i make sure the text is as high as the bounding boxes?

The font_size is just a scaling factor, which in most cases does map to 1/72 inch (pt), but not always.

The transformations are: GlyphSpace -> TextSpace -> UserSpace (where UserSpace is essentially the page space, and is 1/72 inch)

The glyphs in the font are defined in GlyphSpace, and there is a font matrix that maps to TextSpace. Typically, 1000 units maps to 1 unit in test space, but not always.

Then the text matrix (element.SetTextMatrix), the font size (variable in question here) and some additional parameters, transform TextSpace coordinates to UserSpace.

In the end though, the exact height, depends on the glyph also.

This forum post shows how to go from the glyph data, to UserSpace. See ProcessElements https://groups.google.com/d/msg/pdfnet-sdk/eOATUHGFyqU/6tsUF0BHukkJ