Calculating the length of text using fontsize (npm - pdf2json library)

302 Views Asked by At

I am using the pdf2json library to parse a pdf.

It is returning the parsed data in a json and I've attached some sample data.

The main variable to keep note of are

Height - The height of the pdf in PAGE_UNITS

Width - The width of the pdf in PAGE_UNITS

sw - (space width of the font) Defined in the README.md of the pd2json library

TS at index 1 - font size in pt

w - Where my confusion is happening. W is supposed to represent the width of the line of text. However, my line of text has a greater width than the width of the page which doesn't make any sense.

I need to get the length of the text. I've tried doing (number of chars in text * sw)/pagewidth to get the ratio of the line relative to the pdf.Tp test I have then used that ratio in my frontend to draw over an image of the same pdf over the specific line.

But this doesn't seem to be giving me the correct length of the line. Usually it is too short.

If anyone could please help that would be super appreciated. I've been going through the pd2json issues searching for something similar however there have been no answers and the library doesn't appear to be supported all that well.

"Pages": [
  {
    "Height": 49.5,
    "HLines": [],
    "VLines": [],
    "Fills": [
      {
        "x": 0,
        "y": 0,
        "w": 0,
        "h": 0,
        "clr": 1
      },
      {
        "x": 9.001,
        "y": 19.271,
        "w": 5.372,
        "h": 0.038,
        "clr": 35
      }
    ],
    "Texts": [
      {
        "x": 4.252,
        "y": 45.981,
        "w": 96.648,
        "sw": 0.32553125,
        "clr": 0,
        "A": "left",
        "R": [
          {
            "T": "Hello%20World%20",
            "S": -1,
            "TS": [
              0,
              15,
              0,
              0
            ]
          }
        ]
      },
 "Width": 38.25
...
2

There are 2 best solutions below

0
On

Coming here a bit late and working through the same issue. I believe the w property is the width of the text's bounding rectangle and not indicative of the text length, as you've already noted.

Unless the font is monospace, I don't think we can accurately obtain the text width using the number of chars.

Since the HTML5 Canvas element has a built in measureText, I decided to try the below using node canvas

import { createCanvas } from 'canvas';
...
const canvas = createCanvas(900, 200); //for text length calcs
const ctx = canvas.getContext('2d');//for text length calcs
const pdfUnitToPx = 22.2281951; //magic scale factor from https://github.com/modesty/pdf2json/issues/123
...
/**
 * getTextWidth
 * calculate text width using canvas
 * @param {pdf2jsonParsedItem} item 
 * @param {CanvasRenderingContext2D} ctx 
 * @returns 
 */
 const getTextWidth = (item,ctx)=>{ 
    //loop through the R elements
    return item.R.reduce((acc,cur)=>{
        //decode the text
        const text = decodeURIComponent(cur.T);
        
        //parse the TS components from pdf2json documentation
        const [fontFaceID,fontSize,fontBold,fontItalic] = cur.TS;
        
        //determine the font face from PDFParser.fontFaceDict which I call pdf2jsonFontFaceDict here
        const fontFace = fontFaceID in pdf2jsonFontFaceDict ? pdf2jsonFontFaceDict[fontFaceID] : 'Arial';
        
        //construct the font style for the canvas
        ctx.font = `${fontBold ? 'bold':''} ${fontItalic ? 'italic':''} ${fontSize}px ${fontFace}`.trim(); //TODO
        
        //use canvas to determine the pixels and convert to pdf units, appending to the accumulator
        acc+= ctx.measureText(text).width / pdfUnitToPx; 
        return acc;
    },0);
}

This has not gone through unit testing so I can't stand by its accuracy. But it is working appropriately for my needs so far. And the overhead isn't as big as I thought.

I hope this can help someone else.

0
On

In Will R's answer, he put the pdfUnitToPx = 22.2281951

I ended up taking the pdf2json text w field and dividing it by this number to get a fairly decent pdf width of the text that matches with the x and y values

const pdfUnitToPx = 22.2281951; //magic scale factor from https://github.com/modesty/pdf2json/issues/123

const someTextPart = someData.Pages[0].Texts[0]

const textPdfWidth = someTextPart.w / pdfUnitToPx

const rightXofText = someTextPart.x + textPdfWidth