I am using the pdf2json library to parse a pdf.
It is returning the parsed data in a json and I've attached some sample data.
The main variable to keep note of are
Height - The height of the pdf in PAGE_UNITS
Width - The width of the pdf in PAGE_UNITS
sw - (space width of the font) Defined in the README.md of the pd2json library
TS at index 1 - font size in pt
w - Where my confusion is happening. W is supposed to represent the width of the line of text. However, my line of text has a greater width than the width of the page which doesn't make any sense.
I need to get the length of the text. I've tried doing (number of chars in text * sw)/pagewidth to get the ratio of the line relative to the pdf.Tp test I have then used that ratio in my frontend to draw over an image of the same pdf over the specific line.
But this doesn't seem to be giving me the correct length of the line. Usually it is too short.
If anyone could please help that would be super appreciated. I've been going through the pd2json issues searching for something similar however there have been no answers and the library doesn't appear to be supported all that well.
"Pages": [
{
"Height": 49.5,
"HLines": [],
"VLines": [],
"Fills": [
{
"x": 0,
"y": 0,
"w": 0,
"h": 0,
"clr": 1
},
{
"x": 9.001,
"y": 19.271,
"w": 5.372,
"h": 0.038,
"clr": 35
}
],
"Texts": [
{
"x": 4.252,
"y": 45.981,
"w": 96.648,
"sw": 0.32553125,
"clr": 0,
"A": "left",
"R": [
{
"T": "Hello%20World%20",
"S": -1,
"TS": [
0,
15,
0,
0
]
}
]
},
"Width": 38.25
...
Coming here a bit late and working through the same issue. I believe the
w
property is the width of the text's bounding rectangle and not indicative of the text length, as you've already noted.Unless the font is monospace, I don't think we can accurately obtain the text width using the number of chars.
Since the HTML5 Canvas element has a built in
measureText
, I decided to try the below using node canvasThis has not gone through unit testing so I can't stand by its accuracy. But it is working appropriately for my needs so far. And the overhead isn't as big as I thought.
I hope this can help someone else.