Consider https://github.com/modesty/pdf2json/files/13788866/test.pdf .

When I do pdf2json -f test.pdf I get as the following as the coordinates / dimensions of the "yyy" field:

"x": 11.924,
"y": 7.027,
"w": 9.375,
"h": 1.292

But when I do qpdf test.pdf --qdf test.qdf I get this:

  /Rect [
    190.784
    658.903
    340.784
    680.903
  ]

My question is... why aren't they the same and how do I convert one to the other and vice versa?

I note the page sizes are different as well. qdf gives this as the size:

%% Page 1
%% Original object ID: 18 0
21 0 obj
<<
  /Annots 6 0 R
  /Contents 22 0 R
  /CropBox [
    0.0
    0.0
    612.0
    792.0
  ]
  /MediaBox [
    0.0
    0.0
    612.0
    792.0
  ]
  /Parent 15 0 R
  /Resources <<
  >>
  /Rotate 0
  /Type /Page
>>
endobj

So 612x792 (which is consistent with what pdfbox -box test.pdf tells me) whereas pdf2json gives 38.25x49.5. Now in the case of the page size I note that you can transform one to the other by either multiplying by 16 or dividing by 16. But for the position and dimensions of the PDF field that is not the case.

So like presumably 658.903 in the qpdf output corresponds to the 11.924 (x) in the pdf2json output. 658.903/11.924 is about 55.25. Likewise it stands to reason that 190.784 corresponds to the 7.027 (y) in the pdf2json output but 190.784/7.027 gives me 27.15. So that means that there's not some constant multiplier that I can use to transform one set of coordinates to another.

For good measure I also tried 680.903/11.924 (57.10) and 340.784/7.027 (48.49) and those don't match either.

So how do pdf2json's coordinates / dimensions relate to the numbers in /Rect? Do they relate at all?

1

There are 1 best solutions below

0
On BEST ANSWER

The source PDF /Type/Page has a size of /CropBox[0 0 612 792] and a linked /Type/Annot of Dimensions /Rect[190.784 658.903 340.784 680.903] Thus a field box of 150 units wide (not all shown here) by a relative height of 22 units and without transformations can be considered as simple point sizes @ 1/72" per unit.

enter image description here

The JSON interpretation is calculated at this point in time for whatever reason on 1/16ths (4.5/72). (it can be different on different devices and pages).

Page units are relative units which depend on the size, resolution and dpi of the system & pdf. Please refer to this link for more info (pdf2json Page Unit: What is it?).

https://github.com/search?q=repo%3Amodesty%2Fpdf2json+units&type=issues

https://github.com/modesty/pdf2json/issues/136#issuecomment-1129033826 It also randomly appeared to me that converting Page Units to points, was simply multiplying Page Units by 16.

Thus to convert the JSON HTML units as you describe just multiply by 16!

"x": 11.924, x 16 = 190.784 (HTML and PDF are measured from the left unless either are measured from the right.
"y": 7.027,  x 16 = 112.432 this represents the PDF area converted into HTML orientation i.e. 792 - 680.903 = 111.097 and a smitch (1.335)
"w": 9.375,  x 16 = 150     this represents the width of the field horizontally whatever the system
"h": 1.292   x 16 =  20.672 this represents the 22 points high (less above smitch, here 1.328) 
So allowing for rounding errors we can say smitch (smidgeon or smidgen will do, doesn't matter, they are all a shim) = 1.333

enter image description here