Incorrect y coordinate of text matrix (tm) using pypdf 3.7.0

556 Views Asked by At

I'm using pypdf 3.7.0 to extract text from a pdf file. I need the text's location to do subsequent operations, so I extract the text along with its x and y coordinates from the text matrix. However, while there is no issue with the x coordinate, the y coordinates are incorrect.

I tried to check the page size to make sure that the file wasn't scaled, which is correct (the page size is 612x792).

I think one of the ways to solve this issue is to do some modification with the transformation matrix (cm) with the text matrix (tm), but I haven't figured out how to do that.

Note: A reason why I think about the transformation matrix (cm) is that for other pdf files, its value is [1,0,0,1,0,0] (Identity matrix if you put it in 3x3). However, for this pdffile, the values of cm keep on changing (especially the last 2 elements in the matrix).

Link to the pdf file: https://drive.google.com/file/d/10KMQVAJPB2hQSOOT6OrnF0RGg82k6i31/view?usp=sharing

Below is an code example of the first page.(The issue happens with all the pages)

from pypdf import PdfReader

def visitor_body(text, cm, tm, fontDict, fontSize):
  x, y = tm[4], tm[5]
  print('This is text',text)
  print('This is tm',tm)
  print('This is cm',cm)

py_reader = PdfReader("Typhoon Merbok PVRR.pdf")
py_page = py_reader.pages[0]

print('This is page size',py_page.mediabox)
py_page.extract_text(visitor_text=visitor_body)

This is the result:

This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 699.75]
This is text Typhoon Merbok
This is tm [1.0, 0.0, 0.0, -1.0, 138.5182341, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 699.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 138.5182341, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 683.25]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 683.25]
This is text 17 September, 2022
This is tm [1.0, 0.0, 0.0, -1.0, 127.91241470000003, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 683.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 127.91241470000003, 14.079999970000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text Released:
This is tm [1.0, 0.0, 0.0, -1.0, 64.36441049999999, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 73.317032, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 31 October, 2022
This is tm [1.0, 0.0, 0.0, -1.0, 177.6134188, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 669.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 177.6134188, 14.079999970000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 656.25]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 656.25]
This is text NHERI DesignSafe Project ID:
This is tm [1.0, 0.0, 0.0, -1.0, 201.9814454, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 656.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 201.9814454, 14.079999970000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 0.0, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text PRJ-
This is tm [1.0, 0.0, 0.0, -1.0, 27.6880035, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 36.640625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 3737
This is tm [1.0, 0.0, 0.0, -1.0, 36.640625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 642.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 36.640625, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 76.5, 615.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 31.672241, 1.02156258]
This is cm [0.75, 0.0, 0.0, -0.75, 76.5, 615.75]
This is text PRELIMINARY VIRTUAL RECONNAISSANCE REPORT (PVRR)
This is tm [1.0, 0.0, 0.0, -1.0, 573.4407955, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 76.5, 615.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 573.4407955, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 522.0]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 95.317818, 1.02156258]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 522.0]
This is text Virtual Assessment Structural Team (VAST) Lead
This is tm [1.0, 0.0, 0.0, -1.0, 516.5935141, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 522.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 516.5935141, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 505.5]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 155.507813, 15.89333344]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 505.5]
This is text Mohammad Alam, University of Notre Dame
This is tm [1.0, 0.0, 0.0, -1.0, 243.546876, 15.89333344]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 505.5]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 243.546876, 15.89333344]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 480.0]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 81.33474, 1.02156258]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 480.0]
This is text Virtual Assessment Structural Team (VAST) Authors
This is tm [1.0, 0.0, 0.0, -1.0, 530.9040763, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 480.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 530.9040763, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 463.5]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 234.625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 463.5]
This is text (in alphabetical order)
This is tm [1.0, 0.0, 0.0, -1.0, 234.625, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 463.5]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 234.625, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 449.25]
This is text  

This is tm [1.0, 0.0, 0.0, -1.0, 160.804688, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 449.25]
This is text Janise Rodgers, GeoHazards International
This is tm [1.0, 0.0, 0.0, -1.0, 160.804688, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 449.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 160.804688, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 435.0]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 186.91797, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 435.0]
This is text Prateek Arora, New York University
This is tm [1.0, 0.0, 0.0, -1.0, 339.003908, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 435.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 339.003908, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 420.75]
This is text  

This is tm [1.0, 0.0, 0.0, -1.0, 182.59766, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 420.75]
This is text Stephanie Pilkington, UNC Charlotte
This is tm [1.0, 0.0, 0.0, -1.0, 182.59766, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 420.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 182.59766, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 396.0]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 251.35808, 1.02156258]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 396.0]
This is text PVRR Editors
This is tm [1.0, 0.0, 0.0, -1.0, 362.2662068, 17.92000058]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 396.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 362.2662068, 17.92000058]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 379.5]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 241.12329, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 379.5]
This is text (in alphabetical order)
This is tm [1.0, 0.0, 0.0, -1.0, 377.9960624999998, 14.079999970000001]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 379.5]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 377.9960624999998, 14.079999970000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 366.0]
This is text  

This is tm [1.0, 0.0, 0.0, -1.0, 187.05859, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 366.0]
This is text Ian Robertson, University of Hawaii
This is tm [1.0, 0.0, 0.0, -1.0, 187.05859, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 366.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 187.05859, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 351.75]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 197.00391, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 351.75]
This is text Kurt Gurley , University of Florida
This is tm [1.0, 0.0, 0.0, -1.0, 276.73047299999996, 15.35999966]
This is cm [0.75, 0.0, 0.0, -0.75, 78.75, 351.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 276.73047299999996, 15.35999966]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 71.25]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 232.0, 0.65671921]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 71.25]
This is text PVRR: 17 September , 2022 T yphoon Merbok
This is tm [1.0, 0.0, 0.0, -1.0, 393.847657, 11.520000510000001]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 71.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 393.847657, 11.520000510000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 60.75]
This is text  

This is tm [1.0, 0.0, 0.0, -1.0, 232.0, 11.520000510000001]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 60.75]
This is text PRJ-3737 | Released: 31 October , 2022
This is tm [1.0, 0.0, 0.0, -1.0, 411.22069999999997, 11.520000510000001]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 60.75]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 411.22069999999997, 11.520000510000001]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 50.25]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 232.0, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 50.25]
This is text Building Resilience through Reconnaissance
This is tm [1.0, 0.0, 0.0, -1.0, 232.0, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 50.25]
This is text 
This is tm [1.0, 0.0, 0.0, 1.0, 0.0, 0.0]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 50.25]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 576.0, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 50.25]
This is text 1
This is tm [1.0, 0.0, 0.0, -1.0, 576.0, 15.3599997]
This is cm [0.75, 0.0, 0.0, -0.75, 72.0, 50.25]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 576.0, 15.3599997]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 576.0, 15.3599997]
This is cm [290.39792, 0.0, 0.0, 66.75, 67.78627, 631.5]
This is text 

This is tm [1.0, 0.0, 0.0, -1.0, 576.0, 15.3599997]
This is cm [290.39792, 0.0, 0.0, 66.75, 67.78627, 631.5]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 576.0, 15.3599997]
This is cm [1.0, 0.0, 0.0, -1.0, 0.0, 792.0]
This is text 
This is tm [1.0, 0.0, 0.0, -1.0, 576.0, 15.3599997]
This is cm [157.5, 0.0, 0.0, 36.0, 79.5, 36.0]
1

There are 1 best solutions below

2
K J On

Unclear why you think the cm values are not correct ? so lets pick the first letter of your minimal sample T y p h o o n

This is cm [0.75, 0.0, 0.0, -0.75, 363.75, 699.75]
This is text Typhoon Merbok

Lets see where it is placed as its rounded values in my gui editor we need to compare using a small leeway. as you say 363.75 x is here shown rounded to 363.8

enter image description here

You will need to enlarge image to see other vector as (92.3) =92.25 from top, However that not the way PDF uses co-ords it calculates in a chartlike cartesian co-ordinate system (Origin Lower Left)

So to compare with the PDF we need to subtract that from the previously defined 792.0 thus the T is at 792.0-92.25 = 699.75 and we can see that's the cm location as given by 363.75, 699.75]

Likewise the top of 17 September is 108.75 from top of page = 683.25 in Y 363.75, 683.25]

As we move down the "Page" the Y value will naturally decrease.