I am using the library PyPDF2 to read pdf files and convert to text format. There are a number of PDF files, and using the following code:
def visitor_body(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 50 and y < 720:
parts.append(text)
Works for some pdf files but not others. Is there a way to extract header and footer size of each pdf when first read, and use that instead of constants 50 and 720? I noted some solutions for other libraries exist, like this post but I am interested to learn this about PyPDF2.
from PyPDF2 import PdfReader
# Replace below file name with multiple different files
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")
page = reader.pages[3]
parts = []
def visitor_body(text, cm, tm, fontDict, fontSize):
y = tm[5]
if y > 50 and y < 720:
parts.append(text)
page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)
print(text_body)