Remove PDF header and footer with PyPDF2

866 Views Asked by At

I am using the library PyPDF2 to read pdf files and convert to text format. There are a number of PDF files, and using the following code:

def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)

Works for some pdf files but not others. Is there a way to extract header and footer size of each pdf when first read, and use that instead of constants 50 and 720? I noted some solutions for other libraries exist, like this post but I am interested to learn this about PyPDF2.

from PyPDF2 import PdfReader

# Replace below file name with multiple different files
reader = PdfReader("GeoBase_NHNC1_Data_Model_UML_EN.pdf")

page = reader.pages[3]

parts = []


def visitor_body(text, cm, tm, fontDict, fontSize):
    y = tm[5]
    if y > 50 and y < 720:
        parts.append(text)


page.extract_text(visitor_text=visitor_body)
text_body = "".join(parts)

print(text_body)
0

There are 0 best solutions below