Trying to extract text from PDF gives me this error: "TypeError: can only concatenate str (not "NoneType") to str"

622 Views Asked by At

I am currently trying to extract the text from this whole PDF. I have tried extracting the text from single pages of the PDF and it works properly but when I try to extract the whole PDF, it gives me this error:

    Traceback (most recent call last):
  File "D:/PDF_extract_1/main.py", line 35, in <module>
    extract_whole_pdf()
  File "D:/PDF_extract_1/main.py", line 26, in extract_whole_pdf
    final = final + "\n" + data
TypeError: can only concatenate str (not "NoneType") to str

For reference, this is the code I use when extracting from single pages:

def extract_first():
    pdf = pdfplumber.open("pdftest2.pdf")
    page = pdf.pages[6] #just for example, I chose page 5 of the PDF
    text = page.extract_text()

    print("First page data : {}".format(text))

    with open("pdf_pages.txt", "w", encoding='utf-8') as f:
        f.write(text)

    pdf.close()

and this is the code I use to extract the whole PDF:

def extract_whole_pdf():
    pdf = pdfplumber.open("pdftest2.pdf")
    n = len(pdf.pages)

    final = ""
    for page in range(n):
        data = pdf.pages[page].extract_text()
        final = final + "\n" + data

    print("Whole document data : {}".format(final))

    with open("pdf_extract.txt", "w", encoding='utf-8') as f:
        f.write(final)

    pdf.close()

I notice this question has been asked a lot but they don't seem to be applicable to my problem. One of the questions had a similar error but it was a different situation than mine.

1

There are 1 best solutions below

0
On

The problem seems to be the method extract_text() returning None when it finds an empty page. You can solve this by testing the data returned before concatenation:

def extract_whole_pdf():
    pdf = pdfplumber.open("pdftest2.pdf")
    n = len(pdf.pages)

    final = ""
    for page in range(n):
        data = pdf.pages[page].extract_text()
        if data:
            final = final + "\n" + data

    print(f"Whole document data : {final}")

    with open("pdf_extract.txt", "w", encoding='utf-8') as f:
        f.write(final)

    pdf.close()

As a side note I also recommend using f-strings for string formatting as it is the latest standard.