I am currently trying to extract the text from this whole PDF. I have tried extracting the text from single pages of the PDF and it works properly but when I try to extract the whole PDF, it gives me this error:
Traceback (most recent call last):
File "D:/PDF_extract_1/main.py", line 35, in <module>
extract_whole_pdf()
File "D:/PDF_extract_1/main.py", line 26, in extract_whole_pdf
final = final + "\n" + data
TypeError: can only concatenate str (not "NoneType") to str
For reference, this is the code I use when extracting from single pages:
def extract_first():
pdf = pdfplumber.open("pdftest2.pdf")
page = pdf.pages[6] #just for example, I chose page 5 of the PDF
text = page.extract_text()
print("First page data : {}".format(text))
with open("pdf_pages.txt", "w", encoding='utf-8') as f:
f.write(text)
pdf.close()
and this is the code I use to extract the whole PDF:
def extract_whole_pdf():
pdf = pdfplumber.open("pdftest2.pdf")
n = len(pdf.pages)
final = ""
for page in range(n):
data = pdf.pages[page].extract_text()
final = final + "\n" + data
print("Whole document data : {}".format(final))
with open("pdf_extract.txt", "w", encoding='utf-8') as f:
f.write(final)
pdf.close()
I notice this question has been asked a lot but they don't seem to be applicable to my problem. One of the questions had a similar error but it was a different situation than mine.
The problem seems to be the method
extract_text()
returningNone
when it finds an empty page. You can solve this by testing the data returned before concatenation:As a side note I also recommend using f-strings for string formatting as it is the latest standard.