How to return all extracted text from multiple PDFs in python?

220 Views Asked by Swechha Ghimire At 19 July 2020 at 15:07

This is my code. So far, it'll print all the content of the pdfs to the pages variable. However, I cannot seem to return the same extracted text. I've been testing it by pulling information from random pdfs and placing it in the folder I'm calling. How do I get it to return the extracted text the same way it prints it?

import os
import PyPDF2 as pdf
import pandas as pd

def scan_files(root):
    for path, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.pdf'):
                #print(name)
                pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                numPages = pdf.getNumPages()
                for p in range(0, numPages):
                        pages = ''
                        page = pdf.getPage(p)
                        pages += page.extractText()
                        pages = pages.replace('\n', '')
                        #print(pages)
                        return pages

Original Q&A

There are 1 best solutions below

iustin On 19 July 2020 at 15:28 BEST ANSWER

Printing the text will allow the last for loop to iterate(using the "print(pages)" you mentioned). However, returning pages will terminate the loops running and will spit out the text it covered so far. Try using something like:

def scan_files(root):
    pdftext = ''
    for path, subdirs, files in os.walk(root):
        for name in files:
            if name.endswith('.pdf'):
                #print(name)
                pdf = PyPDF2.PdfFileReader(os.path.join(path,name))
                numPages = pdf.getNumPages()
                
                pages = ''                    

                for p in range(0, numPages):
                    page = pdf.getPage(p)
                    pages += page.extractText()
                    pages = pages.replace('\n', '')

                pdftext += pages

    return pdftext

How to return all extracted text from multiple PDFs in python?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PDF

Related Questions in MACHINE-LEARNING

Related Questions in NLP

Related Questions in PDF-SCRAPING

Trending Questions

Popular # Hahtags

Popular Questions