I am looking for a method to extract only the core text of a scientific paper. The paper is structured in paragraphs and I only want to cover the text without any mail-adress, websites, tables or pictures. My purpose is to create a clean txt file for a language model.
Which methods are available to filter data (i.e. font size, searching for keywords or inlcude Spacy etc.)?
Thank you in advance!
`from langchain.document_loaders import PyPDFLoader # for loading the pdf
import glob
import os
# Path to pdf folder
folder_path = "C:/Users/faenkaya/Desktop/Language Models/documents/Scientific Data eng"
# Path to output
output_file = "C:/Users/faenkaya/Desktop/Language Models/documents/Scientific Data eng/Full_text.txt"
# Write the txt file
with open(output_file, "w", encoding="utf-8") as file:
# Loop for each pdf
for file_path in glob.glob(os.path.join(folder_path, "*.pdf")):
# Open PDF in read-mode
with open(file_path, "rb") as pdf_file:
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()
text = ""
for i in range(len(pages)):
text += pages[i].page_content
text += "\n"
print(file_path)
# Write text to txt file
file.write(text)
file.write("\n")`