I'm trying to load multiple documents using langchain's PyPDF loader using as ususal:
import os
from langchain.document_loaders import PyPDFLoader
documents = []
for file in os.listdir("docs"):
if file.endswith(".pdf"):
pdf_path = "./docs/" + file
loader = PyPDFLoader(pdf_path)
documents.extend(loader.load())
My PDF's contain a lot of images and signs in the text (these are manuals for technical devices) The document names contain spaces and the name containing numbers.
The code finishes without throwing error, but the kernel outputs messages like these:
could not convert string to float: '.0.038' : FloatObject (b'.0.038') invalid; use 0.0 instead
NumberObject(b'--') invalid; use 0 instead
Should I ingnore these messages or there is another method to process?
The message comes from this part of
pypdf
:There is not a lot you can do, if you cannot / do not want to fix the PDF manually.
See Exceptions, Warnings, and Log messages