PyPDF loader could not convert string to float

83 Views Asked by At

I'm trying to load multiple documents using langchain's PyPDF loader using as ususal:

import os 
from langchain.document_loaders import PyPDFLoader
documents = []
for file in os.listdir("docs"):
    if file.endswith(".pdf"):
        pdf_path = "./docs/" + file
        loader = PyPDFLoader(pdf_path)
        documents.extend(loader.load())

My PDF's contain a lot of images and signs in the text (these are manuals for technical devices) The document names contain spaces and the name containing numbers.

The code finishes without throwing error, but the kernel outputs messages like these:

could not convert string to float: '.0.038' : FloatObject (b'.0.038') invalid; use 0.0 instead

NumberObject(b'--') invalid; use 0 instead

Should I ingnore these messages or there is another method to process?

1

There are 1 best solutions below

0
On

The message comes from this part of pypdf:

class FloatObject(float, PdfObject):
    def __new__(
        cls, value: Union[str, Any] = "0.0", context: Optional[Any] = None
    ) -> "FloatObject":
        try:
            value = float(str_(value))
            return float.__new__(cls, value)
        except Exception as e:
            # If this isn't a valid decimal (happens in malformed PDFs)
            # fallback to 0
            logger_warning(
                f"{e} : FloatObject ({value}) invalid; use 0.0 instead", __name__
            )
            return float.__new__(cls, 0.0)

There is not a lot you can do, if you cannot / do not want to fix the PDF manually.

See Exceptions, Warnings, and Log messages