Tabula font error in reading table from PDF

884 Views Asked by At

I saw a lot of people had similar issues, but not this one. And many of the similar issues do not have an applicable solution, unfortunately.

I am getting this warning from tabula. And when I look at the result or test the length of what it extracts, there is nothing there. Here is the message:

Got stderr: Apr 12, 2022 5:34:12 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'Helvetica-Oblique' for 'CenturyGothic-Italic'

All I am using is:

   table = tabula.read_pdf(pdf_path, pages= page, multiple_tables = True) 

Any ideas??

1

There are 1 best solutions below

0
On

The correct approach, would be to install the missing fonts as recommended in the answer here: Using fallback font while parsing file content using pdfbox - can it cause mistakes?

However, for my application, which is reading pdf files from a docker container, installing extra fonts in the OS might be unnecessary. Because what you see in the logs are a warning, the missing fonts do not really impact the parsing of the PDF.

To remove these warnings from any logging in tabula.py I just added silent=True to the arguments in the method call as follows:

table_df = tabula.read_pdf(
    input_path=pdf_file, 
    output_format="dataframe", 
    pages="all", 
    silent=True,
)