The CodeCache issue while parsing a pdf using the tabula module

28 Views Asked by At

First of all, I'd like to let you know that I'm new in coding. The problem I'm dealing with right now is the following. I developed a small Python program to parse tables of PDF files. I use for that purpose the tabula (pip install tabula-py) python module which requires Java 8+ and Python 3.8+. Below is the code of the program.

import tabula
pdf_path = "data.pdf"

dfs = tabula.read_pdf(pdf_path, pages="1")

print(len(dfs))
dfs[0].to_csv("first.csv")

The data.pdf has a pretty large table that's stretched over 15 pages. As you can see the script only reads one page at a time. However I faced the error related to the cache memory. When runing the scripts in the terminal I get the following error message.

Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
CodeCache: size=131072Kb used=7204Kb max_used=7204Kb free=123867Kb
 bounds [0x0000000300000000, 0x0000000300720000, 0x0000000308000000]
 total_blobs=2538 nmethods=2043 adapters=410
 compilation: disabled (not enough contiguous free space left)
/Users/mago/Work/pdf/venv/lib/python3.12/site-packages/tabula/io.py:1045: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  df[c] = pd.to_numeric(df[c], errors="ignore")

I would be grateful for your insights how to deal with or work around the issue. Thank you in advance!

P.S. At the beginning I was trying to parse all the pdf. Then I limited the reading with 1 page at a time. But the issue is still there.

0

There are 0 best solutions below