First of all, I'd like to let you know that I'm new in coding. The problem I'm dealing with right now is the following. I developed a small Python program to parse tables of PDF files. I use for that purpose the tabula (pip install tabula-py
) python module which requires Java 8+ and Python 3.8+. Below is the code of the program.
import tabula
pdf_path = "data.pdf"
dfs = tabula.read_pdf(pdf_path, pages="1")
print(len(dfs))
dfs[0].to_csv("first.csv")
The data.pdf has a pretty large table that's stretched over 15 pages. As you can see the script only reads one page at a time. However I faced the error related to the cache memory. When runing the scripts in the terminal I get the following error message.
Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
CodeCache: size=131072Kb used=7204Kb max_used=7204Kb free=123867Kb
bounds [0x0000000300000000, 0x0000000300720000, 0x0000000308000000]
total_blobs=2538 nmethods=2043 adapters=410
compilation: disabled (not enough contiguous free space left)
/Users/mago/Work/pdf/venv/lib/python3.12/site-packages/tabula/io.py:1045: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
df[c] = pd.to_numeric(df[c], errors="ignore")
I would be grateful for your insights how to deal with or work around the issue. Thank you in advance!
P.S. At the beginning I was trying to parse all the pdf. Then I limited the reading with 1 page at a time. But the issue is still there.