Can someone suggest a way to give tika a larger heap size (1 GByte or so) while using tika-python (on Windows)?
I get "status: 500" errors from tika when processing very large Microsoft Word files. If I run tika from the Windows command line as follows, the errors go away:
C:>java -Xmx1G -jar tika-app-2.1.0.jar
The -Xmx1G specifies a maximum heap size of 1 GByte (much larger than the default).
I've seen several answers for other languages, but none specific for Python with tika-python.
I've tried:
os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
from tika import parser as tika_parser
and:
def main():
global MODEL_LIST
os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
start_time = time.time()
... rest of code ...
and from the Windows command line:
C:\<path>\findEm>set TIKA_JAVA_ARGS="-Xmx1G"
C:\<path>\findEm>python3 findEmv1.52.py
All 3 methods result in the same error, something like
2021-10-19 14:43:55,782 [MainThread ] [WARNI] Tika server returned status: 500
I think the main problem is that the Java tika process is already running when I'm trying to change the maximum heap size - somehow I need to kill that, set the heap size max, and restart the Java tika server. How?
Your suspicion about the process already running would indeed be correct. Leaving
tikarunning in the background means when your script starts means it doesn't restart the java process with the new flag, which means no heap increase.As to solving that issue, we can do it completely in Python on Windows with the help of
psutil:{1} I'm directly appending to
tika_server.TikaJavaArgsas the environment variable is parsed whentika_serveris imported. You can replace with setting the environment variable if you delay the import (as in the first attempt in the question).Result:
You can definitely improve this (such as for instance, checking to see if your args are the same and skip terminating if they are), but this should get you going again at least.
Additionally, you should look into adding a call to
tika.tika.killServer()at the end of your script to stop the server when you're done with it.