We have a couple of SGE clusters running various versions of RHEL at my work and we're testing a new one with a newer Redhat, all . On the old cluster ("Centos release 5.4"), I'm able to submit a job like the following one and it runs fine:
echo "java -Xms8G -Xmx8G -jar blah.jar ..." |qsub ... -l h_vmem=10G,virtual_free=10G ...
On the new cluster "CentOS release 6.2 (Final)", a job with those parameters fails due to running out of memory, and I have to change the h_vmem to h_vmem=17G in order for it to succeed. The new nodes have about 3x the RAM of the old node and in testing I'm only putting in a couple of jobs at a time.
On the old cluster, I'd set the -Xms/Xms
to be N
, I could use N+1
or so for the h_vmem
. On the new cluster, I seem to be crashing unless I set h_vmem
to be 2N+1
.
I wrote a tiny perl script that all it does is progressively use consume more memory and periodically print out the memory used until it crashes or it reaches a limit. The h_vmem parameter makes it crash at the expected memory usage.
I've tried multiple versions of the JVM (1.6 and 1.7). If I omit the h_vmem
, it works, but then things are riskier to run.
I have googled where others have seen similar issues, but no resolutions found.
The problem here appears to be an issue with the combination of the following factors:
To fix the problem I've used a combination of the following:
export MALLOC_ARENA_MAX=1
java -XX:ParallelGCThreads=1 ...
qsub -pe pthreads 2
Note that it's unclear that setting the MALLOC_ARENA_MAX all the way down to 1 is the right number, but low numbers seem to work well from my testing.
Here are the links that lead me to these conclusions:
https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en
What would cause a java process to greatly exceed the Xmx or Xss limit?
http://siddhesh.in/journal/2012/10/24/malloc-per-thread-arenas-in-glibc/