Java performance issue On Oracle Linux

507 Views Asked by At

I'm running very "simple" Test with.

@Fork(value = 1, jvmArgs = { "--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints", "-XX:ActiveProcessorCount=7",
        "-XX:+UseNUMA"
        , "-XX:+UnlockDiagnosticVMOptions", "-XX:DisableIntrinsic=_currentTimeMillis,_nanoTime",

        "-Xmx10G", "-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
)
    @Benchmark
    public String generateRandom() {
        return UUID.randomUUID().toString();
    }

May be it's not very simple, because uses random, but same issue is on any other tests with java

On my home desktop

Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 12 Threads (hyperthreading enabled ), 64 GB Ram, "Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)"
Linux homepc 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Performance with 7 threads:

Benchmark                                            Mode  Cnt        Score       Error   Units
RulesBenchmark.generateRandom                       thrpt    5  1312295.357 ± 27853.707   ops/s

Flame Graph with AsyncProfiler Result with 7 Thread At Home enter image description here

I have an issue on Oracle Linux

Linux  5.4.17-2102.201.3.el8uek.x86_64 #2 SMP Fri Apr 23 09:05:57 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) Gold 6258R CPU @ 2.70GHz with 56 Threads(hyperthreading disabled, the same when enabled and there is 112 cpu threads ) and 1 TB RAM I have half of performance (Even increasing threads) NAME="Oracle Linux Server" VERSION="8.4"

with 1 thread, I have very great performance:

Benchmark                                            Mode  Cnt        Score      Error   Units
RulesBenchmark.generateRandom                       thrpt    5  2377471.113 ± 8049.532   ops/s

Flame Graph with AsyncProfiler Result 1 Thread enter image description here But with 7 thread

Benchmark                                            Mode  Cnt       Score       Error   Units


RulesBenchmark.generateRandom                       thrpt    5  688612.296 ± 70895.058   ops/s

Flame Graph with AsyncProfiler Result 7 Thread

enter image description here

May be it's an issue of NUMA becase there is 2 Sockets, and system is configured with only 1 NUMA node numactl --hardware

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 0 size: 1030835 MB
node 0 free: 1011029 MB
node distances:
node   0 
  0:  10 

But after disabling some cpu threads using:

for i in {12..55}
do
 # your-unix-command-here
  echo '0'| sudo tee /sys/devices/system/cpu/cpu$i/online
done

Performance little improved, not much.

This is just very "simple" test. On complex test with real code, it's even worth, It spends a lot of time on .annobin___pthread_cond_signal.start

I also deployed vagrant image with the same version of Oracle Linux and kernel version on my home desktop and run it with 10 cpu threads, and performance was nearly as same (~1M op/sec) as on my descktop. So it's not about OS or kernel, but some configuration

Tested with several jDK versions and vendors (jdk 11 and above). It's very little performance when using OpenJDK 11 from YUM distribution, but not significant.

Can you sugest some advice Thanks in advance

1

There are 1 best solutions below

5
On

In essense, your benchmark tests the throughput of SecureRandom. The default implementation is synchronized (more precisely, the default implementation mixes the input form /dev/urandom and the above provider).

The paradox is, more threads result in more contention, and thus lower overall performance, as the main part of the algorithm is under a global lock anyway. Async-profiler indeed shows that the bottleneck is the synchronization on a Java monitor: __lll_unlock_wake, __pthread_cond_wait, __pthread_cond_signal - all come from that synchronization.

The contention overhead definitely depends on the hardware, the firmware, and the OS configuration. Instead of trying to reduce this overhead (which can be hard, as, you know, some day will arrive yet another security patch that will make syscalls 2x slower, for example), I'd suggest to get rid of the contention in the first place.

This can be achieved by installing a different, non-blocking SecureRandom provider like shown in this answer. I won't give a recommendation on a particular SecureRandomSpi, as it depends on your specific requirements (throughput/scalability/security). Will just mention that an implementation can be based on