I am testing manual vectorization in Java with the jdk.incubator.vector
API (JDK 19), and have come across some unexpected behavior w.r.t. Garbage Collection, exemplified here with a toy memset
implementation.
In the following code, the functions memsetLoop1
and memsetLoop2
are functionally identical, with the only difference being whether the LongVector.broadcast
takes place in or outside of the inner loop. Because the broadcasted value is the same for every iteration of the inner loop, this is a correct "optimization."
However, my measurements (in the main
function) indicate memsetLoop2
takes about 50% longer to complete, on average.
// Run with `--source=19 --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector`
package main;
import java.lang.foreign.*;
import java.nio.ByteOrder;
import jdk.incubator.vector.*;
public class VectorGCBugDemo {
// I am aware that the below line of code is generally a Very Bad Idea.
// It has no bearing on the issue at hand, AFAICT.
final static MemorySegment ADDRESS_SPACE = MemorySegment.ofAddress(MemoryAddress.NULL, Long.MAX_VALUE, MemorySession.global());
final static ByteOrder NATIVE_BYTE_ORDER = ByteOrder.nativeOrder();
final static long SPECIES256_VECTOR_SIZE = LongVector.SPECIES_256.vectorByteSize();
// Something that fits in L1 cache, for maximum speed.
static final long ALLOCATION_BYTES = 512L * SPECIES256_VECTOR_SIZE;
// Big number to make unexpected GC behavior visible.
static final long ITERATIONS = 50_000_000L;
static void memsetLoop1(final long address, final long bytes, final long iterations) {
final long start = address;
final long end = address + bytes;
for (long outer = 0; outer < iterations; outer++) {
// vvvvvvvv INNER LOOP 1 vvvvvvvv
for (long i = start; i < end; i += SPECIES256_VECTOR_SIZE) {
LongVector.broadcast(LongVector.SPECIES_256, outer).intoMemorySegment(ADDRESS_SPACE, i, NATIVE_BYTE_ORDER);
}
}
}
static void memsetLoop2(final long address, final long bytes, final long iterations) {
final long start = address;
final long end = address + bytes;
for (long outer = 0; outer < iterations; outer++) {
// vvvvvvvv INNER LOOP 2 vvvvvvvv
final LongVector broadcastedValue = LongVector.broadcast(LongVector.SPECIES_256, outer);
for (long i = start; i < end; i += SPECIES256_VECTOR_SIZE) {
broadcastedValue.intoMemorySegment(ADDRESS_SPACE, i, NATIVE_BYTE_ORDER);
}
}
}
public static void main(String... args) {
// Allocate vector-aligned memory.
final MemorySegment allocation = MemorySegment.allocateNative(ALLOCATION_BYTES, SPECIES256_VECTOR_SIZE, MemorySession.global());
final long allocationAddress = allocation.address().toRawLongValue();
System.out.print("Running memsetLoop1... ");
final long t1 = System.nanoTime();
memsetLoop1(allocationAddress, ALLOCATION_BYTES, ITERATIONS);
final long t2 = System.nanoTime();
System.out.printf("memset loop took %.2fns on average.\n", (double)(t2 - t1) / ITERATIONS);
System.out.print("Running memsetLoop2... ");
final long t3 = System.nanoTime();
memsetLoop2(allocationAddress, ALLOCATION_BYTES, ITERATIONS);
final long t4 = System.nanoTime();
System.out.printf("memset loop took %.2fns on average.\n", (double)(t4 - t3) / ITERATIONS);
}
}
Looking at the x86 code generated by the JIT compiler for each function (via -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly
and the hsdis
library) doesn't seem to show any differences between them; the compiler is more than smart enough to move the vpbroadcastq
instruction out of the loop in memsetLoop1
, and they both end up looking pretty much the same to my (somewhat inexpert) eye.
However, running it with -verbose:gc
with default heap sizes reveals significant GC activity during the execution of memsetLoop2
, while memsetLoop1
does not have any:
[0.002s][info][gc] Using Serial
WARNING: Using incubator modules: jdk.incubator.vector
Running memsetLoop1... memset loop took 271.30ns on average.
Running memsetLoop2... [15.305s][info][gc] GC(0) Pause Young (Allocation Failure) 268M->1M(968M) 2.190ms
[17.037s][info][gc] GC(1) Pause Young (Allocation Failure) 268M->1M(968M) 1.836ms
[18.808s][info][gc] GC(2) Pause Young (Allocation Failure) 268M->1M(968M) 0.862ms
[20.607s][info][gc] GC(3) Pause Young (Allocation Failure) 268M->1M(968M) 0.856ms
[22.369s][info][gc] GC(4) Pause Young (Allocation Failure) 268M->1M(968M) 0.955ms
[24.128s][info][gc] GC(5) Pause Young (Allocation Failure) 268M->1M(968M) 0.750ms
[25.887s][info][gc] GC(6) Pause Young (Allocation Failure) 268M->1M(968M) 0.928ms
[27.646s][info][gc] GC(7) Pause Young (Allocation Failure) 268M->1M(968M) 0.755ms
[29.405s][info][gc] GC(8) Pause Young (Allocation Failure) 268M->1M(968M) 0.750ms
[31.213s][info][gc] GC(9) Pause Young (Allocation Failure) 268M->1M(968M) 0.858ms
[32.972s][info][gc] GC(10) Pause Young (Allocation Failure) 268M->1M(968M) 0.752ms
memset loop took 405.50ns on average.
Using higher outer loop counts just results in even more GC activity for memsetLoop2
.
Lowering the heap sizes to their minimum values shows that memsetLoop1
does initially have some light GC activity, but the allocations generated by direct interpretation of the loop code seem to get optimized out by the JIT, while in memsetLoop2
this is not the case for whatever reason.
I am quite puzzled by these facts, as I would expect functions to be optimized essentially identically once the iteration-invariant code is lifted out of the loop. If anyone has any clues as to why this behavior might be occurring, an explanation would be very much appreciated. Thank you!