Why does moving an iteration-invariant vector out of a loop induce unnecessary GC activity?

85 Views Asked by At

I am testing manual vectorization in Java with the jdk.incubator.vector API (JDK 19), and have come across some unexpected behavior w.r.t. Garbage Collection, exemplified here with a toy memset implementation.

In the following code, the functions memsetLoop1 and memsetLoop2 are functionally identical, with the only difference being whether the LongVector.broadcast takes place in or outside of the inner loop. Because the broadcasted value is the same for every iteration of the inner loop, this is a correct "optimization."

However, my measurements (in the main function) indicate memsetLoop2 takes about 50% longer to complete, on average.

// Run with `--source=19 --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --enable-preview --enable-native-access=ALL-UNNAMED --add-modules jdk.incubator.vector`
package main;

import java.lang.foreign.*;
import java.nio.ByteOrder;
import jdk.incubator.vector.*;

public class VectorGCBugDemo {
    // I am aware that the below line of code is generally a Very Bad Idea.
    // It has no bearing on the issue at hand, AFAICT.
    final static MemorySegment ADDRESS_SPACE = MemorySegment.ofAddress(MemoryAddress.NULL, Long.MAX_VALUE, MemorySession.global());
    final static ByteOrder NATIVE_BYTE_ORDER = ByteOrder.nativeOrder();
    final static long SPECIES256_VECTOR_SIZE = LongVector.SPECIES_256.vectorByteSize();

    // Something that fits in L1 cache, for maximum speed.
    static final long ALLOCATION_BYTES = 512L * SPECIES256_VECTOR_SIZE;
    // Big number to make unexpected GC behavior visible.
    static final long ITERATIONS = 50_000_000L;

    static void memsetLoop1(final long address, final long bytes, final long iterations) {
        final long start = address;
        final long end = address + bytes;
        for (long outer = 0; outer < iterations; outer++) {

            // vvvvvvvv INNER LOOP 1 vvvvvvvv
            for (long i = start; i < end; i += SPECIES256_VECTOR_SIZE) {
                LongVector.broadcast(LongVector.SPECIES_256, outer).intoMemorySegment(ADDRESS_SPACE, i, NATIVE_BYTE_ORDER);
            }

        }
    }

    static void memsetLoop2(final long address, final long bytes, final long iterations) {
        final long start = address;
        final long end = address + bytes;
        for (long outer = 0; outer < iterations; outer++) {

            // vvvvvvvv INNER LOOP 2 vvvvvvvv
            final LongVector broadcastedValue = LongVector.broadcast(LongVector.SPECIES_256, outer);
            for (long i = start; i < end; i += SPECIES256_VECTOR_SIZE) {
                broadcastedValue.intoMemorySegment(ADDRESS_SPACE, i, NATIVE_BYTE_ORDER);
            }

        }
    }
    
    public static void main(String... args) {
        // Allocate vector-aligned memory.
        final MemorySegment allocation = MemorySegment.allocateNative(ALLOCATION_BYTES, SPECIES256_VECTOR_SIZE, MemorySession.global());
        final long allocationAddress = allocation.address().toRawLongValue();

        System.out.print("Running memsetLoop1... ");
        final long t1 = System.nanoTime();
        memsetLoop1(allocationAddress, ALLOCATION_BYTES, ITERATIONS);
        final long t2 = System.nanoTime();
        System.out.printf("memset loop took %.2fns on average.\n", (double)(t2 - t1) / ITERATIONS);
        
        System.out.print("Running memsetLoop2... ");
        final long t3 = System.nanoTime();
        memsetLoop2(allocationAddress, ALLOCATION_BYTES, ITERATIONS);
        final long t4 = System.nanoTime();
        System.out.printf("memset loop took %.2fns on average.\n", (double)(t4 - t3) / ITERATIONS);
    }
}

Looking at the x86 code generated by the JIT compiler for each function (via -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly and the hsdis library) doesn't seem to show any differences between them; the compiler is more than smart enough to move the vpbroadcastq instruction out of the loop in memsetLoop1, and they both end up looking pretty much the same to my (somewhat inexpert) eye.

However, running it with -verbose:gc with default heap sizes reveals significant GC activity during the execution of memsetLoop2, while memsetLoop1 does not have any:

[0.002s][info][gc] Using Serial
WARNING: Using incubator modules: jdk.incubator.vector
Running memsetLoop1... memset loop took 271.30ns on average.
Running memsetLoop2... [15.305s][info][gc] GC(0) Pause Young (Allocation Failure) 268M->1M(968M) 2.190ms
[17.037s][info][gc] GC(1) Pause Young (Allocation Failure) 268M->1M(968M) 1.836ms
[18.808s][info][gc] GC(2) Pause Young (Allocation Failure) 268M->1M(968M) 0.862ms
[20.607s][info][gc] GC(3) Pause Young (Allocation Failure) 268M->1M(968M) 0.856ms
[22.369s][info][gc] GC(4) Pause Young (Allocation Failure) 268M->1M(968M) 0.955ms
[24.128s][info][gc] GC(5) Pause Young (Allocation Failure) 268M->1M(968M) 0.750ms
[25.887s][info][gc] GC(6) Pause Young (Allocation Failure) 268M->1M(968M) 0.928ms
[27.646s][info][gc] GC(7) Pause Young (Allocation Failure) 268M->1M(968M) 0.755ms
[29.405s][info][gc] GC(8) Pause Young (Allocation Failure) 268M->1M(968M) 0.750ms
[31.213s][info][gc] GC(9) Pause Young (Allocation Failure) 268M->1M(968M) 0.858ms
[32.972s][info][gc] GC(10) Pause Young (Allocation Failure) 268M->1M(968M) 0.752ms
memset loop took 405.50ns on average.

Using higher outer loop counts just results in even more GC activity for memsetLoop2.

Lowering the heap sizes to their minimum values shows that memsetLoop1 does initially have some light GC activity, but the allocations generated by direct interpretation of the loop code seem to get optimized out by the JIT, while in memsetLoop2 this is not the case for whatever reason.

I am quite puzzled by these facts, as I would expect functions to be optimized essentially identically once the iteration-invariant code is lifted out of the loop. If anyone has any clues as to why this behavior might be occurring, an explanation would be very much appreciated. Thank you!

0

There are 0 best solutions below