In Linux v5.10, when handling the SVE
accessing exception in do_sve_acc()
function, why zero out the thread's SVE state?
I think it should not zero out the SVE state before restoring the SVE state. Am I right?
https://elixir.bootlin.com/linux/v5.10.205/source/arch/arm64/kernel/fpsimd.c#L513
My doubt is that, when the process is trapped by SVE not the first time, namely, the SVE state contains the context to restore. Then after zeroing out the SVE state, what to restore?
This trap can only happen when there was no SVE state, only ASIMD.
System calls are allowed to discard the SVE state and return to FP/ASIMD only mode for cheaper context-switches. From the big block-comment I quoted below: During any syscall, the kernel may optionally1 clear TIF_SVE and discard the vector state except for the FPSIMD subset.
"Discarding" means there isn't still architectural state that user-space can be expecting to read later. It will read zeros for parts of vector registers outside the low 128 bits.
Comments in the file you linked describe the design. When SVE isn't being used, it uses cheaper FP/ASIMD context switching. Many processes won't use SVE at all because it's still pretty new, so it definitely makes sense to have this even on hardware that does support SVE.
Specifically https://elixir.bootlin.com/linux/v5.10.205/source/arch/arm64/kernel/fpsimd.c#L229
The key part being
bits [max : 128]
for each ofZ0-Z31
are logically zero but not stored anywhere in the TIF_SVE clear state - that's why it's zeroing stuff when leaving that state.Also, another comment says:
So before that, SVE regs might have stale garbage from another process, but we need to prevent data leaks. Same reason fresh pages from
mmap(MAP_ANONYMOUS)
are zeroed.Also same reason
execve
zeros the integer registers (and non-SVE SIMD registers). The ABI allows garbage, but for security the kernel chooses fixed values, and zero is convenient.Other comments like https://elixir.bootlin.com/linux/v5.10.205/source/arch/arm64/kernel/fpsimd.c#L501 have similar stuff. This seems like a lazy init of SVE state, on the assumption that most processes won't use SVE at all. So instead of always allocating and zeroing space for it on
execve
, do it on first use in the task, which triggers this trap.I know Linux used to do lazy context-switching for x86 SIMD/FP state, but now only does "eager" context switching, and has for so long that support for "lazy" has been dropped. On x86-64, pretty much every process will use some SSE instructions in compiler generated code and in library functions like
strlen
andmemcpy
, so most timeslices would involve a trap if the kernel left vector/FP instructions disabled.It looks like that's the same for AArch64 FP/ASIMD, only eager is implemented.
do_fpsimd_acc
is a stub that just warns, because nothing ever calls it. (It still tries to avoid unnecessary swaps when just changing current context inside the kernel, only restoring when actually returning to user-space if the values in regs aren't the values for the user-space context we're returning to. But it doesn't leave ASIMD instructions set to trap on first use.)AArch64 SVE on the other hand is quite new, and not widely available, so many programs might not use it at all. (Unless libc detects and uses it.) This isn't lazy context-switching for SVE for processes that are using it, only lazy init on first use. (Or on use after a system call if the kernel guessed that it might be done with SVE for a while.)
All the comments are consistent with the idea that
do_sve_acc
is only called to migrate state from FP/ASIMD to SVE on the first use of SVE, when there is no existing state. e.g. beforedo_sve_acc
itself:Footnote 1: "may optionally"?
I don't know if there's a heuristic or if in practice it always chooses to reset back to FPASIMD whenever possible. Having syscall-clobbered extended vector regs seems like a good design; most vectorized code wants lots of big vector regs for loops that don't involve system-calls or function-calls, maybe keeping some vector constants around between loops in the same function but usually without a system call in between. In a rare function that did make a system call between loops, perhaps futex for synchronization, code would have to assume SVE regs were destroyed, so either reload SVE constants or only vectorize with ASIMD.
The standard AArch64 calling convention does have some call-preserved vector regs, allowing some scalar or 128-bit vector values to stay in registers across user-level function calls, too. (e.g. https://godbolt.org/z/vrsn5n6d7). But I'm assuming the upper parts (SVE state beyond 128 bits) are call-clobbered so even if the kernel didn't clear uppers on system calls, you could only take advantage of it by manually inlining a system call in asm, not letting a C compiler call a libc wrapper function that follows the user-space calling convention.
So the cost is in the time it takes to trap the next time SVE is used. The kernel might try to notice that a process is frequently causing these traps, and/or that it spends little time in non-SVE context switches, and decide not to reset back to FPASIMD state on future system calls. That could avoid the worst-case situations for an always-reset strategy.
For many processes, SVE is never used, pure win to only do FPASIMD context switching, no traps. (But resetting from SVE to FPASIMD wouldn't be needed either.) For threads that don't make system calls and spend all their time doing SVE number crunching, they won't make any of these traps.
An adaptive strategy would be good for threads that have some phases of heavy use of SVE, but other phases of not using SVE, like running non-vectorized code or only ASIMD. (Like perhaps the SVE code was only in a library function, and other phases of computation don't use that library.) But only if they have a system call between phases. For threads that's probably not rare if they sleep and wait for notification from other threads. And in fact right as a thread goes to sleep is a great place to clear its SVE state, unless it's about to use SVE when it wakes up.
I'm just speculating here; hopefully there's some profiling data to back up whatever strategy Linux actually uses. It may change over time if glibc starts using SVE in
strlen
andmemcmp
for example, so more tasks will use SVE every timeslice. (If they don't do that already?)