I have a qemu-kvm process suspiciously core dumped with SIGFPE:
Program terminated with signal 8, Arithmetic exception.
#0 bdrv_exceed_io_limits (bs=0x7f75916b7270, is_write=false, nb_sectors=1)
at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3730
3730 elapsed_time /= (NANOSECONDS_PER_SECOND);
Where elapsed_time
is double
(the value in gdb output below) and NANOSECONDS_PER_SECOND
is a macro:
#define NANOSECONDS_PER_SECOND 1000000000.0
I can't think of a reason how should could cause SIGFPE. Any clue?
Scenario: I'm using RHEL-6.5 as the host and trying to start a windows guest. It is steadily reproducible with the same command.
Full backtrace:
(gdb) bt
#0 bdrv_exceed_io_limits (bs=0x7ffff86f9270, is_write=false, nb_sectors=1) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3730
#1 bdrv_io_limits_intercept (bs=0x7ffff86f9270, is_write=false, nb_sectors=1) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:181
#2 0x00007ffff7e0bf6d in bdrv_co_do_readv (bs=0x7ffff86f9270, sector_num=0, nb_sectors=1, qiov=0x7fffe8000ab8, flags=<value optimized out>)
at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:2136
#3 0x00007ffff7e0c293 in bdrv_co_do_rw (opaque=0x7fffe8000b00) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3880
#4 0x00007ffff7e125eb in coroutine_trampoline (i0=<value optimized out>, i1=<value optimized out>)
at /usr/src/debug/qemu-kvm-0.12.1.2/coroutine-ucontext.c:129
#5 0x00007ffff5718ba0 in ?? () from /lib64/libc.so.6
#6 0x00007fffffffbf60 in ?? ()
#7 0x0000000000000000 in ?? ()
(gdb) disass
0x00007ffff7e0b6ae <+190>: mov 0x8a0(%rbx),%rax
0x00007ffff7e0b6b5 <+197>: test %rax,%rax
=> 0x00007ffff7e0b6b8 <+200>: divsd 0x170660(%rip),%xmm0 # 0x7ffff7f7bd20
0x00007ffff7e0b6c0 <+208>: je 0x7ffff7e0b950 <bdrv_io_limits_intercept+864>
0x00007ffff7e0b6c6 <+214>: mov 0x888(%rbx),%rsi
(gdb) x/gf 0x7ffff7f7bd20
0x7ffff7f7bd20: 1000000000
(gdb) p elapsed_time
$3 = 919718
(gdb) p $_siginfo
$1 = {si_signo = 8, si_errno = 0, si_code = 6, _sifields = {_pad = {-136186690, 32767, 4244976, 0, -560757824, 32767, -
-560757344, 32767, 0, 0, 0, 0, 0, 0, 34884976, 0, -136186690, 32767, 34884976, 0, 4258127, 0, 0, 0, -55876128, 3265
-136186690, si_uid = 32767}, _timer = {si_tid = -136186690, si_overrun = 32767, si_sigval = {sival_int = 4244976, s
_rt = {si_pid = -136186690, si_uid = 32767, si_sigval = {sival_int = 4244976, sival_ptr = 0x40c5f0}}, _sigchld = {s
si_uid = 32767, si_status = 4244976, si_utime = -2408436515056123904, si_stime = -584917379700457473}, _sigfault
0x7ffff7e1f4be}, _sigpoll = {si_band = 140737352168638, si_fd = 4244976}}}
So, what could be wrong with this divsd
instruction? Any suggestion on how to debug it?
Answer it myself: This is a kernel bug that sets mxcsr accidentally to some bad value, Linux kernel triggers SIGFPE code INEXACT when the bit is not masked properly.
SIGFPE may not necessarily be seen until some time after the instruction causing it. This is confusing of course.
See https://stackoverflow.com/a/2219339/1442050