Bug 217457
Summary: | Persistent rt_sigreturn segfaults on KVM VMs after upgrade to 5.15 | ||
---|---|---|---|
Product: | Linux | Reporter: | Theodor Milkov (tm) |
Component: | Kernel | Assignee: | Virtual assignee for kernel bugs (linux-kernel) |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | bagasdotme, qhu, rockyramchandani02 |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: |
Description
Theodor Milkov
2023-05-18 08:43:13 UTC
(In reply to Theodor Milkov from comment #0) > I'm experiencing sporadic but persistent segmentation faults on the KVM VMs > I manage. These faults began appearing after upgrading from Linux Kernel 4.x > to 5.15.59. I further upgraded to 5.15.91 and transitioned the userspace > from Debian 10 (buster) to Debian 11 (bullseye), yet the issues persist. > Notably, the libc has also changed in the process as seen in the following > error logs: > I guess before upgrading you use v4.19 (from your distro), right? (In reply to Theodor Milkov from comment #0) > Switching to the 6.x kernel isn't immediately feasible as these are > production systems with specific requirements. The transition is planned but > will likely take several months. > If you already have testing system (which should be identical to your production ones), can you try latest mainline there? We have compiled both our previous 4.14 kernel and our current 5.15 kernel ourselves, without module support. These kernels are used in the guest systems. Meanwhile, the host systems are using the 5.10 kernel from the distribution (Debian). The problem started right after upgrading guest kernel to 5.15.59. I am currently looking into trying 6.1 kernel on a subset of approximately 150 guest systems. I will update you on the results in the coming days. @Theodor Milkov, is there an equivalent core dump generated for your program/script? probably under /var/log/core. If yes, can you provide excerpts from its gdb analysis? PS - https://stackoverflow.com/questions/5115613/core-dump-file-analysis The only way I currently have to trigger this segmentation fault is with the following Python code: https://gist.github.com/z-image/762691ee7a67ffdeb88318c47d9ebf0c The actual code is much lengthier, as it involves monitoring /proc/stat & /proc/pressure and regulating the rate of os.kill() to avoid overloading the servers. These are, after all, production machines. However, the most crucial component is this. I've managed to obtain a core dump, but, as anticipated, it doesn't contain much useful information. This is because a) this is the non-debug version of Python 3, and b) the issue lies deeper within the kernel. Here's the backtrace: (gdb) bt full #0 0x000060f0e5c4904e in __libc_read (fd=5, buf=0x60f0e5701620, nbytes=1) at ../sysdeps/unix/sysv/linux/read.c:26 resultvar = 0 __arg3 = <optimized out> _a2 = <optimized out> sc_ret = <optimized out> __arg1 = <optimized out> _a3 = <optimized out> sc_cancel_oldtype = <optimized out> resultvar = <optimized out> resultvar = <optimized out> __arg2 = <optimized out> _a1 = <optimized out> #1 0x0000000000561ec7 in _Py_read () No symbol table info available. I may try with python3-dbg if you think that is going to be more useful. I'd like to report that the persistent rt_sigreturn segmentation faults on KVM VMs are no longer reproducible after upgrading to kernel version 5.15.131. It appears the issue was fixed in one of the updates post 5.15.91. |