Bug 204339
Summary: | BUG: Bad rss-counter state | ||
---|---|---|---|
Product: | File System | Reporter: | icytxw (icytxw) |
Component: | Other | Assignee: | fs_other |
Status: | NEW --- | ||
Severity: | normal | CC: | Ulrich.Windl |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | v5.2-rc6 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Re-modern code
Stack traces from three different machines Two more kdumps (5.3.18-150300.59.49-default) Another kernel panic (BUG: kernel NULL pointer dereference, address: 0000000000000008) |
Description
icytxw
2019-07-27 08:41:11 UTC
We had multiple kernel crashes using 5.3.18-150300.59.49-default from SLES15 SP3 runing under xen-4.14.3_06-150300.3.18.2.x86_64. The effect is that at some moment processes start dumping core due to SIGSEGV, and there are messages regarding "BUG: Bad RSS-counter..." usually combined with "Code: Bad RIP value." The bug was not seen with SLES15 SP2 kernel (5.3.18-24.99-default, xen-4.14.4_02-3.40) Created attachment 300503 [details]
Stack traces from three different machines
The file contains related syslog messages as well as the final stack trace when the kernel paniced on three different machines. Before pacĀ“nic a big number of core dumps was written.
The servers were all Dell PowerEdge R7415 with one AMD EPYC 7401P 24-Core Processor (latest Firmware Updates applied).
Created attachment 300546 [details]
Two more kdumps (5.3.18-150300.59.49-default)
Two more kdumps. Maybe filesystem-related (BtrFS). The system is using snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly random processes see SIGSEGV before the kernel panics.
(In reply to Ulrich.Windl from comment #3) > Two more kdumps. Maybe filesystem-related (BtrFS). The system is using > snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly > random processes see SIGSEGV before the kernel panics. Maybe also the issue is caused by Xen. At least the RAM had no problems on the machines. Created attachment 300578 [details]
Another kernel panic (BUG: kernel NULL pointer dereference, address: 0000000000000008)
Another kernel panic for 5.3.18-150300.59.49-default #1 SLE15-SP3, kcompactd-related, possibly also related to BtrFS "qgroup scan completed (inconsistency flag cleared)". Uptime before was multiple days.
Here I can easily trigger the bug by doing some I/O, like doing "rear backup" to NFS. There will multiple core dumps while sending the data to NFS. I've updated to kernel 5.3.18-150300.59.63 meanwhile, but the problem is still there. |