Created attachment 283997 [details] Re-modern code Dear All This bug was found in Linux Kernel v5.2-rc6 Syzkaller hit 'BUG: Bad rss-counter state' bug. BUG: Bad rss-counter state mm:00000000ff222ec1 idx:0 val:1 BUG: Bad rss-counter state mm:00000000ff222ec1 idx:1 val:2 BUG: non-zero pgtables_bytes on freeing mm: 4096 Syzkaller reproducer: # {Threaded:true Collide:true Repeat:true RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 Leak:false EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:true HandleSegv:true Repro:false Trace:false} ioctl$TIOCLINUX7(0xffffffffffffffff, 0x541c, 0x0) r0 = syz_open_dev$sg(&(0x7f0000000380)='/dev/sg#\x00', 0x0, 0x0) ioctl$SCSI_IOCTL_SEND_COMMAND(r0, 0x1, &(0x7f00000000c0)={0x1, 0x0, 0x8, "be"}) You can run QEMU through this qemu-system-x86_64 -m 2048 -smp 2 -net nic,model=e1000 -net user,host=10.0.2.10,hostfwd=tcp::1111-:22 -display none -serial stdio -no-reboot -enable-kvm -cpu host,migratable=off -hda /home/icy/gopath/src/github.com/google/syzkaller/image/wheezy.img -snapshot -kernel /home/icy/gopath/src/github.com/google/syzkaller/linux/arch/x86/boot/bzImage -append "earlyprintk=serial oops=panic nmi_watchdog=panic panic_on_warn=1 panic=1 ftrace_dump_on_oops=orig_cpu rodata=n vsyscall=native net.ifnames=0 biosdevname=0 root=/dev/sda console=ttyS0 kvm-intel.nested=1 kvm-intel.unrestricted_guest=1 kvm-intel.vmm_exclusive=1 kvm-intel.fasteoi=1 kvm-intel.ept=1 kvm-intel.flexpriority=1 kvm-intel.vpid=1 kvm-intel.emulate_invalid_guest_state=1 kvm-intel.eptad=1 kvm-intel.enable_shadow_vmcs=1 kvm-intel.pml=1 kvm-intel.enable_apicv=1 " You can download Wheezy.img from this wget https://storage.googleeapi.com/syzkaller/wheezy.img from this url:192.168.44.128
We had multiple kernel crashes using 5.3.18-150300.59.49-default from SLES15 SP3 runing under xen-4.14.3_06-150300.3.18.2.x86_64. The effect is that at some moment processes start dumping core due to SIGSEGV, and there are messages regarding "BUG: Bad RSS-counter..." usually combined with "Code: Bad RIP value." The bug was not seen with SLES15 SP2 kernel (5.3.18-24.99-default, xen-4.14.4_02-3.40)
Created attachment 300503 [details] Stack traces from three different machines The file contains related syslog messages as well as the final stack trace when the kernel paniced on three different machines. Before pac´nic a big number of core dumps was written. The servers were all Dell PowerEdge R7415 with one AMD EPYC 7401P 24-Core Processor (latest Firmware Updates applied).
Created attachment 300546 [details] Two more kdumps (5.3.18-150300.59.49-default) Two more kdumps. Maybe filesystem-related (BtrFS). The system is using snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly random processes see SIGSEGV before the kernel panics.
(In reply to Ulrich.Windl from comment #3) > Two more kdumps. Maybe filesystem-related (BtrFS). The system is using > snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly > random processes see SIGSEGV before the kernel panics. Maybe also the issue is caused by Xen. At least the RAM had no problems on the machines.
Created attachment 300578 [details] Another kernel panic (BUG: kernel NULL pointer dereference, address: 0000000000000008) Another kernel panic for 5.3.18-150300.59.49-default #1 SLE15-SP3, kcompactd-related, possibly also related to BtrFS "qgroup scan completed (inconsistency flag cleared)". Uptime before was multiple days.
Here I can easily trigger the bug by doing some I/O, like doing "rear backup" to NFS. There will multiple core dumps while sending the data to NFS. I've updated to kernel 5.3.18-150300.59.63 meanwhile, but the problem is still there.