Bug 204339

Summary: BUG: Bad rss-counter state
Product: File System Reporter: icytxw (icytxw)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: normal CC: Ulrich.Windl
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v5.2-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: Re-modern code
Stack traces from three different machines
Two more kdumps (5.3.18-150300.59.49-default)
Another kernel panic (BUG: kernel NULL pointer dereference, address: 0000000000000008)

Description icytxw 2019-07-27 08:41:11 UTC
Created attachment 283997 [details]
Re-modern code

Dear All
This bug was found in Linux Kernel v5.2-rc6
Syzkaller hit 'BUG: Bad rss-counter state' bug.

BUG: Bad rss-counter state mm:00000000ff222ec1 idx:0 val:1
BUG: Bad rss-counter state mm:00000000ff222ec1 idx:1 val:2
BUG: non-zero pgtables_bytes on freeing mm: 4096


Syzkaller reproducer:
# {Threaded:true Collide:true Repeat:true RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 Leak:false EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:true HandleSegv:true Repro:false Trace:false}
ioctl$TIOCLINUX7(0xffffffffffffffff, 0x541c, 0x0)
r0 = syz_open_dev$sg(&(0x7f0000000380)='/dev/sg#\x00', 0x0, 0x0)
ioctl$SCSI_IOCTL_SEND_COMMAND(r0, 0x1, &(0x7f00000000c0)={0x1, 0x0, 0x8, "be"})

You can run QEMU through this
qemu-system-x86_64 -m 2048 -smp 2 -net nic,model=e1000 -net user,host=10.0.2.10,hostfwd=tcp::1111-:22 -display none -serial stdio -no-reboot -enable-kvm -cpu host,migratable=off -hda /home/icy/gopath/src/github.com/google/syzkaller/image/wheezy.img -snapshot -kernel /home/icy/gopath/src/github.com/google/syzkaller/linux/arch/x86/boot/bzImage -append "earlyprintk=serial oops=panic nmi_watchdog=panic panic_on_warn=1 panic=1 ftrace_dump_on_oops=orig_cpu rodata=n vsyscall=native net.ifnames=0 biosdevname=0 root=/dev/sda console=ttyS0 kvm-intel.nested=1 kvm-intel.unrestricted_guest=1 kvm-intel.vmm_exclusive=1 kvm-intel.fasteoi=1 kvm-intel.ept=1 kvm-intel.flexpriority=1 kvm-intel.vpid=1 kvm-intel.emulate_invalid_guest_state=1 kvm-intel.eptad=1 kvm-intel.enable_shadow_vmcs=1 kvm-intel.pml=1 kvm-intel.enable_apicv=1 "

You can download Wheezy.img from this 
wget https://storage.googleeapi.com/syzkaller/wheezy.img

from this url:192.168.44.128
Comment 1 Ulrich.Windl 2022-02-22 10:10:32 UTC
We had multiple kernel crashes using 5.3.18-150300.59.49-default from SLES15 SP3 runing under xen-4.14.3_06-150300.3.18.2.x86_64.
The effect is that at some moment processes start dumping core due to SIGSEGV, and there are messages regarding "BUG: Bad RSS-counter..." usually combined with "Code: Bad RIP value."
The bug was not seen with SLES15 SP2 kernel (5.3.18-24.99-default, xen-4.14.4_02-3.40)
Comment 2 Ulrich.Windl 2022-02-22 10:29:18 UTC
Created attachment 300503 [details]
Stack traces from three different machines

The file contains related syslog messages as well as the final stack trace when the kernel paniced on three different machines. Before pacĀ“nic a big number of core dumps was written.
The servers were all Dell PowerEdge R7415 with one AMD EPYC 7401P 24-Core Processor (latest Firmware Updates applied).
Comment 3 Ulrich.Windl 2022-03-08 09:21:48 UTC
Created attachment 300546 [details]
Two more kdumps (5.3.18-150300.59.49-default)

Two more kdumps. Maybe filesystem-related (BtrFS). The system is using snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly random processes see SIGSEGV before the kernel panics.
Comment 4 Ulrich.Windl 2022-03-08 09:23:38 UTC
(In reply to Ulrich.Windl from comment #3)
> Two more kdumps. Maybe filesystem-related (BtrFS). The system is using
> snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly
> random processes see SIGSEGV before the kernel panics.

Maybe also the issue is caused by Xen. At least the RAM had no problems on the machines.
Comment 5 Ulrich.Windl 2022-03-16 12:49:55 UTC
Created attachment 300578 [details]
Another kernel panic (BUG: kernel NULL pointer dereference, address: 0000000000000008)

Another kernel panic for 5.3.18-150300.59.49-default #1 SLE15-SP3, kcompactd-related, possibly also related to BtrFS "qgroup scan completed (inconsistency flag cleared)". Uptime before was multiple days.
Comment 6 Ulrich.Windl 2022-04-21 06:41:24 UTC
Here I can easily trigger the bug by doing some I/O, like doing "rear backup" to NFS. There will multiple core dumps while sending the data to NFS.
I've updated to kernel 5.3.18-150300.59.63 meanwhile, but the problem is still there.