Bug 204339 - BUG: Bad rss-counter state
Summary: BUG: Bad rss-counter state
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-27 08:41 UTC by icytxw
Modified: 2022-04-21 06:41 UTC (History)
1 user (show)

See Also:
Kernel Version: v5.2-rc6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Re-modern code (9.24 KB, text/x-csrc)
2019-07-27 08:41 UTC, icytxw
Details
Stack traces from three different machines (18.70 KB, text/plain)
2022-02-22 10:29 UTC, Ulrich.Windl
Details
Two more kdumps (5.3.18-150300.59.49-default) (11.94 KB, text/plain)
2022-03-08 09:21 UTC, Ulrich.Windl
Details
Another kernel panic (BUG: kernel NULL pointer dereference, address: 0000000000000008) (6.71 KB, text/plain)
2022-03-16 12:49 UTC, Ulrich.Windl
Details

Description icytxw 2019-07-27 08:41:11 UTC
Created attachment 283997 [details]
Re-modern code

Dear All
This bug was found in Linux Kernel v5.2-rc6
Syzkaller hit 'BUG: Bad rss-counter state' bug.

BUG: Bad rss-counter state mm:00000000ff222ec1 idx:0 val:1
BUG: Bad rss-counter state mm:00000000ff222ec1 idx:1 val:2
BUG: non-zero pgtables_bytes on freeing mm: 4096


Syzkaller reproducer:
# {Threaded:true Collide:true Repeat:true RepeatTimes:0 Procs:1 Sandbox: Fault:false FaultCall:-1 FaultNth:0 Leak:false EnableTun:false EnableNetDev:false EnableNetReset:false EnableCgroups:false EnableBinfmtMisc:false EnableCloseFds:false UseTmpDir:true HandleSegv:true Repro:false Trace:false}
ioctl$TIOCLINUX7(0xffffffffffffffff, 0x541c, 0x0)
r0 = syz_open_dev$sg(&(0x7f0000000380)='/dev/sg#\x00', 0x0, 0x0)
ioctl$SCSI_IOCTL_SEND_COMMAND(r0, 0x1, &(0x7f00000000c0)={0x1, 0x0, 0x8, "be"})

You can run QEMU through this
qemu-system-x86_64 -m 2048 -smp 2 -net nic,model=e1000 -net user,host=10.0.2.10,hostfwd=tcp::1111-:22 -display none -serial stdio -no-reboot -enable-kvm -cpu host,migratable=off -hda /home/icy/gopath/src/github.com/google/syzkaller/image/wheezy.img -snapshot -kernel /home/icy/gopath/src/github.com/google/syzkaller/linux/arch/x86/boot/bzImage -append "earlyprintk=serial oops=panic nmi_watchdog=panic panic_on_warn=1 panic=1 ftrace_dump_on_oops=orig_cpu rodata=n vsyscall=native net.ifnames=0 biosdevname=0 root=/dev/sda console=ttyS0 kvm-intel.nested=1 kvm-intel.unrestricted_guest=1 kvm-intel.vmm_exclusive=1 kvm-intel.fasteoi=1 kvm-intel.ept=1 kvm-intel.flexpriority=1 kvm-intel.vpid=1 kvm-intel.emulate_invalid_guest_state=1 kvm-intel.eptad=1 kvm-intel.enable_shadow_vmcs=1 kvm-intel.pml=1 kvm-intel.enable_apicv=1 "

You can download Wheezy.img from this 
wget https://storage.googleeapi.com/syzkaller/wheezy.img

from this url:192.168.44.128
Comment 1 Ulrich.Windl 2022-02-22 10:10:32 UTC
We had multiple kernel crashes using 5.3.18-150300.59.49-default from SLES15 SP3 runing under xen-4.14.3_06-150300.3.18.2.x86_64.
The effect is that at some moment processes start dumping core due to SIGSEGV, and there are messages regarding "BUG: Bad RSS-counter..." usually combined with "Code: Bad RIP value."
The bug was not seen with SLES15 SP2 kernel (5.3.18-24.99-default, xen-4.14.4_02-3.40)
Comment 2 Ulrich.Windl 2022-02-22 10:29:18 UTC
Created attachment 300503 [details]
Stack traces from three different machines

The file contains related syslog messages as well as the final stack trace when the kernel paniced on three different machines. Before pac´nic a big number of core dumps was written.
The servers were all Dell PowerEdge R7415 with one AMD EPYC 7401P 24-Core Processor (latest Firmware Updates applied).
Comment 3 Ulrich.Windl 2022-03-08 09:21:48 UTC
Created attachment 300546 [details]
Two more kdumps (5.3.18-150300.59.49-default)

Two more kdumps. Maybe filesystem-related (BtrFS). The system is using snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly random processes see SIGSEGV before the kernel panics.
Comment 4 Ulrich.Windl 2022-03-08 09:23:38 UTC
(In reply to Ulrich.Windl from comment #3)
> Two more kdumps. Maybe filesystem-related (BtrFS). The system is using
> snapshots (BtrFS and OCFS2 reflink-snapshots). Symptom is that seemingly
> random processes see SIGSEGV before the kernel panics.

Maybe also the issue is caused by Xen. At least the RAM had no problems on the machines.
Comment 5 Ulrich.Windl 2022-03-16 12:49:55 UTC
Created attachment 300578 [details]
Another kernel panic (BUG: kernel NULL pointer dereference, address: 0000000000000008)

Another kernel panic for 5.3.18-150300.59.49-default #1 SLE15-SP3, kcompactd-related, possibly also related to BtrFS "qgroup scan completed (inconsistency flag cleared)". Uptime before was multiple days.
Comment 6 Ulrich.Windl 2022-04-21 06:41:24 UTC
Here I can easily trigger the bug by doing some I/O, like doing "rear backup" to NFS. There will multiple core dumps while sending the data to NFS.
I've updated to kernel 5.3.18-150300.59.63 meanwhile, but the problem is still there.

Note You need to log in before you can comment on or make changes to this bug.