With 5.15.2 and latest stable-queue patches applied (including new get_wchan stuff), I got the following BUG and eventually a system crash (with nothing saved on disk, sadly): BUG: unable to handle page fault for address: ffffb305c4efbd58 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 100000067 P4D 100000067 PUD 1001b7067 PMD 127965067 PTE 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 8954 Comm: ThreadPoolForeg Tainted: G S 5.15.3-arch1-0.2 #34 34806430f7aa311a8381e00d1e077ee548b9b809 Hardware name: Dell Inc. XPS 13 9300/0PP9G2, BIOS 1.8.0 10/07/2021 RIP: 0010:__unwind_start+0x10b/0x1e0 Code: 9f fb ff 85 c0 75 d2 eb c0 65 48 8b 04 25 c0 bb 01 00 48 39 c6 0f 84 86 00 00 00 48 8b 86 98 1b 00 00 48 8d 78 38 48 89 7d 38 <48> 8b 50 28 48 89 55 40 48 8b 40 30 48 89 RSP: 0018:ffffb305c4787b38 EFLAGS: 00010002 RAX: ffffb305c4efbd30 RBX: ffffb305c4efbd30 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff90233d552640 RDI: ffffb305c4efbd68 RBP: ffffb305c4787b58 R08: 0000000000010000 R09: 0000000000001004 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffffb305c4787b58 R14: 0000000000000a09 R15: ffffffffa20530a0 FS: 00007f12ee91d640(0000) GS:ffff9028b0840000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffb305c4efbd58 CR3: 00000002fc812002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> __get_wchan+0x54/0xb0 get_wchan+0x5c/0x70 do_task_stat+0xd06/0xe10 proc_single_show+0x47/0xa0 seq_read_iter+0x114/0x470 seq_read+0x12a/0x170 vfs_read+0x92/0x190 ksys_read+0x67/0xe0 do_syscall_64+0x56/0x80 ? __audit_syscall_exit+0x24d/0x2a0 ? syscall_exit_to_user_mode+0x23/0x40 ? do_syscall_64+0x63/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f1301a8c89c Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 a9 56 f9 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 RSP: 002b:00007f12ee91b940 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 00001ff2028e9600 RCX: 00007f1301a8c89c RDX: 0000000000001000 RSI: 00001ff2000c3c00 RDI: 000000000000011b RBP: 00007f1301b5f300 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000001000 R11: 0000000000000246 R12: 00001ff2000c3c00 R13: 0000000000000d68 R14: 00007f1301b5e700 R15: 0000000000001000 </TASK> Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 libcurve25519_generic libchacha libblake2s_generic fcomm snd_seq_dummy snd_hrtimer snd_seq typec_displayport nft_reject_inet nf_rejec> intel_tcc_cooling x86_pkg_temp_thermal dell_wmi_sysman dell_wmi firmware_attributes_class intel_powerclamp btrfs coretemp wmi_bmof intel_wmi_thunderbolt dell_smbios ptor xor raid6_pq libcrc32c dcdbas kvm_intel kvm snd_sof_pci_intel_icl irqbypass snd_sof_i> intel_ishtp tpm_crb processor_thermal_device_pci_legacy processor_thermal_device processor_thermal_rfim processor_thermal_mbox igb processor_thermal_rapl intel_rapl_common a typec thunderbolt intel_soc_dts_iosf roles tpm_tis_core wmi ov8856 v4l2_fwnode v4> CR2: ffffb305c4efbd58 ---[ end trace 5324ec285bb4a456 ]--- RIP: 0010:__unwind_start+0x10b/0x1e0 Code: 9f fb ff 85 c0 75 d2 eb c0 65 48 8b 04 25 c0 bb 01 00 48 39 c6 0f 84 86 00 00 00 48 8b 86 98 1b 00 00 48 8d 78 38 48 89 7d 38 <48> 8b 50 28 48 89 55 40 48 8b 40 30 48 89 RSP: 0018:ffffb305c4787b38 EFLAGS: 00010002 RAX: ffffb305c4efbd30 RBX: ffffb305c4efbd30 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff90233d552640 RDI: ffffb305c4efbd68 RBP: ffffb305c4787b58 R08: 0000000000010000 R09: 0000000000001004 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffffb305c4787b58 R14: 0000000000000a09 R15: ffffffffa20530a0 FS: 00007f12ee91d640(0000) GS:ffff9028b0840000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffb305c4efbd58 CR3: 00000002fc812002 CR4: 0000000000770ee0 PKRU: 55555554 note: ThreadPoolForeg[8954] exited with preempt_count 1 $ cat /proc/cmdline cryptdevice=UUID=a38849e4-f207-4d4c-a16f-0a64ef515be7:lvm:allow-discards root=UUID=bd65a79e-bc5e-4359-8cd3-6d5bcee2e948 resume=UUID=703abfcf-4913-4fdb-9687-9037e3965401 rw mitigations=off fbcon=font:TER16x32 msr.allow_writes=on splash i915.fastboot=1 quiet rd.systemd.show_status=auto rd.udev.log-priority=3 kfence.sample_interval=100 This happened when joining a Google Meet call, so it might come from opening mic / webcam.
Created attachment 299577 [details] dmesg with KASAN enabled I built a KASAN version of my kernel and it finds a use after free in the ACPI code [ 5.199776] BUG: KASAN: use-after-free in acpi_ex_system_memory_space_handler+0x4af/0x500
So I can reproduce the BUG 100% of the time by starting a google meet call on Chrome (but not on Firefox). Also, I can't hit the BUG with KASAN enabled. Running Arch Linux, Gnome 42-dev, Wayland, on a Dell XPS 9300.
Does 5d1ceb3969b6b2e47e2df6d17790a7c5a20fcbb4 fix this for you?
No, it's already applied. I have all the patches from the 5.15 stable queue applied up to 03af7745988ef53819414e427afce4cb4185dcc0
What does your "./scripts/faddr2line vmlinux __get_wchan+0x54/0xb0" produce? Mine isn't matching the size.
The backtrace posted this morning was with a slightly earlier rev (179756d16045dc3812227354f3432c061f4403aa). With 03af7745988ef53819414e427afce4cb4185dcc0 the backtrace is [ 275.170276] __get_wchan+0x44/0xa0 [ 275.170279] get_wchan+0x5c/0x70 [ 275.170282] do_task_stat+0xcdf/0xdf0 [ 275.170286] proc_single_show+0x47/0xa0 [ 275.170289] seq_read_iter+0x114/0x470 [ 275.170291] seq_read+0xfd/0x140 [ 275.170293] vfs_read+0x92/0x190 [ 275.170296] ksys_read+0x5f/0xe0 [ 275.170298] do_syscall_64+0x56/0x80 [ 275.170301] ? syscall_exit_to_user_mode+0x23/0x40 [ 275.170303] ? do_syscall_64+0x63/0x80 [ 275.170304] ? exc_page_fault+0x72/0x170 [ 275.170306] entry_SYSCALL_64_after_hwframe+0x44/0xae and resolves to: get_wchan+0x5c/0x70: get_wchan at kernel/sched/core.c:1978 do_task_stat+0xcdf/0xdf0: do_task_stat at fs/proc/array.c:544 proc_single_show+0x47/0xa0: put_task_struct at include/linux/sched/task.h:113 (inlined by) proc_single_show at fs/proc/base.c:780 seq_read_iter+0x114/0x470: seq_read_iter at fs/seq_file.c:230 seq_read+0xfd/0x140: seq_read at fs/seq_file.c:163 vfs_read+0x92/0x190: vfs_read at fs/read_write.c:483 ksys_read+0x5f/0xe0: ksys_read at fs/read_write.c:623
Apologies, I missed the first line of the stack. __get_wchan+0x44/0xa0: __get_wchan at arch/x86/kernel/process.c:952 get_wchan+0x5c/0x70: get_wchan at kernel/sched/core.c:1978 do_task_stat+0xcdf/0xdf0: do_task_stat at fs/proc/array.c:544 proc_single_show+0x47/0xa0: put_task_struct at include/linux/sched/task.h:113 (inlined by) proc_single_show at fs/proc/base.c:780 seq_read_iter+0x114/0x470: seq_read_iter at fs/seq_file.c:230 seq_read+0xfd/0x140: seq_read at fs/seq_file.c:163 vfs_read+0x92/0x190: vfs_read at fs/read_write.c:483 ksys_read+0x5f/0xe0: ksys_read at fs/read_write.c:623
That's this line for me: for (unwind_start(&state, p, NULL, NULL); !unwind_done(&state); "state" is local stack. Did "p" get corrupted? Weird...
Yes, that's the line too in my working directory. And it's very reproducible and does not happen in 5.15.2 with the same build configuration.
are you able to bisect the offending patch? I wonder if 5d1ceb3969b6b2e47e2df6d17790a7c5a20fcbb4 is _causing_ the problem when not combined with other (missing?) patches...
I can confirm that I cannot trigger the BUG with 5d1ceb3969b6b2e47e2df6d17790a7c5a20fcbb4 removed (and keeping all the other patches)
Created attachment 299611 [details] BUG with KASAN enabled I have reproduced the bug with fb7bf8982aa7e07a73a2f9d0c0f02191b28e40bd and KASAN, once more by launching Chrome. Sadly, I'm not sure it gives more insights.
Confirmed. I can reproduce this bug at will. In my case, it's a regression between 5.14.18 and 5.14.19, mainline, untainted. (I am not yet running mainline 5.15). This bug occurs on every boot of the machine (it's a file server) during initialization of NFS. Specifically, here's the (harmless looking) line of shell script that triggers the oops: /sbin/pidof nfsd rpc.idmapd rpc.nfsd rpcbind > /dev/null If I disable NFS (so that it is not started by /etc/init.d) but then start it manually once the machine is up, it works fine (no oops). I have another server very similar to the first that also serves NFS; the oops has not yet shown up there. It's consistent enough to be a good candidate for "git bisect" if need be. See bugzilla bug 209739 for details of how my /etc/init.d/nfs is crafted. Here is a representative OOPS, looking very much like the one from the OP, except that I'm running a mainline (hand-compiled, non-tainted) kernel, here under Slackware: NFSD: Using UMH upcall client tracking operations. NFSD: starting 90-second grace period (net f0000098) BUG: unable to handle page fault for address: ffffc900081b7de8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 100001067 P4D 100001067 PUD 10005a067 PMD 10df42067 PTE 0 Oops: 0000 [#1] SMP CPU: 5 PID: 1421 Comm: pidof Not tainted 5.14.19 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015 RIP: 0010:__unwind_start+0x105/0x1d0 Code: ff 85 c0 75 d2 eb c0 65 48 8b 04 25 00 ad 01 00 48 39 c6 66 90 0f 84 87 00 00 00 48 8b 86 d8 09 00 00 48 8d 78 38 48 89 7d 38 <48> 8b 50 28 48 89 55 40 48 8b 40 30 48 3d f0 10 00 81 48 89 45 48 RSP: 0018:ffffc9000830fbe8 EFLAGS: 00010087 RAX: ffffc900081b7dc0 RBX: ffffc900081b7dc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff888101b0aa00 RDI: ffffc900081b7df8 RBP: ffffc9000830fc08 R08: 00000000080046eb R09: 0000000001001000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000467 R14: 0000000000000467 R15: 0000000000000001 FS: 00007fa1e7a6b740(0000) GS:ffff88881ed40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc900081b7de8 CR3: 000000014195b000 CR4: 00000000000406e0 Call Trace: <TASK> __get_wchan+0x41/0xa0 get_wchan+0x64/0x70 do_task_stat+0xcd1/0xdd0 proc_single_show+0x57/0xc0 seq_read_iter+0xfd/0x470 seq_read+0xf9/0x150 vfs_read+0xa7/0x190 ? do_sys_openat2+0x90/0x160 ksys_read+0x68/0xf0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa1e792c3ce Code: c0 e9 e6 fe ff ff 50 48 8d 3d 7e 53 0a 00 e8 29 ea 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 RSP: 002b:00007ffedea27f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 00007fa1e7a2c970 RCX: 00007fa1e792c3ce RDX: 0000000000000400 RSI: 00000000019a8580 RDI: 0000000000000004 RBP: 0000000000000004 R08: 000000007fffffff R09: 00007ffedea27e20 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00000000019a8d20 R15: 00000000019a9490 </TASK> Modules linked in: netconsole dm_crypt encrypted_keys nouveau uas usb_storage drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops f2fs crc32_generic CR2: ffffc900081b7de8 ---[ end trace e6e8d8331667cbf6 ]--- RIP: 0010:__unwind_start+0x105/0x1d0 Code: ff 85 c0 75 d2 eb c0 65 48 8b 04 25 00 ad 01 00 48 39 c6 66 90 0f 84 87 00 00 00 48 8b 86 d8 09 00 00 48 8d 78 38 48 89 7d 38 <48> 8b 50 28 48 89 55 40 48 8b 40 30 48 3d f0 10 00 81 48 89 45 48 RSP: 0018:ffffc9000830fbe8 EFLAGS: 00010087 RAX: ffffc900081b7dc0 RBX: ffffc900081b7dc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff888101b0aa00 RDI: ffffc900081b7df8 RBP: ffffc9000830fc08 R08: 00000000080046eb R09: 0000000001001000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000467 R14: 0000000000000467 R15: 0000000000000001 FS: 00007fa1e7a6b740(0000) GS:ffff88881ed40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc900081b7de8 CR3: 000000014195b000 CR4: 00000000000406e0
Just confirmed that this also reproduces on my desktop/development machine. It's quite a bit different from the server system mentioned in comment 13, with a different startup sequence. But it, too, oopses on exactly the same "pidof" line within the NFS start code. This may be NFS related and not specifically a "memory management" bug. Perhaps the OP can update the metadata for bug 215031 to reflect that: Regression: Yes Kernel version: 5.14.19 as well as 5.15.2