Bug 215031
Summary: | BUG: unable to handle page fault in get_wchan | ||
---|---|---|---|
Product: | Memory Management | Reporter: | François Guerraz (kubrick) |
Component: | Other | Assignee: | Andrew Morton (akpm) |
Status: | NEW --- | ||
Severity: | normal | CC: | bugs-a21, kees, keescook, peterz |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.15.3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg with KASAN enabled
BUG with KASAN enabled |
Description
François Guerraz
2021-11-15 11:46:10 UTC
Created attachment 299577 [details]
dmesg with KASAN enabled
I built a KASAN version of my kernel and it finds a use after free in the ACPI code
[ 5.199776] BUG: KASAN: use-after-free in acpi_ex_system_memory_space_handler+0x4af/0x500
So I can reproduce the BUG 100% of the time by starting a google meet call on Chrome (but not on Firefox). Also, I can't hit the BUG with KASAN enabled. Running Arch Linux, Gnome 42-dev, Wayland, on a Dell XPS 9300. Does 5d1ceb3969b6b2e47e2df6d17790a7c5a20fcbb4 fix this for you? No, it's already applied. I have all the patches from the 5.15 stable queue applied up to 03af7745988ef53819414e427afce4cb4185dcc0 What does your "./scripts/faddr2line vmlinux __get_wchan+0x54/0xb0" produce? Mine isn't matching the size. The backtrace posted this morning was with a slightly earlier rev (179756d16045dc3812227354f3432c061f4403aa). With 03af7745988ef53819414e427afce4cb4185dcc0 the backtrace is [ 275.170276] __get_wchan+0x44/0xa0 [ 275.170279] get_wchan+0x5c/0x70 [ 275.170282] do_task_stat+0xcdf/0xdf0 [ 275.170286] proc_single_show+0x47/0xa0 [ 275.170289] seq_read_iter+0x114/0x470 [ 275.170291] seq_read+0xfd/0x140 [ 275.170293] vfs_read+0x92/0x190 [ 275.170296] ksys_read+0x5f/0xe0 [ 275.170298] do_syscall_64+0x56/0x80 [ 275.170301] ? syscall_exit_to_user_mode+0x23/0x40 [ 275.170303] ? do_syscall_64+0x63/0x80 [ 275.170304] ? exc_page_fault+0x72/0x170 [ 275.170306] entry_SYSCALL_64_after_hwframe+0x44/0xae and resolves to: get_wchan+0x5c/0x70: get_wchan at kernel/sched/core.c:1978 do_task_stat+0xcdf/0xdf0: do_task_stat at fs/proc/array.c:544 proc_single_show+0x47/0xa0: put_task_struct at include/linux/sched/task.h:113 (inlined by) proc_single_show at fs/proc/base.c:780 seq_read_iter+0x114/0x470: seq_read_iter at fs/seq_file.c:230 seq_read+0xfd/0x140: seq_read at fs/seq_file.c:163 vfs_read+0x92/0x190: vfs_read at fs/read_write.c:483 ksys_read+0x5f/0xe0: ksys_read at fs/read_write.c:623 Apologies, I missed the first line of the stack. __get_wchan+0x44/0xa0: __get_wchan at arch/x86/kernel/process.c:952 get_wchan+0x5c/0x70: get_wchan at kernel/sched/core.c:1978 do_task_stat+0xcdf/0xdf0: do_task_stat at fs/proc/array.c:544 proc_single_show+0x47/0xa0: put_task_struct at include/linux/sched/task.h:113 (inlined by) proc_single_show at fs/proc/base.c:780 seq_read_iter+0x114/0x470: seq_read_iter at fs/seq_file.c:230 seq_read+0xfd/0x140: seq_read at fs/seq_file.c:163 vfs_read+0x92/0x190: vfs_read at fs/read_write.c:483 ksys_read+0x5f/0xe0: ksys_read at fs/read_write.c:623 That's this line for me: for (unwind_start(&state, p, NULL, NULL); !unwind_done(&state); "state" is local stack. Did "p" get corrupted? Weird... Yes, that's the line too in my working directory. And it's very reproducible and does not happen in 5.15.2 with the same build configuration. are you able to bisect the offending patch? I wonder if 5d1ceb3969b6b2e47e2df6d17790a7c5a20fcbb4 is _causing_ the problem when not combined with other (missing?) patches... I can confirm that I cannot trigger the BUG with 5d1ceb3969b6b2e47e2df6d17790a7c5a20fcbb4 removed (and keeping all the other patches) Created attachment 299611 [details]
BUG with KASAN enabled
I have reproduced the bug with fb7bf8982aa7e07a73a2f9d0c0f02191b28e40bd and KASAN, once more by launching Chrome.
Sadly, I'm not sure it gives more insights.
Confirmed. I can reproduce this bug at will. In my case, it's a regression between 5.14.18 and 5.14.19, mainline, untainted. (I am not yet running mainline 5.15). This bug occurs on every boot of the machine (it's a file server) during initialization of NFS. Specifically, here's the (harmless looking) line of shell script that triggers the oops: /sbin/pidof nfsd rpc.idmapd rpc.nfsd rpcbind > /dev/null If I disable NFS (so that it is not started by /etc/init.d) but then start it manually once the machine is up, it works fine (no oops). I have another server very similar to the first that also serves NFS; the oops has not yet shown up there. It's consistent enough to be a good candidate for "git bisect" if need be. See bugzilla bug 209739 for details of how my /etc/init.d/nfs is crafted. Here is a representative OOPS, looking very much like the one from the OP, except that I'm running a mainline (hand-compiled, non-tainted) kernel, here under Slackware: NFSD: Using UMH upcall client tracking operations. NFSD: starting 90-second grace period (net f0000098) BUG: unable to handle page fault for address: ffffc900081b7de8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 100001067 P4D 100001067 PUD 10005a067 PMD 10df42067 PTE 0 Oops: 0000 [#1] SMP CPU: 5 PID: 1421 Comm: pidof Not tainted 5.14.19 #1 Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 R2.0, BIOS 2603 06/26/2015 RIP: 0010:__unwind_start+0x105/0x1d0 Code: ff 85 c0 75 d2 eb c0 65 48 8b 04 25 00 ad 01 00 48 39 c6 66 90 0f 84 87 00 00 00 48 8b 86 d8 09 00 00 48 8d 78 38 48 89 7d 38 <48> 8b 50 28 48 89 55 40 48 8b 40 30 48 3d f0 10 00 81 48 89 45 48 RSP: 0018:ffffc9000830fbe8 EFLAGS: 00010087 RAX: ffffc900081b7dc0 RBX: ffffc900081b7dc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff888101b0aa00 RDI: ffffc900081b7df8 RBP: ffffc9000830fc08 R08: 00000000080046eb R09: 0000000001001000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000467 R14: 0000000000000467 R15: 0000000000000001 FS: 00007fa1e7a6b740(0000) GS:ffff88881ed40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc900081b7de8 CR3: 000000014195b000 CR4: 00000000000406e0 Call Trace: <TASK> __get_wchan+0x41/0xa0 get_wchan+0x64/0x70 do_task_stat+0xcd1/0xdd0 proc_single_show+0x57/0xc0 seq_read_iter+0xfd/0x470 seq_read+0xf9/0x150 vfs_read+0xa7/0x190 ? do_sys_openat2+0x90/0x160 ksys_read+0x68/0xf0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa1e792c3ce Code: c0 e9 e6 fe ff ff 50 48 8d 3d 7e 53 0a 00 e8 29 ea 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 RSP: 002b:00007ffedea27f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 00007fa1e7a2c970 RCX: 00007fa1e792c3ce RDX: 0000000000000400 RSI: 00000000019a8580 RDI: 0000000000000004 RBP: 0000000000000004 R08: 000000007fffffff R09: 00007ffedea27e20 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00000000019a8d20 R15: 00000000019a9490 </TASK> Modules linked in: netconsole dm_crypt encrypted_keys nouveau uas usb_storage drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops f2fs crc32_generic CR2: ffffc900081b7de8 ---[ end trace e6e8d8331667cbf6 ]--- RIP: 0010:__unwind_start+0x105/0x1d0 Code: ff 85 c0 75 d2 eb c0 65 48 8b 04 25 00 ad 01 00 48 39 c6 66 90 0f 84 87 00 00 00 48 8b 86 d8 09 00 00 48 8d 78 38 48 89 7d 38 <48> 8b 50 28 48 89 55 40 48 8b 40 30 48 3d f0 10 00 81 48 89 45 48 RSP: 0018:ffffc9000830fbe8 EFLAGS: 00010087 RAX: ffffc900081b7dc0 RBX: ffffc900081b7dc0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff888101b0aa00 RDI: ffffc900081b7df8 RBP: ffffc9000830fc08 R08: 00000000080046eb R09: 0000000001001000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000467 R14: 0000000000000467 R15: 0000000000000001 FS: 00007fa1e7a6b740(0000) GS:ffff88881ed40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc900081b7de8 CR3: 000000014195b000 CR4: 00000000000406e0 Just confirmed that this also reproduces on my desktop/development machine. It's quite a bit different from the server system mentioned in comment 13, with a different startup sequence. But it, too, oopses on exactly the same "pidof" line within the NFS start code. This may be NFS related and not specifically a "memory management" bug. Perhaps the OP can update the metadata for bug 215031 to reflect that: Regression: Yes Kernel version: 5.14.19 as well as 5.15.2 |