Bug 217823
Summary: | kernel bug when performing heavy IO operations | ||
---|---|---|---|
Product: | File System | Reporter: | dianlujitao |
Component: | btrfs | Assignee: | BTRFS virtual assignee (fs_btrfs) |
Status: | NEW --- | ||
Severity: | normal | CC: | bagasdotme, m.seyfarth |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: | incomplete dmesg |
Description
dianlujitao
2023-08-25 06:24:47 UTC
Created attachment 304941 [details]
incomplete dmesg
Could you run memtest86+ 6.20/memtest86 10.6 for a cycle or two? (In reply to Artem S. Tashkinov from comment #2) > Could you run memtest86+ 6.20/memtest86 10.6 for a cycle or two? I ran memtest86+ 6.20 for a cycle last night and it passed. (In reply to dianlujitao from comment #0) > When the IO load is heavy (compiling AOSP in my case), there's a chance to > crash the kernel, the only way to recover is to perform a hard reset. Logs > look like follows: > > 8月 25 13:52:23 arch-pc kernel: BUG: Bad page map in process tmux: client > pte:8000000462500025 pmd:b99c98067 > 8月 25 13:52:23 arch-pc kernel: page:00000000460fa108 refcount:4 > mapcount:-256 mapping:00000000612a1864 index:0x16 pfn:0x462500 > 8月 25 13:52:23 arch-pc kernel: memcg:ffff8a1056ed0000 > 8月 25 13:52:23 arch-pc kernel: aops:btrfs_aops [btrfs] ino:9c4635 dentry > name:"locale-archive" > 8月 25 13:52:23 arch-pc kernel: flags: > 0x2ffff5800002056(referenced|uptodate|lru|workingset|private|node=0|zone=2|la > stcpupid=0xffff) > 8月 25 13:52:23 arch-pc kernel: page_type: 0xfffffeff(offline) > 8月 25 13:52:23 arch-pc kernel: raw: 02ffff5800002056 ffffe6e210c05248 > ffffe6e20e714dc8 ffff8a10472a8c70 > 8月 25 13:52:23 arch-pc kernel: raw: 0000000000000016 0000000000000001 > 00000003fffffeff ffff8a1056ed0000 > 8月 25 13:52:23 arch-pc kernel: page dumped because: bad pte > 8月 25 13:52:23 arch-pc kernel: addr:00007f5fc9816000 vm_flags:08000071 > anon_vma:0000000000000000 mapping:ffff8a10472a8c70 index:16 > 8月 25 13:52:23 arch-pc kernel: file:locale-archive fault:filemap_fault > mmap:btrfs_file_mmap [btrfs] read_folio:btrfs_read_folio [btrfs] > 8月 25 13:52:23 arch-pc kernel: CPU: 40 PID: 2033787 Comm: tmux: client > Tainted: G OE 6.4.11-zen2-1-zen #1 > a571467d6effd6120b1e64d2f88f90c58106da17 > 8月 25 13:52:23 arch-pc kernel: Hardware name: JGINYUE X99-8D3/2.5G > Server/X99-8D3/2.5G Server, BIOS 5.11 06/30/2022 > 8月 25 13:52:23 arch-pc kernel: Call Trace: > 8月 25 13:52:23 arch-pc kernel: <TASK> > 8月 25 13:52:23 arch-pc kernel: dump_stack_lvl+0x47/0x60 > 8月 25 13:52:23 arch-pc kernel: print_bad_pte+0x194/0x250 > 8月 25 13:52:23 arch-pc kernel: ? page_remove_rmap+0x8d/0x260 > 8月 25 13:52:23 arch-pc kernel: unmap_page_range+0xbb1/0x20f0 > 8月 25 13:52:23 arch-pc kernel: unmap_vmas+0x142/0x220 > 8月 25 13:52:23 arch-pc kernel: exit_mmap+0xe4/0x350 > 8月 25 13:52:23 arch-pc kernel: mmput+0x5f/0x140 > 8月 25 13:52:23 arch-pc kernel: do_exit+0x31f/0xbc0 > 8月 25 13:52:23 arch-pc kernel: do_group_exit+0x31/0x80 > 8月 25 13:52:23 arch-pc kernel: __x64_sys_exit_group+0x18/0x20 > 8月 25 13:52:23 arch-pc kernel: do_syscall_64+0x60/0x90 > 8月 25 13:52:23 arch-pc kernel: entry_SYSCALL_64_after_hwframe+0x77/0xe1 > 8月 25 13:52:23 arch-pc kernel: RIP: 0033:0x7f5fca0da14d > 8月 25 13:52:23 arch-pc kernel: Code: Unable to access opcode bytes at > 0x7f5fca0da123. > 8月 25 13:52:23 arch-pc kernel: RSP: 002b:00007fff54a44358 EFLAGS: 00000206 > ORIG_RAX: 00000000000000e7 > 8月 25 13:52:23 arch-pc kernel: RAX: ffffffffffffffda RBX: 00007f5fca23ffa8 > RCX: 00007f5fca0da14d > 8月 25 13:52:23 arch-pc kernel: RDX: 00000000000000e7 RSI: fffffffffffffeb8 > RDI: 0000000000000000 > 8月 25 13:52:23 arch-pc kernel: RBP: 0000000000000002 R08: 00007fff54a442f8 > R09: 00007fff54a4421f > 8月 25 13:52:23 arch-pc kernel: R10: 00007fff54a44130 R11: 0000000000000206 > R12: 0000000000000000 > 8月 25 13:52:23 arch-pc kernel: R13: 0000000000000000 R14: 00007f5fca23e680 > R15: 00007f5fca23ffc0 > 8月 25 13:52:23 arch-pc kernel: </TASK> > 8月 25 13:52:23 arch-pc kernel: Disabling lock debugging due to kernel taint > > Full log is available at https://fars.ee/HJw3 > Notice that the issue is introduced by linux kernel released in recent > months. What kernel version do you have this issue? And last known good version that doesn't have it? (In reply to Bagas Sanjaya from comment #4) > What kernel version do you have this issue? And last known good version that > doesn't have it? My kernel was 6.4.11 zen kernel. Following the suggestion from https://lore.kernel.org/linux-btrfs/a21684a4-ee7e-4404-85a2-2ab1f4a1623a@gmx.com/, I've switched to the less ricing `linux` kernel from Arch repo for now. I'll keep an eye on it and report here once the issue happens again. I don't know the last good version, sorry. FWIW, I've seen probably the same issue on my X99 system, too: https://bugzilla.kernel.org/show_bug.cgi?id=216688 But I've seen this for a very long time now over the past two years. And my CPU recently had issues with the CPU memory controller and spits out errors when using all 8 memory channels, 4 memory channels are fine though. Memtest was also running fine on my system. From my observations, this problem could only be triggered by heavy compile jobs; not with games. While I've still seen this with the default CachyOS-Kernel recently, I haven't seen it with my custom Kernel for some time now that also carries some experimental patches from the LKML and elsewhere around. Maybe something in there actually helps to either fix or mitigate this for dianlujitao, too? It is available in my github repo if someone wants to test it (and applies on top of 6.4.12): https://github.com/ms178/archpkgbuilds/blob/main/packages/linux-cachyos/0001-ms178.patch (In reply to Marcus Seyfarth from comment #6) > FWIW, I've seen probably the same issue on my X99 system, too: > https://bugzilla.kernel.org/show_bug.cgi?id=216688 > > But I've seen this for a very long time now over the past two years. And my > CPU recently had issues with the CPU memory controller and spits out errors > when using all 8 memory channels, 4 memory channels are fine though. > > Memtest was also running fine on my system. From my observations, this > problem could only be triggered by heavy compile jobs; not with games. While > I've still seen this with the default CachyOS-Kernel recently, I haven't > seen it with my custom Kernel for some time now that also carries some > experimental patches from the LKML and elsewhere around. Maybe something in > there actually helps to either fix or mitigate this for dianlujitao, too? It > is available in my github repo if someone wants to test it (and applies on > top of 6.4.12): > https://github.com/ms178/archpkgbuilds/blob/main/packages/linux-cachyos/0001- > ms178.patch Thank you for the info. Our logs indeed look quite similar, but page_type is missing from yours. It turns out that why PG_offline got set on mine is unclear according to https://lore.kernel.org/linux-btrfs/ZOs5j93aAmZhrA%2FG@casper.infradead.org/. Do you get newer crash logs containing that field? (In reply to Marcus Seyfarth from comment #6) > FWIW, I've seen probably the same issue on my X99 system, too: > https://bugzilla.kernel.org/show_bug.cgi?id=216688 > > But I've seen this for a very long time now over the past two years. And my > CPU recently had issues with the CPU memory controller and spits out errors > when using all 8 memory channels, 4 memory channels are fine though. > BTW I have only 4 memory sticks installed. This is another one I came across at the end of 2022, but can't see anything about the page_type in there either. [ +0,018472] BUG: Bad page map in process clang++ pte:80000009bad84025 pmd:ad85e1067 [ +0,000009] page:000000006b030a03 refcount:17 mapcount:-241 mapping:00000000c8719a71 index:0x1333 pfn:0x9bad84 [ +0,000005] memcg:ffff94f7670c9000 [ +0,000002] aops:ext4_da_aops [ext4] ino:27d0482 dentry name:"mold.profdata" [ +0,000021] flags: 0xa600000000020056(referenced|uptodate|lru|workingset|mappedtodisk|zone=2) [ +0,000005] raw: a600000000020056 ffffc564511e12c8 ffffc5646be3a208 ffff94f75c2f51c0 [ +0,000003] raw: 0000000000001333 0000000000000000 00000011ffffff0e ffff94f7670c9000 [ +0,000001] page dumped because: bad pte [ +0,000001] addr:00007f09e1533000 vm_flags:00200071 anon_vma:0000000000000000 mapping:ffff94f75c2f51c0 index:1333 [ +0,000003] file:mold.profdata fault:filemap_fault mmap:ext4_file_mmap [ext4] read_folio:ext4_read_folio [ext4] [ +0,000033] CPU: 0 PID: 353494 Comm: clang++ Tainted: G B D W O 6.1.2-rc2-3.1-cachyos-bore-lto #1 4b7e0b805b20530f271e911adf9f4d9fd738cbbe [ +0,000004] Hardware name: LENOVO GAMING TF/X99-TF Gaming, BIOS CX99DE26 10/10/2020 [ +0,000002] Call Trace: [ +0,000002] <TASK> [ +0,000001] ? print_bad_pte+0x1eb/0x280 [ +0,000005] ? unmap_page_range+0xb16/0x1240 [ +0,000004] ? unmap_vmas+0x126/0x260 [ +0,000005] ? unmap_region+0x120/0x220 [ +0,000004] ? do_mas_align_munmap+0x5f9/0x8e0 [ +0,000004] ? enqueue_task_rt+0x37d/0x580 [ +0,000005] ? __vm_munmap+0x169/0x1c0 [ +0,000003] ? __x64_sys_munmap+0x12/0x20 [ +0,000004] ? do_syscall_64+0x2b/0x60 [ +0,000004] ? entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ +0,000004] </TASK> |