Bug 217823 - kernel bug when performing heavy IO operations
Summary: kernel bug when performing heavy IO operations
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: BTRFS virtual assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-25 06:24 UTC by dianlujitao
Modified: 2023-08-28 11:24 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
incomplete dmesg (758.51 KB, application/octet-stream)
2023-08-25 08:48 UTC, Artem S. Tashkinov
Details

Description dianlujitao 2023-08-25 06:24:47 UTC
When the IO load is heavy (compiling AOSP in my case), there's a chance to crash the kernel, the only way to recover is to perform a hard reset. Logs look like follows:

8月 25 13:52:23 arch-pc kernel: BUG: Bad page map in process tmux: client  pte:8000000462500025 pmd:b99c98067
8月 25 13:52:23 arch-pc kernel: page:00000000460fa108 refcount:4 mapcount:-256 mapping:00000000612a1864 index:0x16 pfn:0x462500
8月 25 13:52:23 arch-pc kernel: memcg:ffff8a1056ed0000
8月 25 13:52:23 arch-pc kernel: aops:btrfs_aops [btrfs] ino:9c4635 dentry name:"locale-archive"
8月 25 13:52:23 arch-pc kernel: flags: 0x2ffff5800002056(referenced|uptodate|lru|workingset|private|node=0|zone=2|lastcpupid=0xffff)
8月 25 13:52:23 arch-pc kernel: page_type: 0xfffffeff(offline)
8月 25 13:52:23 arch-pc kernel: raw: 02ffff5800002056 ffffe6e210c05248 ffffe6e20e714dc8 ffff8a10472a8c70
8月 25 13:52:23 arch-pc kernel: raw: 0000000000000016 0000000000000001 00000003fffffeff ffff8a1056ed0000
8月 25 13:52:23 arch-pc kernel: page dumped because: bad pte
8月 25 13:52:23 arch-pc kernel: addr:00007f5fc9816000 vm_flags:08000071 anon_vma:0000000000000000 mapping:ffff8a10472a8c70 index:16
8月 25 13:52:23 arch-pc kernel: file:locale-archive fault:filemap_fault mmap:btrfs_file_mmap [btrfs] read_folio:btrfs_read_folio [btrfs]
8月 25 13:52:23 arch-pc kernel: CPU: 40 PID: 2033787 Comm: tmux: client Tainted: G           OE      6.4.11-zen2-1-zen #1 a571467d6effd6120b1e64d2f88f90c58106da17
8月 25 13:52:23 arch-pc kernel: Hardware name: JGINYUE X99-8D3/2.5G Server/X99-8D3/2.5G Server, BIOS 5.11 06/30/2022
8月 25 13:52:23 arch-pc kernel: Call Trace:
8月 25 13:52:23 arch-pc kernel:  <TASK>
8月 25 13:52:23 arch-pc kernel:  dump_stack_lvl+0x47/0x60
8月 25 13:52:23 arch-pc kernel:  print_bad_pte+0x194/0x250
8月 25 13:52:23 arch-pc kernel:  ? page_remove_rmap+0x8d/0x260
8月 25 13:52:23 arch-pc kernel:  unmap_page_range+0xbb1/0x20f0
8月 25 13:52:23 arch-pc kernel:  unmap_vmas+0x142/0x220
8月 25 13:52:23 arch-pc kernel:  exit_mmap+0xe4/0x350
8月 25 13:52:23 arch-pc kernel:  mmput+0x5f/0x140
8月 25 13:52:23 arch-pc kernel:  do_exit+0x31f/0xbc0
8月 25 13:52:23 arch-pc kernel:  do_group_exit+0x31/0x80
8月 25 13:52:23 arch-pc kernel:  __x64_sys_exit_group+0x18/0x20
8月 25 13:52:23 arch-pc kernel:  do_syscall_64+0x60/0x90
8月 25 13:52:23 arch-pc kernel:  entry_SYSCALL_64_after_hwframe+0x77/0xe1
8月 25 13:52:23 arch-pc kernel: RIP: 0033:0x7f5fca0da14d
8月 25 13:52:23 arch-pc kernel: Code: Unable to access opcode bytes at 0x7f5fca0da123.
8月 25 13:52:23 arch-pc kernel: RSP: 002b:00007fff54a44358 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
8月 25 13:52:23 arch-pc kernel: RAX: ffffffffffffffda RBX: 00007f5fca23ffa8 RCX: 00007f5fca0da14d
8月 25 13:52:23 arch-pc kernel: RDX: 00000000000000e7 RSI: fffffffffffffeb8 RDI: 0000000000000000
8月 25 13:52:23 arch-pc kernel: RBP: 0000000000000002 R08: 00007fff54a442f8 R09: 00007fff54a4421f
8月 25 13:52:23 arch-pc kernel: R10: 00007fff54a44130 R11: 0000000000000206 R12: 0000000000000000
8月 25 13:52:23 arch-pc kernel: R13: 0000000000000000 R14: 00007f5fca23e680 R15: 00007f5fca23ffc0
8月 25 13:52:23 arch-pc kernel:  </TASK>
8月 25 13:52:23 arch-pc kernel: Disabling lock debugging due to kernel taint

Full log is available at https://fars.ee/HJw3
Notice that the issue is introduced by linux kernel released in recent months.
Comment 1 Artem S. Tashkinov 2023-08-25 08:48:28 UTC
Created attachment 304941 [details]
incomplete dmesg
Comment 2 Artem S. Tashkinov 2023-08-25 11:12:47 UTC
Could you run memtest86+ 6.20/memtest86 10.6 for a cycle or two?
Comment 3 dianlujitao 2023-08-26 05:23:33 UTC
(In reply to Artem S. Tashkinov from comment #2)
> Could you run memtest86+ 6.20/memtest86 10.6 for a cycle or two?

I ran memtest86+ 6.20 for a cycle last night and it passed.
Comment 4 Bagas Sanjaya 2023-08-27 02:46:04 UTC
(In reply to dianlujitao from comment #0)
> When the IO load is heavy (compiling AOSP in my case), there's a chance to
> crash the kernel, the only way to recover is to perform a hard reset. Logs
> look like follows:
> 
> 8月 25 13:52:23 arch-pc kernel: BUG: Bad page map in process tmux: client 
> pte:8000000462500025 pmd:b99c98067
> 8月 25 13:52:23 arch-pc kernel: page:00000000460fa108 refcount:4
> mapcount:-256 mapping:00000000612a1864 index:0x16 pfn:0x462500
> 8月 25 13:52:23 arch-pc kernel: memcg:ffff8a1056ed0000
> 8月 25 13:52:23 arch-pc kernel: aops:btrfs_aops [btrfs] ino:9c4635 dentry
> name:"locale-archive"
> 8月 25 13:52:23 arch-pc kernel: flags:
> 0x2ffff5800002056(referenced|uptodate|lru|workingset|private|node=0|zone=2|la
> stcpupid=0xffff)
> 8月 25 13:52:23 arch-pc kernel: page_type: 0xfffffeff(offline)
> 8月 25 13:52:23 arch-pc kernel: raw: 02ffff5800002056 ffffe6e210c05248
> ffffe6e20e714dc8 ffff8a10472a8c70
> 8月 25 13:52:23 arch-pc kernel: raw: 0000000000000016 0000000000000001
> 00000003fffffeff ffff8a1056ed0000
> 8月 25 13:52:23 arch-pc kernel: page dumped because: bad pte
> 8月 25 13:52:23 arch-pc kernel: addr:00007f5fc9816000 vm_flags:08000071
> anon_vma:0000000000000000 mapping:ffff8a10472a8c70 index:16
> 8月 25 13:52:23 arch-pc kernel: file:locale-archive fault:filemap_fault
> mmap:btrfs_file_mmap [btrfs] read_folio:btrfs_read_folio [btrfs]
> 8月 25 13:52:23 arch-pc kernel: CPU: 40 PID: 2033787 Comm: tmux: client
> Tainted: G           OE      6.4.11-zen2-1-zen #1
> a571467d6effd6120b1e64d2f88f90c58106da17
> 8月 25 13:52:23 arch-pc kernel: Hardware name: JGINYUE X99-8D3/2.5G
> Server/X99-8D3/2.5G Server, BIOS 5.11 06/30/2022
> 8月 25 13:52:23 arch-pc kernel: Call Trace:
> 8月 25 13:52:23 arch-pc kernel:  <TASK>
> 8月 25 13:52:23 arch-pc kernel:  dump_stack_lvl+0x47/0x60
> 8月 25 13:52:23 arch-pc kernel:  print_bad_pte+0x194/0x250
> 8月 25 13:52:23 arch-pc kernel:  ? page_remove_rmap+0x8d/0x260
> 8月 25 13:52:23 arch-pc kernel:  unmap_page_range+0xbb1/0x20f0
> 8月 25 13:52:23 arch-pc kernel:  unmap_vmas+0x142/0x220
> 8月 25 13:52:23 arch-pc kernel:  exit_mmap+0xe4/0x350
> 8月 25 13:52:23 arch-pc kernel:  mmput+0x5f/0x140
> 8月 25 13:52:23 arch-pc kernel:  do_exit+0x31f/0xbc0
> 8月 25 13:52:23 arch-pc kernel:  do_group_exit+0x31/0x80
> 8月 25 13:52:23 arch-pc kernel:  __x64_sys_exit_group+0x18/0x20
> 8月 25 13:52:23 arch-pc kernel:  do_syscall_64+0x60/0x90
> 8月 25 13:52:23 arch-pc kernel:  entry_SYSCALL_64_after_hwframe+0x77/0xe1
> 8月 25 13:52:23 arch-pc kernel: RIP: 0033:0x7f5fca0da14d
> 8月 25 13:52:23 arch-pc kernel: Code: Unable to access opcode bytes at
> 0x7f5fca0da123.
> 8月 25 13:52:23 arch-pc kernel: RSP: 002b:00007fff54a44358 EFLAGS: 00000206
> ORIG_RAX: 00000000000000e7
> 8月 25 13:52:23 arch-pc kernel: RAX: ffffffffffffffda RBX: 00007f5fca23ffa8
> RCX: 00007f5fca0da14d
> 8月 25 13:52:23 arch-pc kernel: RDX: 00000000000000e7 RSI: fffffffffffffeb8
> RDI: 0000000000000000
> 8月 25 13:52:23 arch-pc kernel: RBP: 0000000000000002 R08: 00007fff54a442f8
> R09: 00007fff54a4421f
> 8月 25 13:52:23 arch-pc kernel: R10: 00007fff54a44130 R11: 0000000000000206
> R12: 0000000000000000
> 8月 25 13:52:23 arch-pc kernel: R13: 0000000000000000 R14: 00007f5fca23e680
> R15: 00007f5fca23ffc0
> 8月 25 13:52:23 arch-pc kernel:  </TASK>
> 8月 25 13:52:23 arch-pc kernel: Disabling lock debugging due to kernel taint
> 
> Full log is available at https://fars.ee/HJw3
> Notice that the issue is introduced by linux kernel released in recent
> months.

What kernel version do you have this issue? And last known good version that
doesn't have it?
Comment 5 dianlujitao 2023-08-27 03:58:29 UTC
(In reply to Bagas Sanjaya from comment #4)
> What kernel version do you have this issue? And last known good version that
> doesn't have it?

My kernel was 6.4.11 zen kernel. Following the suggestion from https://lore.kernel.org/linux-btrfs/a21684a4-ee7e-4404-85a2-2ab1f4a1623a@gmx.com/, I've switched to the less ricing `linux` kernel from Arch repo for now. I'll keep an eye on it and report here once the issue happens again.

I don't know the last good version, sorry.
Comment 6 Marcus Seyfarth 2023-08-27 23:50:39 UTC
FWIW, I've seen probably the same issue on my X99 system, too: https://bugzilla.kernel.org/show_bug.cgi?id=216688

But I've seen this for a very long time now over the past two years. And my CPU recently had issues with the CPU memory controller and spits out errors when using all 8 memory channels, 4 memory channels are fine though. 

Memtest was also running fine on my system. From my observations, this problem could only be triggered by heavy compile jobs; not with games. While I've still seen this with the default CachyOS-Kernel recently, I haven't seen it with my custom Kernel for some time now that also carries some experimental patches from the LKML and elsewhere around. Maybe something in there actually helps to either fix or mitigate this for dianlujitao, too? It is available in my github repo if someone wants to test it (and applies on top of 6.4.12): https://github.com/ms178/archpkgbuilds/blob/main/packages/linux-cachyos/0001-ms178.patch
Comment 7 dianlujitao 2023-08-28 09:41:55 UTC
(In reply to Marcus Seyfarth from comment #6)
> FWIW, I've seen probably the same issue on my X99 system, too:
> https://bugzilla.kernel.org/show_bug.cgi?id=216688
> 
> But I've seen this for a very long time now over the past two years. And my
> CPU recently had issues with the CPU memory controller and spits out errors
> when using all 8 memory channels, 4 memory channels are fine though. 
> 
> Memtest was also running fine on my system. From my observations, this
> problem could only be triggered by heavy compile jobs; not with games. While
> I've still seen this with the default CachyOS-Kernel recently, I haven't
> seen it with my custom Kernel for some time now that also carries some
> experimental patches from the LKML and elsewhere around. Maybe something in
> there actually helps to either fix or mitigate this for dianlujitao, too? It
> is available in my github repo if someone wants to test it (and applies on
> top of 6.4.12):
> https://github.com/ms178/archpkgbuilds/blob/main/packages/linux-cachyos/0001-
> ms178.patch

Thank you for the info. Our logs indeed look quite similar, but page_type is missing from yours. It turns out that why PG_offline got set on mine is unclear according to https://lore.kernel.org/linux-btrfs/ZOs5j93aAmZhrA%2FG@casper.infradead.org/. Do you get newer crash logs containing that field?
Comment 8 dianlujitao 2023-08-28 09:47:42 UTC
(In reply to Marcus Seyfarth from comment #6)
> FWIW, I've seen probably the same issue on my X99 system, too:
> https://bugzilla.kernel.org/show_bug.cgi?id=216688
> 
> But I've seen this for a very long time now over the past two years. And my
> CPU recently had issues with the CPU memory controller and spits out errors
> when using all 8 memory channels, 4 memory channels are fine though. 
> 
BTW I have only 4 memory sticks installed.
Comment 9 Marcus Seyfarth 2023-08-28 11:24:39 UTC
This is another one I came across at the end of 2022, but can't see anything about the page_type in there either.

[  +0,018472] BUG: Bad page map in process clang++  pte:80000009bad84025 pmd:ad85e1067
[  +0,000009] page:000000006b030a03 refcount:17 mapcount:-241 mapping:00000000c8719a71 index:0x1333 pfn:0x9bad84
[  +0,000005] memcg:ffff94f7670c9000
[  +0,000002] aops:ext4_da_aops [ext4] ino:27d0482 dentry name:"mold.profdata"
[  +0,000021] flags: 0xa600000000020056(referenced|uptodate|lru|workingset|mappedtodisk|zone=2)
[  +0,000005] raw: a600000000020056 ffffc564511e12c8 ffffc5646be3a208 ffff94f75c2f51c0
[  +0,000003] raw: 0000000000001333 0000000000000000 00000011ffffff0e ffff94f7670c9000
[  +0,000001] page dumped because: bad pte
[  +0,000001] addr:00007f09e1533000 vm_flags:00200071 anon_vma:0000000000000000 mapping:ffff94f75c2f51c0 index:1333
[  +0,000003] file:mold.profdata fault:filemap_fault mmap:ext4_file_mmap [ext4] read_folio:ext4_read_folio [ext4]
[  +0,000033] CPU: 0 PID: 353494 Comm: clang++ Tainted: G    B D W  O       6.1.2-rc2-3.1-cachyos-bore-lto #1 4b7e0b805b20530f271e911adf9f4d9fd738cbbe
[  +0,000004] Hardware name: LENOVO GAMING TF/X99-TF Gaming, BIOS CX99DE26 10/10/2020
[  +0,000002] Call Trace:
[  +0,000002]  <TASK>
[  +0,000001]  ? print_bad_pte+0x1eb/0x280
[  +0,000005]  ? unmap_page_range+0xb16/0x1240
[  +0,000004]  ? unmap_vmas+0x126/0x260
[  +0,000005]  ? unmap_region+0x120/0x220
[  +0,000004]  ? do_mas_align_munmap+0x5f9/0x8e0
[  +0,000004]  ? enqueue_task_rt+0x37d/0x580
[  +0,000005]  ? __vm_munmap+0x169/0x1c0
[  +0,000003]  ? __x64_sys_munmap+0x12/0x20
[  +0,000004]  ? do_syscall_64+0x2b/0x60
[  +0,000004]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  +0,000004]  </TASK>

Note You need to log in before you can comment on or make changes to this bug.