Bug 217864 - [6.5.1] Slow boot and opening files (RIP: kvfree_rcu_bulk)
Summary: [6.5.1] Slow boot and opening files (RIP: kvfree_rcu_bulk)
Status: RESOLVED INVALID
Alias: None
Product: Linux
Classification: Unclassified
Component: Kernel (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Virtual assignee for kernel bugs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-02 17:12 UTC by Marcus Seyfarth
Modified: 2023-09-05 19:11 UTC (History)
0 users

See Also:
Kernel Version: 6.5.0
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg.txt (91.36 KB, text/plain)
2023-09-02 17:12 UTC, Marcus Seyfarth
Details
Diff between 6.4.14 and 6.5 RC1 (403.42 KB, image/png)
2023-09-03 11:41 UTC, Marcus Seyfarth
Details
journalctl_65.log (178.07 KB, text/plain)
2023-09-04 21:20 UTC, Marcus Seyfarth
Details
journalctl_6414.log (189.18 KB, text/plain)
2023-09-04 21:20 UTC, Marcus Seyfarth
Details
config_65 (166.42 KB, text/plain)
2023-09-04 21:21 UTC, Marcus Seyfarth
Details

Description Marcus Seyfarth 2023-09-02 17:12:39 UTC
Created attachment 305023 [details]
dmesg.txt

I've just made the transition from 6.4.14 to 6.5.1 and my Haswell-EP X99 machine took way longer to boot (55 seconds instead of 16 seconds). The following trace was seen in dmesg which was also not present on 6.4.14 (and might be the cause for the long boot time); this is on bare metal.

[  +0,000021] CPU: 13 PID: 338 Comm: kworker/13:1 Not tainted 6.5.1-3.1-cachyos-lto #1 c414458bd5e5db6e6f9addca639c3a78811b24e7
[  +0,000003] Hardware name: LENOVO GAMING TF/X99-TF Gaming, BIOS CX99DE26 10/10/2020
[  +0,000002] Workqueue: events kfree_rcu_work
[  +0,000004] RIP: 0010:kvfree_rcu_bulk+0x13b/0x160
[  +0,000004] Code: 80 04 00 00 80 bf 89 04 00 00 00 75 24 85 c0 75 20 31 f6 ba 02 00 00 00 e8 72 50 bd ff 5b 41 5c 41 5e 41 5f 5d e9 05 df ba ff <0f> 0b e9 54 ff ff ff a9 ff ff ff 7f 74 e5 80 bf 88 04 00 00 >
[  +0,000002] RSP: 0018:ffff8fe4611cbd90 EFLAGS: 00010206
[  +0,000002] RAX: 0000000000000048 RBX: ffff8fe8e04f7000 RCX: fffffffffffffffc
[  +0,000002] RDX: 0000000000000000 RSI: ffff8fe8e04f7000 RDI: ffff8fe9df95cac8
[  +0,000001] RBP: ffff8fe4611cbe40 R08: 8080808080808080 R09: fefefefefefefeff
[  +0,000002] R10: 000073746e657665 R11: 8080000000000000 R12: 0000000000000000
[  +0,000001] R13: ffff8fe4611cbde0 R14: ffff8fe9df95cac8 R15: ffff8fe4611cbdd0
[  +0,000001] FS:  0000000000000000(0000) GS:ffff8fe9df940000(0000) knlGS:0000000000000000
[  +0,000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0,000002] CR2: 00007f8287bff008 CR3: 00000005e8f73001 CR4: 00000000001706e0
[  +0,000001] Call Trace:
[  +0,000003]  <TASK>
[  +0,000001]  ? __warn+0x9e/0x160
[  +0,000004]  ? kvfree_rcu_bulk+0x13b/0x160
[  +0,000004]  ? report_bug+0x112/0x180
[  +0,000003]  ? handle_bug+0x3d/0x80
[  +0,000003]  ? exc_invalid_op+0x16/0x40
[  +0,000003]  ? asm_exc_invalid_op+0x16/0x20
[  +0,000005]  ? kvfree_rcu_bulk+0x13b/0x160
[  +0,000003]  kfree_rcu_work+0xcd/0x200
[  +0,000005]  process_one_work+0x21a/0x620
[  +0,000004]  ? wake_up_process+0x1d3/0x1720
[  +0,000004]  worker_thread+0x12b/0x4c0
[  +0,000003]  ? compat_get_bitmap+0xa0/0xa0
[  +0,000003]  kthread+0xf1/0x160
[  +0,000004]  ? getreg32+0x1e0/0x1e0
[  +0,000003]  ret_from_fork+0x30/0x40
[  +0,000005]  ? getreg32+0x1e0/0x1e0
[  +0,000003]  ret_from_fork_asm+0x11/0x20
[  +0,000005]  </TASK>
[  +0,000001] ---[ end trace 0000000000000000 ]---

More details on the Kernel and extra-patches used can be found in my public repo: https://github.com/ms178/archpkgbuilds/tree/main/packages/linux-cachyos
Comment 1 Marcus Seyfarth 2023-09-02 17:45:28 UTC
This trace was seen with a self-compiled 6.5.1 Kernel, but I've just verified with the CachyOS 6.5.0-2 default Kernel that the same sympthoms show up, the system also cannot shutdown any longer. The only difference is that there is no trace in dmesg with the default Kernel.
Comment 2 Marcus Seyfarth 2023-09-02 17:56:43 UTC
I just saw a patch from Hugh Dickins on the LKML (https://www.spinics.net/lists/kernel/msg4919906.html) and indeed, with my self-compiled 6.5.1 Kernel, the trace is now downgraded to a warning (see below). However, the slow boot still remains and also my games won't load up due to missing a rendering device. But that might be a different issue.


[  +0,000227] ------------[ cut here ]------------
[  +0,000002] WARNING: CPU: 21 PID: 345 at kernel/rcu/tree.c:2952 kvfree_rcu_bulk+0x13b/0x160
[  +0,000011] Modules linked in: pkcs8_key_parser crypto_user fuse loop zram bpf_preload ip_tables x_tables ext4 crc32c_generic mbcache crc16 jbd2 usbhid amdgpu mfd_core drm_buddy drm_suballoc_helper crc32c_i>
[  +0,000027] CPU: 21 PID: 345 Comm: kworker/21:1 Not tainted 6.5.1-3.1-cachyos-lto #1 de6495663682da00bbe0d80bdc163dd768b25681
[  +0,000004] Hardware name: LENOVO GAMING TF/X99-TF Gaming, BIOS CX99DE26 10/10/2020
[  +0,000002] Workqueue: events kfree_rcu_work
[  +0,000006] RIP: 0010:kvfree_rcu_bulk+0x13b/0x160
[  +0,000006] Code: 80 04 00 00 80 bf 89 04 00 00 00 75 24 85 c0 75 20 31 f6 ba 02 00 00 00 e8 52 50 bd ff 5b 41 5c 41 5e 41 5f 5d e9 e5 de ba ff <0f> 0b e9 54 ff ff ff a9 ff ff ff 7f 74 e5 80 bf 88 04 00 00 >
[  +0,000003] RSP: 0018:ffff8df57a5ffd90 EFLAGS: 00010206
[  +0,000003] RAX: 0000000000000020 RBX: ffff8df44cb40000 RCX: fffffffffffffffc
[  +0,000003] RDX: 0000000000000000 RSI: ffff8df44cb40000 RDI: ffff8df91fb5cac8
[  +0,000002] RBP: ffff8df57a5ffe40 R08: 8080808080808080 R09: fefefefefefefeff
[  +0,000002] R10: 000073746e657665 R11: 8080000000000000 R12: 0000000000000000
[  +0,000002] R13: ffff8df57a5ffde0 R14: ffff8df91fb5cac8 R15: ffff8df57a5ffdd0
[  +0,000002] FS:  0000000000000000(0000) GS:ffff8df91fb40000(0000) knlGS:0000000000000000
[  +0,000003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0,000002] CR2: 000055fe51d3f700 CR3: 000000065de7d002 CR4: 00000000001706e0
[  +0,000002] Call Trace:
[  +0,000003]  <TASK>
[  +0,000002]  ? __warn+0x9e/0x160
[  +0,000006]  ? kvfree_rcu_bulk+0x13b/0x160
[  +0,000005]  ? report_bug+0x112/0x180
[  +0,000005]  ? handle_bug+0x3d/0x80
[  +0,000005]  ? exc_invalid_op+0x16/0x40
[  +0,000003]  ? asm_exc_invalid_op+0x16/0x20
[  +0,000007]  ? kvfree_rcu_bulk+0x13b/0x160
[  +0,000006]  kfree_rcu_work+0xcd/0x200
[  +0,000006]  process_one_work+0x21a/0x620
[  +0,000006]  ? wake_up_process+0x1d3/0x1720
[  +0,000005]  worker_thread+0x12b/0x4c0
[  +0,000005]  ? compat_get_bitmap+0xa0/0xa0
[  +0,000004]  kthread+0xf1/0x160
[  +0,000006]  ? getreg32+0x1e0/0x1e0
[  +0,000004]  ret_from_fork+0x30/0x40
[  +0,000007]  ? getreg32+0x1e0/0x1e0
[  +0,000003]  ret_from_fork_asm+0x11/0x20
[  +0,000009]  </TASK>
[  +0,000001] ---[ end trace 0000000000000000 ]---
Comment 3 Artem S. Tashkinov 2023-09-03 09:59:13 UTC
I've not seen any similar reports so your best bet will be to bisect:

https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 4 Marcus Seyfarth 2023-09-03 11:41:50 UTC
Created attachment 305026 [details]
Diff between 6.4.14 and 6.5 RC1

I've added a graphical diff with changes between 6.4.14 and 6.5 RC1 on kernel/rcu/tree.c in that area of the file. I suspect one of the changes from May 10th 2023 from Qiang Zhang from that day (e.g. rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snap).
Comment 5 Marcus Seyfarth 2023-09-03 12:14:29 UTC
For people not wanting to open the attachement provided above, here is the graphical diff of the full file between 6.4.14 and 6.5 RC1: https://www.diffchecker.com/3RzTkguz/
Comment 6 Marcus Seyfarth 2023-09-03 15:21:47 UTC
New insights from an off-list discussion with Paul E. McKenney, I took the old /kernel/rcu/tree.c file from 6.4.14 and copied it into the 6.5.1 sources. I also left out a problematic patch that was shipped with CachyOS (https://lore.kernel.org/linux-mm/20230703184928.GB4378@monkey/T/#m36fa1f2a52341d57a8ac39f5bd2d64376d26bfe5).

The only improvement I saw is that the system can now shutdown and reboot again while the reported dmesg warning is also gone. The boot process is still slow (54 seconds) and I've seen a lot of warnings from varous linux-tools (e.g. bpftune, powerdevil) that seem to fail with that newer Kernel, but that might come from something else then. Thanks a lot still for the pointers.
Comment 7 Marcus Seyfarth 2023-09-04 21:19:20 UTC
A small update from the off-list discussion:

I've now also tested the Arch vanilla 6.5.1 Kernel which shows a long boot sequence but no warning in dmesg. I will attach the journalctl output of the Arch 6.5.1 Kernel and my custom 6.4.14 Kernel for a comparison. Systemd-analyze tells me something suspicious going on with the firmware though: "Startup finished in 18min 27.940s (firmware) + 6.158s (loader) + 9.094s (kernel) + 2.499s (userspace) = 18min 45.693s graphical.target reached after 2.499s in userspace."

6.4.14 loads in 16 seconds total.

Testing WebGL Aquarium also shows significantly lower numbers with 15.000 fishes on 6.5.1 and the CPU does not go beyond 2.3 Ghz while it should boost up to 3.8 Ghz as reported via powertop. While the logs do not show errors in this regard, I suspect something might went wrong with the Microcode loading for the CPU?! The powerdevil and dbus errors in the 6.5 log also look suspicious.

I should note that my hardware is very special, it is one of these modded Chinese X99 motherboards with a Turbo Boost-Unlock and comes without microcode to make use of a hardware bug to maximize the all-core turbo frequency of my Haswell-Xeon. Maybe some work went into 6.5 that broke the previous behavior? At least that system is highly dependant on a successful loading of the CPU microcode at a later stage for the desired boosting behavior.
Comment 8 Marcus Seyfarth 2023-09-04 21:20:17 UTC
Created attachment 305030 [details]
journalctl_65.log
Comment 9 Marcus Seyfarth 2023-09-04 21:20:43 UTC
Created attachment 305031 [details]
journalctl_6414.log
Comment 10 Marcus Seyfarth 2023-09-04 21:21:41 UTC
Created attachment 305032 [details]
config_65
Comment 11 Marcus Seyfarth 2023-09-05 19:11:16 UTC
The RCU-related warning/trace turned out to come from an older patch that I carried around, mea culpa. I will drop it. I will file a new report for the unrelated slow boot/shutdown/reboot problem.

Note You need to log in before you can comment on or make changes to this bug.