Bug 218651
Summary: | kernel 6.8.2 - Bluetooth bug/dump at boot | ||
---|---|---|---|
Product: | Drivers | Reporter: | jb (jb.1234abcd) |
Component: | Bluetooth | Assignee: | linux-bluetooth (linux-bluetooth) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | agurenko, ethanboxx, fedux, ferdi, kbugreports, lilydjwg, luiz.dentz, makiftasova, peter.weber, pmenzel+bugzilla.kernel.org, regressions, sugaraddicted, vasyl.demin, wolf.seifert |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.8.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
journalctl
lsusb lsmod dmesg log from several boots |
Description
jb
2024-03-28 02:10:12 UTC
Created attachment 306050 [details]
lsusb
Created attachment 306051 [details]
lsmod
Per https://bbs.archlinux.org/viewtopic.php?id=294292 Same here with ThinkPad T14 Gen1 and Intel Corp. AX200 Bluetooth. Therefore I have doubts that the Qualcomm related commit from v6.8-rc7 is the cause, furthermore a downgrade to linux-6.8.1 fixes this. Created attachment 306052 [details]
dmesg log from several boots
Same here on my MSI Tomahawk X570 WiFi with AX200. 6.8.1 works fine, 6.8.2 *and* 6.9.0-0.rc1 has Bluetooth: hci0: command <command> tx timeout:
kernel: Bluetooth: hci0: command 0xfc01 tx timeout
or
kernel: Bluetooth: hci0: command 0xfc05 tx timeout
Okay, probably command timeout is related to this one: https://bugzilla.kernel.org/show_bug.cgi?id=218416, but I indeed missed the null pointer, which is also present in my log Per https://bbs.archlinux.org/viewtopic.php?id=294292 ThinkPad X13 Gen3 with Qualcomm WiFi/Bluetooth is working properly with Linux "6.8.2". Therefore this issue is limited to Intel based Bluetooth. I've cross-checked with a ThinkPad X13 Gen 3 (AMD + Qualcomm WiFi/Bluetooth), and it works properly with Linux "6.8.2". This issue is limited to devices with Intel WiFi/Bluetooth. I'm too slow :) Can somebody please bisect. You should be able to also use QEMU for this, and pass the Bluetooth device into the virtual machine. Here is the bisect. $ git bisect good b53e5ef62fe9853648b4478bd6cb3aba970a6f1f is the first bad commit commit b53e5ef62fe9853648b4478bd6cb3aba970a6f1f Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Date: Tue Jan 9 13:45:40 2024 -0500 Bluetooth: hci_core: Cancel request on command timeout [ Upstream commit 63298d6e752fc0ec7f5093860af8bc9f047b30c8 ] If command has timed out call __hci_cmd_sync_cancel to notify the hci_req since it will inevitably cause a timeout. This also rework the code around __hci_cmd_sync_cancel since it was wrongly assuming it needs to cancel timer as well, but sometimes the timers have not been started or in fact they already had timed out in which case they don't need to be cancel yet again. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Stable-dep-of: 2615fd9a7c25 ("Bluetooth: hci_sync: Fix overwriting request callback") Signed-off-by: Sasha Levin <sashal@kernel.org> include/net/bluetooth/hci_sync.h | 2 +- net/bluetooth/hci_core.c | 84 +++++++++++++++++++++++++++------------- net/bluetooth/hci_request.c | 2 +- net/bluetooth/hci_sync.c | 20 +++++----- net/bluetooth/mgmt.c | 2 +- 5 files changed, 71 insertions(+), 39 deletions(-) $ git bisect log git bisect start # status: waiting for both good and bad commits # bad: [03a22b591c5443ba269e8570c6fef411251fe1b8] Linux 6.8.2 git bisect bad 03a22b591c5443ba269e8570c6fef411251fe1b8 # status: waiting for good commit(s), bad commit known # good: [8a8b2a057ed9684704792b5d4b333616769002c2] Linux 6.8.1 git bisect good 8a8b2a057ed9684704792b5d4b333616769002c2 # bad: [da2d94af7ba950b33ce7dfd326894460c5536988] drm: Don't treat 0 as -1 in drm_fixp2int_ceil git bisect bad da2d94af7ba950b33ce7dfd326894460c5536988 # good: [116cc80f47b29edcba609ad92be1ad83d1cedcd0] arm64: dts: qcom: sm6115: drop pipe clock selection git bisect good 116cc80f47b29edcba609ad92be1ad83d1cedcd0 # good: [57662cd437c052595711bc733574e6895e074ee5] gpiolib: Pass consumer device through to core in devm_fwnode_gpiod_get_index() git bisect good 57662cd437c052595711bc733574e6895e074ee5 # bad: [b08bd8f02a24e2b82fece5ac51dc1c3d9aa6c404] Bluetooth: btusb: Fix memory leak git bisect bad b08bd8f02a24e2b82fece5ac51dc1c3d9aa6c404 # good: [4a09d0236854360d0c33fec01d3c7d9703cca570] PCI: Make pci_dev_is_disconnected() helper public for other drivers git bisect good 4a09d0236854360d0c33fec01d3c7d9703cca570 # good: [da0de50013c160f76b0d4c1869be25875f48015b] Bluetooth: mgmt: Remove leftover queuing of power_off work git bisect good da0de50013c160f76b0d4c1869be25875f48015b # bad: [b53e5ef62fe9853648b4478bd6cb3aba970a6f1f] Bluetooth: hci_core: Cancel request on command timeout git bisect bad b53e5ef62fe9853648b4478bd6cb3aba970a6f1f # good: [54db3630deff566224de6cfb0767d2d398e68ed5] Bluetooth: Remove BT_HS git bisect good 54db3630deff566224de6cfb0767d2d398e68ed5 # good: [d8c7785e8104359f139cdfa99e2511575c4d4823] Bluetooth: hci_qca: don't use IS_ERR_OR_NULL() with gpiod_get_optional() git bisect good d8c7785e8104359f139cdfa99e2511575c4d4823 # first bad commit: [b53e5ef62fe9853648b4478bd6cb3aba970a6f1f] Bluetooth: hci_core: Cancel request on command timeout From the bisection and the oops it's pretty like a duplicate of https://lore.kernel.org/all/08275279-7462-4f4a-a0ee-8aa015f829bc@leemhuis.info/ Then this patch should help (which might only get to Linus next Thursday): https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=1c3366abdbe884 I asked the stable team to pick up the patch: https://lore.kernel.org/all/bf267566-c18c-4ad9-9263-8642ecfdef1f@leemhuis.info/ Fix now queued for the next release of all affected stable/longterm series I can confirm this bug for / in kernel v6.6.23. Not that I care about bluetooth, but apparently it also affects usb. This issue is present at most but not all boots, sometimes it does not occur. Real world consequence on my system (thinkpad, kabylake) is that external usb keyboard and mouse are not recognized -> system unusable. Inbuild touchpad works, so I can initiate a reboot, but the system doesn't power down properly and still needs a hard reset to complete the "reboot". If the system happens to boot fine (i.e. the new, unusual output on the boot screen is missing), those problems don't appear. Going back to kernel v6.6.22 solve this issue for me. Hoping / assuming the fix will also be included in the 6.6.24 kernel. The bug is also in kernel v6.7.11 affecting bluetooth and usb. Also causing the system to hang at reboot or shutdown. This is on a Lenovo thinkpad T560 (intel skylake) Also confirmed on Lenovo thinkpad p52s laptop. Also causing hang on reboot/shutdown. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #12) > From the bisection and the oops it's pretty like a duplicate of > https://lore.kernel.org/all/08275279-7462-4f4a-a0ee-8aa015f829bc@leemhuis. > info/ > > Then this patch should help (which might only get to Linus next Thursday): > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/ > ?id=1c3366abdbe884 Although the git bisect gave the same git commit, the problem is probably different and the suggested fix did not work. See https://bbs.archlinux.org/viewtopic.php?pid=2161135#p2161135 for details. If the kernel without the patch (In reply to wolf.seifert from comment #17) > > Although the git bisect gave the same git commit, the problem is probably > different and the suggested fix did not work. > > See > https://bbs.archlinux.org/viewtopic.php?pid=2161135#p2161135 > for details. Spreading feedback over multiple places makes things hard. And journalctl -k / dmesg would be helpful. Did your kernel threw that "kernel: BUG: kernel NULL pointer dereference" before the fix? If it did not, it was a different problem to begin with and worth its own ticket, as things otherwise get confusing. Or did the kernel throw that error and it's gone now, but things are not working? Then the patch helped -- but there might be another problem or the fix is not enough. Building a 6.8.2 kernel with the culprit removed could help to narrow things down. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #18) > If the kernel without the patch (In reply to wolf.seifert from comment #17) > > > > Although the git bisect gave the same git commit, the problem is probably > > different and the suggested fix did not work. > > > > See > > https://bbs.archlinux.org/viewtopic.php?pid=2161135#p2161135 > > for details. > > Spreading feedback over multiple places makes things hard. > > And journalctl -k / dmesg would be helpful. > > Did your kernel threw that "kernel: BUG: kernel NULL pointer dereference" > before the fix? If it did not, it was a different problem to begin with and > worth its own ticket, as things otherwise get confusing. > > Or did the kernel throw that error and it's gone now, but things are not > working? Then the patch helped -- but there might be another problem or the > fix is not enough. Building a 6.8.2 kernel with the culprit removed could > help to narrow things down. Sorry for the confusion! In fact I never had this "kernel: BUG: kernel NULL pointer dereference", but other people having this commented my original post, so things got messed up. Anyway, the git bisection is probably o.k., only the problem is different. I will try to clarify and open a own ticket. (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #14) > Fix now queued for the next release of all affected stable/longterm series Hmm, was the original change backported to stable kernels, afaik I didn't mark it to Cc stable: commit 63298d6e752fc0ec7f5093860af8bc9f047b30c8 Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Date: Tue Jan 9 13:45:40 2024 -0500 Bluetooth: hci_core: Cancel request on command timeout If command has timed out call __hci_cmd_sync_cancel to notify the hci_req since it will inevitably cause a timeout. This also rework the code around __hci_cmd_sync_cancel since it was wrongly assuming it needs to cancel timer as well, but sometimes the timers have not been started or in fact they already had timed out in which case they don't need to be cancel yet again. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> I wonder why it got selected to be backported, in any case I don't think it is a good idea to attempt to do backporting without having at least a Fixes tag to begin with otherwise we risk having problems like this widespread to people not really running the latest where this sort of problem is sort of expected during the early rc phase, so instead of having these 2 patches backported we could just remove the above from the stable trees. (In reply to Luiz Von Dentz from comment #20) > Hmm, was the original change backported to stable kernels, You won't get an answer to that here, so I brought this to the lists: https://lore.kernel.org/all/84da1f26-0457-451c-b4fd-128cb9bd860d@leemhuis.info/ Tested stable 6.8.3, boot, lsusb -v - it is OK. Will keep it open for a few days and then close if no problems. I can confirm this as fixed in 6.8.3 and 6.7.12 So far so good on my end with 6.8.3, thank you everyone Kernel v6.6.25 works fine on my system regarding the bug that was in v6.6.23. Thanks for the fix! Closed. Fixed in stable 6.8.3. commit b0a3738c0b3bcb5760ff4db1f22b9b0e1725d1d2 |