Bug 218651 - kernel 6.8.2 - Bluetooth bug/dump at boot
Summary: kernel 6.8.2 - Bluetooth bug/dump at boot
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Bluetooth (show other bugs)
Hardware: Intel Linux
: P3 high
Assignee: linux-bluetooth@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-28 02:10 UTC by jb
Modified: 2024-04-06 06:54 UTC (History)
14 users (show)

See Also:
Kernel Version: 6.8.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
journalctl (109.53 KB, text/plain)
2024-03-28 02:10 UTC, jb
Details
lsusb (9.95 KB, text/plain)
2024-03-28 02:12 UTC, jb
Details
lsmod (632 bytes, text/plain)
2024-03-28 02:13 UTC, jb
Details
dmesg log from several boots (132.21 KB, text/plain)
2024-03-28 11:03 UTC, Gurenko Alex
Details

Description jb 2024-03-28 02:10:12 UTC
Created attachment 306049 [details]
journalctl

kernel: BUG: kernel NULL pointer dereference
Also lsusb display related dumps.
See: attachments.

Maybe related ?
https://lore.kernel.org/all/20240314084412.1127-1-johan%2Blinaro@kernel.org/
Comment 1 jb 2024-03-28 02:12:44 UTC
Created attachment 306050 [details]
lsusb
Comment 2 jb 2024-03-28 02:13:38 UTC
Created attachment 306051 [details]
lsmod
Comment 3 jb 2024-03-28 11:00:30 UTC
Per
https://bbs.archlinux.org/viewtopic.php?id=294292

Same here with ThinkPad T14 Gen1 and Intel Corp. AX200 Bluetooth.
Therefore I have doubts that the Qualcomm related commit from v6.8-rc7 is the cause, furthermore a downgrade to linux-6.8.1 fixes this.
Comment 4 Gurenko Alex 2024-03-28 11:03:11 UTC
Created attachment 306052 [details]
dmesg log from several boots

Same here on my MSI Tomahawk X570 WiFi with AX200. 6.8.1 works fine, 6.8.2 *and* 6.9.0-0.rc1 has Bluetooth: hci0: command <command> tx timeout:

kernel: Bluetooth: hci0: command 0xfc01 tx timeout
or
kernel: Bluetooth: hci0: command 0xfc05 tx timeout
Comment 5 Gurenko Alex 2024-03-28 11:08:04 UTC
Okay, probably command timeout is related to this one: https://bugzilla.kernel.org/show_bug.cgi?id=218416, but I indeed missed the null pointer, which is also present in my log
Comment 6 jb 2024-03-28 12:48:34 UTC
Per
https://bbs.archlinux.org/viewtopic.php?id=294292

ThinkPad X13 Gen3 with Qualcomm WiFi/Bluetooth is working properly with Linux "6.8.2". Therefore this issue is limited to Intel based Bluetooth.
Comment 7 Peter Weber 2024-03-28 12:49:34 UTC
I've cross-checked with a ThinkPad X13 Gen 3 (AMD + Qualcomm WiFi/Bluetooth), and it works properly with Linux "6.8.2". This issue is limited to devices with Intel WiFi/Bluetooth.
Comment 8 Peter Weber 2024-03-28 12:50:28 UTC
I'm too slow :)
Comment 9 Paul Menzel 2024-03-28 12:52:58 UTC
Can somebody please bisect. You should be able to also use QEMU for this, and pass the Bluetooth device into the virtual machine.
Comment 10 sugaraddicted 2024-03-30 01:49:57 UTC
Here is the bisect.

$ git bisect good
b53e5ef62fe9853648b4478bd6cb3aba970a6f1f is the first bad commit
commit b53e5ef62fe9853648b4478bd6cb3aba970a6f1f
Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Date:   Tue Jan 9 13:45:40 2024 -0500

    Bluetooth: hci_core: Cancel request on command timeout
    
    [ Upstream commit 63298d6e752fc0ec7f5093860af8bc9f047b30c8 ]
    
    If command has timed out call __hci_cmd_sync_cancel to notify the
    hci_req since it will inevitably cause a timeout.
    
    This also rework the code around __hci_cmd_sync_cancel since it was
    wrongly assuming it needs to cancel timer as well, but sometimes the
    timers have not been started or in fact they already had timed out in
    which case they don't need to be cancel yet again.
    
    Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
    Stable-dep-of: 2615fd9a7c25 ("Bluetooth: hci_sync: Fix overwriting request callback")
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 include/net/bluetooth/hci_sync.h |  2 +-
 net/bluetooth/hci_core.c         | 84 +++++++++++++++++++++++++++-------------
 net/bluetooth/hci_request.c      |  2 +-
 net/bluetooth/hci_sync.c         | 20 +++++-----
 net/bluetooth/mgmt.c             |  2 +-
 5 files changed, 71 insertions(+), 39 deletions(-)
Comment 11 sugaraddicted 2024-03-30 01:50:41 UTC
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [03a22b591c5443ba269e8570c6fef411251fe1b8] Linux 6.8.2
git bisect bad 03a22b591c5443ba269e8570c6fef411251fe1b8
# status: waiting for good commit(s), bad commit known
# good: [8a8b2a057ed9684704792b5d4b333616769002c2] Linux 6.8.1
git bisect good 8a8b2a057ed9684704792b5d4b333616769002c2
# bad: [da2d94af7ba950b33ce7dfd326894460c5536988] drm: Don't treat 0 as -1 in drm_fixp2int_ceil
git bisect bad da2d94af7ba950b33ce7dfd326894460c5536988
# good: [116cc80f47b29edcba609ad92be1ad83d1cedcd0] arm64: dts: qcom: sm6115: drop pipe clock selection
git bisect good 116cc80f47b29edcba609ad92be1ad83d1cedcd0
# good: [57662cd437c052595711bc733574e6895e074ee5] gpiolib: Pass consumer device through to core in devm_fwnode_gpiod_get_index()
git bisect good 57662cd437c052595711bc733574e6895e074ee5
# bad: [b08bd8f02a24e2b82fece5ac51dc1c3d9aa6c404] Bluetooth: btusb: Fix memory leak
git bisect bad b08bd8f02a24e2b82fece5ac51dc1c3d9aa6c404
# good: [4a09d0236854360d0c33fec01d3c7d9703cca570] PCI: Make pci_dev_is_disconnected() helper public for other drivers
git bisect good 4a09d0236854360d0c33fec01d3c7d9703cca570
# good: [da0de50013c160f76b0d4c1869be25875f48015b] Bluetooth: mgmt: Remove leftover queuing of power_off work
git bisect good da0de50013c160f76b0d4c1869be25875f48015b
# bad: [b53e5ef62fe9853648b4478bd6cb3aba970a6f1f] Bluetooth: hci_core: Cancel request on command timeout
git bisect bad b53e5ef62fe9853648b4478bd6cb3aba970a6f1f
# good: [54db3630deff566224de6cfb0767d2d398e68ed5] Bluetooth: Remove BT_HS
git bisect good 54db3630deff566224de6cfb0767d2d398e68ed5
# good: [d8c7785e8104359f139cdfa99e2511575c4d4823] Bluetooth: hci_qca: don't use IS_ERR_OR_NULL() with gpiod_get_optional()
git bisect good d8c7785e8104359f139cdfa99e2511575c4d4823
# first bad commit: [b53e5ef62fe9853648b4478bd6cb3aba970a6f1f] Bluetooth: hci_core: Cancel request on command timeout
Comment 12 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-30 13:59:24 UTC
From the bisection and the oops it's pretty like a duplicate of https://lore.kernel.org/all/08275279-7462-4f4a-a0ee-8aa015f829bc@leemhuis.info/

Then this patch should help (which might only get to Linus next Thursday): https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=1c3366abdbe884
Comment 13 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-30 15:50:34 UTC
I asked the stable team to pick up the patch:
https://lore.kernel.org/all/bf267566-c18c-4ad9-9263-8642ecfdef1f@leemhuis.info/
Comment 14 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-30 17:27:01 UTC
Fix now queued for the next release of all affected stable/longterm series
Comment 15 kbugreports 2024-03-30 19:34:01 UTC
I can confirm this bug for / in kernel v6.6.23.

Not that I care about bluetooth, but apparently it also affects usb.

This issue is present at most but not all boots, sometimes it does not occur.

Real world consequence on my system (thinkpad, kabylake) is that external usb keyboard and mouse are not recognized -> system unusable. Inbuild touchpad works, so I can initiate a reboot, but the system doesn't power down properly and still needs a hard reset to complete the "reboot".

If the system happens to boot fine (i.e. the new, unusual output on the boot screen is missing), those problems don't appear.

Going back to kernel v6.6.22 solve this issue for me.

Hoping / assuming the fix will also be included in the 6.6.24 kernel.
Comment 16 Ferdi Scholten 2024-03-30 21:52:00 UTC
The bug is also in kernel v6.7.11 affecting bluetooth and usb. Also causing the system to hang at reboot or shutdown. This is on a Lenovo thinkpad T560 (intel skylake)

Also confirmed on Lenovo thinkpad p52s laptop. Also causing hang on reboot/shutdown.
Comment 17 wolf.seifert 2024-03-31 15:48:48 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #12)
> From the bisection and the oops it's pretty like a duplicate of
> https://lore.kernel.org/all/08275279-7462-4f4a-a0ee-8aa015f829bc@leemhuis.
> info/
> 
> Then this patch should help (which might only get to Linus next Thursday):
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/
> ?id=1c3366abdbe884

Although the git bisect gave the same git commit, the problem is probably different and the suggested fix did not work.

See
https://bbs.archlinux.org/viewtopic.php?pid=2161135#p2161135
for details.
Comment 18 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-03-31 16:21:11 UTC
If the kernel without the patch (In reply to wolf.seifert from comment #17)
>
> Although the git bisect gave the same git commit, the problem is probably
> different and the suggested fix did not work.
> 
> See
> https://bbs.archlinux.org/viewtopic.php?pid=2161135#p2161135
> for details.

Spreading feedback over multiple places makes things hard.

And journalctl -k / dmesg would be helpful. 

Did your kernel threw that "kernel: BUG: kernel NULL pointer dereference" before the fix? If it did not, it was a different problem to begin with and worth its own ticket, as things otherwise get confusing. 

Or did the kernel throw that error and it's gone now, but things are not working? Then the patch helped -- but there might be another problem or the fix is not enough. Building a 6.8.2 kernel with the culprit removed could help to narrow things down.
Comment 19 wolf.seifert 2024-03-31 16:56:09 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #18)
> If the kernel without the patch (In reply to wolf.seifert from comment #17)
> >
> > Although the git bisect gave the same git commit, the problem is probably
> > different and the suggested fix did not work.
> > 
> > See
> > https://bbs.archlinux.org/viewtopic.php?pid=2161135#p2161135
> > for details.
> 
> Spreading feedback over multiple places makes things hard.
> 
> And journalctl -k / dmesg would be helpful. 
> 
> Did your kernel threw that "kernel: BUG: kernel NULL pointer dereference"
> before the fix? If it did not, it was a different problem to begin with and
> worth its own ticket, as things otherwise get confusing. 
> 
> Or did the kernel throw that error and it's gone now, but things are not
> working? Then the patch helped -- but there might be another problem or the
> fix is not enough. Building a 6.8.2 kernel with the culprit removed could
> help to narrow things down.

Sorry for the confusion! In fact I never had this "kernel: BUG: kernel NULL pointer dereference", but other people having this commented my original post, so things got messed up.

Anyway, the git bisection is probably o.k., only the problem is different. I will try to clarify and open a own ticket.
Comment 20 Luiz Von Dentz 2024-04-01 20:33:54 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #14)
> Fix now queued for the next release of all affected stable/longterm series

Hmm, was the original change backported to stable kernels, afaik I didn't mark it to Cc stable:

commit 63298d6e752fc0ec7f5093860af8bc9f047b30c8
Author: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Date:   Tue Jan 9 13:45:40 2024 -0500

    Bluetooth: hci_core: Cancel request on command timeout
    
    If command has timed out call __hci_cmd_sync_cancel to notify the
    hci_req since it will inevitably cause a timeout.
    
    This also rework the code around __hci_cmd_sync_cancel since it was
    wrongly assuming it needs to cancel timer as well, but sometimes the
    timers have not been started or in fact they already had timed out in
    which case they don't need to be cancel yet again.
    
    Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>

I wonder why it got selected to be backported, in any case I don't think it is a good idea to attempt to do backporting without having at least a Fixes tag to begin with otherwise we risk having problems like this widespread to people not really running the latest where this sort of problem is sort of expected during the early rc phase, so instead of having these 2 patches backported we could just remove the above from the stable trees.
Comment 21 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-04-02 06:47:42 UTC
(In reply to Luiz Von Dentz from comment #20)
> Hmm, was the original change backported to stable kernels,

You won't get an answer to that here, so I brought this to the lists:
https://lore.kernel.org/all/84da1f26-0457-451c-b4fd-128cb9bd860d@leemhuis.info/
Comment 22 jb 2024-04-04 04:45:47 UTC
Tested stable 6.8.3, boot, lsusb -v - it is OK.
Will keep it open for a few days and then close if no problems.
Comment 23 Ferdi Scholten 2024-04-04 05:38:12 UTC
I can confirm this as fixed in 6.8.3 and 6.7.12
Comment 24 Gurenko Alex 2024-04-04 08:56:05 UTC
So far so good on my end with 6.8.3, thank you everyone
Comment 25 kbugreports 2024-04-05 16:35:21 UTC
Kernel v6.6.25 works fine on my system regarding the bug that was in v6.6.23. Thanks for the fix!
Comment 26 jb 2024-04-06 06:54:54 UTC
Closed. Fixed in stable 6.8.3.
commit b0a3738c0b3bcb5760ff4db1f22b9b0e1725d1d2

Note You need to log in before you can comment on or make changes to this bug.