Starting in 5.7-rc1, my Dell XPS 9300 fails to suspend about 60% of the time. In 5.6.0, it runs thousands of suspend cycles without error. The following hci0 message is present in dmesg for all of the failures, and none of the successes: PM: Preparing system for sleep (s2idle) Bluetooth: hci0: command 0x2042 tx timeout PM: suspend exit This may be related to bug 207169, though that mentions a different command.
With Linux-5.7-rc4, this system sees no errors after hundreds of suspend/resume cycles, if Bluetooth is disabled via BIOS SETUP.
With Linux-5.7-rc4 and Bluetooth enabled in BIOS SETUP... # rmmod btusb is sufficient to allow 20 suspend/resume cycles without error. # modprobe btusb results in 8 of 20 cycles failing.
# hciconfig hci0 down is sufficient to allow 20 suspend/resume cycles without error. # hciconfig hci0 up returns us to failing 13 of 20 cycles.
dd522a7429b07e4441871ae75ebbfcf53635bdd4 is the first bad commit commit dd522a7429b07e4441871ae75ebbfcf53635bdd4 Author: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Date: Wed Mar 11 08:54:02 2020 -0700 Bluetooth: Handle LE devices during suspend To handle LE devices, we must first disable passive scanning and disconnect all connected devices. Once that is complete, we update the whitelist and re-enable scanning Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Unfortunately this commit, by itself, does not cleanly revert from upstream.
5.7-rc5 also fails. It works with the following two commits reverted: # "Bluetooth: Pause discovery and advertising during suspend" git revert 4867bd007d25a8dfd4ffc558534f7aec8b361789 # "Bluetooth: Handle LE devices during suspend" git revert dd522a7429b07e4441871ae75ebbfcf53635bdd4
Created attachment 289077 [details] issue.def -- for recording known suspend/resume stress test issues
Does your system rfkill Bluetooth on suspend? I think that's what was happening on https://bugzilla.kernel.org/show_bug.cgi?id=207169 Without rfkill (directly entering suspend), the issue seemed to disappear: https://bugzilla.kernel.org/show_bug.cgi?id=207169#c14 The bluetooth commits you've identified added a suspend hook in bluetooth so it gracefully disconnects active connections and sets up wakeup hooks on reconnections. If Bluetooth is rfkill-ed, it won't be able to do this and will cmd timeout as seen.
I'm invoking suspend via sleepgraph, which does "echo freeze > /sys/power/state" without any prior interaction with any user-space distro hooks. The distro in this case is Ubuntu 18.04, which comes pre-installed on this box. Exactly what would you like me to observe, and how?
Could you collect grab dmesg w/ dynamic debug enabled? This will give us some confirmation about what's going on in Bluetooth (especially since 0x0c1a is sent before 0x2042 and that should be timing out first). Dyndebug instructions: https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html Please enable debug for files hci_core.c, hci_request.c and hci_event.c: (for example: echo 'file hci_core.c +p' > <debugfs>/dynamic_debug/control) Collecting this is probably the most useful observation of this bunch. -- I'm not familiar with what power hooks Ubuntu uses but it might be worth checking the systemd-sleep programs to see if any of them are going to rfkill bluetooth: https://wiki.archlinux.org/index.php/Power_management#Hooks_in_/usr/lib/systemd/system-sleep -- You can also try running `rfkill event` in shell while suspending. This might capture any service that is rfkill-ing without your knowledge. http://manpages.ubuntu.com/manpages/bionic/man8/rfkill.8.html
Created attachment 289101 [details] dmesg 5.7-rc5 # file 'hci_core.c +p' > /sys/kernel/debug/dynamic_debug/control # file 'hci_request.c +p' > /sys/kernel/debug/dynamic_debug/control # file 'hci_event.c +p' > /sys/kernel/debug/dynamic_debug/control # sleepgraph -m freeze -multi 10 0 where the 1st and 5th suspend cycles succeeded, and the other 8 failed due to bluetooth. dmesg attached.
Created attachment 289103 [details] dmesg 5.7-rc5 oops, copy/paste typo in my script, did it right this time: # file 'hci_core.c +p' > /sys/kernel/debug/dynamic_debug/control # file 'hci_request.c +p' > /sys/kernel/debug/dynamic_debug/control # file 'hci_event.c +p' > /sys/kernel/debug/dynamic_debug/control # sleepgraph -m freeze -multi 10 0 This time the 1st, 4, 5, 6th and 8th failed. full dmesg attached.
Re: ubuntu, systemd etc... Note that the same system with Ubuntu 18.04 successfully runs thousands of suspend/resume cycles on top of the following kernels: 5.6.8+ 5.6.0+ 5.5.0+ 5.4.0+ 5.0.0-1050-oem-osp1 5.0.0-1050-oem-osp1 5.0.0-1031-oem-osp1 and it doesn't fail until the kernel includes the bisected commit. That said, it appears that Ubuntu 18.04's systemd invokes "tlp suspend" were it to be involved in the suspend process. So just in case, I replaced /usr/bin/tlp and /usr/bin/bluetooth with /bin/true, but the system fails the same way with those programs absent.
Re: rfkill $ sudo rfkill ID TYPE DEVICE SOFT HARD 0 wlan phy0 unblocked unblocked 1 bluetooth hci0 unblocked unblocked $ sudo rfkill event 2020-05-13 00:41:30,772265-0400: idx 0 type 1 op 0 soft 0 hard 0 2020-05-13 00:41:30,772289-0400: idx 1 type 2 op 0 soft 0 hard 0 Here I ran 10 suspend/resume attempts (4 succeeded), but rfkill didn't print any more lines. If I manually run "bluetooth off", rfkill prints this: 2020-05-13 00:49:00,003851-0400: idx 1 type 2 op 2 soft 1 hard 0 If I manually run "bluetooth on", rfkill prints this: 2020-05-13 00:49:15,297390-0400: idx 1 type 2 op 2 soft 0 hard 0 I get the same two lines by turning bluetooth off/on in the system settings GUI, no matter if the bluetooth program is available or not.
Ok -- this one is not a systemic problem as I predicted. This looks to be the controller not responding to a specific command. In the logs below, the controller is responding to everything except for 0x2042 and even that is intermittent (responds to some 0x2042 but not others). So this is increasingly starting to look like a controller firmware problem. Looks like the Dell XPS 9300 uses an Intel AX wifi+bt chipset. You're using firmware from 2019 and there's been several updates in the past few months. Could you try a newer firmware to see if this is still happening? The latest update was in April: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=c5ac1add86be7a90d0c573c957e37c610f3d7f25 If you can still repro it w/ new firmware, could you collect an hci trace when it occurs? `btmon -w <file location>` will collect the trace. -- [ 855.188312] PM: Preparing system for sleep (s2idle) [ 855.188792] hci0 opcode 0x0c1a plen 1 [ 855.188793] skb len 4 [ 855.188794] hci0 opcode 0x2042 plen 6 [ 855.188795] skb len 9 [ 855.188795] length 2 [ 855.188799] hci0 cmd_cnt 1 cmd queued 2 [ 855.188801] hci0 type 1 len 4 [ 855.327410] hci0 [ 855.327412] hci0 Event packet [ 855.327414] hci0 status 0x00 [ 855.327415] hci0 opcode 0x0c1a [ 855.327416] opcode 0x0c1a status 0x00 [ 855.327417] hci0 opcode 0x0c1a [ 855.327421] hci0 cmd_cnt 1 cmd queued 1 [ 855.327422] hci0 type 1 len 9 [ 855.352390] hci0 [ 855.352391] hci0 Event packet [ 855.352392] hci0 status 0x00 [ 855.352392] hci0 opcode 0x2042 [ 855.352393] opcode 0x2042 status 0x00 [ 855.352394] hci0 opcode 0x2042 [ 855.352394] hci0: Request complete opcode=0x2042, status=0x0 # Notice that 0x2042 completes successfully [ 855.352407] hci0 opcode 0x0c05 plen 1 [ 855.352408] skb len 4 [ 855.352409] hci0 opcode 0x0c1a plen 1 [ 855.352410] skb len 4 [ 855.352411] hci0 opcode 0x2042 plen 6 [ 855.352411] skb len 9 [ 855.352412] hci0: LE passive scan with whitelist = 1 [ 855.352412] hci0 opcode 0x2041 plen 8 [ 855.352413] skb len 11 [ 855.352413] hci0 opcode 0x2042 plen 6 [ 855.352413] skb len 9 [ 855.352414] length 5 [ 855.352416] hci0 cmd_cnt 1 cmd queued 5 [ 855.352417] hci0 type 1 len 4 [ 855.353392] hci0 [ 855.353392] hci0 Event packet [ 855.353394] hci0 opcode 0x0c05 [ 855.353394] opcode 0x0c05 status 0x00 [ 855.353395] hci0 opcode 0x0c05 [ 855.353398] hci0 cmd_cnt 1 cmd queued 4 [ 855.353398] hci0 type 1 len 4 [ 855.354342] hci0 [ 855.354343] hci0 Event packet [ 855.354343] hci0 status 0x00 [ 855.354344] hci0 opcode 0x0c1a [ 855.354344] opcode 0x0c1a status 0x00 [ 855.354345] hci0 opcode 0x0c1a [ 855.354346] hci0 cmd_cnt 1 cmd queued 3 [ 855.354347] hci0 type 1 len 9 [ 857.354012] hci0: Timed out waiting for suspend # An early timeout in the suspend path [ 857.354014] hci0: Bit 5 is set [ 857.354096] hci0 opcode 0x0c05 plen 1 [ 857.354098] skb len 4 [ 857.354099] hci0 opcode 0x2042 plen 6 [ 857.354099] skb len 9 [ 857.354100] hci0: LE passive scan with whitelist = 1 [ 857.354101] hci0 opcode 0x2041 plen 8 [ 857.354101] skb len 11 [ 857.354101] hci0 opcode 0x2042 plen 6 [ 857.354102] skb len 9 [ 857.354102] length 4 [ 857.354107] hci0 cmd_cnt 0 cmd queued 6 [ 857.385937] Bluetooth: hci0: command 0x2042 tx timeout [ 857.385938] Bluetooth: hci0: No way to reset. Ignoring and continuing # Enabling passive scan failed [ 857.385941] hci0 cmd_cnt 1 cmd queued 6 [ 857.385943] hci0 type 1 len 11 [ 857.386342] hci0 [ 857.386342] hci0 Event packet [ 857.386344] hci0 status 0x00 [ 857.386344] hci0 opcode 0x2041 [ 857.386345] opcode 0x2041 status 0x00 [ 857.386345] hci0 opcode 0x2041 [ 857.386347] hci0 cmd_cnt 1 cmd queued 5 [ 857.386348] hci0 type 1 len 9 [ 857.388421] hci0 [ 857.388422] hci0 Event packet [ 857.388423] hci0 status 0x00 [ 857.388424] hci0 opcode 0x2042 [ 857.388425] opcode 0x2042 status 0x00 [ 857.388425] hci0 opcode 0x2042 [ 857.388428] hci0: Request complete opcode=0x2042, status=0x0 # 0x2042 completes successfully again [ 857.388431] hci0 cmd_cnt 1 cmd queued 4 [ 857.388431] hci0 type 1 len 4 [ 857.389383] hci0 [ 857.389384] hci0 Event packet [ 857.389385] hci0 opcode 0x0c05 [ 857.389386] opcode 0x0c05 status 0x00 [ 857.389386] hci0 opcode 0x0c05 [ 857.389389] hci0 cmd_cnt 1 cmd queued 3 [ 857.389390] hci0 type 1 len 9 [ 857.400353] hci0 [ 857.400355] hci0 Event packet [ 857.400357] hci0 status 0x00 [ 857.400358] hci0 opcode 0x2042 [ 857.400359] opcode 0x2042 status 0x00 [ 857.400359] hci0 opcode 0x2042 [ 857.400363] hci0 cmd_cnt 1 cmd queued 2 [ 857.400365] hci0 type 1 len 11 [ 857.401330] hci0 [ 857.401332] hci0 Event packet [ 857.401333] hci0 status 0x00 [ 857.401334] hci0 opcode 0x2041 [ 857.401334] opcode 0x2041 status 0x00 [ 857.401335] hci0 opcode 0x2041 [ 857.401337] hci0 cmd_cnt 1 cmd queued 1 [ 857.401338] hci0 type 1 len 9 [ 857.403349] hci0 [ 857.403349] hci0 Event packet [ 857.403350] hci0 status 0x00 [ 857.403351] hci0 opcode 0x2042 [ 857.403351] opcode 0x2042 status 0x00 [ 857.403352] hci0 opcode 0x2042 [ 857.403353] hci0: Request complete opcode=0x2042, status=0x0 [ 857.429909] PM: suspend exit
This same Linux Bluetooth regression is seen on additional platforms: man:Dell Inc. | plat:XPS 13 7390 | cpu:Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz | bios:1.5.1 | biosdate:03/09/2020 man:Dell Inc. | plat:Inspiron 7386 | cpu:Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz | bios:1.2.0 | biosdate:12/05/2018 man:Intel Corporation | plat:NUC7JYB | cpu:Intel(R) Celeron(R) J4005 CPU @ 2.00GHz | bios:JYGLKCPX.86A.0053.2019.1015.1510 | biosdate:10/15/2019 Their failure rates are slighly higher than the Dell 9300: 5.7.0-rc4+ otcpl-dell-7386-whl freeze-x3547 3261 91.94% 5.7.0-rc4+ otcpl-nuc-kbl freeze-x3207 2821 87.96% 5.7.0-rc4+ otcpl-dell-7390-cmlu freeze-x2879 2275 79.02% 5.7.0-rc4+ lenb-Dell-XPS-13-9300 freeze-x2426 1429 58.90%
re: firmware dmesg showed this: [ 7.833706] Bluetooth: hci0: Minimum firmware build 1 week 10 2014 [ 7.835192] Bluetooth: hci0: Found device firmware: intel/ibt-19-32-4.sfi [ 9.324793] Bluetooth: hci0: Waiting for firmware download to complete [ 9.325473] Bluetooth: hci0: Firmware loaded in 1459099 usecs [ 9.355517] Bluetooth: hci0: Firmware revision 0.0 build 62 week 31 2019 even though there are apparently higher numbered of ibt in /lib/firmware/intel. I downloaded linux-firmware-c5ac1add86be7a90d0c573c957e37c610f3d7f25.tar.gz and I replaced /lib/firmware/intel/ with the new copy, but it appears that the driver did not find a newer version for this device: [ 7.766736] Bluetooth: hci0: Firmware revision 0.0 build 62 week 31 2019 What am I missing? In any case, a bunch of machines are failing with the firmware that they are running, when they used to work before this regression, and so this appears to be more of a Linux regression, than a device-specific firmware issue?
I found another patch in Chromium repository that seems to work around this issue. It comes down to conditional cmd execution that is specific to the passive scanning code. It explains why the issue occurs specifically on the passive scanning and why it only happens sometimes. https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1636009 Could you try this patch and see if it fixes your issue? If yes, I'll try to get it merged upstream.
I looked at that patch again and it looks like it works around something that is not upstreamed so please disregard comment #17. -- re comment #16 - Checking on a recent build, the firmware revision I see is build 128 week 11 2020 (I think this is slightly older than the one I linked). I'm not sure why you're seeing an older firmware.
My test case went away (Dell 9300) After screwing around with /lib/firmware/intel the failure went away, and restoring that directory, the failure doesn't come back.
(In reply to Abhishek Pandit-Subedi from comment #14) > Ok -- this one is not a systemic problem as I predicted. This looks to be > the controller not responding to a specific command. In the logs below, the > controller is responding to everything except for 0x2042 and even that is > intermittent (responds to some 0x2042 but not others). > > So this is increasingly starting to look like a controller firmware problem. Which totally doesn't matter. Evidently, there are multiple cases in which the controller stops responding and that must not be a reason for failing the system suspend (that may mean refusing to suspend after a user has closed a laptop lid or similar). The controller may not be able to suspend cleanly, but that doesn't matter. There still is time to recover during system resume and if that fails, it will just not work going forward and so be it. Failing system suspend is not an option. You can do that only in critical situations.
re #19 - Chromium had an internal patch that was conditionally sending the passive scanning command. After removing it, we were getting command disallowed. This is very likely what you were seeing. There's a patch fixing this on the mailing list now: https://patchwork.kernel.org/patch/11582831/ re #20 - You make a good point. A second problem I encountered with higher usage of this code is that if the controller stops responding (i.e. it crashes) and we can't reset it, it will also prevent suspend. This one is unacceptable. Likely, the fix from Bluetooth end would be just to always allow suspend but log an error here: https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git/tree/net/bluetooth/hci_core.c#n3368 Expect a patch today on the mailing list with a fix for this.
Patch to test: https://patchwork.kernel.org/patch/11586221/
I tried running this patch over the 5.8.0-rc2 upstream kernel and there was no appreciable difference in the occurrence of this error. I have 5 machines (in addition to len's) that have been getting this error since 5.7.0-rc1. KERNEL(w patch) HOST TESTS FAILS 5.8.0-rc2+ otcpl-dell-7386-whl freeze-x1141 1073 94.04% 5.8.0-rc2+ otcpl-cml-s-1 freeze-x1064 983 92.39% 5.8.0-rc2+ otcpl-dell-7390-cmlu freeze-x1631 1419 87.00% 5.8.0-rc2+ otcpl-nuc-glk freeze-x1083 929 85.78% 5.8.0-rc2+ otcpl-cml-s-2 freeze-x1172 869 74.15% Without the patch on 5.8.0-rc1 I get this: KERNEL(w/o patch) HOST TESTS FAILS 5.8.0-rc1+ otcpl-dell-7386-whl freeze-x2876 2698 93.81% 5.8.0-rc1+ otcpl-cml-s-1 freeze-x2359 2128 90.21% 5.8.0-rc1+ otcpl-dell-7390-cmlu freeze-x3655 3268 89.41% 5.8.0-rc1+ otcpl-nuc-glk freeze-x1089 911 83.65% 5.8.0-rc1+ otcpl-cml-s-2 freeze-x2748 1991 72.45% Len you'll have to try it on your machine.
Here's a comprehensive list of all the machines in our power lab that are seeing this issue. It started in 5.7.0-rc1: 5.7.0-rc1+ otcpl-dell-7386-whl freeze-x4344 3892 89.59% 5.7.0-rc1+ otcpl-nuc-kbl freeze-x984 850 86.38% 5.7.0-rc1+ lenb-Dell-XPS-13-9300 freeze-x607 359 59.14% 5.7.0-rc2+ otcpl-dell-7386-whl freeze-x4341 3850 88.69% 5.7.0-rc2+ otcpl-cml-u freeze-x2070 1797 86.81% 5.7.0-rc2+ otcpl-nuc-kbl freeze-x3697 3164 85.58% 5.7.0-rc2+ lenb-Dell-XPS-13-9300 freeze-x402 214 53.23% 5.7.0-rc3+ otcpl-cml-u freeze-x4020 3667 91.22% 5.7.0-rc3+ otcpl-dell-7386-whl freeze-x4404 3979 90.35% 5.7.0-rc3+ otcpl-nuc-kbl freeze-x3777 3288 87.05% 5.7.0-rc3+ lenb-Dell-XPS-13-9300 freeze-x612 359 58.66% 5.7.0-rc4+ otcpl-cml-u freeze-x3460 3185 92.05% 5.7.0-rc4+ otcpl-dell-7386-whl freeze-x3547 3261 91.94% 5.7.0-rc4+ otcpl-nuc-kbl freeze-x3207 2821 87.96% 5.7.0-rc4+ otcpl-dell-7390-cmlu freeze-x2879 2275 79.02% 5.7.0-rc4+ lenb-Dell-XPS-13-9300 freeze-x2426 1429 58.90% 5.7.0-rc4+ lenb-Dell-XPS-13-9300 freeze-x931 487 52.31% 5.7.0-rc5+ otcpl-nuc-kbl freeze-x3254 2950 90.66% 5.7.0-rc5+ otcpl-dell-7386-whl freeze-x3769 3400 90.21% 5.7.0-rc5+ otcpl-cml-u freeze-x551 479 86.93% 5.7.0-rc5+ otcpl-dell-3493-icl freeze-x3628 2992 82.47% 5.7.0-rc5+ otcpl-cml-s-1 freeze-x2949 1675 56.80% 5.7.0-rc5+ lenb-Dell-XPS-13-9300 freeze-x20 11 55.00% 5.7.0-rc6+ otcpl-dell-7386-whl freeze-x3762 3417 90.83% 5.7.0-rc6+ otcpl-cml-u freeze-x2420 2181 90.12% 5.7.0-rc6+ otcpl-nuc-glk freeze-x1049 881 83.98% 5.7.0-rc6+ otcpl-cml-s-1 freeze-x2862 1581 55.24% 5.7.0-rc7+ otcpl-dell-7386-whl freeze-x3818 3529 92.43% 5.7.0-rc7+ otcpl-nuc-glk freeze-x3101 2637 85.04% 5.7.0-rc7+ otcpl-dell-7390-cmlu freeze-x2971 2508 84.42% 5.7.0-rc7+ otcpl-cml-s-1 freeze-x1958 1649 84.22% 5.7.0-rc7+ otcpl-cml-s-2 freeze-x2920 1634 55.96% 5.7.0+ otcpl-dell-7386-whl freeze-x3836 3550 92.54% 5.7.0+ otcpl-dell-7386-whl freeze-x3804 3499 91.98% 5.7.0+ otcpl-cml-u freeze-x1145 1044 91.18% 5.7.0+ otcpl-nuc-glk freeze-x3160 2815 89.08% 5.7.0+ otcpl-nuc-glk freeze-x3247 2854 87.90% 5.7.0+ otcpl-dell-7390-cmlu freeze-x2966 2530 85.30% 5.7.0+ otcpl-dell-7390-cmlu freeze-x2911 2395 82.27% 5.7.0+ otcpl-cml-s-2 freeze-x2879 1619 56.23%
Todd, can you confirm you picked up both https://patchwork.kernel.org/patch/11582831/ and https://patchwork.kernel.org/patch/11586221/? Also, what kind of failure are you seeing? Is it failing to enter suspend completely? Is it waking spuriously? Is it not entering S0ix? (sent up a fix for this recently) It would be helpful to have logs for a failure case. Could you collect some logs with dynamic debug for BT (see comment #11) and collect a btmon trace (btmon -w /tmp/btsnoop) when reproducing this issue?
Hi Pandit, I only used one patch on the 5.8.0-rc2 build, sorry. This time around a applied both patches and ran on 5.8.0-rc3 and 5.8.0-rc4 and am very happy to say the bug has completely disappeared! :) These are the machines that are now working (all that were affected): otcpl-dell-7386-whl otcpl-nuc-glk otcpl-cml-s-1 otcpl-dell-7390-cmlu otcpl-nuc-glk otcpl-cml-s-2 otcpl-cml-u So this fix is confirmed, thanks!
(In reply to Abhishek Pandit-Subedi from comment #25) > Todd, can you confirm you picked up both > https://patchwork.kernel.org/patch/11582831/ and > https://patchwork.kernel.org/patch/11586221/? > > Also, what kind of failure are you seeing? Is it failing to enter suspend > completely? Is it waking spuriously? Is it not entering S0ix? (sent up a fix > for this recently) > > It would be helpful to have logs for a failure case. Could you collect some > logs with dynamic debug for BT (see comment #11) and collect a btmon trace > (btmon -w /tmp/btsnoop) when reproducing this issue? Why haven't these patches been merged with stable? Linux 5.7.15 has neither of them.
Hi Artem, 5.7.16 got one of the fixes but not the other. I've sent a follow-up email to netdev@ to add the missing one and cc-ed you.
Fixed in 5.8.
*** Bug 208975 has been marked as a duplicate of this bug. ***