Bug 207629

Summary: BISECTED Bluetooth: hci0: command 0x2042 tx timeout - suspend fails - Dell XPS 9300, Dell XPS 7390, Dell Inspiron 7386, Intel NUC7JYB
Product: Drivers Reporter: Len Brown (lenb)
Component: BluetoothAssignee: linux-bluetooth (linux-bluetooth)
Status: RESOLVED CODE_FIX    
Severity: normal CC: abhishekpandit, aros, rjw, todd.e.brandt
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.7-rc1, 5.7-rc2, 5.7-rc3, 5.7-rc4, 5.7-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 208975    
Attachments: issue.def -- for recording known suspend/resume stress test issues
dmesg 5.7-rc5
dmesg 5.7-rc5

Description Len Brown 2020-05-08 21:12:56 UTC
Starting in 5.7-rc1, my Dell XPS 9300 fails to suspend about 60% of the time.
In 5.6.0, it runs thousands of suspend cycles without error.

The following hci0 message is present in dmesg for all of the failures, and none of the successes:

PM: Preparing system for sleep (s2idle)
Bluetooth: hci0: command 0x2042 tx timeout
PM: suspend exit

This may be related to bug 207169, though that mentions a different command.
Comment 1 Len Brown 2020-05-08 21:16:20 UTC
With Linux-5.7-rc4, this system sees no errors after hundreds of suspend/resume cycles, if Bluetooth is disabled via BIOS SETUP.
Comment 2 Len Brown 2020-05-09 17:28:58 UTC
With Linux-5.7-rc4 and Bluetooth enabled in BIOS SETUP...

# rmmod btusb
is sufficient to allow 20 suspend/resume cycles without error.

# modprobe btusb
results in 8 of 20 cycles failing.
Comment 3 Len Brown 2020-05-09 19:31:24 UTC
# hciconfig hci0 down
is sufficient to allow 20 suspend/resume cycles without error.

# hciconfig hci0 up

returns us to failing 13 of 20 cycles.
Comment 4 Len Brown 2020-05-11 03:05:54 UTC
dd522a7429b07e4441871ae75ebbfcf53635bdd4 is the first bad commit
commit dd522a7429b07e4441871ae75ebbfcf53635bdd4
Author: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Date:   Wed Mar 11 08:54:02 2020 -0700

    Bluetooth: Handle LE devices during suspend
    
    To handle LE devices, we must first disable passive scanning and
    disconnect all connected devices. Once that is complete, we update the
    whitelist and re-enable scanning
    
    Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
    Signed-off-by: Marcel Holtmann <marcel@holtmann.org>


Unfortunately this commit, by itself, does not cleanly revert from upstream.
Comment 5 Len Brown 2020-05-11 04:59:54 UTC
5.7-rc5 also fails.

It works with the following two commits reverted:

# "Bluetooth: Pause discovery and advertising during suspend"
git revert 4867bd007d25a8dfd4ffc558534f7aec8b361789

# "Bluetooth: Handle LE devices during suspend"
git revert dd522a7429b07e4441871ae75ebbfcf53635bdd4
Comment 6 Len Brown 2020-05-11 13:31:46 UTC
Created attachment 289077 [details]
issue.def -- for recording known suspend/resume stress test issues
Comment 7 Abhishek Pandit-Subedi 2020-05-11 19:11:54 UTC
Does your system rfkill Bluetooth on suspend? I think that's what was happening on https://bugzilla.kernel.org/show_bug.cgi?id=207169

Without rfkill (directly entering suspend), the issue seemed to disappear: https://bugzilla.kernel.org/show_bug.cgi?id=207169#c14

The bluetooth commits you've identified added a suspend hook in bluetooth so it gracefully disconnects active connections and sets up wakeup hooks on reconnections. If Bluetooth is rfkill-ed, it won't be able to do this and will cmd timeout as seen.
Comment 8 Len Brown 2020-05-12 16:00:50 UTC
I'm invoking suspend via sleepgraph, which does
"echo freeze > /sys/power/state" without any prior interaction
with any user-space distro hooks.

The distro in this case is Ubuntu 18.04, which comes pre-installed on this box.

Exactly what would you like me to observe, and how?
Comment 9 Abhishek Pandit-Subedi 2020-05-12 22:12:44 UTC
Could you collect grab dmesg w/ dynamic debug enabled? This will give us some confirmation about what's going on in Bluetooth (especially since 0x0c1a is sent before 0x2042 and that should be timing out first).

Dyndebug instructions:
https://www.kernel.org/doc/html/latest/admin-guide/dynamic-debug-howto.html

Please enable debug for files hci_core.c, hci_request.c and hci_event.c:
(for example: echo 'file hci_core.c +p' > <debugfs>/dynamic_debug/control)

Collecting this is probably the most useful observation of this bunch.

--

I'm not familiar with what power hooks Ubuntu uses but it might be worth checking the systemd-sleep programs to see if any of them are going to rfkill bluetooth:

https://wiki.archlinux.org/index.php/Power_management#Hooks_in_/usr/lib/systemd/system-sleep

--

You can also try running `rfkill event` in shell while suspending. This might capture any service that is rfkill-ing without your knowledge.

http://manpages.ubuntu.com/manpages/bionic/man8/rfkill.8.html
Comment 10 Len Brown 2020-05-13 04:00:17 UTC
Created attachment 289101 [details]
dmesg 5.7-rc5

# file 'hci_core.c +p' > /sys/kernel/debug/dynamic_debug/control
# file 'hci_request.c +p' > /sys/kernel/debug/dynamic_debug/control
# file 'hci_event.c +p' > /sys/kernel/debug/dynamic_debug/control

# sleepgraph -m freeze -multi 10 0

where the 1st and 5th suspend cycles succeeded, and the other 8 failed due to bluetooth.

dmesg attached.
Comment 11 Len Brown 2020-05-13 04:11:25 UTC
Created attachment 289103 [details]
dmesg 5.7-rc5

oops, copy/paste typo in my script, did it right this time:

# file 'hci_core.c +p' > /sys/kernel/debug/dynamic_debug/control
# file 'hci_request.c +p' > /sys/kernel/debug/dynamic_debug/control
# file 'hci_event.c +p' > /sys/kernel/debug/dynamic_debug/control

# sleepgraph -m freeze -multi 10 0

This time the 1st, 4, 5, 6th and 8th failed.

full dmesg attached.
Comment 12 Len Brown 2020-05-13 04:38:50 UTC
Re: ubuntu, systemd etc...

Note that the same system with Ubuntu 18.04 successfully runs thousands of
suspend/resume cycles on top of the following kernels:

5.6.8+
5.6.0+
5.5.0+
5.4.0+
5.0.0-1050-oem-osp1
5.0.0-1050-oem-osp1
5.0.0-1031-oem-osp1

and it doesn't fail until the kernel includes the bisected commit.

That said, it appears that Ubuntu 18.04's systemd invokes "tlp suspend"
were it to be involved in the suspend process.  So just in case,
I replaced /usr/bin/tlp and /usr/bin/bluetooth with /bin/true,
but the system fails the same way with those programs absent.
Comment 13 Len Brown 2020-05-13 04:56:43 UTC
Re: rfkill

$ sudo rfkill
ID TYPE      DEVICE      SOFT      HARD
 0 wlan      phy0   unblocked unblocked
 1 bluetooth hci0   unblocked unblocked

$ sudo rfkill event
2020-05-13 00:41:30,772265-0400: idx 0 type 1 op 0 soft 0 hard 0
2020-05-13 00:41:30,772289-0400: idx 1 type 2 op 0 soft 0 hard 0

Here I ran 10 suspend/resume attempts (4 succeeded),
but rfkill didn't print any more lines.

If I manually run "bluetooth off", rfkill prints this:

2020-05-13 00:49:00,003851-0400: idx 1 type 2 op 2 soft 1 hard 0

If I manually run "bluetooth on", rfkill prints this:

2020-05-13 00:49:15,297390-0400: idx 1 type 2 op 2 soft 0 hard 0

I get the same two lines by turning bluetooth off/on in the
system settings GUI, no matter if the bluetooth program is available or not.
Comment 14 Abhishek Pandit-Subedi 2020-05-13 16:51:57 UTC
Ok -- this one is not a systemic problem as I predicted. This looks to be the controller not responding to a specific command. In the logs below, the controller is responding to everything except for 0x2042 and even that is intermittent (responds to some 0x2042 but not others).

So this is increasingly starting to look like a controller firmware problem. Looks like the Dell XPS 9300 uses an Intel AX wifi+bt chipset. You're using firmware from 2019 and there's been several updates in the past few months.

Could you try a newer firmware to see if this is still happening? The latest update was in April: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/?id=c5ac1add86be7a90d0c573c957e37c610f3d7f25

If you can still repro it w/ new firmware, could you collect an hci trace when it occurs?
`btmon -w <file location>` will collect the trace.

--
[  855.188312] PM: Preparing system for sleep (s2idle)
[  855.188792] hci0 opcode 0x0c1a plen 1
[  855.188793] skb len 4
[  855.188794] hci0 opcode 0x2042 plen 6
[  855.188795] skb len 9
[  855.188795] length 2
[  855.188799] hci0 cmd_cnt 1 cmd queued 2
[  855.188801] hci0 type 1 len 4
[  855.327410] hci0
[  855.327412] hci0 Event packet
[  855.327414] hci0 status 0x00
[  855.327415] hci0 opcode 0x0c1a
[  855.327416] opcode 0x0c1a status 0x00
[  855.327417] hci0 opcode 0x0c1a
[  855.327421] hci0 cmd_cnt 1 cmd queued 1
[  855.327422] hci0 type 1 len 9
[  855.352390] hci0
[  855.352391] hci0 Event packet
[  855.352392] hci0 status 0x00
[  855.352392] hci0 opcode 0x2042
[  855.352393] opcode 0x2042 status 0x00
[  855.352394] hci0 opcode 0x2042
[  855.352394] hci0: Request complete opcode=0x2042, status=0x0

# Notice that 0x2042 completes successfully

[  855.352407] hci0 opcode 0x0c05 plen 1
[  855.352408] skb len 4
[  855.352409] hci0 opcode 0x0c1a plen 1
[  855.352410] skb len 4
[  855.352411] hci0 opcode 0x2042 plen 6
[  855.352411] skb len 9
[  855.352412] hci0: LE passive scan with whitelist = 1
[  855.352412] hci0 opcode 0x2041 plen 8
[  855.352413] skb len 11
[  855.352413] hci0 opcode 0x2042 plen 6
[  855.352413] skb len 9
[  855.352414] length 5
[  855.352416] hci0 cmd_cnt 1 cmd queued 5
[  855.352417] hci0 type 1 len 4
[  855.353392] hci0
[  855.353392] hci0 Event packet
[  855.353394] hci0 opcode 0x0c05
[  855.353394] opcode 0x0c05 status 0x00
[  855.353395] hci0 opcode 0x0c05
[  855.353398] hci0 cmd_cnt 1 cmd queued 4
[  855.353398] hci0 type 1 len 4
[  855.354342] hci0
[  855.354343] hci0 Event packet
[  855.354343] hci0 status 0x00
[  855.354344] hci0 opcode 0x0c1a
[  855.354344] opcode 0x0c1a status 0x00
[  855.354345] hci0 opcode 0x0c1a
[  855.354346] hci0 cmd_cnt 1 cmd queued 3
[  855.354347] hci0 type 1 len 9
[  857.354012] hci0: Timed out waiting for suspend

# An early timeout in the suspend path

[  857.354014] hci0: Bit 5 is set
[  857.354096] hci0 opcode 0x0c05 plen 1
[  857.354098] skb len 4
[  857.354099] hci0 opcode 0x2042 plen 6
[  857.354099] skb len 9
[  857.354100] hci0: LE passive scan with whitelist = 1
[  857.354101] hci0 opcode 0x2041 plen 8
[  857.354101] skb len 11
[  857.354101] hci0 opcode 0x2042 plen 6
[  857.354102] skb len 9
[  857.354102] length 4
[  857.354107] hci0 cmd_cnt 0 cmd queued 6
[  857.385937] Bluetooth: hci0: command 0x2042 tx timeout
[  857.385938] Bluetooth: hci0: No way to reset. Ignoring and continuing

# Enabling passive scan failed

[  857.385941] hci0 cmd_cnt 1 cmd queued 6
[  857.385943] hci0 type 1 len 11
[  857.386342] hci0
[  857.386342] hci0 Event packet
[  857.386344] hci0 status 0x00
[  857.386344] hci0 opcode 0x2041
[  857.386345] opcode 0x2041 status 0x00
[  857.386345] hci0 opcode 0x2041
[  857.386347] hci0 cmd_cnt 1 cmd queued 5
[  857.386348] hci0 type 1 len 9
[  857.388421] hci0
[  857.388422] hci0 Event packet
[  857.388423] hci0 status 0x00
[  857.388424] hci0 opcode 0x2042
[  857.388425] opcode 0x2042 status 0x00
[  857.388425] hci0 opcode 0x2042
[  857.388428] hci0: Request complete opcode=0x2042, status=0x0

# 0x2042 completes successfully again

[  857.388431] hci0 cmd_cnt 1 cmd queued 4
[  857.388431] hci0 type 1 len 4
[  857.389383] hci0
[  857.389384] hci0 Event packet
[  857.389385] hci0 opcode 0x0c05
[  857.389386] opcode 0x0c05 status 0x00
[  857.389386] hci0 opcode 0x0c05
[  857.389389] hci0 cmd_cnt 1 cmd queued 3
[  857.389390] hci0 type 1 len 9
[  857.400353] hci0
[  857.400355] hci0 Event packet
[  857.400357] hci0 status 0x00
[  857.400358] hci0 opcode 0x2042
[  857.400359] opcode 0x2042 status 0x00
[  857.400359] hci0 opcode 0x2042
[  857.400363] hci0 cmd_cnt 1 cmd queued 2
[  857.400365] hci0 type 1 len 11
[  857.401330] hci0
[  857.401332] hci0 Event packet
[  857.401333] hci0 status 0x00
[  857.401334] hci0 opcode 0x2041
[  857.401334] opcode 0x2041 status 0x00
[  857.401335] hci0 opcode 0x2041
[  857.401337] hci0 cmd_cnt 1 cmd queued 1
[  857.401338] hci0 type 1 len 9
[  857.403349] hci0
[  857.403349] hci0 Event packet
[  857.403350] hci0 status 0x00
[  857.403351] hci0 opcode 0x2042
[  857.403351] opcode 0x2042 status 0x00
[  857.403352] hci0 opcode 0x2042
[  857.403353] hci0: Request complete opcode=0x2042, status=0x0
[  857.429909] PM: suspend exit
Comment 15 Len Brown 2020-05-13 21:55:19 UTC
This same Linux Bluetooth regression is seen on additional platforms:

man:Dell Inc. | plat:XPS 13 7390 | cpu:Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz | bios:1.5.1 | biosdate:03/09/2020

man:Dell Inc. | plat:Inspiron 7386 | cpu:Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz | bios:1.2.0 | biosdate:12/05/2018

man:Intel Corporation | plat:NUC7JYB | cpu:Intel(R) Celeron(R) J4005 CPU @ 2.00GHz | bios:JYGLKCPX.86A.0053.2019.1015.1510 | biosdate:10/15/2019

Their failure rates are slighly higher than the Dell 9300:

5.7.0-rc4+	otcpl-dell-7386-whl	freeze-x3547	3261	91.94%
5.7.0-rc4+	otcpl-nuc-kbl	        freeze-x3207	2821	87.96%
5.7.0-rc4+	otcpl-dell-7390-cmlu	freeze-x2879	2275	79.02%
5.7.0-rc4+	lenb-Dell-XPS-13-9300	freeze-x2426	1429	58.90%
Comment 16 Len Brown 2020-05-13 22:59:08 UTC
re: firmware

dmesg showed this:

[    7.833706] Bluetooth: hci0: Minimum firmware build 1 week 10 2014
[    7.835192] Bluetooth: hci0: Found device firmware: intel/ibt-19-32-4.sfi
[    9.324793] Bluetooth: hci0: Waiting for firmware download to complete
[    9.325473] Bluetooth: hci0: Firmware loaded in 1459099 usecs
[    9.355517] Bluetooth: hci0: Firmware revision 0.0 build 62 week 31 2019

even though there are apparently higher numbered of ibt in /lib/firmware/intel.

I downloaded linux-firmware-c5ac1add86be7a90d0c573c957e37c610f3d7f25.tar.gz
and I replaced /lib/firmware/intel/ with the new copy, but it appears
that the driver did not find a newer version for this device:

[    7.766736] Bluetooth: hci0: Firmware revision 0.0 build 62 week 31 2019

What am I missing?

In any case, a bunch of machines are failing with the firmware that they are running, when they used to work before this regression, and so this appears to be more of a Linux regression, than a device-specific firmware issue?
Comment 17 Abhishek Pandit-Subedi 2020-05-13 23:15:35 UTC
I found another patch in Chromium repository that seems to work around this issue. It comes down to conditional cmd execution that is specific to the passive scanning code. It explains why the issue occurs specifically on the passive scanning and why it only happens sometimes.

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1636009

Could you try this patch and see if it fixes your issue? If yes, I'll try to get it merged upstream.
Comment 18 Abhishek Pandit-Subedi 2020-05-14 01:47:03 UTC
I looked at that patch again and it looks like it works around something that is not upstreamed so please disregard comment #17.

--

re comment #16 - 

Checking on a recent build, the firmware revision I see is build 128 week 11 2020 (I think this is slightly older than the one I linked). I'm not sure why you're seeing an older firmware.
Comment 19 Len Brown 2020-06-03 14:32:55 UTC
My test case went away (Dell 9300)
After screwing around with /lib/firmware/intel
the failure went away, and restoring that directory, the failure doesn't come back.
Comment 20 Rafael J. Wysocki 2020-06-03 14:37:13 UTC
(In reply to Abhishek Pandit-Subedi from comment #14)
> Ok -- this one is not a systemic problem as I predicted. This looks to be
> the controller not responding to a specific command. In the logs below, the
> controller is responding to everything except for 0x2042 and even that is
> intermittent (responds to some 0x2042 but not others).
> 
> So this is increasingly starting to look like a controller firmware problem.

Which totally doesn't matter.

Evidently, there are multiple cases in which the controller stops responding and that must not be a reason for failing the system suspend (that may mean refusing to suspend after a user has closed a laptop lid or similar).

The controller may not be able to suspend cleanly, but that doesn't matter.

There still is time to recover during system resume and if that fails, it will just not work going forward and so be it.

Failing system suspend is not an option.  You can do that only in critical situations.
Comment 21 Abhishek Pandit-Subedi 2020-06-03 16:17:56 UTC
re #19 - Chromium had an internal patch that was conditionally sending the passive scanning command. After removing it, we were getting command disallowed. This is very likely what you were seeing. There's a patch fixing this on the mailing list now: https://patchwork.kernel.org/patch/11582831/

re #20 - You make a good point. A second problem I encountered with higher usage of this code is that if the controller stops responding (i.e. it crashes) and we can't reset it, it will also prevent suspend. This one is unacceptable.

Likely, the fix from Bluetooth end would be just to always allow suspend but log an error here: https://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git/tree/net/bluetooth/hci_core.c#n3368

Expect a patch today on the mailing list with a fix for this.
Comment 22 Rafael J. Wysocki 2020-06-04 14:53:52 UTC
Patch to test: https://patchwork.kernel.org/patch/11586221/
Comment 23 Todd Brandt 2020-06-25 14:47:40 UTC
I tried running this patch over the 5.8.0-rc2 upstream kernel and there was no appreciable difference in the occurrence of this error. I have 5 machines (in addition to len's) that have been getting this error since 5.7.0-rc1.

KERNEL(w patch) HOST                    TESTS           FAILS
5.8.0-rc2+      otcpl-dell-7386-whl     freeze-x1141	1073	94.04%
5.8.0-rc2+      otcpl-cml-s-1           freeze-x1064	983	92.39%
5.8.0-rc2+      otcpl-dell-7390-cmlu    freeze-x1631	1419	87.00%
5.8.0-rc2+      otcpl-nuc-glk           freeze-x1083	929	85.78%
5.8.0-rc2+      otcpl-cml-s-2           freeze-x1172	869	74.15%

Without the patch on 5.8.0-rc1 I get this:

KERNEL(w/o patch) HOST                   TESTS          FAILS
5.8.0-rc1+      otcpl-dell-7386-whl      freeze-x2876	2698	93.81%
5.8.0-rc1+      otcpl-cml-s-1            freeze-x2359	2128	90.21%
5.8.0-rc1+      otcpl-dell-7390-cmlu     freeze-x3655	3268	89.41%
5.8.0-rc1+      otcpl-nuc-glk            freeze-x1089	911	83.65%
5.8.0-rc1+      otcpl-cml-s-2            freeze-x2748	1991	72.45%

Len you'll have to try it on your machine.
Comment 24 Todd Brandt 2020-06-25 14:53:45 UTC
Here's a comprehensive list of all the machines in our power lab that are seeing this issue. It started in 5.7.0-rc1:

5.7.0-rc1+	otcpl-dell-7386-whl	freeze-x4344	3892	89.59%
5.7.0-rc1+	otcpl-nuc-kbl		freeze-x984	850	86.38%
5.7.0-rc1+	lenb-Dell-XPS-13-9300	freeze-x607	359	59.14%
5.7.0-rc2+	otcpl-dell-7386-whl	freeze-x4341	3850	88.69%
5.7.0-rc2+	otcpl-cml-u		freeze-x2070	1797	86.81%
5.7.0-rc2+	otcpl-nuc-kbl		freeze-x3697	3164	85.58%
5.7.0-rc2+	lenb-Dell-XPS-13-9300	freeze-x402	214	53.23%
5.7.0-rc3+	otcpl-cml-u		freeze-x4020	3667	91.22%
5.7.0-rc3+	otcpl-dell-7386-whl	freeze-x4404	3979	90.35%
5.7.0-rc3+	otcpl-nuc-kbl		freeze-x3777	3288	87.05%
5.7.0-rc3+	lenb-Dell-XPS-13-9300	freeze-x612	359	58.66%
5.7.0-rc4+	otcpl-cml-u		freeze-x3460	3185	92.05%
5.7.0-rc4+	otcpl-dell-7386-whl	freeze-x3547	3261	91.94%
5.7.0-rc4+	otcpl-nuc-kbl		freeze-x3207	2821	87.96%
5.7.0-rc4+	otcpl-dell-7390-cmlu	freeze-x2879	2275	79.02%
5.7.0-rc4+	lenb-Dell-XPS-13-9300	freeze-x2426	1429	58.90%
5.7.0-rc4+	lenb-Dell-XPS-13-9300	freeze-x931	487	52.31%
5.7.0-rc5+	otcpl-nuc-kbl		freeze-x3254	2950	90.66%
5.7.0-rc5+	otcpl-dell-7386-whl	freeze-x3769	3400	90.21%
5.7.0-rc5+	otcpl-cml-u		freeze-x551	479	86.93%
5.7.0-rc5+	otcpl-dell-3493-icl	freeze-x3628	2992	82.47%
5.7.0-rc5+	otcpl-cml-s-1		freeze-x2949	1675	56.80%
5.7.0-rc5+	lenb-Dell-XPS-13-9300	freeze-x20	11	55.00%
5.7.0-rc6+	otcpl-dell-7386-whl	freeze-x3762	3417	90.83%
5.7.0-rc6+	otcpl-cml-u		freeze-x2420	2181	90.12%
5.7.0-rc6+	otcpl-nuc-glk		freeze-x1049	881	83.98%
5.7.0-rc6+	otcpl-cml-s-1		freeze-x2862	1581	55.24%
5.7.0-rc7+	otcpl-dell-7386-whl	freeze-x3818	3529	92.43%
5.7.0-rc7+	otcpl-nuc-glk		freeze-x3101	2637	85.04%
5.7.0-rc7+	otcpl-dell-7390-cmlu	freeze-x2971	2508	84.42%
5.7.0-rc7+	otcpl-cml-s-1		freeze-x1958	1649	84.22%
5.7.0-rc7+	otcpl-cml-s-2		freeze-x2920	1634	55.96%
5.7.0+		otcpl-dell-7386-whl	freeze-x3836	3550	92.54%
5.7.0+		otcpl-dell-7386-whl	freeze-x3804	3499	91.98%
5.7.0+		otcpl-cml-u		freeze-x1145	1044	91.18%
5.7.0+		otcpl-nuc-glk		freeze-x3160	2815	89.08%
5.7.0+		otcpl-nuc-glk		freeze-x3247	2854	87.90%
5.7.0+		otcpl-dell-7390-cmlu	freeze-x2966	2530	85.30%
5.7.0+		otcpl-dell-7390-cmlu	freeze-x2911	2395	82.27%
5.7.0+		otcpl-cml-s-2		freeze-x2879	1619	56.23%
Comment 25 Abhishek Pandit-Subedi 2020-06-26 17:31:00 UTC
Todd, can you confirm you picked up both https://patchwork.kernel.org/patch/11582831/ and https://patchwork.kernel.org/patch/11586221/?

Also, what kind of failure are you seeing? Is it failing to enter suspend completely? Is it waking spuriously? Is it not entering S0ix? (sent up a fix for this recently)

It would be helpful to have logs for a failure case. Could you collect some logs with dynamic debug for BT (see comment #11) and collect a btmon trace (btmon -w /tmp/btsnoop) when reproducing this issue?
Comment 26 Todd Brandt 2020-07-14 00:20:38 UTC
Hi Pandit, I only used one patch on the 5.8.0-rc2 build, sorry.

This time around a applied both patches and ran on 5.8.0-rc3 and 5.8.0-rc4 and am very happy to say the bug has completely disappeared! :)

These are the machines that are now working (all that were affected):

otcpl-dell-7386-whl
otcpl-nuc-glk
otcpl-cml-s-1
otcpl-dell-7390-cmlu
otcpl-nuc-glk
otcpl-cml-s-2
otcpl-cml-u

So this fix is confirmed, thanks!
Comment 27 Artem S. Tashkinov 2020-08-20 13:38:54 UTC
(In reply to Abhishek Pandit-Subedi from comment #25)
> Todd, can you confirm you picked up both
> https://patchwork.kernel.org/patch/11582831/ and
> https://patchwork.kernel.org/patch/11586221/?
> 
> Also, what kind of failure are you seeing? Is it failing to enter suspend
> completely? Is it waking spuriously? Is it not entering S0ix? (sent up a fix
> for this recently)
> 
> It would be helpful to have logs for a failure case. Could you collect some
> logs with dynamic debug for BT (see comment #11) and collect a btmon trace
> (btmon -w /tmp/btsnoop) when reproducing this issue?

Why haven't these patches been merged with stable? Linux 5.7.15 has neither of them.
Comment 28 Abhishek Pandit-Subedi 2020-08-20 16:59:29 UTC
Hi Artem,

5.7.16 got one of the fixes but not the other. I've sent a follow-up email to netdev@ to add the missing one and cc-ed you.
Comment 29 Artem S. Tashkinov 2020-10-13 14:45:45 UTC
Fixed in 5.8.
Comment 30 Artem S. Tashkinov 2020-10-13 14:46:10 UTC
*** Bug 208975 has been marked as a duplicate of this bug. ***