Bug 196649

Summary: Dell XPS 9360: ath10k: reboot required to restore wifi and unblock Low Power Idle - firmware crashed! failed to receive control response completion, polling...
Product: Drivers Reporter: Anton Kochkov (anton.kochkov)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED INVALID    
Severity: normal CC: bimokn97, bj.cardon, danjared, dfasre, henrique.ferreiro, lenb, mieabby, rockdreamer, smit17av, superm1, todd.e.brandt
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.13_rc4 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 178231    
Attachments: full dmesg of linux kernel 4.13_rc4
Part of dmesg for Linux 4.14.36-1-MANJARO
dmesg with failure in 5.3.0-23-generic
dmesg with failure in 5.4-rc7
issue.def
lspci -vvnnnxxxx output
issue.def
sleepgraph for ath10k death in Linux-5.6-rc4
sleepgraph for ath10k death in Linux-5.6-rc3 ACPI/S3

Description Anton Kochkov 2017-08-13 07:40:42 UTC
Created attachment 257905 [details]
full dmesg of linux kernel 4.13_rc4

Laptop: Dell XPS 15 9560

Linux kernel version - 4.13_rc4
linux-firmware-20170622

When I performed scp on a remote server using the wifi connection:

[ 1952.185717] ath10k_pci 0000:02:00.0: firmware crashed! (uuid n/a)
[ 1952.185726] ath10k_pci 0000:02:00.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:1535
[ 1952.185730] ath10k_pci 0000:02:00.0: kconfig debug 0 debugfs 0 tracing 0 dfs 0 testmode 0
[ 1952.187328] ath10k_pci 0000:02:00.0: firmware ver WLAN.RM.4.4-00022-QCARMSWPZ-2 api 6 features wowlan,ignore-otp crc32 4d458559
[ 1952.188401] ath10k_pci 0000:02:00.0: board_file api 2 bmi_id N/A crc32 07ee144e
[ 1952.188406] ath10k_pci 0000:02:00.0: htt-ver 3.32 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1
[ 1952.201429] ath10k_pci 0000:02:00.0: failed to get memcpy hi address for firmware address 4: -16
[ 1952.201430] ath10k_pci 0000:02:00.0: failed to read firmware dump area: -16
[ 1952.201430] ath10k_pci 0000:02:00.0: Copy Engine register dump:
[ 1952.201440] ath10k_pci 0000:02:00.0: [00]: 0x00034400   0   0   3   3
[ 1952.201448] ath10k_pci 0000:02:00.0: [01]: 0x00034800  15  15 187 188
[ 1952.201457] ath10k_pci 0000:02:00.0: [02]: 0x00034c00  18  18  80  81
[ 1952.201466] ath10k_pci 0000:02:00.0: [03]: 0x00035000   4   4   6   4
[ 1952.201475] ath10k_pci 0000:02:00.0: [04]: 0x00035400 1485 1483  38 230
[ 1952.201484] ath10k_pci 0000:02:00.0: [05]: 0x00035800   0   0  64   0
[ 1952.201492] ath10k_pci 0000:02:00.0: [06]: 0x00035c00   5   4   4   2
[ 1952.201501] ath10k_pci 0000:02:00.0: [07]: 0x00036000   0   0   0   1
[ 1952.265988] ieee80211 phy0: Hardware restart was requested
[ 1952.951516] ath10k_pci 0000:02:00.0: Unknown eventid: 90118
[ 1953.055767] ath10k_pci 0000:02:00.0: device successfully recovered
Comment 1 dfasre 2018-05-02 12:44:18 UTC
Created attachment 275733 [details]
Part of dmesg for Linux 4.14.36-1-MANJARO

I also have Dell XPS 15 9560 and faced a very similar issue after resuming from suspend. My version of linux-firmware is 20180402.8c1e439-1 from Manjaro Linux repository. Attached is the latest part of dmesg, right before & after the suspend.
Comment 2 Len Brown 2019-11-19 03:23:56 UTC
*** Bug 203389 has been marked as a duplicate of this bug. ***
Comment 3 Len Brown 2019-11-19 03:24:23 UTC
*** Bug 201743 has been marked as a duplicate of this bug. ***
Comment 4 Len Brown 2019-11-19 03:48:26 UTC
On a Dell XPS 9360 running the latest BIOS (2.12.0, 05/26/2019)
Running latest Ubuntu (19.10) and its latest kernel (Linux-5.0.0-23-generic)

This issue can be reproduced by running overnight suspend/resume endurance tests.  eg.

sudo sleepgraph -m freeze -rtcwake 3 -multi 3500 0

If the ath10k driver is loaded and the network is UP before the tests,
the driver will crash during resume before completing the tests.
After the ath10k fails, WIFI is not accessible, and all subsequent
system s2idle cycles fail to reach the deepest power saving state.

The tests will complete if the ath* drivers are unloaded before testing,
and after the tests, the driver can be re-loaded and the network works.
However, unloading the driver prevents the Dell from getting into
the deepest power saving state.

The tests will also complete and the system will successfully reach
the deepest power saving state if the driver is kept loaded,
but the WIFI interface is taken DOWN before the tests.
The interface can be brought back UP after the tests
and WIFI works.

This workaround is not necessary with competing WIFI adapters...
Comment 5 Len Brown 2019-11-19 04:03:13 UTC
Created attachment 285971 [details]
dmesg with failure in 5.3.0-23-generic

The attached dmesg shows the failure, complete with multiple oops and ATH register dump.
Comment 6 Len Brown 2019-11-19 04:14:04 UTC
Created attachment 285973 [details]
dmesg with failure in 5.4-rc7

The symptom and the workaround appears to be the same in Linux 5.4-rc7,
however, the output has changed slightly since Linux-5.3:

[21374.171831] ath10k_pci 0000:3a:00.0: failed to receive control response completion, polling..
[21374.201988] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x00000001 at 0x00034430: -110
...
[21390.255193] Hardware became unavailable upon resume. This could be a software issue prior to suspend or a hardware issue.
[21390.255258] WARNING: CPU: 1 PID: 30231 at net/mac80211/util.c:2197 ieee80211_reconfig+0x9b/0x1500 [mac80211]
Comment 7 Todd Brandt 2019-11-22 15:38:20 UTC
Created attachment 286025 [details]
issue.def
Comment 8 Mario Limonciello 2019-12-02 16:57:44 UTC
Can you please confirm whether L0s is disabled?  On the PCIe root port offset 0x50 should be set to 042 (Enable L1, disable L0s).
Comment 9 Mario Limonciello 2019-12-02 16:58:18 UTC
To clarify: I meant 0x42 not 042.
Comment 10 Anton Kochkov 2019-12-02 17:02:05 UTC
Created attachment 286153 [details]
lspci -vvnnnxxxx output

Attaching the verbose lspci output, just in case you will need other registers too.
Comment 11 Mario Limonciello 2019-12-02 17:07:41 UTC
Yeah it looks like root port #9 0x50 is set to 0x42, so that's not the problem on this system.
Comment 12 Len Brown 2020-03-04 15:58:54 UTC
Created attachment 287797 [details]
issue.def
Comment 13 Len Brown 2020-03-04 20:12:46 UTC
Sometimes the ath10k death does not print "firmware crashed", but instead dmesg looks like that below from Linux-5.6-rc3.  The impact is the same.  The system must be rebooted to get the network back, and until that reboot, the system is also blocked from sleeping in its deepest Low Power Idle state.


[ 4671.137325] PM: noirq resume of devices complete after 49.448 msecs
[ 4671.140716] PM: early resume of devices complete after 1.051 msecs
[ 4671.335270] usb 1-3: reset full-speed USB device number 2 using xhci_hcd
[ 4672.358779] ath10k_pci 0000:3a:00.0: failed to receive control response completion, polling..
[ 4672.388942] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x00000001 at 0x00034430: -110
[ 4672.419048] ath10k_pci 0000:3a:00.0: failed to wake target for read32 at 0x00034444: -110
[ 4672.449450] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x0000001e at 0x00034430: -110
[ 4672.479562] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x00000001 at 0x00034830: -110
[ 4672.509668] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x00000001 at 0x00035430: -110
[ 4672.539771] ath10k_pci 0000:3a:00.0: failed to wake target for read32 at 0x00035444: -110
[ 4672.569885] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x0000001e at 0x00035430: -110
[ 4672.599991] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x0000001e at 0x00034830: -110
[ 4672.630098] ath10k_pci 0000:3a:00.0: failed to wake target for write32 of 0x00000001 at 0x00034c30: -110
[ 4674.022216] ath10k_pci 0000:3a:00.0: ctl_resp never came in (-110)
[ 4674.022222] ath10k_pci 0000:3a:00.0: failed to connect to HTC: -110
[ 4677.369488] ath10k_warn: 122 callbacks suppressed
Comment 14 Mario Limonciello 2020-03-04 20:17:59 UTC
I wonder if this is the right time for proposing a bandaid - does it respond to PCI config space writes?  Maybe a PCI reset?  Or if not, maybe a bigger bandaid of a slot reset?
Comment 15 Len Brown 2020-03-04 20:25:37 UTC
I have an example death where PCI accesses do not respond, let me dig that one up.

the only bandaid that I know works, so far, is taking the interface DOWN before suspend and bringing it back UP after resume.  However, I have tested this only before/after multiple suspend/resumes, and have not yet tested it for EVERY suspend resume.
Comment 16 Len Brown 2020-03-04 20:28:48 UTC
Created attachment 287799 [details]
sleepgraph for ath10k death in Linux-5.6-rc4

In this example, ath10k caused resume to take 17,000 ms.

Many users would give up and power-cycle their machine if it took that long to resume.
Comment 17 Mario Limonciello 2020-03-04 20:33:27 UTC
Looking through that though, if a bandaid is an option I would think counting up a few failures of "failed to wake target for write32" returning -110 is enough to bail.  So throwing down a hammer at that point should hopefully also mean not 17 seconds of failed writes, but maybe 1-2seconds?
Comment 18 Len Brown 2020-03-04 20:38:27 UTC
After the failure in comment 16, subsequent s2idle attempts saw this:

[ 2324.332355] ath10k_pci 0000:3a:00.0: can't change power state from D0 to D3hot (config space inaccessible)

...
[ 2326.228463] ath10k_pci 0000:3a:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Comment 19 Len Brown 2020-03-04 20:42:25 UTC
Re: 17 seconds of failure
I'd hope we could recognize a failure 100x faster, rather than just 10x.
ie. is 170ms of retries not sufficient to decide that it is dead?
Comment 20 Mario Limonciello 2020-03-04 20:47:46 UTC
OK so PCI device reset isn't viable, but slot reset might still be.

You can probably get away with trying it without even a kernel patch for POC.  In a failed state doing:
echo 0 > /sys/bus/pci/slots/$NUMBER/power
echo 1 > /sys/bus/pci/slots/$NUMBER/power
Should be enough I'd think.
Comment 21 Len Brown 2020-03-04 21:01:33 UTC
Created attachment 287801 [details]
sleepgraph for ath10k death in Linux-5.6-rc3 ACPI/S3

This example is from an ACPI/S3, rather than s2idle.
(ACPI S3 is not the default suspend on the 9360, but when LPI doesn't work in s2idle, it is a viable backup)

[21603.645369] ath10k_pci 0000:3a:00.0: firmware crashed! (guid 5043c362-0b7f-4f25-9c0c-8437209e510b)
[21603.645372] ath10k_pci 0000:3a:00.0: qca6174 hw3.2 target 0x05030000 chip_id 0x00340aff sub 1a56:1535
[21603.645373] ath10k_pci 0000:3a:00.0: kconfig debug 0 debugfs 1 tracing 1 dfs 0 testmode 0
[21603.645759] ath10k_pci 0000:3a:00.0: firmware ver WLAN.RM.4.4.1-00140-QCARMSWPZ-1 api 6 features wowlan,ignore-otp,mfp crc32 29eb8ca1
[21603.646069] ath10k_pci 0000:3a:00.0: board_file api 2 bmi_id N/A crc32 4ed3569e
[21603.646070] ath10k_pci 0000:3a:00.0: htt-ver 3.60 wmi-op 4 htt-op 3 cal otp max-sta 32 raw 0 hwcrypto 1
[21603.656199] ath10k_pci 0000:3a:00.0: failed to get memcpy hi address for firmware address 4: -16
[21603.656200] ath10k_pci 0000:3a:00.0: failed to read firmware dump area: -16
[21603.656201] ath10k_pci 0000:3a:00.0: Copy Engine register dump:
[21603.656204] ath10k_pci 0000:3a:00.0: [00]: 0x00034400 4294967295 4294967295 4294967295 4294967295
[21603.656208] ath10k_pci 0000:3a:00.0: [01]: 0x00034800 4294967295 4294967295 4294967295 4294967295
[21603.656213] ath10k_pci 0000:3a:00.0: [02]: 0x00034c00 4294967295 4294967295 4294967295 4294967295
[21603.656218] ath10k_pci 0000:3a:00.0: [03]: 0x00035000 4294967295 4294967295 4294967295 4294967295
[21603.656221] ath10k_pci 0000:3a:00.0: [04]: 0x00035400 4294967295 4294967295 4294967295 4294967295
[21603.656224] ath10k_pci 0000:3a:00.0: [05]: 0x00035800 4294967295 4294967295 4294967295 4294967295
[21603.656227] ath10k_pci 0000:3a:00.0: [06]: 0x00035c00 4294967295 4294967295 4294967295 4294967295
[21603.656231] ath10k_pci 0000:3a:00.0: [07]: 0x00036000 4294967295 4294967295 4294967295 4294967295

It appears that the WIFI actually came back to life two suspend/resume cycles later -- at least as measured by a non-empty /proc/net/wireless file,
but subsequent suspend/resume ACPI/S3 all had many of these:

[21695.368075] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.4
[21695.368089] pcieport 0000:00:1c.4: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[21695.368092] pcieport 0000:00:1c.4: AER:   device [8086:9d14] error status/mask=00100000/00010000
[21695.368094] pcieport 0000:00:1c.4: AER:    [20] UnsupReq               (First)
[21695.368095] pcieport 0000:00:1c.4: AER:   TLP Header: 34000000 3a000010 00000000 00008c60
[21695.368101] ath10k_pci 0000:3a:00.0: AER: can't recover (no error_detected callback)
[21695.368145] pcieport 0000:00:1c.4: AER: device recovery failed
Comment 22 Len Brown 2020-03-04 21:07:57 UTC
(In reply to Mario Limonciello from comment #20)
> OK so PCI device reset isn't viable, but slot reset might still be.

Worth a try -- I'll do it on next opportunity.

Will also try a test to see if big band-aid #1 actually works (DOWN/UP the interface) if invoked around every suspend/resume
Comment 23 Len Brown 2020-03-09 06:25:21 UTC
/sys/bus/pci/slots/

exists, but is empty; both in Ubuntu binary and my upstream build-from-source kernel.
Comment 24 Mario Limonciello 2020-03-09 13:37:17 UTC
On a different machine here I have that directory populated, but it seems that it's not populated everywhere.

Perhaps using setpci instead as shown here may be a viable option for checking if that recovers it:
https://unix.stackexchange.com/questions/73908/how-to-reset-cycle-power-to-a-pcie-device
Comment 25 Len Brown 2020-03-09 19:00:14 UTC
I've confirmed that running this before/after the suspend/resume
is NOT sufficient to prevent the ath10k failure:

nmcli connection down ssid
nmcli connection up ssid

It failed the same way as it does without this (attempted) workaround, and also in under just 200 tests -- so it didn't measurably enhance the resilience of the system.
Comment 26 Len Brown 2020-03-09 19:46:20 UTC
echo 1 > reset in the PCI sysfs for the device causes a time-out

echo 1 > remove in the PCI sysfs causes the attributes for this device in sysfs to vanish.


issuing a reset of the port (1c.4), as per the script at url above, fails with permission denied -- even in the working case.

in the failing case, after the remove...

echo 1 > rescan

does not find the device, and provokes no messages in dmesg.  It is as 
if the device is permanently powered off.

rmmod/modprobe of the ath10k_pci loads the driver, but it doesn't find any hardware, and doesn't say boo in dmesg.
Comment 27 Len Brown 2020-03-09 19:53:16 UTC
I sometimes see AER's on the pcieport (1c.4) that holds the ath10k -- this one is upon reboot:

[   12.987007] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4
[   12.987018] pcieport 0000:00:1c.4: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[   12.987026] pcieport 0000:00:1c.4: AER:   device [8086:9d14] error status/mask=00001000/00002000
[   12.987032] pcieport 0000:00:1c.4: AER:    [12] Timeout
[   13.001838] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:00:1c.4
[   13.001852] pcieport 0000:00:1c.4: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[   13.001862] pcieport 0000:00:1c.4: AER:   device [8086:9d14] error status/mask=00001000/00002000
[   13.001868] pcieport 0000:00:1c.4: AER:    [12] Timeout
Comment 28 Len Brown 2020-03-11 02:29:04 UTC
I've confirmed that running this before/after the suspend/resume
is NOT sufficient to prevent the ath10k failure:

nmcli networking off
nmcli networking on

ath10k crashed after about 150 cycles, and took the network and Low-Power-Idle with it, requiring system reboot.
Comment 29 Len Brown 2020-03-11 04:38:23 UTC
I've confirmed that running this before/after the suspend/resume
is NOT sufficient to prevent the ath10k failure:

rmmod ath10k_pci
modprobe ath10k_pci

ath10k again crashed after about 150 cycles, and took the network and Low-Power-Idle with it, requiring system reboot.

Is there ath10k firmware newer than this?:

ath10k_pci 0000:3a:00.0: firmware ver WLAN.RM.4.4.1-00140-QCARMSWPZ-1 api 6 features wowlan,ignore-otp,mfp crc32 29eb8ca1

If no, I'm out of ideas.
Comment 30 Len Brown 2020-03-12 17:36:40 UTC
Verified that disabling the ath10k in BIOS SETUP
also prevents the Dell 9360 from reaching deep Idle Power States.
Comment 31 Len Brown 2020-03-13 03:37:09 UTC
Confirmed that physically removing the ath10k card from the Dell 9360 allows the system to reach Low Power Idle states during suspend.
Comment 32 Len Brown 2020-03-13 03:42:45 UTC
Confirmed that installing an Intel wireless card in the Dell 9360 allows it to reliably access the network across suspend/resume cycles to both ACPI S3 and suspend-to-idle.

Confirmed that in the suspend-to-idle case, the Dell 9360 reliably achieves good residency in Low Power Idle.

Since the platform is working well without the ath10k, and the ath10k driver prints "firmware crashed" and all methods to subsequently access the ath10k hardware fail, closing this as an ath10k hardware bug.
Comment 33 bj.cardon 2020-09-21 03:06:28 UTC
I seem to have this issue on an XPS 13 9360 as well on Kernel 5.4, however I haven't always had this issue. Is the suggestion Len made above meant to imply that this is a hardware issue that can't be fixed in the driver? Is the system capable of entering Low Power Idle in Windows, and, if so, wouldn't that suggest that it is a driver issue and not a hardware issue? I found it odd that switching to a physically different wireless adapter and thus driver was seen as evidence of an unresolvable hardware bug...
Comment 34 bimokn97 2020-12-23 17:54:54 UTC
i have same isues, my laptop is ASUS A442UR, im running on fedora 33 with kernel 5.9, but to make my hardware on again i must open my laptop and take my internal battery off for a while.
Comment 35 smit17av 2021-07-08 16:50:53 UTC
Same issue on Archlinux with 5.12.14 on Dell Vostro 3583 but with qca9377 card. Only option is to reboot.