Bug 207851 - ath10k firmware crash requiring a reboot
Summary: ath10k firmware crash requiring a reboot
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-22 05:09 UTC by Luis Chamberlain
Modified: 2021-01-27 10:18 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.6.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
ath10k firmware crash requires a full reboot (196.52 KB, text/plain)
2020-05-22 05:09 UTC, Luis Chamberlain
Details
ath10k firmware crashed again (64.78 KB, text/plain)
2020-05-27 18:30 UTC, Luis Chamberlain
Details

Description Luis Chamberlain 2020-05-22 05:09:05 UTC
Created attachment 289221 [details]
ath10k firmware crash requires a full reboot

I keep hitting an ath10k firmware crash after a few days, after which I cannot resume any possible operation with the device, requiring a full reboot.

I've tried removing the module, adding it back again and that still does not resolve it. My only option is to do a full reboot...

Attached is my full kernel log. I am on 5.6.0, which is today's latest debian kernel, and is today's stable kernel.

Unfortunately users are not informed of this in any way today, so unless you know what might be wrong would you look under the hood and see it was a firmware crash.
Comment 1 Luis Chamberlain 2020-05-27 18:30:15 UTC
Created attachment 289355 [details]
ath10k firmware crashed again

Crashed again today using 5.6.7. I tried removing the module and adding it and again, it didn't work, requiring a full system reboot.
Comment 2 Brian Norris 2020-06-02 20:34:30 UTC
FYI, this doesn't really look like a firmware crash at all. It looks like the entire PCI endpoint is just not responding. [1] To confirm that, you might want to trace the values of all those ioread32() calls (e.g., in ath10k_pci_is_awake()). I believe they tend to be 0xffffffff when the endpoint isn't responding at all.

And in fact, for the "register dumps", that's exactly what I see:

[67103.491095] ath10k_pci 0000:01:00.0: Copy Engine register dump:
[67103.613636] ath10k_pci 0000:01:00.0: [00]: 0x00034400 4294967295 4294967295 4294967295 4294967295
...

$ printf '%#x\n' 4294967295
0xffffffff

As to how to debug this...it might help if you know this used to work on some previous kernel. Then there might be something you could track down in the core platform area (e.g., PCI? power management?).

It does seem telling that at least one of your failures is seen immediately after system suspend/resume. That's definitely a key area where I've seen platform instability in other contexts...


[1] It actually seems rather similar to the "EVENT=INACCESSIBLE" stuff I was talking about in iwlwifi land here:
https://lore.kernel.org/linux-wireless/CA+ASDXOg9oKeMJP1Mf42oCMMM3sVe0jniaWowbXVuaYZ=ZpDjQ@mail.gmail.com/
It's not a WiFi firmware issue, but a platform/hardware issue.
Comment 3 Luis Chamberlain 2020-08-14 18:35:01 UTC
(In reply to Brian Norris from comment #2)
> It does seem telling that at least one of your failures is seen immediately
> after system suspend/resume. That's definitely a key area where I've seen
> platform instability in other contexts...

FWIW I saw this now on a fresh boot, without any suspend cycles kicking in at all.
Comment 4 Luis Chamberlain 2021-01-27 10:18:18 UTC
I've also seen this when booting with:

pcie_aspm=off

Note You need to log in before you can comment on or make changes to this bug.