Created attachment 289221 [details]
ath10k firmware crash requires a full reboot
I keep hitting an ath10k firmware crash after a few days, after which I cannot resume any possible operation with the device, requiring a full reboot.
I've tried removing the module, adding it back again and that still does not resolve it. My only option is to do a full reboot...
Attached is my full kernel log. I am on 5.6.0, which is today's latest debian kernel, and is today's stable kernel.
Unfortunately users are not informed of this in any way today, so unless you know what might be wrong would you look under the hood and see it was a firmware crash.
Created attachment 289355 [details]
ath10k firmware crashed again
Crashed again today using 5.6.7. I tried removing the module and adding it and again, it didn't work, requiring a full system reboot.
FYI, this doesn't really look like a firmware crash at all. It looks like the entire PCI endpoint is just not responding.  To confirm that, you might want to trace the values of all those ioread32() calls (e.g., in ath10k_pci_is_awake()). I believe they tend to be 0xffffffff when the endpoint isn't responding at all.
And in fact, for the "register dumps", that's exactly what I see:
[67103.491095] ath10k_pci 0000:01:00.0: Copy Engine register dump:
[67103.613636] ath10k_pci 0000:01:00.0: : 0x00034400 4294967295 4294967295 4294967295 4294967295
$ printf '%#x\n' 4294967295
As to how to debug this...it might help if you know this used to work on some previous kernel. Then there might be something you could track down in the core platform area (e.g., PCI? power management?).
It does seem telling that at least one of your failures is seen immediately after system suspend/resume. That's definitely a key area where I've seen platform instability in other contexts...
 It actually seems rather similar to the "EVENT=INACCESSIBLE" stuff I was talking about in iwlwifi land here:
It's not a WiFi firmware issue, but a platform/hardware issue.
(In reply to Brian Norris from comment #2)
> It does seem telling that at least one of your failures is seen immediately
> after system suspend/resume. That's definitely a key area where I've seen
> platform instability in other contexts...
FWIW I saw this now on a fresh boot, without any suspend cycles kicking in at all.