Bug 109681 - mwifiex_pcie on surface pro 4 is unstable
Summary: mwifiex_pcie on surface pro 4 is unstable
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: verdre
URL:
Keywords:
: 195183 (view as bug list)
Depends on:
Blocks: 195183
  Show dependency tree
 
Reported: 2015-12-20 03:39 UTC by Weng Xuetian
Modified: 2022-01-04 18:34 UTC (History)
30 users (show)

See Also:
Kernel Version: 4.3.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
a more complete log when problem happens. (12.82 KB, application/octet-stream)
2015-12-21 06:39 UTC, Weng Xuetian
Details
log with firmware 15.68.7.p53 (8.14 KB, application/octet-stream)
2015-12-22 00:14 UTC, Weng Xuetian
Details
A more complete log with debug flag set to 0xffffff (246.63 KB, text/plain)
2016-01-23 02:25 UTC, Weng Xuetian
Details
attachment-2572-0.html (1.40 KB, text/html)
2016-03-15 16:35 UTC, Anton Anikin
Details
attachment-7733-0.html (1.24 KB, text/html)
2016-03-16 09:35 UTC, Anton Anikin
Details
kernel 4.5 log with debug mask 0xfffffff (529.00 KB, application/octet-stream)
2016-03-24 17:29 UTC, Weng Xuetian
Details
Log of the failure to connect. (79.87 KB, application/octet-stream)
2016-04-04 19:17 UTC, Jeremy White
Details
Log of the cmd size 0 failure mode. (650.68 KB, application/x-bzip)
2016-04-04 19:18 UTC, Jeremy White
Details
Log of a failure to connect, with debug mask set from boot. (52.96 KB, application/x-bzip)
2016-04-13 14:45 UTC, Jeremy White
Details
Log of a 'working' session, which has poor latency. debug_mask ffffffff from start. (153.52 KB, application/x-bzip)
2016-04-13 14:47 UTC, Jeremy White
Details
Hack to prevent power saving from ever starting (1.45 KB, patch)
2016-04-13 15:45 UTC, Jeremy White
Details | Diff
Debug patch for sleep issue (1.81 KB, application/octet-stream)
2016-04-25 15:16 UTC, Amitkumar Karwar
Details
Log with new debug patch; this is a case where we cannot connect. (58.12 KB, application/x-bzip)
2016-04-25 16:05 UTC, Jeremy White
Details
Case where we get just a few slow pings, but it otherwise works. (177.78 KB, application/x-bzip)
2016-04-25 16:07 UTC, Jeremy White
Details
debug sleep issue changes (2.35 KB, patch)
2016-04-25 18:47 UTC, Amitkumar Karwar
Details | Diff
Run with dbg3 (200.36 KB, application/octet-stream)
2016-04-25 19:41 UTC, Jeremy White
Details
mwifiex on Surface Pro 3 (80.44 KB, application/zip)
2016-04-26 17:14 UTC, Hieu Bui
Details
dmesg log of a failure to connect (23.30 KB, application/x-bzip)
2016-04-27 02:39 UTC, Jeremy White
Details
Screenshot of sample kernel panic with dbg3 (2.31 MB, image/jpeg)
2016-04-27 02:39 UTC, Jeremy White
Details
FW in bad state (15.66 KB, application/x-bzip)
2016-04-27 23:51 UTC, Hieu Bui
Details
Multiple "FW in bad state" (20.82 KB, application/x-bzip)
2016-04-29 05:57 UTC, Hieu Bui
Details
Diff that shows problem, and has current working solution (5.63 KB, patch)
2016-05-11 14:20 UTC, Jeremy White
Details | Diff
Log of failing session (123.47 KB, application/x-bzip)
2016-05-12 13:31 UTC, Jeremy White
Details
attachment-8673-0.html (3.49 KB, text/html)
2016-05-12 17:08 UTC, Anton Anikin
Details
attachment-11994-0.html (4.60 KB, text/html)
2016-05-14 19:36 UTC, Anton Anikin
Details
Trace error Surface Pro 3 (kernel 4.6, Jeremy White's patch, archlinux) (6.53 KB, text/plain)
2016-05-29 17:46 UTC, David López
Details
Firmware in bad state after sleep on Surface pro 3 (45.46 KB, application/x-targz)
2016-08-02 05:13 UTC, Hieu Bui
Details
no-power-save (177 bytes, application/octet-stream)
2017-05-20 12:48 UTC, Mikael Djurfeldt
Details
dmesg log and data file from firmware dump (2.07 MB, application/x-compressed-tar)
2017-11-22 14:44 UTC, Jeremy White
Details
dmesg of command timeout issue (204.69 KB, text/plain)
2019-10-03 18:01 UTC, verdre
Details
systemd journal kernel log from boot until command timeout issue (3.47 MB, text/plain)
2019-10-03 18:03 UTC, verdre
Details
[PATCH] mwifiex: debugfs: add entry for device_dump (1.86 KB, patch)
2019-10-07 12:40 UTC, Tsuchiya Yuto (kitakar5525)
Details | Diff
dmesg of command timeout issue (67.19 KB, text/plain)
2019-10-10 13:48 UTC, verdre
Details
firmware dump of command timeout issue (1.08 MB, text/plain)
2019-10-10 13:49 UTC, verdre
Details
0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume (794 bytes, patch)
2019-10-18 04:29 UTC, Ganapathi Bhat
Details | Diff
attachment-14890-0.html (244 bytes, text/html)
2019-12-26 13:41 UTC, Ganapathi Bhat
Details
attachment-14943-0.html (246 bytes, text/html)
2019-12-26 13:41 UTC, Amitkumar Karwar
Details
attachment-16293-0.html (244 bytes, text/html)
2020-01-06 12:15 UTC, Ganapathi Bhat
Details
attachment-30418-0.html (246 bytes, text/html)
2020-04-18 00:10 UTC, Amitkumar Karwar
Details
attachment-26841-0.html (246 bytes, text/html)
2020-07-01 23:38 UTC, Amitkumar Karwar
Details
mwifiex crash log on SB1 with ps_on, 5GHzAP (1.47 MB, application/x-xz)
2020-07-02 12:00 UTC, Tsuchiya Yuto (kitakar5525)
Details

Description Weng Xuetian 2015-12-20 03:39:56 UTC
Wifi looks fine at first after boot, but after hours of use or even idle, it goes into a unrecoverable state (unable to send and receive) until next reboot. dmesg logs shows lots of following messages:

[ 6822.586704] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586731] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6822.586899] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6822.586904] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6827.925487] mwifiex_pcie 0000:02:00.0: 4296924600 : Tx timeout(#40), bss_type-num = 0-0
[ 6828.584671] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584687] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6828.584755] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6828.584763] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587608] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587623] mwifiex_pcie 0000:02:00.0: failed to get signal information
[ 6834.587698] mwifiex_pcie 0000:02:00.0: PREP_CMD: FW in reset state
[ 6834.587706] mwifiex_pcie 0000:02:00.0: failed to get signal information
Comment 1 Weng Xuetian 2015-12-21 06:39:55 UTC
Created attachment 197921 [details]
a more complete log when problem happens.

Attach a more complete log when problem happens.
Comment 2 Amitkumar Karwar 2015-12-21 06:45:09 UTC
Thanks for reporting the problem. This is a command timeout problem caused mostly due to a firmware bug.
Which firmware is being used here?
Comment 3 Weng Xuetian 2015-12-21 06:47:45 UTC
It's commit bbe4917 from linux-firmware git repository, not quite sure about the actual version.
Comment 4 Amitkumar Karwar 2015-12-21 06:53:15 UTC
Could you try following latest firmware?
http://git.marvell.com/?p=mwifiex-firmware.git;a=commit;h=495e89368ea4f59bf6f8e54e94b5e8448fadd541
Comment 5 Weng Xuetian 2015-12-22 00:14:21 UTC
Created attachment 197961 [details]
log with firmware 15.68.7.p53

At first I though it's fixed, but I just got into similar situation. Log attached.

BTW, I also seems to have unstable connection when wifi is working. E.g. I open google.com in firefox with google's instant search feature enabled. While typing, The result usually not displayed and I have to refresh page several times to get web page displayed. The other laptop I have doesn't show the same problem so it's not network issue.
Comment 6 Anton Anikin 2016-01-03 16:27:33 UTC
I have the same problem with mwifiex_pcie (latest git firmware) on Surface Pro 3
Comment 7 Weng Xuetian 2016-01-23 02:25:07 UTC
Created attachment 201031 [details]
A more complete log with debug flag set to 0xffffff

4.3.3 with firmware 15.68.7.p53
Comment 8 Mikael Djurfeldt 2016-01-25 22:42:37 UTC
I have the same problem. In addition, using mwiflex_pcie on Surface Pro 4 seems to cause intermittent system crashes.

Using kernel 4.3.3-5 compiled from linux-source.
Comment 9 Andrea Gottardo 2016-01-26 12:39:25 UTC
I have the same problem on my Microsoft Surface 3 (not Pro model), same issues (even though the rest of system is stable) and same dmesg printout.
Lspci lists the card on my tablet as: Marvell Technology Group Ltd. 88W8897 [AVASTAR] 802.11ac Wireless.
Comment 10 Andrea Gottardo 2016-01-26 12:41:12 UTC
I have also noticed that the issues appear as soon as I start using the wireless card heavily (downloading files, apt update, etc.). As long as I just use ping, everything is fine.
Comment 11 Weng Xuetian 2016-03-14 21:25:40 UTC
It becomes much worse in linux 4.5 comparing with 4.4.2.

Is there any update on this issue?
Comment 12 Anton Anikin 2016-03-15 16:35:20 UTC
Created attachment 209311 [details]
attachment-2572-0.html

The driver was updated recently:

Из git://git.marvell.com/mwifiex-firmware
   e92f8b3..a96efa0  master     -> origin/master
Обновление e92f8b3..a96efa0
Fast-forward
 mrvl/pcie8897_uapsta.bin | Bin 803884 -> 816772 bytes
 mrvl/sd8897_uapsta.bin   | Bin 780760 -> 794148 bytes


2016-03-15 0:25 GMT+03:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #11 from Weng Xuetian <wengxt@gmail.com> ---
> It becomes much worse in linux 4.5 comparing with 4.4.2.
>
> Is there any update on this issue?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 13 Weng Xuetian 2016-03-16 03:52:19 UTC
Actually I'm aware of that.

However, with latest firmware + linux 4.5, my system would simply freeze in a few minutes that I can't get any useful information for this bug.

With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi also enters bad state quite fast comparing with 4.4.2 or 4.3.3.
Comment 14 Weng Xuetian 2016-03-16 03:55:05 UTC
The only thing I notice so far is that:

With latest firmware + linux 4.3.3, I notice a lot of "mwifiex_pcie 0000:02:00.0: mwifiex_process_sleep_confirm_resp: cmd size is 0" when wifi enters bad state, which doesn't happen with 15.68.7.p53.
Comment 15 Amitkumar Karwar 2016-03-16 06:05:09 UTC
Hi Weng,

Can you provide complete log for the issue mentioned in comment#14?
Comment 16 Anton Anikin 2016-03-16 09:35:54 UTC
Created attachment 209441 [details]
attachment-7733-0.html

My system freezes too with 4.5 kernel and latest and previous versions of
firmware.

2016-03-16 6:52 GMT+03:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #13 from Weng Xuetian <wengxt@gmail.com> ---
> Actually I'm aware of that.
>
> However, with latest firmware + linux 4.5, my system would simply freeze
> in a
> few minutes that I can't get any useful information for this bug.
>
> With 15.68.7.p53 my kernel could live longer, but with kernel 4.5, wifi
> also
> enters bad state quite fast comparing with 4.4.2 or 4.3.3.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 17 Weng Xuetian 2016-03-24 17:29:41 UTC
Created attachment 210651 [details]
kernel 4.5 log with debug mask 0xfffffff

The complete log is too large(286MB), so I cut it to last 1 minute before PREP_CMD: FW is in bad state happens.

mwifiex_pcie 0000:02:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.7.p66)
Comment 18 Jeremy White 2016-04-04 19:17:14 UTC
I'm experiencing a variety of problems with this driver as well.

Using Linux v4.6-rc1-64-g6ddf37d, firmware 15.68.7.p66.

The first failure is an inability to connect.  This is sporadic; it seems to happen about 1 time in 4 boots, but one time I had it three or four times in a row.  I will attach a dmesg log with debug increased part way through.  It seems as though this may be similar to the problem reported here:
  https://www.mail-archive.com/linux-wireless@vger.kernel.org/msg19911.html

The second failure is horrific performance, accompanied by constantly repeated mwifiex_process_sleep_confirm_resp: cmd size is 0 messages.  The user experience is harder to quantify; you get absolutely terrible network performance (7-10% packet loss), and general instability.  I'll attach a dmesg log for that as well, again with debug increased part way through.  This is similar to what Weng Xuetian reported in Comment 14.

Finally, as I use this device, I find one consistent and persistent oddity.  That is, even when everything is 'working', I get poor latency connecting to the device.  That is, my latency when I ping 'out' is fine (< 10 ms, not much variation).  But when I ping the device from a different system, I get highly variable latency.   (Ranging from 10 ms to 4000 ms).  No dropped packets, but horrible latency.  You really notice it when you're ssh'd in :-/.
Comment 19 Jeremy White 2016-04-04 19:17:51 UTC
Created attachment 211711 [details]
Log of the failure to connect.
Comment 20 Jeremy White 2016-04-04 19:18:46 UTC
Created attachment 211721 [details]
Log of the cmd size 0 failure mode.
Comment 21 Jeremy White 2016-04-13 14:44:56 UTC
I've spent some time looking at the source code and trying to debug what is happening.  I'm using the kvalo wireless-drivers-next pending branch, and I've applied recent patches sent to lkml that seem valuable; but the problems persist.

It seems as though in the 'normal' case we get an unusual number of 'max count reached while accessing sleep cookie' messages.  The usleep calls in that case make me suspicious, and being slow to wake up would seem to be a good explanation of the performance we see.  (That is, sending a packet is fast, but responding is slow).  It also seems like we're in a rather startling loop of awake/sleep events; doing multiple spins within a millisecond, as far as I can tell.

In the disconnect case, I see an 'invalid cmd resp' error followed by 'There is no command but got a command response'.

I have now started booting with debug_mask=0xffffffff; I'll replace my logs with those, as they appear to be more complete.

For reference, here are related areas of the code:

Sleep message:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#377

There is no command:
https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#1637
(although intriguingly to my naive mind, that code block is related to the sleep check as well).
Comment 22 Jeremy White 2016-04-13 14:45:42 UTC
Created attachment 212611 [details]
Log of a failure to connect, with debug mask set from boot.
Comment 23 Jeremy White 2016-04-13 14:47:18 UTC
Created attachment 212621 [details]
Log of a 'working' session, which has poor latency.  debug_mask ffffffff from start.
Comment 24 Jeremy White 2016-04-13 15:44:26 UTC
Disabling power management (via a Network Manager dispatcher.d script) seems to help the first problem (latency issues).  But it leaves the sporadic failures to connect.

A hack to prevent power management from ever starting seems to resolve those failures as well, although it's hard to claim that as true; it could be I was simply not patient enough to reboot enough times to reproduce the failure.

ping latency and dropped packets can still suffer, so there is clearly something else still wrong, but I'm hoping this is a clue to help understand the issue.

I'll attach the hack.
Comment 25 Jeremy White 2016-04-13 15:45:07 UTC
Created attachment 212631 [details]
Hack to prevent power saving from ever starting
Comment 26 Amitkumar Karwar 2016-04-25 15:16:03 UTC
Created attachment 214061 [details]
Debug patch for sleep issue
Comment 27 Amitkumar Karwar 2016-04-25 15:16:38 UTC
We could not recreate below error message on our reference platform.

[   16.506296] mwifiex_pcie 0000:02:00.0: mwifiex_process_sleep_confirm_resp: cmd size is 0
[   16.506513] mwifiex_pcie 0000:02:00.0: max count reached while accessing sleep cookie

Could you apply attached debug patch and share dmesg log for the issue?
Comment 28 Jeremy White 2016-04-25 16:05:36 UTC
Created attachment 214081 [details]
Log with new debug patch; this is a case where we cannot connect.
Comment 29 Jeremy White 2016-04-25 16:07:20 UTC
Created attachment 214091 [details]
Case where we get just a few slow pings, but it otherwise works.

Interestingly, the constant problem I was having (most packets delayed) seems to have softened, for no discernable reason.  But I do still see a few slow packets (e.g. replies in 80-300 ms range), and in the log, I do see a few of the 'max count reached while accessing sleep cookie' messages.
Comment 30 Amitkumar Karwar 2016-04-25 18:47:21 UTC
Created attachment 214141 [details]
debug sleep issue changes
Comment 31 Amitkumar Karwar 2016-04-25 18:48:27 UTC
Please try attached patch. It includes a fix and debug messages.
Comment 32 Jeremy White 2016-04-25 19:41:36 UTC
Created attachment 214161 [details]
Run with dbg3

This does seem to fix the lack of connection, although again, it's tough to claim that for sure.  But I did get 8 or 9 clean connections.

I do still get hangs at reboot and shutdown sporadically; I don't have a strong sense of whether my hack eliminated those or not.  (And they seem to happen after file systems are unmounted so I don't have an easy way to capture them.).
Comment 33 Amitkumar Karwar 2016-04-26 15:01:38 UTC
Thanks for the tests.
Please share the logs for the observed issues.
Comment 34 Hieu Bui 2016-04-26 17:14:08 UTC
Created attachment 214431 [details]
mwifiex on Surface Pro 3

I have a Surface pro 3 model and I had the same problems. I started booting my system with debug flag for mwifiex module with wireless-next kernel yesterday. I have some logs and I hope that they will help you find the problems.
Comment 35 Jeremy White 2016-04-27 02:39:00 UTC
My apologies; I'm traveling for work, and am away from my normal work environment, which is making it harder for me to be thorough and responsive.  However, what I can tell with the tablet and my laptop is that the DBG3 patch seems to improve matters, but it does not appear to be a fix.

I spent some time rebooting the tablet again and again.  I eventually was able to get a failure to connect.  That was without the debug_mask=0xffffffff; but I will attach that log regardless.  I will try to reproduce it with debug on.

I also get frequent kernel panics at reboot or shutdown.  They seem to come after the file system is unmounted; I can't seem to find the logs in /var/log or using journalctl -k -b -1.  Forgive me if there is kernel dev 101 I am forgetting to properly capture those.

I will try to attach a photograph of one of the kernel panics; the stack trace, at least, appears to be legible.

I do not seem to get the kernel panics on reboot with a 4.4 kernel using my hack.

Full disclosure:  I do not have a 'pure' 4.6 tree; I'm using the wireless-next tree, and I have one or two of the interesting mwifiex patches from lkml applied.  I will also try to reproduce these tests with a pure 4.6 tree.
Comment 36 Jeremy White 2016-04-27 02:39:25 UTC
Created attachment 214471 [details]
dmesg log of a failure to connect
Comment 37 Jeremy White 2016-04-27 02:39:51 UTC
Created attachment 214481 [details]
Screenshot of sample kernel panic with dbg3
Comment 38 Hieu Bui 2016-04-27 23:51:58 UTC
Created attachment 214581 [details]
FW in bad state

With the wireless-next kernel and the power saving patch and the debug sleep patch, sometimes I got this error very often. The driver says the firmware is in bad state.
Comment 39 Hieu Bui 2016-04-29 05:57:01 UTC
Created attachment 214681 [details]
Multiple "FW in bad state"

I got multiple "FW in bad state" errors with the latest wifi-next kernel.
Comment 40 fc 2016-05-04 01:32:33 UTC
I am also experiencing `PREP_CMD: FW in reset state` from mwifiex_pcie on a Surface Book, often after long idle periods.
Comment 41 Jeremy White 2016-05-09 16:33:27 UTC
Studying some of the logs, and the logic change in DBG3, a potential clue is that in pcie.c, around line 1610, there is logic something like:
  skb_trim(skb, rx_len)
  skb_pull(skb, INTF_HEADER_LEN)
  if (...) {
    skb_push(skb, INTF_HEADER_LEN)
  }

It appears logical that the skb_push and skb_pull are meant to balance each other.  However, I see from studying logs that it we get cases where rx_len is 0.    In that case, the skb_pull will be a NOP, and the skb_push will move the data pointer, potentially incorrectly.

I have one skb_under_crash that seems likely to be related to that.
Comment 42 Jeremy White 2016-05-10 21:48:44 UTC
I have a hypothesis as to what I am seeing.

That is, with a fair bit of debug logging, I found that two interrupts in a row could arrive, even if the handler for the first one sent a mask of 0 to the PCIE_HOST_INT_MASK port.  That combined with a unique situation involving the while loop in mwifiex_process_pcie_int().  That is, it seems that the flow can be:
  Interrupt #1 (set int_status)
  Start processing interrupt #1.  Start by clearing int_status.  While processing, interrupt #2 arrives, once again setting int_status.  After interrupt #1 is processed, the while loop manually checks the interrupt status on the card and reloops.  That causes it to fully process the second interrupt *without clearing int_status*.
  Now we re-enter, see that int_status is set to 0x4, and try to process a command that is no longer there.  That is a 0 length command, and we get into a lot of trouble.

I have found that if I comment out the recheck (and effectively disable the while loop) here:
  https://kernel.googlesource.com/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next/+/pending/drivers/net/wireless/marvell/mwifiex/pcie.c#2238

I get stable results.  It's not clear if that's a correct fix.  And, frankly, my results are sporadic enough that I'm going to need some time before I believe it's a real result.  But I thought I'd share what seems like a promising lead.
Comment 43 Jeremy White 2016-05-11 14:20:48 UTC
Created attachment 215931 [details]
Diff that shows problem, and has current working solution

The attached diff is the patch set I was running with (against wireless-drivers-next/master; HEAD is e1ca790c8a32c0c77b9d89089ac7e73b72c2adfc).

Parts of it are the hack to prevent power saving.  Parts of it are debug messages that seem to exacerbate the problem.  That is, with these messages in, I get a fairly reproducible problem set.

Finally, there is an #ifdef; JPW_EXPERIMENT_1 that 'fixes' the problems for me.
Comment 44 Viz 2016-05-11 18:50:12 UTC
I have tried the patch from Jeremy White (but without disabling power-saving feature) on Linux 4.5.3 (Gentoo Source and latest firmware 15.68.7.p66) and it runs for several hours without any problem, thanks.
Comment 45 fc 2016-05-12 10:38:41 UTC
Thanks Jeremy, I applied the patch on top of tigerite's kernel (https://launchpad.net/~tigerite/+archive/ubuntu/kernel) on top of 4.4.6-3-surface. I will run it for a while on my Surface Book and see if that solves the problem.

It's here in case someone wants to save some time: https://github.com/EiNSTeiN-/mwifiex
Comment 46 Amitkumar Karwar 2016-05-12 11:28:39 UTC
Hi Jeremy,

I got the race condition you described in comment #42. It’s not expected as per our design. Here is the flow.

1) Interrupt is received over PCIe interface
2) Interrupt handler does following things.
    a) Read and clear the interrupt register, copy the interrupt bitmap to "adapter->int_status"
    b) Disable interrupts and exit.

3) Other thread will call mwifiex_process_int_status() to serve the interrupts in while loop.

4) At the end of loop, we read interrupt register for any further interrupts and serve them.

5) We exit from while loop if there are no more interrupts.

6) Interrupts are enabled while exiting mwifiex_process_int_status()

Basically we don't expect interrupt handler being called while serving the interrupts. The reason is interrupts are disabled at that time.

Looks like mwifiex_pcie_disable_host_int() call in mwifiex_interrupt_status() didn't take an effect for some reason.

Can you confirm by adding debug message if you receive an interrupt when it's disabled?
Comment 47 Amitkumar Karwar 2016-05-12 11:35:29 UTC
The fix from Jeremy is ok and doesn't have any side-effects. It removes the code which was added for improving the performance.

I am curious to know why interrupt is received even after disabling it which caused a race condition.
Comment 48 Jeremy White 2016-05-12 13:31:19 UTC
Created attachment 216101 [details]
Log of failing session

The attached log should reflect the wireless next tree, with the patch I provided applied.  (Except that I've left the bottom part of the while loop enabled, so as to reproduce the failure).

I believe the problem case starts at time index 30.893034.

(I had instrumented the 0 byte command failure; so I generally search for '0 byte cmd' and then look backwards to hunt for the reason we got a 0 byte command).

Again, my sense of the events is:
  30.893034 Event (0x00000008) interrupt arrives, we disable
  30.893300 We detect a command response, reloop, and start processing
  30.893300 Command (0x00000004) interrupt arrives, we stash it into
            adapter->int_status
  30.893314 We re-enable (but too late; damage is done)
  30.893374 We 'pop' the already processed command from adapter->int_status,
            and think it's a 0 byte cmd response
Comment 49 Anton Anikin 2016-05-12 17:08:47 UTC
Created attachment 216161 [details]
attachment-8673-0.html

I have applied the patch above to the 4.6-rc7 kernel on my Surface Pro 3.
After some time of usage I am unable to turn of wifi or change the AP.
The dmesg log:

[ 8155.017776] mwifiex_pcie 0000:01:00.0: disable_host_int
[ 8155.017781] mwifiex_pcie 0000:01:00.0: jpw:
mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
skb->tail 76, skb->end 2432
[ 8155.017784] mwifiex_pcie 0000:01:00.0: enable_host_int
[ 8155.017789] mwifiex_pcie 0000:01:00.0: jpw: mwifiex_pcie_cmdrsp_complete
skb ffff880244ee1700, skb->len 8, skb->data_len 0, skb->head
ffff880242b2a940, skb->data ffff880242b2a984, skb->tail 76, skb->end 2432
[ 8155.017867] mwifiex_pcie 0000:01:00.0: disable_host_int
[ 8155.017873] mwifiex_pcie 0000:01:00.0: jpw:
mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
skb->tail 76, skb->end 2432
[ 8155.017875] mwifiex_pcie 0000:01:00.0: enable_host_int
[ 8155.017879] mwifiex_pcie 0000:01:00.0: jpw: mwifiex_pcie_cmdrsp_complete
skb ffff880244ee1700, skb->len 8, skb->data_len 0, skb->head
ffff880242b2a940, skb->data ffff880242b2a984, skb->tail 76, skb->end 2432


2016-05-12 16:31 GMT+03:00 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> --- Comment #48 from Jeremy White <jwhite@codeweavers.com> ---
> Created attachment 216101 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=216101&action=edit
> Log of failing session
>
> The attached log should reflect the wireless next tree, with the patch I
> provided applied.  (Except that I've left the bottom part of the while loop
> enabled, so as to reproduce the failure).
>
> I believe the problem case starts at time index 30.893034.
>
> (I had instrumented the 0 byte command failure; so I generally search for
> '0
> byte cmd' and then look backwards to hunt for the reason we got a 0 byte
> command).
>
> Again, my sense of the events is:
>   30.893034 Event (0x00000008) interrupt arrives, we disable
>   30.893300 We detect a command response, reloop, and start processing
>   30.893300 Command (0x00000004) interrupt arrives, we stash it into
>             adapter->int_status
>   30.893314 We re-enable (but too late; damage is done)
>   30.893374 We 'pop' the already processed command from
> adapter->int_status,
>             and think it's a 0 byte cmd response
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 50 Anton Anikin 2016-05-14 19:36:02 UTC
Created attachment 216261 [details]
attachment-11994-0.html

Well, I have tested it for 3 days and it seems to work fine

2016-05-12 20:08 GMT+03:00 Anton Anikin <Anton@anikin.name>:

> I have applied the patch above to the 4.6-rc7 kernel on my Surface Pro 3.
> After some time of usage I am unable to turn of wifi or change the AP.
> The dmesg log:
>
> [ 8155.017776] mwifiex_pcie 0000:01:00.0: disable_host_int
> [ 8155.017781] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
> skb->tail 76, skb->end 2432
> [ 8155.017784] mwifiex_pcie 0000:01:00.0: enable_host_int
> [ 8155.017789] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_cmdrsp_complete skb ffff880244ee1700, skb->len 8,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a984,
> skb->tail 76, skb->end 2432
> [ 8155.017867] mwifiex_pcie 0000:01:00.0: disable_host_int
> [ 8155.017873] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_process_cmd_complete skb ffff880244ee1700, skb->len 12,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a980,
> skb->tail 76, skb->end 2432
> [ 8155.017875] mwifiex_pcie 0000:01:00.0: enable_host_int
> [ 8155.017879] mwifiex_pcie 0000:01:00.0: jpw:
> mwifiex_pcie_cmdrsp_complete skb ffff880244ee1700, skb->len 8,
> skb->data_len 0, skb->head ffff880242b2a940, skb->data ffff880242b2a984,
> skb->tail 76, skb->end 2432
>
>
> 2016-05-12 16:31 GMT+03:00 <bugzilla-daemon@bugzilla.kernel.org>:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>>
>> --- Comment #48 from Jeremy White <jwhite@codeweavers.com> ---
>> Created attachment 216101 [details]
>>   --> https://bugzilla.kernel.org/attachment.cgi?id=216101&action=edit
>> Log of failing session
>>
>> The attached log should reflect the wireless next tree, with the patch I
>> provided applied.  (Except that I've left the bottom part of the while
>> loop
>> enabled, so as to reproduce the failure).
>>
>> I believe the problem case starts at time index 30.893034.
>>
>> (I had instrumented the 0 byte command failure; so I generally search for
>> '0
>> byte cmd' and then look backwards to hunt for the reason we got a 0 byte
>> command).
>>
>> Again, my sense of the events is:
>>   30.893034 Event (0x00000008) interrupt arrives, we disable
>>   30.893300 We detect a command response, reloop, and start processing
>>   30.893300 Command (0x00000004) interrupt arrives, we stash it into
>>             adapter->int_status
>>   30.893314 We re-enable (but too late; damage is done)
>>   30.893374 We 'pop' the already processed command from
>> adapter->int_status,
>>             and think it's a 0 byte cmd response
>>
>> --
>> You are receiving this mail because:
>> You are on the CC list for the bug.
>>
>
>
>
>
>
Comment 51 fc 2016-05-17 16:31:45 UTC
I have been running a version of the patch with the power management changes removed (i.e. power management turned on) to see if it made a difference. I had several instances of unusable wifi (until next reboot) in 2 days. So disabling power management definitely seems to be part of the solution.
Comment 52 David López 2016-05-29 17:46:17 UTC
Created attachment 218101 [details]
Trace error Surface Pro 3 (kernel 4.6, Jeremy White's patch, archlinux)

I use archlinux in a Surface Pro 3, I think is the same wifi card as SP4. I use the package linux-surfacepro3 4.6-1 (https://aur.archlinux.org/packages/linux-surfacepro3/), which includes kernel 4.6 and Jeremy White's patch. Although wifi works, I receive an error message in dmesg which didn't appear in previous versions of the package.

I don't know if this error message could be of interest.
Comment 53 Jeremy White 2016-06-03 15:50:32 UTC
(In reply to David López from comment #52)
> Created attachment 218101 [details]
> Trace error Surface Pro 3 (kernel 4.6, Jeremy White's patch, archlinux)
> 
> I use archlinux in a Surface Pro 3, I think is the same wifi card as SP4. I
> use the package linux-surfacepro3 4.6-1
> (https://aur.archlinux.org/packages/linux-surfacepro3/), which includes
> kernel 4.6 and Jeremy White's patch. Although wifi works, I receive an error
> message in dmesg which didn't appear in previous versions of the package.
> 
> I don't know if this error message could be of interest.

That's not related or (as far as I can tell) especially harmful.  If you look at the referenced line of code (net/wireless/core.c, line 363) you see it's a message generated because we have a set_tx_power function without a get_tx_power function.  (The second message is the same; but for get_antenna).  I suspect you're seeing it because you've updated your driver and/or firmware.
Comment 54 Stephen Just 2016-06-17 19:58:02 UTC
There is some new firmware available from Marvell [1]. So far, I've had no issues with it, but I have only tried it for a few days. Perhaps this new firmware makes any patch unnecessary?

I've actually managed to dig up some release notes in ChromiumOS's commit logs that suggest the newer firmware may help with stability [2].

[1] http://git.marvell.com/?p=mwifiex-firmware.git;a=tree;f=mrvl;hb=HEAD
[2] https://chromium.googlesource.com/chromiumos/third_party/marvell/
Comment 55 Christoph 2016-07-12 21:59:26 UTC
The Marvell firmware from link [1] in (previous) comment #54 does not help avoiding the random disconnects on my Surface Pro 3 with openSUSE Tumbleweed kernel 4.6.3. Not sure if this kernel carries any relevant patches compared to upstream.
Comment 56 Amitkumar Karwar 2016-07-15 13:42:09 UTC
I have just submitted patches to resolve original issue reported in this bug. i.e. below error messages followed by command timeout.

[   11.522123] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[   11.680412] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp

https://patchwork.kernel.org/patch/9232091/
https://patchwork.kernel.org/patch/9232093/
Comment 57 Hieu Bui 2016-08-02 02:40:20 UTC
(In reply to Amitkumar Karwar from comment #56)
> I have just submitted patches to resolve original issue reported in this
> bug. i.e. below error messages followed by command timeout.
> 
> [   11.522123] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
> [   11.680412] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
> 
> https://patchwork.kernel.org/patch/9232091/
> https://patchwork.kernel.org/patch/9232093/

It seems more stable but the 'CMD_RESP' problem persists on my system (Surface Pro 3 running Linux 4.7.0).

    7.720645] mwifiex_pcie 0000:01:00.0: WLAN FW is active
[    7.808716] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.7.p77) 
[    7.808721] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 (15.68.7.p77) 
[    7.812614] mwifiex_pcie 0000:01:00.0 eth0: renamed from mlan0
[    9.696401] mwifiex_pcie 0000:01:00.0: info: trying to associate to 'n0wifi' bssid 4c:e6:76:71:25:1b
[    9.726026] mwifiex_pcie 0000:01:00.0: info: associated to bssid 4c:e6:76:71:25:1b successfully
[ 1145.239295] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1146.012946] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
[ 1624.849858] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1625.236366] mwifiex_pcie 0000:01:00.0: CMD_RESP: invalid cmd resp
[ 1626.009534] mwifiex_pcie 0000:01:00.0: There is no command but got cmdrsp
[ 3988.474123] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 4c:e6:76:71:25:1b: reason code 3
[ 3988.474151] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 00:00:00:00:00:00:	reason code 3
Comment 58 Hieu Bui 2016-08-02 05:13:19 UTC
Created attachment 227261 [details]
Firmware in bad state after sleep on Surface pro 3

I got the previous errors (firmware in bad state, 'CMD_RESP') after sleep with the new patch. I tried to reset the pci card/rmmod and modprobe the module again without success.
Comment 59 Mikael Djurfeldt 2016-09-25 09:27:57 UTC
I'm now running a 4.7 kernel which has Amitkumar's patches from 2016-07-15 included. I get errors like this:

Sep 25 11:22:14 hat kernel: [ 1058.526154] mwifiex_pcie 0000:02:00.0: CMD_RESP: cmd 0x10f error, result=0x2

and the driver crashes after a couple of hours.

With Jeremy White's patch from 2016-05-11 above, the driver is completely stable.
Comment 60 Mikael Djurfeldt 2016-09-25 09:36:56 UTC
To clarify my last comment:

With JW:s patch, I *still* get the CMD_RESP error (which I assume is harmless) but the driver doesn't crash.

Could this have something to do with the following change?:

--- a/drivers/net/wireless/marvell/mwifiex/sta_cmd.c	
+++ a/drivers/net/wireless/marvell/mwifiex/sta_cmd.c	
@@ -2237,7 +2237,7 @@ int mwifiex_sta_init_cmd(struct mwifiex_private *priv, u8 first_sta, bool init)
 		if (ret)
 			return -1;
 
-		if (priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
+		if (0 && priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
 			/* Enable IEEE PS by default */
 			priv->adapter->ps_mode = MWIFIEX_802_11_POWER_MODE_PSP;
 			ret = mwifiex_send_cmd(priv,
@@ -2300,7 +2300,7 @@ int mwifiex_sta_init_cmd(struct mwifiex_private *priv, u8 first_sta, bool init)
 	if (ret)
 		return -1;
 
-	if (!disable_auto_ds &&
+	if (0 && !disable_auto_ds &&
 	    first_sta && priv->adapter->iface_type != MWIFIEX_USB &&
 	    priv->bss_type != MWIFIEX_BSS_TYPE_UAP) {
 		/* Enable auto deep sleep */
Comment 61 Mikael Djurfeldt 2016-10-06 07:39:51 UTC
This is the behavior for 4.8.0 (without any patches):

The driver works for 25 s. Then I get the message:

Oct  6 09:34:15 hat wpa_supplicant[1959]: wlp2s0: CTRL-EVENT-SCAN-FAILED ret=-16 retry=1

every second.

If I shut down network-manager and reload the module, it again works for 25 s, followed by the above message. I also cannot scan for any wireless networks when the message starts, although network communication works.

No sign, yet, of the CMD_RESP error which I've reported about above for previous kernels.

Should I, perhaps, report this as a new bug?
Comment 62 Mikael Djurfeldt 2016-10-06 11:23:59 UTC
I can now report that I've tested the kernel from Linus master branch. It seems like the patches by Amitkumar Karwar from July which went into the mainline kernel yesterday finally makes the mwifiex driver stable on MS Surface Pro 4, so I suggest that this bug report can now be closed.
Comment 63 Kevin 2017-02-17 17:24:59 UTC
Mikael,

What version of both the kernel and mwifiex are you using.

I am using mwifiex 1.0 (15.68.7.p77)
and kernel 4.8.6-300.fc25 (I am trying to install Fedora25)

if the console ever goes into a suspend state, I end up with what you were seeing: "PREP_CMD: FW in reset state"

Thanks, 
Kevin
Comment 64 Christoph 2017-05-11 11:48:58 UTC
I can confirm that with newest firmware and kernel 4.10, the connection is stable on a SP3. I do get constant 5% CPU usage in a kworker/uN:N process caused by the mwifiex driver, though. Not sure if this is expected, or worth a separate bug report.
Comment 65 Christoph 2017-05-19 16:37:19 UTC
With kernel 4.11.0, the issue is back.

[    7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to '<routername>' bssid <ID>
[    7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID> successfully
[ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
[ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
Comment 66 Mikael Djurfeldt 2017-05-20 12:48:19 UTC
Created attachment 256641 [details]
no-power-save

Dear Kevin,

Sorry for taking so long to respond.

I have got this error in many different kernels. I still get it in 4.11.0.
But I also know what causes it: The current linux driver doesn't handle
power management properly. So, it is necessary to insert if-up rules which
switches off power management. In my case, where the interface is governed
by NetworkManager, I wrote a file
/etc/NetworkManager/dispatcher.d/no-power-save which I attach to this email.

With this fix, my mwifiex driver is now fully stable under kernel 4.11.0.

Best regards,
Mikael

On Fri, Feb 17, 2017 at 6:24 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=109681
>
> Kevin (kj.kuhns@gmail.com) changed:
>
>            What    |Removed                     |Added
> ------------------------------------------------------------
> ----------------
>                  CC|                            |kj.kuhns@gmail.com
>
> --- Comment #63 from Kevin (kj.kuhns@gmail.com) ---
> Mikael,
>
> What version of both the kernel and mwifiex are you using.
>
> I am using mwifiex 1.0 (15.68.7.p77)
> and kernel 4.8.6-300.fc25 (I am trying to install Fedora25)
>
> if the console ever goes into a suspend state, I end up with what you were
> seeing: "PREP_CMD: FW in reset state"
>
> Thanks,
> Kevin
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 67 Ganapathi Bhat 2017-11-02 15:06:59 UTC
Hi Christoph

(In reply to Christoph from comment #65)
> With kernel 4.11.0, the issue is back.
> 
> [    7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to
> '<routername>' bssid <ID>
> [    7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID>
> successfully
> [ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
> [ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state

Can you please share the firmware dump after the above issue occur. 

1. To manually trigger the dump you can use below command:

cat /sys/kernel/debug/mwifiex/mlan0/device_dump

2. When the dump is complete dmesg will log "mwifiex firmware dump end".

3. Collect the dump with below script:
-------------------------------------------------------------
#!/bin/bash
/sbin/ethtool --set-dump mlan0 0
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/ITCM.log

/sbin/ethtool --set-dump mlan0 1
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/DTCM.log

/sbin/ethtool --set-dump mlan0 2
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/SQRAM.log

/sbin/ethtool --set-dump mlan0 3
/sbin/ethtool --get-dump mlan0
/sbin/ethtool --get-dump mlan0 data /tmp/IRAM.log
-------------------------------------------------------------

4. Share the /tmp/*.log files collected above for further debugging.


5. Also, please share the dmesg logs generated after you have run the above command.

Thanks,
Ganapathi
Comment 68 Jeremy White 2017-11-21 16:42:58 UTC
I see this issue on a newer Surface Pro 2017, with kernel 4.14, firmware 15.68.7.p119, and the symptoms are very familiar.  Unfortunately, disabling power save does not seem to be as persistent a cure; it improves matters, but does not resolve the issue.

I've attempted to capture a dump, but I am being told that --set-dump mlan0 0 has no such device, and --set-dump wlp1s0 0 returns no such device.
Comment 69 Ganapathi Bhat 2017-11-22 05:00:30 UTC
Hi Jeremy,

Thanks for the intimation.

>>I've attempted to capture a dump, but I am being told that --set-dump mlan0 0
>>has no such device, and --set-dump wlp1s0 0 returns no such device.

Ok. Firmware dump is written into below file:
/sys/devices/virtual/devcoredump/devcd1/data

Can you please check if this file is generated for you. If generated kindly share the same.

Regards,
Ganapathi
Comment 70 Jeremy White 2017-11-22 14:44:26 UTC
Created attachment 260791 [details]
dmesg log and data file from firmware dump

Attached is a tarball with a dmesg log and the /sys/devices/virtual/devcoredump/devcd1/data
 file resulting after requesting a firmware dump.

Note that this is a 4.14 kernel, modified for Surface Pro use, but with the mwifiex module identical to the current tip of the 4.14 tree.  (This tree to be precise https://github.com/jakeday/linux-surface).

I also get this result with a variety of other kernels; it seems specific to this hardware + firmware, and perhaps to this particular use case.  (It is an odd case; a modified Fedora Core 23 stack.)

Nicely, I now get fairly consistent behavior.  That is, upon first launch, the wifi stack is very laggy - you'll get ping times of 43, 100, 243, 79, and so on (when you'd expect steady 1.43).  That shows up in interactive ssh sessions.  Further, after a modest amount of use (usually just a few meg will do it), the wireless will stop working.  You also can no longer successfully reboot or poweroff; you get a hang at kernel shutdown time.

The attached came from a session with a clean boot, debug_mask 0xffffffff, and about 3M of the kernel source downloaded before the wifi stopped functioning.
Comment 71 Andy Shevchenko 2018-01-04 21:11:00 UTC
Any updates on this from Marvell side?
Comment 72 Ganapathi Bhat 2018-01-05 15:19:17 UTC
Hi Jeremy,

Sorry for the delay. I checked the logs you shared. Here it is a scan command(0x107) timeout. But the firmware dump did not give any hints for firmware going to bad state. Moreover the scratch registers dumps tell us that the timed out command 0x107 did not reach the firmware at all. We are checking firmware dumps and will share the updates soon.


On the other hand, are you not getting below error:

> With kernel 4.11.0, the issue is back.
> 
> [    7.890616] mwifiex_pcie 0000:01:00.0: info: trying to associate to
> '<routername>' bssid <ID>
> [    7.920245] mwifiex_pcie 0000:01:00.0: info: associated to bssid <ID>
> successfully
> [ 7834.643967] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed
> [ 8947.336070] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state



Thanks,
Ganapathi
Comment 73 Jeremy White 2018-01-05 17:48:40 UTC
(In reply to Ganapathi Bhat from comment #72)
> Hi Jeremy,
> 
> Sorry for the delay. I checked the logs you shared. Here it is a scan
> command(0x107) timeout. But the firmware dump did not give any hints for
> firmware going to bad state. Moreover the scratch registers dumps tell us
> that the timed out command 0x107 did not reach the firmware at all. We are
> checking firmware dumps and will share the updates soon.

Thanks, I look forward to the update.

> 
> 
> On the other hand, are you not getting below error:

No, I don't recall seeing that error recently.  Note that was Christoph that reported that, not I. I haven't tried with 4.11.

Cheers,

Jeremy
Comment 74 HughC 2018-02-27 07:30:34 UTC
I am seeing that error message on a Surface Pro 2017. The connection appears stable regardless of powersaving but after suspending I get:
[  361.092570] mwifiex_pcie 0000:01:00.0: Firmware wakeup failed

Followed by an endless loop of:
[  366.098914] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state
[  366.098926] mwifiex_pcie 0000:01:00.0: scan failed: -1
Comment 75 Chatty 2018-12-08 01:19:44 UTC
I owned a Surface Book and now own a Surface Book 2. And both were/are afffected by this. Still with kernel 4.20-rc5 the disconnect problem persists after long heavy-usage. Does Ganapathi have some updates yet?
Comment 76 Ganapathi Bhat 2018-12-19 07:53:07 UTC
Hi Chatty,


Do you mean you see the error in #Comment74? (after suspending)?

Regards,
Ganapathi
Comment 77 verdre 2019-09-26 22:10:20 UTC
Hi,

this still seems to be an issue on a lot of Surface devices and I'm not sure if it was ever solved.

I own a Surface Pro 2017 and I'm seeing the same issues as in #Comment74, the issue happens randomly after some time (sometimes hours, sometimes minutes) and is easily reproducible by suspending and resuming multiple times. As long as you're not suspending the device, a workaround is to manually set PCIe ASPM to L0-only for the wifi card.

I'm running a 5.3 kernel right now, are there any patches I could try?
Comment 78 Andy Shevchenko 2019-09-27 06:23:22 UTC
(In reply to verdre from comment #77)
> Hi,
> 
> this still seems to be an issue on a lot of Surface devices and I'm not sure
> if it was ever solved.
> 
> I own a Surface Pro 2017 and I'm seeing the same issues as in #Comment74,
> the issue happens randomly after some time (sometimes hours, sometimes
> minutes) and is easily reproducible by suspending and resuming multiple
> times. As long as you're not suspending the device, a workaround is to
> manually set PCIe ASPM to L0-only for the wifi card.
> 
> I'm running a 5.3 kernel right now, are there any patches I could try?

It seems a firmware issue and Marvell simple doesn't care. Perhaps power of social networks can help.
Comment 79 Andy Shevchenko 2019-09-27 06:28:23 UTC
Reassign directly to developer from Marvell.
Comment 80 verdre 2019-09-30 09:30:10 UTC
I've been looking into this for the last few days and it turns out the issue is definitely caused by PCI link power management: Using the CONFIG_PCIEASPM_DEBUG config option of the kernel I was able to track the issue down to the PCI ASPM L1.2 power saving state (set link_state of the parent PCI bridge to 15 or 111 to enable everything except L1.2 [1]).

Not sure how to continue at this point, since PCIe ASPM is implemented in hardware and debugging would require special equipment, I really hope someone at Marvell steps up and has a look into this. It might also be possible that L1.2 is disabled on Windows, but I haven't found tools to read the L1 substates from the PCI configuration on Windows yet.

Another interesting fact is that the random command timeouts I'm looking into here only seem to happen on some Surface devices using the 8897 chip via PCI, that is the Pro 4, Pro 2017, the Pro 6, Laptop 2 (see comments in [2]), but not on the Surface Book 2 (according to owners of the device). There are also multiple reports of the same issue on completely different devices to be found on the internet ([3], [4]), but those never got any attention by Marvell developers.

[1] https://github.com/torvalds/linux/blob/master/drivers/pci/pcie/aspm.c#L28
[2] https://github.com/jakeday/linux-surface/issues/163
[3] https://www.spinics.net/lists/linux-wireless/msg159776.html
[4] https://www.spinics.net/lists/linux-wireless/msg175943.html
Comment 81 Ganapathi Bhat 2019-09-30 09:39:58 UTC
Hi Andy/Verdre,

Last time I started working on this, I failed to fix due to different reasons; From past few days I see still the issue is same as given in comment#74;

>Using the CONFIG_PCIEASPM_DEBUG config option of the kernel I was able to
>track the issue down to the PCI ASPM L1.2 power saving state (set link_state
>of the parent PCI bridge to 15 or 111 to enable everything except L1.2 [1]).

This is very good help for us to continue; I will try recreate this again by flashing latest Fedora on MS - Surface 4 and share the updates here soon;

Thanks for this;

Regards,
Ganapathi
Comment 82 verdre 2019-09-30 14:40:54 UTC
Hi Ganapathi,

thank you for picking this up again. To reproduce the issue it should be enough to make sure L1.2 is enabled and can be entered (all other L-states are enabled), that should be the case by default when using Fedora 30. The issue might then appear at any time, although it seems to happen more quickly when downloading a file.

Additionally, the reason why the "pcie_aspm=off" kernel cmdline parameter doesn't work for people is that the kernel follows ACPI rules here [1] and refuses to take over control of the ASPM configuration because ACPI requires ASPM to be enabled.

Also I assume the reason why some people were only seeing this issue after suspending is that the Surface firmware seems to set the ASPM bits to enabled after the device was suspended for some time (I assume it happens when the EC was in deep sleep) by itself, so if the distro or kernel version handles ASPM differently and maybe disables it by default, it would definitely be enabled after suspending and the issue would start to appear.

Note that the errors don't have to begin with "Firmware wakeup failed" as described in Comment#74. Since it seems like we're dealing with a complete loss of communication with the card here, the error might as well be a command timeout.

If there's anything else I can do to help resolving this issue let me know!

[1] https://github.com/torvalds/linux/blob/master/drivers/acpi/pci_root.c#L464

Thanks,
Jonas
Comment 83 Ganapathi Bhat 2019-10-01 08:56:09 UTC
Hi Verdre,

Could you give inputs on how did you get Fedora 30 installed on Surface 4; I created the image using 'dd' and tried to boot from it;It would get to grub, after I was getting the _ cursor forever;

Regards,
Ganapathi
Comment 84 verdre 2019-10-01 09:12:00 UTC
> Could you give inputs on how did you get Fedora 30 installed on Surface 4; I
> created the image using 'dd' and tried to boot from it;It would get to grub,
> after I was getting the _ cursor forever;

That's weird, not sure why that wouldn't be working, if you get to grub you probably did everything correctly... I'm using a Surface Pro 2017, which is the Pro 4's successor though. Maybe make sure secure-boot (and maybe the TPM, too) is disabled?

Also it seems like someone had the same problem as you with a Pro 4 and Fedora, so you might just want to try Ubuntu: https://www.reddit.com/r/SurfaceLinux/comments/7y4k29/success_fedora_27_on_a_surface_pro_4/

Or you could try out Fedora 31 beta, which should have a 5.3 kernel: https://download.fedoraproject.org/pub/fedora/linux/releases/test/31_Beta/Workstation/x86_64/iso/Fedora-Workstation-Live-x86_64-31_Beta-1.1.iso
Comment 85 Ganapathi Bhat 2019-10-01 19:45:51 UTC
Hi Verdre,

>Or you could try out Fedora 31 beta, which should have a 5.3 kernel
Thanks for this, it worked for me;

Now, I try to recreate like below(power save enabled):
1. connect to a known AP
2. ping to AP for few seconds
3. suspend(use rtcwake) and resume
4. goto #1

I do not see the error in comment#74(or a command/TX timeout yet), but let me try it for longer time;

>To reproduce the issue it should be enough to make sure L1.2 is enabled and
>can be entered
I could not yet confirm this; Do you have any inputs on how to confirm this setting is as per above?

Regards,
Ganapathi
Comment 86 verdre 2019-10-01 20:21:37 UTC
Hi Ganapathi,

I'm making progress with figuring out potential workarounds for this issue right now, seems like the "Firmware wakeup failed" error on resume indeed is caused by another issue (that might still be related to PCI connection/ASPM issues) than the command timeout: I wasn't seeing the "Firmware wakeup failed" error at all so far, it only started appearing after I went back to a clean kernel and tried disabling L1.2 to fix the command timeout. While that indeed fixed the command timeout issue, the "Firmware wakeup failed" issue wasn't fixed. Turns out I had at least two changes in my modified kernel sources that (combined) fix the "Firmware wakeup failed" issue, I'm trying to isolate those changes right now.

Your testing setup sounds correct and should trigger the "Firmware wakeup failed" issue (it happens after only two suspend cycles here), maybe make sure that S0Ix sleep is actually reached during suspend by checking if /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us and /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us are not 0.

The command timeout should happen quite quickly (less than 5 minutes) when downloading data, although I can't say anything for devices other than my Pro 2017, the occurrences might be a lot rarer on other devices.

You can check which L-states and substates are enabled by executing "lspci -vvv" as root. The L0s and L1 state can be found under "Capabilities: Express (v2) Endpoint -> LnkCtl" and the L1-substates can be found under "Capabilities: L1 PM Substates -> L1SubCtl1" (enabled substates are marked with a "+" after the state, disabled ones with a "-"). Or you can compile your kernel with CONFIG_PCIEASPM_DEBUG and read the value from /sys/bus/pci/devices/PARENT_BRIDGE/power/link_state

Thanks,
Jonas
Comment 87 Ganapathi Bhat 2019-10-01 20:53:48 UTC
Hi Verdre,

>I'm making progress with figuring out potential workarounds for this issue
>right now, seems like the "Firmware wakeup failed" error on resume indeed is
>caused by another issue (that might still be related to PCI connection/ASPM
>issues) than the command timeout
OK; Thanks for this; I read it as 'these two issues have different root causes';

>The L0s and L1 state can be found under "Capabilities: Express (v2) Endpoint
>-> LnkCtl"
---
LnkCtl:	ASPM L0s L1 Enabled;
---

>the L1-substates can be found under "Capabilities: L1 PM Substates ->
>L1SubCtl1" 
---
L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
---

>make sure L1.2 is enabled and can be entered
it is enabled(above output of lspci -vvv); but is there a seperate check to know if it could be *entered*;


>maybe make sure that S0Ix sleep is actually reached during suspend by checking
cat /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us
cat: /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us: No such file or directory

cat /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us 
3565055

>Your testing setup sounds correct and should trigger the "Firmware wakeup
>failed" issue
>The command timeout should happen quite quickly

I am doing different try outs, I will update as soon I could recreate these;

Regards,
Ganapathi
Comment 88 verdre 2019-10-01 22:41:53 UTC
Hi Ganapathi,

> OK; Thanks for this; I read it as 'these two issues have different root
> causes';

Not sure if that's the case since we enable link power saving after every suspend again.

> it is enabled(above output of lspci -vvv); but is there a seperate check to
> know if it could be *entered*;

I don't think there is, maybe there are unofficial debugging tools by Intel or related manufacturers to read statistics about those states from the chipset.

> cat: /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us: No
> such file or directory

Sounds like the processor used in the Pro 4 isn't new enough, from the cpu residency it sounds like S0Ix is working. You can read more about those states here: https://01.org/blogs/qwang59/2018/how-achieve-s0ix-states-linux


Also I figured out what the fix for the "Firmware wakeup failed" errors I wrote about was: Sleeping for about a second before writing to the PCI power state register, so not really a fix, just a workaround...

If you fail to reproduce the issue on the Pro 4 maybe you can get a Pro 2017 and try it on that?

Thanks,
Jonas
Comment 89 Ganapathi Bhat 2019-10-03 15:29:13 UTC
Hi Verdre,

Could you kindly share the dmesg logs, when the issue is recreated; You could use below method to collect the log:

1. dmesg -c
2. echo 0xffffffff > /sys/kernel/debug/mwifiex/mlan0/debug_mask
3. while true; do dmesg -c >> dmesg.txt; done &
4. run your use case
5. share dmesg.txt once you could hit the issue

Regards,
Ganapathi
Comment 90 verdre 2019-10-03 18:01:01 UTC
Hi Ganapathi,

getting the logs turned out to be harder than I expected, with the while-loop method the issue didn't happen (while the download speed of wget, which I used to reproduce the issue, sometimes dropped to near 0, it always managed to recover and went back up again).

Since the systemd-journal stores the kernel messages anyway, I'll attach a complete journald kernel log and the dmesg output after the issue happened.

Thanks,
Jonas
Comment 91 verdre 2019-10-03 18:01:57 UTC
Created attachment 285333 [details]
dmesg of command timeout issue
Comment 92 verdre 2019-10-03 18:03:33 UTC
Created attachment 285335 [details]
systemd journal kernel log from boot until command timeout issue
Comment 93 Ganapathi Bhat 2019-10-05 18:56:27 UTC
Hi Verdre,

Thanks for above log;

1. Could you confirm above issue is seen only in connected state:
[a] say, you keep scanning for AP but do not connect to any AP;
[b] along with #[a], you do suspend and resume; 
[c] after a specific time(you could try different duration, based on how fast you use to see the issue) you try to connect to any AP;
[d] do we see #[c] pass(could connect) or fail; if it does pass, then would it fail after few more suspend and resume;

2. could you kindly try the same use case by disabling power_save(if not tried already); You could just use below command before connection: 

iw dev mlan0 set power_save off

Let me know, if am not clear;

Regards,
Ganapathi
Comment 94 Ganapathi Bhat 2019-10-06 01:34:00 UTC
>2. could you kindly try the same use case by disabling power_save
I mean, your use case(not #1)

Regards,
Ganapathi
Comment 95 Tsuchiya Yuto (kitakar5525) 2019-10-06 12:02:49 UTC
Hi Verdre and Ganapathi,

I own Surface Book 1 (Skylake, same as Surface Pro 4) and Surface 3 (non-pro, Intel Atom Cherry Trail) for daily usage
and recently I got Surface Pro 4 (with broken sensor) for debugging purposes.

env:
- freshly installed Fedora 31 Beta on an external USB drive and use it on SP4
- run `sudo dnf upgrade` and now `uname -r` shows `5.3.2-300.fc31.x86_64`

@Ganapathi
> I am doing different try outs, I will update as soon I could recreate these;

You mean you're having trouble recreating issues?
On my SP4:
- "cmd_timeout"
     This issue will occur after a second suspend like the dmesg log on Comment#91 regardless of 2.4GHz/5GHz AP or power_save on/off. So, I'm not sure why you cannot recreate this issue.

- "Firmware wakeup failed"
     This issue will randomly occur after a while when using with  5GHz AP+power_save on

- "WLAN FW already running! Skip FW dnld" after card reset
     As you can see on the dmesg log (Comment#91), after card reset, the firmware will not be re-downloaded on Surface devices. I think that re-downloading the firmware is the expected behavior.
You may recreate this issue by this command:

echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

After resetting, it shows these messages repeatedly on dmesg and wifi is not usable anymore:

kern  :info  : [  175.478500] mwifiex_pcie 0000:02:00.0: cmd_wait_q terminated: -110
kern  :info  : [  175.478510] mwifiex_pcie 0000:02:00.0: deleting the crypto keys

Maybe this is the simplest one to fix? and at least if resetting works, we can continue to use wifi.

To reset the card with firmware re-downloading, currently I have to call an ACPI method using acpi_call and remove/rescan the parent bridge of wifi (need to be installed) like this:

WIFI_PARENT="0000:00:1d.0" # SP4 specific
ACPI_WIFI_RST_PATH="\_SB.PCI0.RP09.PXSX.PRWF._RST" # SP4 specific
acall(){ echo "$1" | tee /proc/acpi/call >/dev/null && cat /proc/acpi/call;echo;} # wrapper

modprobe acpi_call
acall $ACPI_WIFI_RST_PATH
echo 1 | tee "/sys/bus/pci/devices/$WIFI_PARENT/remove"
sleep 1
echo 1 | tee /sys/bus/pci/rescan

---

By the way, I find 2.4GHz AP is much more stable with even power_save on. 5GHz AP with power_save on causes wifi stop without any message on dmesg.

It seems that showing APs (e.g. `nmcli d wifi list`) will temporarily fix the wifi stop.

This is another issue from what we're discussing now, though.

Best regards,
Tsuchiya, Yuto (kitakar5525)
Comment 96 Tsuchiya Yuto (kitakar5525) 2019-10-06 12:20:14 UTC
> acpi_call and remove/rescan the parent bridge of wifi (need to be installed)

Sorry, I made a mistake
acpi_call (need to be installed) and remove/rescan the parent bridge of wifi
Comment 97 Ganapathi Bhat 2019-10-07 08:36:30 UTC
Hi Tsuchiya/Verdre,

>- "cmd_timeout"
>     This issue will occur after a second suspend like the dmesg log on
>     >Comment#91 regardless of 2.4GHz/5GHz AP or power_save on/off


1. Command timeout after second suspend could be recreated in my Surface 4;

2. Logs were similar to the one in #91 & #92 by Verdre;

3. It is found that the command, which timed out did not reach the firmware at all and am trying to understand more;

4. Once a command times out, driver dumps few scratch registers and also it triggers the firmware dump; But I fail to get these in my device(I don't know why, it is a different problem); So, I still fail to confirm we are hitting same issue;

Note that, when the dump is uploaded to user-space, dmesg will say:
"== mwifiex dump information to /sys/class/devcoredump end"

Could you please share the dump(along with dmesg), which is saved in below file(it is automatically deleted after 5 minutes):
********************************************
/sys/devices/virtual/devcoredump/devcd1/data
********************************************

>- "Firmware wakeup failed"
This could also be recreated for the first time now; I kept Surface 4 connected to AP and PING overnight; But I have no much detail yet;

I will comeback soon on remaining two concerns you raised in comment#95;


Regards,
Ganapathi

Regards,
Ganapathi
Comment 98 Tsuchiya Yuto (kitakar5525) 2019-10-07 12:35:21 UTC
> But I fail to get these in my device(I don't know why, it is a different
> problem); So, I still fail to confirm we are hitting same issue;

Yeah, the device_dump is also a problematic one on Surface devices.
It's reported that if_ops.device_dump() will cause lockups [1] (e.g. prevent system power off)

I also confirmed that by
1. adding debugfs entry for calling that function
2. and called the function via debugfs (echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/device_dump)
Sometimes I can get the device_dump log, sometimes I can't and it prevents system power off. I will post the patch which adds the debugfs entry to the next comment.

I will post the dump log (along with the dmesg log) as soon as I manage to get the logs after wifi malfunctioning.

By the way, the user manual on drivers/net/wireless/marvell/mwifiex/README describes that device_dump can be obtained by `cat fw_dump` but the debugfs entry does not exist anymore.



References:
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730924/comments/99
Comment #99 :  Bug #1730924 “Wifi does down “crash” in Surface Pro 4” : Bugs : linux package : Ubuntu
Comment 99 Tsuchiya Yuto (kitakar5525) 2019-10-07 12:40:35 UTC
Created attachment 285391 [details]
[PATCH] mwifiex: debugfs: add entry for device_dump

This is a patch to add debugfs entry for device_dump.
Just to test device_dump functionality on Surface devices.

Usage:

  echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/device_dump
Comment 100 verdre 2019-10-10 13:47:54 UTC
Hi Ganapathi and Tsuchiya,

here's a dmesg and a firmware dump of the command timeout issue. It wasn't hard to get it, the devcoredump entry appeared as soon as "mwifiex dump information to /sys/class/devcoredump end" was written to the log.

Regards,
Jonas
Comment 101 verdre 2019-10-10 13:48:39 UTC
Created attachment 285445 [details]
dmesg of command timeout issue
Comment 102 verdre 2019-10-10 13:49:47 UTC
Created attachment 285447 [details]
firmware dump of command timeout issue
Comment 103 Ganapathi Bhat 2019-10-16 17:35:02 UTC
>> 1. Command timeout after second suspend could be recreated in my Surface 4;
more specific observation on this(you could cross check, if possible):
1. timeout happens after a resume, if we see below message in dmesg:
   "mwifiex_pci 0000:02:00.0: Refused to change power state, currently in D0"

2. if device is connected before the 1st suspend, no issue is seen after 1st resume;

3. if device is connected before 2nd suspend(i.e after 1st suspend_resume), then issue is seen after 2nd resume
    
4. if device is not at all connected during suspend_resume stress test, no issue is seen

Regards,
Ganapathi
Comment 104 Ganapathi Bhat 2019-10-17 02:50:28 UTC
Hi verdre,

>> went back to a clean kernel and tried disabling L1.2 to fix the command
>> timeout. While that indeed fixed the command timeout issue, the "Firmware
>> wakeup 
failed" issue wasn't fixed.

could you share your changes, with which command timeout is not seen;

Regards,
Ganapathi
Comment 105 Ganapathi Bhat 2019-10-18 04:29:29 UTC
Created attachment 285537 [details]
0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume

Hi Tsuchiya/Verdre,

Could please give a try with attached change;

Regards,
Ganapathi
Comment 106 Andy Shevchenko 2019-10-18 07:42:29 UTC
(In reply to Ganapathi Bhat from comment #105)
> Created attachment 285537 [details]
> 0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume
> 
> Hi Tsuchiya/Verdre,
> 
> Could please give a try with attached change;
> 
> Regards,
> Ganapathi


I'm wondering if it has any effect on any system. PCI core *does* the same (*) for you already. Does it mean we have some race there?

*) See implementation of pci_pm_resume().
Comment 107 Ganapathi Bhat 2019-10-18 10:10:50 UTC
Hi Andy,

>PCI core *does* the same (*) for you already. 

OK; I see, with this change, command timeout is not recreated; without this change, just after second iteration I see the issue(as given in comment#103);

>Does it mean we have some race there?
I need to understand more on this; I was just trying this change because, different open source wireless drivers have this in their resume handler;

Regards,
Ganapathi
Comment 108 Tsuchiya Yuto (kitakar5525) 2019-10-19 15:15:08 UTC
> 1. timeout happens after a resume, if we see below message in dmesg:
>    "mwifiex_pci 0000:02:00.0: Refused to change power state, currently in D0"

For me, it says instead:
`mwifiex_pcie 0000:02:00.0: Refused to change power state, currently in D3`
 
> 2. if device is connected before the 1st suspend, no issue is seen after 1st
> resume;
> 3. if device is connected before 2nd suspend(i.e after 1st suspend_resume),
> then issue is seen after 2nd resume

Yes. However, the "cmd_timeout" will sometimes also occur after 1st resume for me. Also in this case, the message from mwifiex is the same as No. 1.
     
> 4. if device is not at all connected during suspend_resume stress test, no
> issue is seen

For me, the "cmd_timeout" will occur regardless of AP connection status/rfkill status/power_save status.
What changed by No. 4 is that, it will not cause the "lockup" (e.g. not preventing system shutdown)

Also, I tried the patch 0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume on comment#105.
- So far, the wifi is still working after several suspend.
- I remember I've done exactly the same thing before and it did not
  work before. Hmm…?
- The same message as No. 1 will still be printed on dmesg after 
  the patch.
- It may still not connect to AP automatically because it fails to find APs
  (`nmcli d wifi list` shows empty).  This is easily fixed by reloading "mwifiex_pcie" module
Comment 109 verdre 2019-12-01 09:29:54 UTC
Hi Ganapathi,

since you can now reproduce the issues, are there any updates on a proper fix in the firmware or the kernel driver?

Regards,
Jonas
Comment 110 Ganapathi Bhat 2019-12-02 03:49:05 UTC
Hi Verdre,

I am yet to try on my side, the reason for why change in comment#105 don't cause the issue;

Please let know your inputs on the change shared in comment#105; We will improve the change considering different observations(comment#106, #108)

Regards,
Ganapathi
Comment 111 verdre 2019-12-26 13:40:47 UTC
Hi Ganapathi,

so I've been running a 5.4 kernel with your changes from https://bugzilla.kernel.org/show_bug.cgi?id=109681#c105 for some time now,
Comment 112 Ganapathi Bhat 2019-12-26 13:41:01 UTC
Created attachment 286459 [details]
attachment-14890-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at ganapathi.bhat@nxp.com
Comment 113 Amitkumar Karwar 2019-12-26 13:41:02 UTC
Created attachment 286461 [details]
attachment-14943-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at amitkumar.karwar@nxp.com
Comment 114 verdre 2019-12-26 13:48:07 UTC
Sorry, last comment was sent accidentally...


Hi Ganapathi,

so I've been running a 5.4 kernel with your changes from comment#105 for some time now, and I'm still seeing the "Refused to change power state, currently in D3" messages after resuming from the second suspend.

It seems like the "Firmware wakeup failed" issue is not happening anymore with this change, but the "cmd timeout" issue is definitely still happening, also without suspending at all, so I'm seeing all the same things as Tsuchiya in comment#108.

Thank you,
Jonas
Comment 115 ganapathi.bhat@nxp.com 2020-01-06 12:15:17 UTC
Hi Verdre,

>>but the "cmd timeout" issue is definitely still happening

you could share the dmesg + firmware dump;

also, we find below discussions; could you give a try with the script below:
https://github.com/jakeday/linux-surface/issues/456

Regards,
Ganapathi
Comment 116 Ganapathi Bhat 2020-01-06 12:15:29 UTC
Created attachment 286651 [details]
attachment-16293-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at ganapathi.bhat@nxp.com
Comment 117 Tsuchiya Yuto (kitakar5525) 2020-01-10 12:24:56 UTC
Thank you for looking into this issue. Reporting current status.

== "mwifiex crash after suspend"

This issue might be the Surface devices specific problem? Not sure.
I know some ways to prevent at least mwifiex crashing:

1. Disabling d3cold for wifi device (even doable on mainline kernels):

        echo 0 | sudo tee /sys/devices/pci0000:00/0000:00:1d.3/0000:03:00.0/d3cold_allowed # on SB1
        echo 0 | sudo tee /sys/devices/pci0000:00/0000:00:1d.0/0000:02:00.0/d3cold_allowed # on SP4

2. Your patch "0001-mwifiex-set-PCIE-to-PCI_D0-power-state-on-resume"

3. Disable bridge_d3 on mwifiex_pcie_probe()

        pdev->bus->self->bridge_d3 = false;

but with those ways, as I mentioned before (Comment 108), it sometimes not connect to AP automatically because it fails to find APs (`nmcli d wifi list` shows empty).  This is easily fixed by reloading the "mwifiex_pcie" module.

Patch for this issue we (linux-surface community) are currently using is, based this patch from sebanc:
- https://github.com/jakeday/linux-surface/issues/420#issuecomment-544615371
Actual commit is here:
- https://github.com/linux-surface/kernel/commit/c8da40c0538a1528f3efd85cd91d9e314c85dd19

The patch from sebanc does the following things:
- modify mwifiex_pcie_suspend/mwifiex_pcie_resume functions to fix crash after suspend. 
  It seems that the function mwifiex_pcie_suspend2 is from mwifiex_pcie_remove
  and the function mwifiex_pcie_resume2 is largely from mwifiex_pcie_probe.
  So, I think that it behaves in the same way as we unload modules before
  suspend and re-load modules after suspend.
- disable bridge_d3 to fix crash after suspend
- disable "auto deep sleep" (I think not needed for this issue, but introduced
  hoping to fix other issues)

I think the patch is too aggressive to be upstreamed as they are, but please use it as a reference.

The patch also fixes S0ix on KBL/KBL-R devices (such as Surface Book 2, Surface Pro 5, Surface Laptop 1 or later). SKL devices such as Surface Pro 4 or Surface Book 1 cannot achieve S0ix anyway because of other reasons outside of mwifiex.
Comment 118 Tsuchiya Yuto (kitakar5525) 2020-01-10 12:28:17 UTC
== "mwifiex crash on idle/using"

I heard this issue only from Surface 3 and SP5 owners.
It can be fixed by disabling ASPM L1 state. I think this is the hardest issue to fix. So, I hope at least we could fix the mwifiex reset feature (see below).

== "mwifiex reset feature broken"

I think this issue is common on all devices that use the 88W8897 chip.
You can reproduce this issue by the following command:

        echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset

Expected: mwifiex is usable after this reset.
Actual: mwifiex is broke after this reset with following dmesg output

        $ echo 1 | sudo tee /sys/kernel/debug/mwifiex/mlan0/reset
        mwifiex_pcie 0000:03:00.0: Resetting per request
        mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]:
        reason code 3
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex...
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed
        mwifiex_pcie 0000:03:00.0: WLAN FW already running! Skip FW dnld
        mwifiex_pcie 0000:03:00.0: WLAN FW is active
        mwifiex_pcie 0000:03:00.0: Unknown api_id: 4
        mwifiex_pcie 0000:03:00.0: info: MWIFIEX VERSION: mwifiex 1.0 (15.68.19.p21)
        mwifiex_pcie 0000:03:00.0: driver_version = mwifiex 1.0 (15.68.19.p21)
        mwifiex_pcie 0000:03:00.0: info: trying to associate to '[SSID]' bssid [BSSID]
        mwifiex_pcie 0000:03:00.0: info: associated to bssid [BSSID] successfully
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: info: successfully disconnected from [BSSID]:
        reason code 15
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        mwifiex_pcie 0000:03:00.0: cmd_wait_q terminated: -110
        mwifiex_pcie 0000:03:00.0: deleting the crypto keys
        [...]

Also, does anyone know what is the expected behavior on function level reset? Should firmware be re-downloaded? As you can see on the dmesg log, fw is still active after the reset. ("WLAN FW already running! Skip FW dnld")

== "power_save causes connection instability at least on 5GHz APs"

"connection instability" here means not crash, but networking stop, like ping not responding.

I think this issue is common on all devices that use the 88W8897 chip.

- You can mitigate this issue by showing APs continuously (nmcli d wifi list)
- Or you can prevent this issue by disabling power_save
  (sudo iw dev mlan0 set power_save off)
Comment 119 Tsuchiya Yuto (kitakar5525) 2020-01-10 12:35:54 UTC
Also, ff someone knows devices (other than Surface) that use the same 88W8897 chip (or at least similar PCIe chip like 88W8997), let me know. I want to get the device as a reference.

Thank you.
Comment 120 Tsuchiya Yuto (kitakar5525) 2020-01-10 12:36:32 UTC
*if someone knows devices
Comment 121 ganapathi.bhat@nxp.com 2020-01-13 04:12:18 UTC
Hi Tsuchiya,

>>comment#118:

>>Also, does anyone know what is the expected behavior on function level reset?

note that FLR does require support in the device; 88W8897(PCIE) do not support this;

[we will respond to remaining comments soon]

Regards,
Ganapathi
Comment 122 verdre 2020-04-18 00:09:48 UTC
Hi everyone,

a few months have passed, any news about this bug?

Regards,
Jonas
Comment 123 Ganapathi Bhat 2020-04-18 00:10:07 UTC
Created attachment 288579 [details]
attachment-30359-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at ganapathi.bhat@nxp.com
Comment 124 Amitkumar Karwar 2020-04-18 00:10:09 UTC
Created attachment 288581 [details]
attachment-30418-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at amitkumar.karwar@nxp.com
Comment 125 verdre 2020-04-21 16:34:38 UTC
Hi everyone,

so Tsuchiya and I have been looking into this quite a lot for the last few weeks and we've identified several issues and tried to find the underlying reasons as far as possible.

We hope this information will be helpful for you in indentifying issues with the firmware and the driver so you don't have to take a shot in the dark. Because most issues and their reproducers are already extensively explained in this bugreport, we've listed what we suspect to be some of the root causes for those issues here:

1) The LTR (Latency Tolerance Reporting) messages sent by the firmware might be incorrect. With the newest firmware, the card always reports the maximum latency, which prevents the System from entering the lowest C-States (pc10 and slp_s0). I'm not sure whether that actually prevents ASPM L1.2 from working, because the command timeouts I see when ASPM L1.2 is enabled (other devices see them only with (Surface 3) L1 or (Surface Pro 6) L0s) are still happening.

2) Performing a PCI function level reset of the card with firmware redownloading only works when power-cycling the card by switching it to D3cold and then to D0 again, see [1], which implements this. As mentioned in that commit, this is a quirk that's also done by Windows and seems to be specific to only Surface devices.

3) For avoid a crash of the firmware when resuming from suspend, it's important to disable D3 for the PCI bridge, see [2]. This is another quirk that also seems to be applied by Windows.

4) There's also a bluetooth powersaving issue, which makes the bluetooth device never enter USB the suspend state as long as no LE device is connected: That's because (assuming the linux bluetooth stack works correctly) the firmware never stops sending interrupts with LE advertising reports, which can easily be confirmed by using the btmon utility and by disabling Bluetooth LE functionality in /etc/bluettoth/main.conf.

[1] https://github.com/kitakar5525/linux-surface-kernel/commit/6eba36073e85a64200f2c28e2a6ba0dbeb120bf2

[2] https://github.com/kitakar5525/linux-surface-kernel/commit/4606706679bf81ef41e701fc5e4d5b87adb983dd

Thank you!
Comment 126 Andy Shevchenko 2020-04-21 17:16:01 UTC
(In reply to verdre from comment #125)
> Hi everyone,
> 
> so Tsuchiya and I have been looking into this quite a lot for the last few
> weeks and we've identified several issues and tried to find the underlying
> reasons as far as possible.

I think, especially after acquisition, Marvell is not capable to fix this. Perhaps we simple have to move on, i.e. promote your excellent work to upstream. I can help to assure wireless maintainer that this can be accepted even w.o. Marvell involvement.

Let's simple proceed to fix this very annoying issues.
Comment 127 Pali Rohár 2020-07-01 23:38:36 UTC
Hello!

Tsuchiya & Verdre: Do you have some fixes for instability issues when card is in 5GHz AP mode? This is issue which we are observing on 88W8997 SDIO wifi chips.

I described some problems related to power save mode, nl80211/cfg80211 layer and mwifiex driver in email which I sent to linux-wireless mailing list: https://lore.kernel.org/linux-wireless/20200609111544.v7u5ort3yk4s7coy@pali/

But I have not got any response about it yet and I think that without fixing those issues at cfg80211 layer we cannot fix mwifiex driver for power save mode.

Andy, are you still able to help us with (generic/cfg80211) wireless part?
Comment 128 Amitkumar Karwar 2020-07-01 23:38:52 UTC
Created attachment 290043 [details]
attachment-26841-0.html

As of 6-DEC-2019, NXP has acquired Marvell’s Wireless business unit. You can now reach me at amitkumar.karwar@nxp.com
Comment 129 Tsuchiya Yuto (kitakar5525) 2020-07-02 11:57:18 UTC
(In reply to Pali Rohár from comment #127)
> Hello!

Hello!

> Tsuchiya & Verdre: Do you have some fixes for instability issues when card
> is in 5GHz AP mode? This is issue which we are observing on 88W8997 SDIO
> wifi chips.

Unfortunately, not yet. I'm using 5GHz AP with just power_save off. Maybe
all what we can do about power_save and firmware crashing is submitting
a journal log and devcoredump. I'll attach them in the next comment.

> I described some problems related to power save mode, nl80211/cfg80211 layer
> and mwifiex driver in email which I sent to linux-wireless mailing list:
> https://lore.kernel.org/linux-wireless/20200609111544.v7u5ort3yk4s7coy@pali/

Thank you for letting us know!
Comment 130 Tsuchiya Yuto (kitakar5525) 2020-07-02 12:00:33 UTC
Created attachment 290067 [details]
mwifiex crash log on SB1 with ps_on, 5GHzAP

- power_save on
- connected to 5GHz AP
- connection instability (ping not responding)
- firmware crashing eventually

Journal log and devcoredump are included in the archive. I kept sending ping to 192.168.1.1 and the output was redirected to kmsg to show connection stability
state.

    $ ping 192.168.1.1 | sudo tee /dev/kmsg
    [...]
    64 bytes from 192.168.1.1: icmp_seq=5798 ttl=255 time=3.81 ms
    64 bytes from 192.168.1.1: icmp_seq=5799 ttl=255 time=3.50 ms
    64 bytes from 192.168.1.1: icmp_seq=5800 ttl=255 time=7.16 ms
    #
    # connection stopped suddenly
    #
    From 192.168.1.59 icmp_seq=5830 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=5831 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=5832 Destination Host Unreachable
    [...]
    From 192.168.1.59 icmp_seq=6019 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=6020 Destination Host Unreachable
    From 192.168.1.59 icmp_seq=6025 Destination Host Unreachable
    #
    # connection started working again
    #
    64 bytes from 192.168.1.1: icmp_seq=5801 ttl=255 time=230664 ms
    64 bytes from 192.168.1.1: icmp_seq=5802 ttl=255 time=229660 ms
    64 bytes from 192.168.1.1: icmp_seq=5803 ttl=255 time=228647 ms
    [...]
    64 bytes from 192.168.1.1: icmp_seq=6082 ttl=255 time=7.24 ms
    64 bytes from 192.168.1.1: icmp_seq=6083 ttl=255 time=80.0 ms
    64 bytes from 192.168.1.1: icmp_seq=6084 ttl=255 time=6.73 ms
    #
    # firmware crashed suddenly
    #
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    ping: sendmsg: Network is unreachable
    [...]
Comment 131 Tsuchiya Yuto (kitakar5525) 2020-07-02 12:04:12 UTC
Hi Ganapathi,

We can't do anything about power_save and firmware crashing. So, I
attached journal log and devcoredump when using with power_save on and
5GHz AP, containing 1) connection instability and 2) firmware crashing
eventually.

I hope it helps in debugging.

Thank you.
Comment 132 Andy Shevchenko 2021-08-06 15:39:07 UTC
*** Bug 195183 has been marked as a duplicate of this bug. ***
Comment 133 Andy Shevchenko 2022-01-04 11:14:22 UTC
I have reassigned this to Jonas because it seems he knows much better the state of affairs with this crappy hardware/firmware than vendor does. Jonas, if you are not comfortable, tell me what to do with this bug report (and AFAIU the v5.15+ behaves much more stable nowadays, correct?).
Comment 134 verdre 2022-01-04 18:34:10 UTC
Thanks Andy, indeed with new kernels all the stability issues should be gone.

What remains unsolved are all the firmware bugs without workarounds (see the list I put together in this email: https://lore.kernel.org/linux-wireless/771f1f07-882b-2eeb-36fa-126c9b8d5fd8@v0yd.nl/) and the patch to allow s0ix sleep during runtime, I'm not sure how/if I should follow up on that: https://lore.kernel.org/linux-wireless/20211011134238.16551-1-verdre@v0yd.nl/T/#u

Anyway, given that this bug is specifically about stability, I'd say we can finally close it.

Note You need to log in before you can comment on or make changes to this bug.