Bug 60815

Summary: Interface hangs in mwifiex_usb
Product: Drivers Reporter: Dmitry Khromov (icechrome)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: NEW ---    
Severity: normal CC: akarwar, ardavespam, bzhao, cornogle, debbugs, gbhat, linville, suxpert, zwanem
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.10.9 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output after having to turn interface down and up again for 2 times
running kernel's .config
dmesg (DYNAMIC_DEBUG=y, +p in mwifiex and mwifiex_usb)
/var/log/messages
dmesg, Linux 3.11.0
/var/log/messages, Linux 3.11.0
Remove 4 bytes of redundant headroom for USB
dmesg, Linux 3.11.0, enable_tx_amsdu=Y
test results for scenarios in #10 and #11, Linux 3.11.0
Print pending Tx URBs
Test run 3, as in comment #13. Linux 3.11.0.
Suspend and resume, Linux 3.11.1
.config and suspend/resume test logs on x86 kernel 3.10.9
Suspend and resume, Linux 3.10.9
debug usb8797 NULL poinert dereference in suspend handler
Suspend, Linux 3.10.9 w/108751
debug usb8797 NULL poinert dereference in suspend handler v2
Suspend, Linux 3.10.9 w/108801
debug usb8797 NULL poinert dereference in suspend handler v3
Suspend, Linux 3.10.9 w/108821
0001-mwifiex-fix-hang-issue-for-USB-chipsets.patch
0002-mwifiex-fix-NULL-pointer-dereference-in-usb-suspend-.patch
tx timeout handling
Tx timeout, Linux 3.11.6 w/111991
dmesg on ubuntu 13.10
dmesg on ubuntu 13.10 livecd
Tx timeout, Linux 3.11.6, mlan0/info
Tx timeout, Linux 3.11.6, mlan0/info, system hang
3.15.0-rc5: whole dmesg before any mwifiex_usb errors
mlan0 debug file after a firmware in bad state
dmesg of 3.15.0-rc5 with mwifiex_usb firmare error (DYNDBG)

Description Dmitry Khromov 2013-08-29 18:39:59 UTC
Created attachment 107360 [details]
dmesg output after having to turn interface down and up again for 2 times

I'm running Linux 3.10.9 on Microsoft Surface Pro and have WiFi-related problems
with mwifiex_usb driver when associated with WPA2-PSK (hadn't tried others) APs:

1) Approximately every 30 minutes (pretty random) interface just hangs - no
frames in, no out. Trying to write something (e.g. use ping utility) returns
ENOBUFS. If I try to bring interface down, then up again sometimes it will start
again, sometimes - say that, suddenly, RF-kill is active
(/sys/class/rfkill/rfkillX/state is 0).
It may also hang integrated Bluetooth when this happens.
dmesg is silent.

2) Kernel says the following from time to time:

[ 9291.798173] usb 3-3: data: -ENOSR is returned
[10213.065880] usb 3-3: data: -ENOSR is returned

It may be a single message, a couple of messages in a hundreds of milliseconds
time frame, or a constant flood for tens of seconds. In my experience it is not
related with the current traffic rate.

3) Sometimes (rare) the module will cause panic on transition to STR
(unfortunately I can't figure a way to capture kernel messages in this case
since the only available to user bus in Surface Pro is USB).

4) Sometimes (rare) on resume the USB bus will just stop working. I suppose it
is caused by this module because I had never observed this with the module being
blacklisted. I should probably mention the fact that I have USB auto-suspend off
and power on for all ports in suspend (due to problems with other devices). Once
again - I don't know how to capture kernel messages in this case.

I do not think those are hardware errors - the device seems to work stable in
Windows, though I rarely use Windows, and never for more than an hour or two.

STEPS TO REPRODUCE:
Not sure. I have those problems since 3.10.3 and still don't see correlations
with anything, except, maybe, network activity - if I set iptables to drop all
traffic except DHCP it may work for hours.

I guess my description is not very detailed - but I can't track what causes the
problem.  Any help would be appreciated.
Comment 1 Dmitry Khromov 2013-08-29 18:42:51 UTC
Created attachment 107361 [details]
running kernel's .config
Comment 2 Bing Zhao 2013-08-30 18:04:26 UTC
Could you enable dynamic debug and collect the logs for me? Thanks.

1. CONFIG_DYNAMIC_DEBUG=y in kernel .config and rebuild the kernel
2. modprobe mwifiex_usb
3. echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control
4. echo "module mwifiex_usb +p" > /sys/kernel/debug/dynamic_debug/control
5. start your test, and save the logs in dmesg and /var/log/messages as soon as the issue happens.
Comment 3 Dmitry Khromov 2013-09-03 10:32:49 UTC
Created attachment 107392 [details]
dmesg (DYNAMIC_DEBUG=y, +p in mwifiex and mwifiex_usb)

It's 3.10.10 and now it happens much faster - about 2 to 3 minutes after it gets associated with AP, Bluetooth hung every time. It starts at 726.007972 in attached dmesg output.

Thanks!
Comment 4 Dmitry Khromov 2013-09-03 10:34:22 UTC
Created attachment 107393 [details]
/var/log/messages

Captured at the same point as 107392.
Comment 5 Bing Zhao 2013-09-04 18:53:56 UTC
The Tx AMSDU logic seems to have some problem which causes USB bus hang during data traffic in certain scenarios. We are debugging the Tx AMSDU code.

Since Tx amsdu is by default disabled in latest mainline, you can cherry-pick this patch for 3.10.x and try again.

c9e2404 mwifiex: enable/disable tx_amsdu support via module parameter
Comment 6 Dmitry Khromov 2013-09-05 15:26:37 UTC
Created attachment 107432 [details]
dmesg, Linux 3.11.0

OK, 3.11.0 has c9e2404:

# cat /sys/module/mwifiex/parameters/enable_tx_amsdu
N

However, the issue is still here.
I have attached a tarball with several dmesg outputs:
dmesg_3.11_1 - a couple of seconds after the interface hangs, nothing special as I see.
dmesg_3.11_2 - after some time with the interface hung:

# grep Tx\ ti dmesg_3.11_2
[24268.907543] usb 3-3: 4318924992 : Tx timeout(#35), bss_type-num = 0-0
[24273.917852] usb 3-3: 4318930000 : Tx timeout(#36), bss_type-num = 0-0
[24278.928168] usb 3-3: 4318935008 : Tx timeout(#37), bss_type-num = 0-0

dmesg_3.11_3 - just after I have tried to take the interface down then up again (that didn't help - had to rmmod mwifiex_usb && modprobe mwifiex_usb)

Thanks in advance!
Comment 7 Dmitry Khromov 2013-09-05 15:32:32 UTC
Created attachment 107433 [details]
/var/log/messages, Linux 3.11.0
Comment 8 Bing Zhao 2013-09-05 18:28:00 UTC
Created attachment 107443 [details]
Remove 4 bytes of redundant headroom for USB

Apparently there is a bug in TX_AMSDU path for USB.
Please try this patch for the original "cmd timeout" issue (comment #3, with TX_AMSDU enabled).

The "Tx timeout" (comment #6, with TX_AMSDU disabled) could be a different problem. We will try to replicate this issue.
Comment 9 Dmitry Khromov 2013-09-05 21:44:14 UTC
Created attachment 107448 [details]
dmesg, Linux 3.11.0, enable_tx_amsdu=Y

# cat /sys/module/mwifiex/parameters/enable_tx_amsdu 
Y

I don't see cmd_timeouts in first attached dmesg output (right after hang).
BTW, ping says ENOBUFS again (it didn't in comment #6 test, maybe I hadn't waited enough).
Second dmesg output is made just after I had tried to reset interface and it turned RF kill on by itself (comment #6 had rfkill issue, too), and has Tx timeouts just in case you need it.

I have just encountered Bluetooth hang (309.189327). This one has cmd_timeouts. It's the third dmesg, but sorry, I had no +p set. I'll try to catch that one again with wired keyboard since BT hangs are often now. Later in this output you can see me trying to reinsert mwifiex_usb (+p this time) but mlan0 does not gets created.
Comment 10 Bing Zhao 2013-09-07 03:29:49 UTC
Could you please test this combination with 3.11.0? Thanks!

1. apply the patch in attachment 107743
2. apply commit 1211c961170cedb21c30d5bb7e2033c8720b38db (mainline)
3. enable_tx_amsdu=N
4. enable dynamic_debug for module: mwifiex, mwifiex_module, btusb
5. hcidump -XV (on the hci interface created by USB8797)
Comment 11 Bing Zhao 2013-09-07 05:58:08 UTC
Also, please run the test without Bluetooth to nail down the problem.
You can blacklist or rename the btusb module.
Comment 12 Dmitry Khromov 2013-09-07 20:15:51 UTC
Created attachment 107481 [details]
test results for scenarios in #10 and #11, Linux 3.11.0

run1/ is as in comment #10.
run2/ is as in comment #11.
Comment 13 Bing Zhao 2013-09-10 18:24:39 UTC
Created attachment 108061 [details]
Print pending Tx URBs

Dmitry, thanks for the tests.

In run1 (WLAN+BT), it seems no cmd/data timeout in the logs. There are some "data: -ENOSR is returned" messages though. Did you get the interface hang issue in this test?

In run2 (WLAN only), the data tx timeout still happens.
The debug logs between [17453.000967] and [17459.703120] are missing. And it seems that something has occurred in-between. Do you happen to have the logs for this time frame?

Could you run WLAN only case again with the followings? Thanks!

1. apply the patch in attachment 107743
2. apply the debug patch in this attachment
3. apply commit 1211c961170cedb21c30d5bb7e2033c8720b38db (mainline)
4. enable_tx_amsdu=N
5. enable dynamic_debug for modules: mwifiex, mwifiex_usb
6. when the problems (timeouts, hangs, etc.) happen again, collect mwifiex debug logs by "cat /sys/kernel/debug/mwifiex/mlan0/debug"
Comment 14 Dmitry Khromov 2013-09-12 19:15:33 UTC
Created attachment 108291 [details]
Test run 3, as in comment #13. Linux 3.11.0.

>In run1 (WLAN+BT), it seems no cmd/data timeout in the logs. There are some
>"data: -ENOSR is returned" messages though. Did you get the interface hang
>issue in this test?

Yes, I did. 

>In run2 (WLAN only), the data tx timeout still happens.
>The debug logs between [17453.000967] and [17459.703120] are missing. And it
>seems that something has occurred in-between. Do you happen to have the logs
>for this time frame?

Unfortunately, no, sorry.

>Could you run WLAN only case again with the followings? Thanks!

dmesg.0 was taken in about 6 seconds after the hang, next 4 every 20 seconds later, last one - after trying to take mlan0 down and then up again. The same applies to mlan0.*. I can grep no "time" in dmesg.*

Off timeouts topic: I should probably mention the fact that BT acts poor when USB bus is under load, e.g.: I have my homedir FS on the integrated USB 3.0 SD card reader, when Firefox parses a "heavy" web page and caches content BT mouse becomes very slow, cursor may even hang for a second or two (so do the keyboard). I start to guess it may also affect WiFi, at least such an assumption explains the fact that the hang almost never happens if the system is idle. However, in one of the previous tests I had tried to transmit a few GiBs from SD card via WiFi but hang happened more than an hour later.
By the way, the kernel still panics on resume from STR, it happened in some USB-related calls up from mwifiex_usb_recv() last time when the stack trace was visible.

Thanks in advance!
Comment 15 Bing Zhao 2013-09-14 00:00:59 UTC
When you get Wi-Fi or BT hang with USB bus under load, does WiFi/BT resume functioning (without interface down/up) if you stop the Firefox immediately?

Sequence of events in the logs:

dmesg.4
      1) All 6 Tx urbs are submitted (Time:6747.597074). No ack. we can not dequeue further Tx packets.

Dmesg.5
      2) We stopped netdev queue (Time: 6842.817214), when there are Tx 100 packets in our queue. 
      3) Deauth from app (Time: 6843.297352). Downloaded deauth command
      4) Received Rx packets
      5) Few mwifiex_usb_tx_complete() calls from USB driver(Time:6843.314217).

Now we have free Tx URBs, so started dequeueing packets until deauth command response in received.

The concern here is USB bus driver took so much time to ack submitted Tx packet.
6843.31* - 6747.59* = ~105 secs     


What is the AP model you are using? Are you able to get another AP to try?

For the suspend/resume panic, can you get the kernel/driver logs with dynamic debug enabled? Thanks.
Comment 16 Bing Zhao 2013-09-14 00:29:46 UTC
Also, could you try configuring your AP to run 11g mode?
Comment 17 Dmitry Khromov 2013-09-15 00:47:26 UTC
(In reply to Bing Zhao from comment #15)
> When you get Wi-Fi or BT hang with USB bus under load, does WiFi/BT resume
> functioning (without interface down/up) if you stop the Firefox immediately?

There are at least 3 kind of "hangs" in my experience:
1) It just slows down (cursor start to "jump", looks like packet losses), no hangs. Sometimes it lasts for seconds, sometimes - until I reconnect devices.
2) It may "freeze" for a 1~3 seconds.
3) It may hang completely. Devices will get disconnected by timeout, and usually
that means that WiFi is also going to die in the next few minutes (but that isn't always true). Reinserting mwifiex_usb won't help - firmware doesn't get downloaded, reboot helps.

SIGKILLing Firefox in cases like 1) and 2) doesn't help (it shouldn't, I think - disk caches are already flushing no matter if the writing process is still there). However, I tried to call sync() at random moments of my everyday activity, tried dd if=/dev/zero bs=8K oflag=sync of=${HOME}/zeroes and nothing happened - no WiFi or BT slowdowns, no hangs. Actually, I don't believe that slow SD card may exhaust USB bus (on the other hand, card reader's driver can possibly do). Maybe I should try an USB sniffer?

> What is the AP model you are using? Are you able to get another AP to try?

Usually it is quite old D-Link DIR-320 A2 running OpenWRT and b43 driver. This one was in all tests for this bug report. Less frequent it is OpenBSD/x86 with Ralink RT2790 in AP mode - the issue is there, too. I can also try with my cellphone in hotspot mode, if you like.

> For the suspend/resume panic, can you get the kernel/driver logs with
> dynamic debug enabled? Thanks.

I'll try in the coming days, I have to configure kexec first.

> Also, could you try configuring your AP to run 11g mode?

You mean, g only? It is bg now.

Thanks!
Comment 18 Dmitry Khromov 2013-09-15 17:28:38 UTC
Created attachment 108431 [details]
Suspend and resume, Linux 3.11.1

> For the suspend/resume panic, can you get the kernel/driver logs with dynamic
> debug enabled? Thanks.

So, there are two reproducible variants of suspend/resume problem on 3.11.1:
1) If I don't rmmod mwifiex_usb on suspend - I get NPDR (./dmesg_normmod_npdr)
2) If I do configure pm-utils to rmmod on suspend - system freezes shortly after being resumed if it had mlan0 UP on suspend. kexec to crashkernel doesn't get triggered so I can't collect the complete log (./kmsg_rmmod)
Comment 19 Bing Zhao 2013-09-16 22:53:56 UTC
I meant b/g (I was suspecting an 11n related bug). Since the AP is already running b/g you don't have to try your cell phone hotspot.

For the suspend/resume tests, please don't remove mwifiex_usb module on suspend.
Also we just simply use "echo mem > /sys/power/state" to enter suspend mode, instead of using pm-utils.

Could you try the same approach on your Surface Pro?

kernel: 3.10.9, and apply the following changes:
1. 1211c96 mwifiex: do not create AP and P2P interfaces upon driver loading
2. c9e2404 mwifiex: enable/disable tx_amsdu support via module parameter
3. apply the patch in attachment 107743
4. black list btusb driver

Test procedure:
1. modprobe mwifiex_usb
2. echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control
3. echo "module mwifiex_usb +p" > /sys/kernel/debug/dynamic_debug/control
4. echo mem > /sys/power/state
5. Press Power button to wake up
6. save dmesg and /var/log/messages
Comment 20 Bing Zhao 2013-09-16 23:00:14 UTC
Created attachment 108641 [details]
.config and suspend/resume test logs on x86 kernel 3.10.9

Amitkumar Karwar tested suspend/resume on his x86 setup with kernel 3.10.9.
He mentioned that the kernel 3.11 seems have some issues. So, please switch back to 3.10.9+ for suspend/resume testing.

Attached are his .config and suspend/resume logs for your reference.
Comment 21 Dmitry Khromov 2013-09-17 00:40:32 UTC
Created attachment 108651 [details]
Suspend and resume, Linux 3.10.9

As in comment #19.

NPDR, too. The .config is what it was - I hadn't looked at Mr. Karwar's one yet. I'll try again later with a minimalistic .config.
Comment 22 Bing Zhao 2013-09-17 19:36:28 UTC
Created attachment 108751 [details]
debug usb8797 NULL poinert dereference in suspend handler

Based on the log, we have completed mwifiex_enable_hs() call and got a NULL pointer dereference before exiting mwifiex_usb_suspend().

[    1.514887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[    1.514904] IP: [<ffffffff8158ab70>] pfifo_fast_enqueue+0x90/0x90

I suspect the usb_kill_urb calls caused the NULL pointer dereference in pfifo_fast_enqueue().

Could you apply the attached debug patch to trace down the issue? Thanks!
Comment 23 Dmitry Khromov 2013-09-18 16:55:50 UTC
Created attachment 108791 [details]
Suspend, Linux 3.10.9 w/108751

Applied patch from comment #22.
Looks like the control doesn't reach patched lines.
Comment 24 Bing Zhao 2013-09-18 18:29:34 UTC
Created attachment 108801 [details]
debug usb8797 NULL poinert dereference in suspend handler v2

Could you please try again with this patch (replacing attachment 108751 [details])? Thanks!
Comment 25 Dmitry Khromov 2013-09-18 19:43:59 UTC
Created attachment 108811 [details]
Suspend, Linux 3.10.9 w/108801

This one caught it.
Comment 26 Bing Zhao 2013-09-18 21:51:00 UTC
Created attachment 108821 [details]
debug usb8797 NULL poinert dereference in suspend handler v3

Could you try again with the new debug patch that removes the carries on/off upon suspend/resume?
Comment 27 Dmitry Khromov 2013-09-19 06:28:40 UTC
Created attachment 108851 [details]
Suspend, Linux 3.10.9 w/108821

Yes, it works for both 3.10.9 and 3.11.1.
Comment 28 Bing Zhao 2013-09-20 22:47:51 UTC
Thanks for running all these tests.
I will generate patches based on attachment 107443 [details] (comment #8) and attachment 108821 [details] (comment #26) and send them upstream.
Comment 29 Bing Zhao 2013-09-20 23:19:50 UTC
Created attachment 109011 [details]
0001-mwifiex-fix-hang-issue-for-USB-chipsets.patch

I'm attaching the two patches here and next.
Dimtry, would you be able to test them before I send them upstream? Thanks!
Comment 30 Bing Zhao 2013-09-20 23:22:38 UTC
Created attachment 109021 [details]
0002-mwifiex-fix-NULL-pointer-dereference-in-usb-suspend-.patch

When you test patch 0001 and 0002, please remove all previous debug patches. Thanks!
Comment 31 Dmitry Khromov 2013-09-24 21:31:41 UTC
(In reply to Bing Zhao from comment #30)
> Created attachment 109021 [details]
> 0002-mwifiex-fix-NULL-pointer-dereference-in-usb-suspend-.patch
> 
> When you test patch 0001 and 0002, please remove all previous debug patches.
> Thanks!

Tested on vanilla 3.11.1, no unexpected behaviour.
Thanks, Bing!
Comment 32 Bing Zhao 2013-09-25 02:32:48 UTC
Thanks for testing. I sent them upstream with cc -stable.
Comment 33 Dmitry Khromov 2013-10-18 22:49:50 UTC
Back to the "Tx timeout" problem... I had switched to rw / on built-in SSD, so the card reader is not in use now, but the problem is still here. It happens randomly (e.g. WiFi worked for 4 hours, then hung, after reboot it worked for 7 mins and hung again, BT was off), but generally faster if BT is used. When it happens I can't kill wpa_supplicant if mwfiex_usb is loaded. If I rmmod mwifiex_usb and insert it again mlan0 won't appear - firmware doesn't get loaded until reboot. What testing should be done?
Comment 34 Bing Zhao 2013-10-22 18:44:06 UTC
Created attachment 111991 [details]
tx timeout handling

Attached is a patch that tries to handle the tx timeout case, although the root cause of tx timeout is unknown at this time.
Could you try this patch and send me the logs? Thanks.
Comment 35 Dmitry Khromov 2013-10-30 23:20:55 UTC
Created attachment 112841 [details]
Tx timeout, Linux 3.11.6 w/111991

Sorry it took so long.

Yes, it changes state to deauthenticated and wpa_supplicant can be killed now. Everything else is the same, firmware is still useless in this state. By the way, there is a sudden NPDR now. It comes shortly after Tx timeout #4 or #5. Should I dump kernel's log or was it planned?
Comment 36 Bing Zhao 2013-10-31 02:22:01 UTC
It looks like that the firmware is dead. You can try to issue a command to it after TX timeout. It's best to issue the command as soon as the TX timeout happens. Please run "iw mlan0 link" (assume it's still connected to AP) and "cat /sys/kernel/debug/mwifiex/mlan0/info" and check what happens. Thanks.

Please remove the patch 111991 for above test. And no, the sudden NPDR wasn't planned.
Comment 37 Jie Su 2013-11-01 15:02:13 UTC
Created attachment 113031 [details]
dmesg on ubuntu 13.10

I was running ubuntu on ``surface pro'', everything seems OK until yesterday, the wifi suddenly stops work. output of dmesg is attached, Google for ``mwifiex unable to handle kernel NULL pointer'' lead me here, although there is a slightly difference between Dmitry Khromov's and mine.

Before this, WIFI lost connect once, after reboot it works again. But now it refused to work any more (Of course it does work under windows), what is worse, the computer cannot shutdown. It stopped at ``will now halt'' and just halt without poweroff. I don't know if it was caused by wifi or not, but it happened after wifi stopped.

I haven't try to compile the kernel with this patch (no experience before). According to the dmesg, if you believe that my problem can be solved, I'll take a try. Or if there are any other way to make it work, please let me know. :)
Comment 38 Jie Su 2013-11-01 15:48:59 UTC
Created attachment 113041 [details]
dmesg on ubuntu 13.10 livecd

This is the output of dmesg on a live system for comparison. WIFI works quite well.
Comment 39 Bing Zhao 2013-11-04 20:36:33 UTC
@Comment 37: This issue should have been fixed in stable kernel 3.11.5+.
If you want to stick to 3.11.0-12-generic you need to apply this patch:

1d1f004 mwifiex: fix NULL pointer dereference in usb suspend handler

@Comment 38: The log file doesn't show any suspend attempt. If you try to suspend you will probably get the same problem @37.
Comment 40 Dmitry Khromov 2013-11-05 22:20:42 UTC
Created attachment 113541 [details]
Tx timeout, Linux 3.11.6, mlan0/info

Bing,
Attached are 2 mlan0/info captures (each after issuing iw mlan0 link). First has been taken before watchdog detected the timeout, last - shortly after Tx timeout #5.

last_cmd_resp_id differs in these two, that makes me think firmware is alive.
Reinserting mwifiex_usb made interface usable again without downloading firmware, looks like another sign of it being alive.

I hadn't yet succeed to capture the case when firmware is likely dead (the one, when mwifiex_usb fails to download firmware on reinsert, leaving the module unusable until power is cycled), still trying.
Comment 41 Dmitry Khromov 2013-11-05 22:44:21 UTC
Created attachment 113551 [details]
Tx timeout, Linux 3.11.6, mlan0/info, system hang

1) This is another hang, few minutes after the one in Comment #40.
2) 113541 had mlan0/debug instead of mlan0/info. Sorry for that.
3) I guess, this time the firmware was dead. At least, the whole system hung (sorry, the crash kernel was not loaded - no dmesg) shortly after Tx timeout #1. I guess I should retry with crash kernel. Is it possible to extract data of mlan0/info, mlan0/debug off vmcore?
Comment 42 Bing Zhao 2013-11-05 22:55:35 UTC
Dmitry, thanks for testing. In Comment 40 the firmware was still alive here, and the cmd path was still working. The tx timeout could have caused a task to hang, so you may hit the "hung task" kernel panic in Comment 41. One thing to confirm: the tx timeout happens with and without suspend/resume, right? I'm planning on getting a firmware update sometime next week.
Comment 43 Dmitry Khromov 2013-11-05 23:18:46 UTC
(In reply to Bing Zhao from comment #42)
> Dmitry, thanks for testing. In Comment 40 the firmware was still alive here,
> and the cmd path was still working. The tx timeout could have caused a task
> to hang, so you may hit the "hung task" kernel panic in Comment 41.

No, CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set. In my experience, the second hang (after module reinsertion) almost always leads to the firmware's death. Saying "death" here I mean inability of mwifiex_usb to download the fw on reinsertion. I wish it was possible to power cycle the port on the internal USB hub - that could be a good indicator, but looks like it's not.

> One thing to confirm: the tx timeout happens with and without suspend/resume,
> right?

Yes, though I usually run the tests up from the cold boot with no suspends.

> I'm planning on getting a firmware update sometime next week.

Ok, thanks, Bing!
Comment 44 Bing Zhao 2014-01-09 02:36:38 UTC
FYI, I've uploaded a new firmware for testing. I'm not sure if it helps.

git://git.marvell.com/mwifiex-firmware.git master
Comment 45 Dmitry Khromov 2014-01-25 17:52:50 UTC
> FYI, I've uploaded a new firmware for testing. I'm not sure if it helps.

Unfortunatelly, it doesn't. Tx timeout problem is still here, but it's 
sufficient to deauth and reconnect now.

By the way, there's some problem with CCMP - the traffic stops flowing 
after the first group rekeying (TKIP runs fine).
Comment 46 Bing Zhao 2014-02-04 22:57:09 UTC
Are you running the latest 3.13 kernel now? Based on comments from other people 3.13 seems work better than 3.11 and 3.12.

If you are already in 3.13 kernel, please enable dynamic debug as Comment #2 and attach the debug logs. Thanks!
Comment 47 Bing Zhao 2014-02-07 20:38:03 UTC
@Comment 45: we tested CCMP group rekey with two APs, Apple Extreme and Linksys WRT120N, issue has not been reproduced. Could you let me know your AP model?
Comment 48 Dmitry Khromov 2014-02-12 21:41:35 UTC
(In reply to Bing Zhao from comment #46)
Yes, the issue is gone with 3.13.2 and the firmware from comment #44.
Bing, thanks a lot for your help.

> we tested CCMP group rekey with two APs
It turned out that AP's hostapd was too ancient, sorry.
Comment 49 Bing Zhao 2014-02-13 05:07:59 UTC
You are welcome Dmitry and thanks for your testing. Glad to know that things are working for you now.
Comment 50 Christian Hartmann 2014-05-15 13:18:57 UTC
Hello,

I have read this bug entry carefully, also the bug with the id 69661  "mwifiex_usb on MS Surface Pro 1 is unstable" 

Currently I have issues with the latest updated Marvell USB8797-B0 firmware image
on the Microsoft Surface Pro2, the connection is suddenly broken, in the console (or messages) i got:

+May 15 12:40:35 1111130 NetworkManager[1252]: <info> (mlan0): supplicant connection state:  group handshake -> completed
+May 15 12:46:40 1111130 kernel: snd_hda_intel 0000:00:1b.0: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj.
+May 15 12:47:16 1111130 kernel: snd_hda_intel 0000:00:1b.0: Unstable LPIB (131072 >= 8192); disabling LPIB delay counting
+May 15 12:47:36 1111130 kernel: usb 2-2: mwifiex_cmd_timeout_func: Timeout cmd id (1400150856.19309) = 0xa4, act = 0x0
+May 15 12:47:36 1111130 kernel: usb 2-2: num_data_h2c_failure = 0
+May 15 12:47:36 1111130 kernel: usb 2-2: num_cmd_h2c_failure = 0
+May 15 12:47:36 1111130 kernel: usb 2-2: is_cmd_timedout = 1
+May 15 12:47:36 1111130 kernel: usb 2-2: num_tx_timeout = 0
+May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_index = 4
+May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_id: 16 00 a4 00 7f 00 16 00 a4 00
+May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_act: 00 00 00 00 00 00 00 00 00 00
+May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_resp_index = 3
+May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_resp_id: 16 80 a4 80 7f 80 16 80 7f 80
+May 15 12:47:36 1111130 kernel: usb 2-2: last_event_index = 2
+May 15 12:47:36 1111130 kernel: usb 2-2: last_event: 58 00 58 00 58 00 58 00 58 00
+May 15 12:47:36 1111130 kernel: usb 2-2: data_sent=1 cmd_sent=1
+May 15 12:47:36 1111130 kernel: usb 2-2: ps_mode=1 ps_state=0
+May 15 12:47:36 1111130 kernel: usb 2-2: cmd timeout
+May 15 12:47:36 1111130 kernel: usb 2-2: failed to get signal information
+May 15 12:47:36 1111130 kernel: usb 2-2: PREP_CMD: FW is in bad state



firmware image Version: 14.68.29.p26

(see also
http://git.marvell.com/?p=mwifiex-firmware.git;a=commit;h=e24897b869455e9aea6b5892e21d72585e136376)

tested with kernel version(s): 3.13.9 till 3.15.0-rc5 

I have enabled DYNAMIC_DEBUG (as introduced in some comments above), and have some dumps (tcp, mlan0, dmesg before DYNDBG and after the firmware error)

% mount -t debugfs none /sys/kernel/debug
% echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control
% echo "module mwifiex_usb +p" > /sys/kernel/debug/dynamic_debug/control


What works:

* starting and using mlan0 with the latest firmware.
* rebooting
* scp (tested with more than 1GB)
* youtube, streams

What does not work:
* using SIP (linphone)
whenever I start to talk with friends with my SIP client linphone, I got in a few seconds or minutes the "firmware is in bad state". Everytime. Its 100% reproduceable.

As mentioned above, a rmmod and modprobe will fail to reinit the device and the firmare state. A system reboot can also hang a long time. (watchdogs are barking loud...)
Comment 51 Christian Hartmann 2014-05-15 13:22:33 UTC
Created attachment 136171 [details]
3.15.0-rc5:  whole dmesg before any mwifiex_usb errors

The dmesg right after booting a fresh 3.15.0-rc5
Comment 52 Christian Hartmann 2014-05-15 13:24:58 UTC
Created attachment 136181 [details]
mlan0 debug file after a firmware in bad state

mlan0 debug file after a firmware in bad state
Comment 53 Christian Hartmann 2014-05-15 13:29:06 UTC
Created attachment 136191 [details]
dmesg of 3.15.0-rc5 with mwifiex_usb firmare error (DYNDBG)

whenever linphone is used and RTP session is active (talk with a friend), I got plenty of firmware errors
Comment 54 Dmitry Khromov 2014-05-15 18:44:38 UTC
> --- Comment #53 from Christian Hartmann <cornogle@googlemail.com> ---
> whenever linphone is used and RTP session is active (talk with a friend), I
> got
> plenty of firmware errors

Yes, I can confirm this with Surface Pro 1. The problem is still here in 
VoIP applications (at least some SIP-based, like Ekiga or linphone, 
Skype Linux client too). Symptoms are just the same as they were (the 
firmware hangs until the USB's VBus is cycled). I recall I saw some 
commits by Mr. Karwar recently that appear to me as relevant to the 
problem, not shure if they made it into 3.15.0-rc5. Hadn't tried them 
yet, though.
Comment 55 Christian Hartmann 2014-05-16 14:57:50 UTC
(In reply to Dmitry Khromov from comment #54)
> > --- Comment #53 from Christian Hartmann <cornogle@googlemail.com> ---

> commits by Mr. Karwar recently that appear to me as relevant to the 
> problem, not shure if they made it into 3.15.0-rc5. Hadn't tried them 
> yet, though.

yes, it they made it into 3.15.0-rc2, but that did not resolve the FIRMWARE IN BAD STATE.

https://git.kernel.org/cgit/linux/kernel/git/linville/wireless.git/commit/drivers/net/wireless/mwifiex?id=f8d2b9209ad648a8e9aa973d32fb64aa2ab01d00

Can this be snd_hda_intel related? My question seems silly, but at first I see a warning about "Unstable LPIB ... disabling LPIB delay counting". its than about seconds that the firmware will go into bad state again.

When I use Firefox and what streams, I got also the snd_hda_intel warning, but I get not the firmware bug. This only happends with the SIP application,

On the other hand, there is also a radio fm chip on the Wifi card, which will be another thread (no bug yet, cause there is no radio FM support over i2c yet ;) )

How can I help to debug it so somebody like Bing Zhao can make a new firmware to fix this issues as soon as possible ?? What can I do to come closer to the wrong code lines??
Comment 56 Bing Zhao 2014-07-01 22:23:06 UTC
Hi, we have recently found some missing memset (for tx_info/rx_info control buffers) which can cause random failures on both command and data path.
I'm not sure if these patches will solve this particular issue or not, but they are worth a try:

#1: https://git.kernel.org/cgit/linux/kernel/git/linville/wireless.git/commit/?id=d76744a93246eccdca1106037e8ee29debf48277

#2: http://marc.info/?l=linux-wireless&m=140425083601658&w=4
Comment 57 Ganapathi Bhat 2018-12-19 08:46:44 UTC
Hi Dmitry,

Can you kindly close the bug if it is not seen anymore.

Thanks,
Ganapathi