Bug 60815
Created attachment 107361 [details]
running kernel's .config
Could you enable dynamic debug and collect the logs for me? Thanks. 1. CONFIG_DYNAMIC_DEBUG=y in kernel .config and rebuild the kernel 2. modprobe mwifiex_usb 3. echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control 4. echo "module mwifiex_usb +p" > /sys/kernel/debug/dynamic_debug/control 5. start your test, and save the logs in dmesg and /var/log/messages as soon as the issue happens. Created attachment 107392 [details]
dmesg (DYNAMIC_DEBUG=y, +p in mwifiex and mwifiex_usb)
It's 3.10.10 and now it happens much faster - about 2 to 3 minutes after it gets associated with AP, Bluetooth hung every time. It starts at 726.007972 in attached dmesg output.
Thanks!
Created attachment 107393 [details]
/var/log/messages
Captured at the same point as 107392.
The Tx AMSDU logic seems to have some problem which causes USB bus hang during data traffic in certain scenarios. We are debugging the Tx AMSDU code. Since Tx amsdu is by default disabled in latest mainline, you can cherry-pick this patch for 3.10.x and try again. c9e2404 mwifiex: enable/disable tx_amsdu support via module parameter Created attachment 107432 [details]
dmesg, Linux 3.11.0
OK, 3.11.0 has c9e2404:
# cat /sys/module/mwifiex/parameters/enable_tx_amsdu
N
However, the issue is still here.
I have attached a tarball with several dmesg outputs:
dmesg_3.11_1 - a couple of seconds after the interface hangs, nothing special as I see.
dmesg_3.11_2 - after some time with the interface hung:
# grep Tx\ ti dmesg_3.11_2
[24268.907543] usb 3-3: 4318924992 : Tx timeout(#35), bss_type-num = 0-0
[24273.917852] usb 3-3: 4318930000 : Tx timeout(#36), bss_type-num = 0-0
[24278.928168] usb 3-3: 4318935008 : Tx timeout(#37), bss_type-num = 0-0
dmesg_3.11_3 - just after I have tried to take the interface down then up again (that didn't help - had to rmmod mwifiex_usb && modprobe mwifiex_usb)
Thanks in advance!
Created attachment 107433 [details]
/var/log/messages, Linux 3.11.0
Created attachment 107443 [details] Remove 4 bytes of redundant headroom for USB Apparently there is a bug in TX_AMSDU path for USB. Please try this patch for the original "cmd timeout" issue (comment #3, with TX_AMSDU enabled). The "Tx timeout" (comment #6, with TX_AMSDU disabled) could be a different problem. We will try to replicate this issue. Created attachment 107448 [details] dmesg, Linux 3.11.0, enable_tx_amsdu=Y # cat /sys/module/mwifiex/parameters/enable_tx_amsdu Y I don't see cmd_timeouts in first attached dmesg output (right after hang). BTW, ping says ENOBUFS again (it didn't in comment #6 test, maybe I hadn't waited enough). Second dmesg output is made just after I had tried to reset interface and it turned RF kill on by itself (comment #6 had rfkill issue, too), and has Tx timeouts just in case you need it. I have just encountered Bluetooth hang (309.189327). This one has cmd_timeouts. It's the third dmesg, but sorry, I had no +p set. I'll try to catch that one again with wired keyboard since BT hangs are often now. Later in this output you can see me trying to reinsert mwifiex_usb (+p this time) but mlan0 does not gets created. Could you please test this combination with 3.11.0? Thanks! 1. apply the patch in attachment 107743 2. apply commit 1211c961170cedb21c30d5bb7e2033c8720b38db (mainline) 3. enable_tx_amsdu=N 4. enable dynamic_debug for module: mwifiex, mwifiex_module, btusb 5. hcidump -XV (on the hci interface created by USB8797) Also, please run the test without Bluetooth to nail down the problem. You can blacklist or rename the btusb module. Created attachment 107481 [details] test results for scenarios in #10 and #11, Linux 3.11.0 run1/ is as in comment #10. run2/ is as in comment #11. Created attachment 108061 [details]
Print pending Tx URBs
Dmitry, thanks for the tests.
In run1 (WLAN+BT), it seems no cmd/data timeout in the logs. There are some "data: -ENOSR is returned" messages though. Did you get the interface hang issue in this test?
In run2 (WLAN only), the data tx timeout still happens.
The debug logs between [17453.000967] and [17459.703120] are missing. And it seems that something has occurred in-between. Do you happen to have the logs for this time frame?
Could you run WLAN only case again with the followings? Thanks!
1. apply the patch in attachment 107743
2. apply the debug patch in this attachment
3. apply commit 1211c961170cedb21c30d5bb7e2033c8720b38db (mainline)
4. enable_tx_amsdu=N
5. enable dynamic_debug for modules: mwifiex, mwifiex_usb
6. when the problems (timeouts, hangs, etc.) happen again, collect mwifiex debug logs by "cat /sys/kernel/debug/mwifiex/mlan0/debug"
Created attachment 108291 [details] Test run 3, as in comment #13. Linux 3.11.0. >In run1 (WLAN+BT), it seems no cmd/data timeout in the logs. There are some >"data: -ENOSR is returned" messages though. Did you get the interface hang >issue in this test? Yes, I did. >In run2 (WLAN only), the data tx timeout still happens. >The debug logs between [17453.000967] and [17459.703120] are missing. And it >seems that something has occurred in-between. Do you happen to have the logs >for this time frame? Unfortunately, no, sorry. >Could you run WLAN only case again with the followings? Thanks! dmesg.0 was taken in about 6 seconds after the hang, next 4 every 20 seconds later, last one - after trying to take mlan0 down and then up again. The same applies to mlan0.*. I can grep no "time" in dmesg.* Off timeouts topic: I should probably mention the fact that BT acts poor when USB bus is under load, e.g.: I have my homedir FS on the integrated USB 3.0 SD card reader, when Firefox parses a "heavy" web page and caches content BT mouse becomes very slow, cursor may even hang for a second or two (so do the keyboard). I start to guess it may also affect WiFi, at least such an assumption explains the fact that the hang almost never happens if the system is idle. However, in one of the previous tests I had tried to transmit a few GiBs from SD card via WiFi but hang happened more than an hour later. By the way, the kernel still panics on resume from STR, it happened in some USB-related calls up from mwifiex_usb_recv() last time when the stack trace was visible. Thanks in advance! When you get Wi-Fi or BT hang with USB bus under load, does WiFi/BT resume functioning (without interface down/up) if you stop the Firefox immediately? Sequence of events in the logs: dmesg.4 1) All 6 Tx urbs are submitted (Time:6747.597074). No ack. we can not dequeue further Tx packets. Dmesg.5 2) We stopped netdev queue (Time: 6842.817214), when there are Tx 100 packets in our queue. 3) Deauth from app (Time: 6843.297352). Downloaded deauth command 4) Received Rx packets 5) Few mwifiex_usb_tx_complete() calls from USB driver(Time:6843.314217). Now we have free Tx URBs, so started dequeueing packets until deauth command response in received. The concern here is USB bus driver took so much time to ack submitted Tx packet. 6843.31* - 6747.59* = ~105 secs What is the AP model you are using? Are you able to get another AP to try? For the suspend/resume panic, can you get the kernel/driver logs with dynamic debug enabled? Thanks. Also, could you try configuring your AP to run 11g mode? (In reply to Bing Zhao from comment #15) > When you get Wi-Fi or BT hang with USB bus under load, does WiFi/BT resume > functioning (without interface down/up) if you stop the Firefox immediately? There are at least 3 kind of "hangs" in my experience: 1) It just slows down (cursor start to "jump", looks like packet losses), no hangs. Sometimes it lasts for seconds, sometimes - until I reconnect devices. 2) It may "freeze" for a 1~3 seconds. 3) It may hang completely. Devices will get disconnected by timeout, and usually that means that WiFi is also going to die in the next few minutes (but that isn't always true). Reinserting mwifiex_usb won't help - firmware doesn't get downloaded, reboot helps. SIGKILLing Firefox in cases like 1) and 2) doesn't help (it shouldn't, I think - disk caches are already flushing no matter if the writing process is still there). However, I tried to call sync() at random moments of my everyday activity, tried dd if=/dev/zero bs=8K oflag=sync of=${HOME}/zeroes and nothing happened - no WiFi or BT slowdowns, no hangs. Actually, I don't believe that slow SD card may exhaust USB bus (on the other hand, card reader's driver can possibly do). Maybe I should try an USB sniffer? > What is the AP model you are using? Are you able to get another AP to try? Usually it is quite old D-Link DIR-320 A2 running OpenWRT and b43 driver. This one was in all tests for this bug report. Less frequent it is OpenBSD/x86 with Ralink RT2790 in AP mode - the issue is there, too. I can also try with my cellphone in hotspot mode, if you like. > For the suspend/resume panic, can you get the kernel/driver logs with > dynamic debug enabled? Thanks. I'll try in the coming days, I have to configure kexec first. > Also, could you try configuring your AP to run 11g mode? You mean, g only? It is bg now. Thanks! Created attachment 108431 [details] Suspend and resume, Linux 3.11.1 > For the suspend/resume panic, can you get the kernel/driver logs with dynamic > debug enabled? Thanks. So, there are two reproducible variants of suspend/resume problem on 3.11.1: 1) If I don't rmmod mwifiex_usb on suspend - I get NPDR (./dmesg_normmod_npdr) 2) If I do configure pm-utils to rmmod on suspend - system freezes shortly after being resumed if it had mlan0 UP on suspend. kexec to crashkernel doesn't get triggered so I can't collect the complete log (./kmsg_rmmod) I meant b/g (I was suspecting an 11n related bug). Since the AP is already running b/g you don't have to try your cell phone hotspot. For the suspend/resume tests, please don't remove mwifiex_usb module on suspend. Also we just simply use "echo mem > /sys/power/state" to enter suspend mode, instead of using pm-utils. Could you try the same approach on your Surface Pro? kernel: 3.10.9, and apply the following changes: 1. 1211c96 mwifiex: do not create AP and P2P interfaces upon driver loading 2. c9e2404 mwifiex: enable/disable tx_amsdu support via module parameter 3. apply the patch in attachment 107743 4. black list btusb driver Test procedure: 1. modprobe mwifiex_usb 2. echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control 3. echo "module mwifiex_usb +p" > /sys/kernel/debug/dynamic_debug/control 4. echo mem > /sys/power/state 5. Press Power button to wake up 6. save dmesg and /var/log/messages Created attachment 108641 [details]
.config and suspend/resume test logs on x86 kernel 3.10.9
Amitkumar Karwar tested suspend/resume on his x86 setup with kernel 3.10.9.
He mentioned that the kernel 3.11 seems have some issues. So, please switch back to 3.10.9+ for suspend/resume testing.
Attached are his .config and suspend/resume logs for your reference.
Created attachment 108651 [details] Suspend and resume, Linux 3.10.9 As in comment #19. NPDR, too. The .config is what it was - I hadn't looked at Mr. Karwar's one yet. I'll try again later with a minimalistic .config. Created attachment 108751 [details]
debug usb8797 NULL poinert dereference in suspend handler
Based on the log, we have completed mwifiex_enable_hs() call and got a NULL pointer dereference before exiting mwifiex_usb_suspend().
[ 1.514887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[ 1.514904] IP: [<ffffffff8158ab70>] pfifo_fast_enqueue+0x90/0x90
I suspect the usb_kill_urb calls caused the NULL pointer dereference in pfifo_fast_enqueue().
Could you apply the attached debug patch to trace down the issue? Thanks!
Created attachment 108791 [details] Suspend, Linux 3.10.9 w/108751 Applied patch from comment #22. Looks like the control doesn't reach patched lines. Created attachment 108801 [details] debug usb8797 NULL poinert dereference in suspend handler v2 Could you please try again with this patch (replacing attachment 108751 [details])? Thanks! Created attachment 108811 [details]
Suspend, Linux 3.10.9 w/108801
This one caught it.
Created attachment 108821 [details]
debug usb8797 NULL poinert dereference in suspend handler v3
Could you try again with the new debug patch that removes the carries on/off upon suspend/resume?
Created attachment 108851 [details]
Suspend, Linux 3.10.9 w/108821
Yes, it works for both 3.10.9 and 3.11.1.
Thanks for running all these tests. I will generate patches based on attachment 107443 [details] (comment #8) and attachment 108821 [details] (comment #26) and send them upstream. Created attachment 109011 [details]
0001-mwifiex-fix-hang-issue-for-USB-chipsets.patch
I'm attaching the two patches here and next.
Dimtry, would you be able to test them before I send them upstream? Thanks!
Created attachment 109021 [details]
0002-mwifiex-fix-NULL-pointer-dereference-in-usb-suspend-.patch
When you test patch 0001 and 0002, please remove all previous debug patches. Thanks!
(In reply to Bing Zhao from comment #30) > Created attachment 109021 [details] > 0002-mwifiex-fix-NULL-pointer-dereference-in-usb-suspend-.patch > > When you test patch 0001 and 0002, please remove all previous debug patches. > Thanks! Tested on vanilla 3.11.1, no unexpected behaviour. Thanks, Bing! Thanks for testing. I sent them upstream with cc -stable. Back to the "Tx timeout" problem... I had switched to rw / on built-in SSD, so the card reader is not in use now, but the problem is still here. It happens randomly (e.g. WiFi worked for 4 hours, then hung, after reboot it worked for 7 mins and hung again, BT was off), but generally faster if BT is used. When it happens I can't kill wpa_supplicant if mwfiex_usb is loaded. If I rmmod mwifiex_usb and insert it again mlan0 won't appear - firmware doesn't get loaded until reboot. What testing should be done? Created attachment 111991 [details]
tx timeout handling
Attached is a patch that tries to handle the tx timeout case, although the root cause of tx timeout is unknown at this time.
Could you try this patch and send me the logs? Thanks.
Created attachment 112841 [details]
Tx timeout, Linux 3.11.6 w/111991
Sorry it took so long.
Yes, it changes state to deauthenticated and wpa_supplicant can be killed now. Everything else is the same, firmware is still useless in this state. By the way, there is a sudden NPDR now. It comes shortly after Tx timeout #4 or #5. Should I dump kernel's log or was it planned?
It looks like that the firmware is dead. You can try to issue a command to it after TX timeout. It's best to issue the command as soon as the TX timeout happens. Please run "iw mlan0 link" (assume it's still connected to AP) and "cat /sys/kernel/debug/mwifiex/mlan0/info" and check what happens. Thanks. Please remove the patch 111991 for above test. And no, the sudden NPDR wasn't planned. Created attachment 113031 [details]
dmesg on ubuntu 13.10
I was running ubuntu on ``surface pro'', everything seems OK until yesterday, the wifi suddenly stops work. output of dmesg is attached, Google for ``mwifiex unable to handle kernel NULL pointer'' lead me here, although there is a slightly difference between Dmitry Khromov's and mine.
Before this, WIFI lost connect once, after reboot it works again. But now it refused to work any more (Of course it does work under windows), what is worse, the computer cannot shutdown. It stopped at ``will now halt'' and just halt without poweroff. I don't know if it was caused by wifi or not, but it happened after wifi stopped.
I haven't try to compile the kernel with this patch (no experience before). According to the dmesg, if you believe that my problem can be solved, I'll take a try. Or if there are any other way to make it work, please let me know. :)
Created attachment 113041 [details]
dmesg on ubuntu 13.10 livecd
This is the output of dmesg on a live system for comparison. WIFI works quite well.
@Comment 37: This issue should have been fixed in stable kernel 3.11.5+. If you want to stick to 3.11.0-12-generic you need to apply this patch: 1d1f004 mwifiex: fix NULL pointer dereference in usb suspend handler @Comment 38: The log file doesn't show any suspend attempt. If you try to suspend you will probably get the same problem @37. Created attachment 113541 [details]
Tx timeout, Linux 3.11.6, mlan0/info
Bing,
Attached are 2 mlan0/info captures (each after issuing iw mlan0 link). First has been taken before watchdog detected the timeout, last - shortly after Tx timeout #5.
last_cmd_resp_id differs in these two, that makes me think firmware is alive.
Reinserting mwifiex_usb made interface usable again without downloading firmware, looks like another sign of it being alive.
I hadn't yet succeed to capture the case when firmware is likely dead (the one, when mwifiex_usb fails to download firmware on reinsert, leaving the module unusable until power is cycled), still trying.
Created attachment 113551 [details] Tx timeout, Linux 3.11.6, mlan0/info, system hang 1) This is another hang, few minutes after the one in Comment #40. 2) 113541 had mlan0/debug instead of mlan0/info. Sorry for that. 3) I guess, this time the firmware was dead. At least, the whole system hung (sorry, the crash kernel was not loaded - no dmesg) shortly after Tx timeout #1. I guess I should retry with crash kernel. Is it possible to extract data of mlan0/info, mlan0/debug off vmcore? Dmitry, thanks for testing. In Comment 40 the firmware was still alive here, and the cmd path was still working. The tx timeout could have caused a task to hang, so you may hit the "hung task" kernel panic in Comment 41. One thing to confirm: the tx timeout happens with and without suspend/resume, right? I'm planning on getting a firmware update sometime next week. (In reply to Bing Zhao from comment #42) > Dmitry, thanks for testing. In Comment 40 the firmware was still alive here, > and the cmd path was still working. The tx timeout could have caused a task > to hang, so you may hit the "hung task" kernel panic in Comment 41. No, CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set. In my experience, the second hang (after module reinsertion) almost always leads to the firmware's death. Saying "death" here I mean inability of mwifiex_usb to download the fw on reinsertion. I wish it was possible to power cycle the port on the internal USB hub - that could be a good indicator, but looks like it's not. > One thing to confirm: the tx timeout happens with and without suspend/resume, > right? Yes, though I usually run the tests up from the cold boot with no suspends. > I'm planning on getting a firmware update sometime next week. Ok, thanks, Bing! FYI, I've uploaded a new firmware for testing. I'm not sure if it helps. git://git.marvell.com/mwifiex-firmware.git master > FYI, I've uploaded a new firmware for testing. I'm not sure if it helps.
Unfortunatelly, it doesn't. Tx timeout problem is still here, but it's
sufficient to deauth and reconnect now.
By the way, there's some problem with CCMP - the traffic stops flowing
after the first group rekeying (TKIP runs fine).
Are you running the latest 3.13 kernel now? Based on comments from other people 3.13 seems work better than 3.11 and 3.12. If you are already in 3.13 kernel, please enable dynamic debug as Comment #2 and attach the debug logs. Thanks! @Comment 45: we tested CCMP group rekey with two APs, Apple Extreme and Linksys WRT120N, issue has not been reproduced. Could you let me know your AP model? (In reply to Bing Zhao from comment #46) Yes, the issue is gone with 3.13.2 and the firmware from comment #44. Bing, thanks a lot for your help. > we tested CCMP group rekey with two APs It turned out that AP's hostapd was too ancient, sorry. You are welcome Dmitry and thanks for your testing. Glad to know that things are working for you now. Hello, I have read this bug entry carefully, also the bug with the id 69661 "mwifiex_usb on MS Surface Pro 1 is unstable" Currently I have issues with the latest updated Marvell USB8797-B0 firmware image on the Microsoft Surface Pro2, the connection is suddenly broken, in the console (or messages) i got: +May 15 12:40:35 1111130 NetworkManager[1252]: <info> (mlan0): supplicant connection state: group handshake -> completed +May 15 12:46:40 1111130 kernel: snd_hda_intel 0000:00:1b.0: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj. +May 15 12:47:16 1111130 kernel: snd_hda_intel 0000:00:1b.0: Unstable LPIB (131072 >= 8192); disabling LPIB delay counting +May 15 12:47:36 1111130 kernel: usb 2-2: mwifiex_cmd_timeout_func: Timeout cmd id (1400150856.19309) = 0xa4, act = 0x0 +May 15 12:47:36 1111130 kernel: usb 2-2: num_data_h2c_failure = 0 +May 15 12:47:36 1111130 kernel: usb 2-2: num_cmd_h2c_failure = 0 +May 15 12:47:36 1111130 kernel: usb 2-2: is_cmd_timedout = 1 +May 15 12:47:36 1111130 kernel: usb 2-2: num_tx_timeout = 0 +May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_index = 4 +May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_id: 16 00 a4 00 7f 00 16 00 a4 00 +May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_act: 00 00 00 00 00 00 00 00 00 00 +May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_resp_index = 3 +May 15 12:47:36 1111130 kernel: usb 2-2: last_cmd_resp_id: 16 80 a4 80 7f 80 16 80 7f 80 +May 15 12:47:36 1111130 kernel: usb 2-2: last_event_index = 2 +May 15 12:47:36 1111130 kernel: usb 2-2: last_event: 58 00 58 00 58 00 58 00 58 00 +May 15 12:47:36 1111130 kernel: usb 2-2: data_sent=1 cmd_sent=1 +May 15 12:47:36 1111130 kernel: usb 2-2: ps_mode=1 ps_state=0 +May 15 12:47:36 1111130 kernel: usb 2-2: cmd timeout +May 15 12:47:36 1111130 kernel: usb 2-2: failed to get signal information +May 15 12:47:36 1111130 kernel: usb 2-2: PREP_CMD: FW is in bad state firmware image Version: 14.68.29.p26 (see also http://git.marvell.com/?p=mwifiex-firmware.git;a=commit;h=e24897b869455e9aea6b5892e21d72585e136376) tested with kernel version(s): 3.13.9 till 3.15.0-rc5 I have enabled DYNAMIC_DEBUG (as introduced in some comments above), and have some dumps (tcp, mlan0, dmesg before DYNDBG and after the firmware error) % mount -t debugfs none /sys/kernel/debug % echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control % echo "module mwifiex_usb +p" > /sys/kernel/debug/dynamic_debug/control What works: * starting and using mlan0 with the latest firmware. * rebooting * scp (tested with more than 1GB) * youtube, streams What does not work: * using SIP (linphone) whenever I start to talk with friends with my SIP client linphone, I got in a few seconds or minutes the "firmware is in bad state". Everytime. Its 100% reproduceable. As mentioned above, a rmmod and modprobe will fail to reinit the device and the firmare state. A system reboot can also hang a long time. (watchdogs are barking loud...) Created attachment 136171 [details]
3.15.0-rc5: whole dmesg before any mwifiex_usb errors
The dmesg right after booting a fresh 3.15.0-rc5
Created attachment 136181 [details]
mlan0 debug file after a firmware in bad state
mlan0 debug file after a firmware in bad state
Created attachment 136191 [details]
dmesg of 3.15.0-rc5 with mwifiex_usb firmare error (DYNDBG)
whenever linphone is used and RTP session is active (talk with a friend), I got plenty of firmware errors
> --- Comment #53 from Christian Hartmann <cornogle@googlemail.com> ---
> whenever linphone is used and RTP session is active (talk with a friend), I
> got
> plenty of firmware errors
Yes, I can confirm this with Surface Pro 1. The problem is still here in
VoIP applications (at least some SIP-based, like Ekiga or linphone,
Skype Linux client too). Symptoms are just the same as they were (the
firmware hangs until the USB's VBus is cycled). I recall I saw some
commits by Mr. Karwar recently that appear to me as relevant to the
problem, not shure if they made it into 3.15.0-rc5. Hadn't tried them
yet, though.
(In reply to Dmitry Khromov from comment #54) > > --- Comment #53 from Christian Hartmann <cornogle@googlemail.com> --- > commits by Mr. Karwar recently that appear to me as relevant to the > problem, not shure if they made it into 3.15.0-rc5. Hadn't tried them > yet, though. yes, it they made it into 3.15.0-rc2, but that did not resolve the FIRMWARE IN BAD STATE. https://git.kernel.org/cgit/linux/kernel/git/linville/wireless.git/commit/drivers/net/wireless/mwifiex?id=f8d2b9209ad648a8e9aa973d32fb64aa2ab01d00 Can this be snd_hda_intel related? My question seems silly, but at first I see a warning about "Unstable LPIB ... disabling LPIB delay counting". its than about seconds that the firmware will go into bad state again. When I use Firefox and what streams, I got also the snd_hda_intel warning, but I get not the firmware bug. This only happends with the SIP application, On the other hand, there is also a radio fm chip on the Wifi card, which will be another thread (no bug yet, cause there is no radio FM support over i2c yet ;) ) How can I help to debug it so somebody like Bing Zhao can make a new firmware to fix this issues as soon as possible ?? What can I do to come closer to the wrong code lines?? Hi, we have recently found some missing memset (for tx_info/rx_info control buffers) which can cause random failures on both command and data path. I'm not sure if these patches will solve this particular issue or not, but they are worth a try: #1: https://git.kernel.org/cgit/linux/kernel/git/linville/wireless.git/commit/?id=d76744a93246eccdca1106037e8ee29debf48277 #2: http://marc.info/?l=linux-wireless&m=140425083601658&w=4 Hi Dmitry, Can you kindly close the bug if it is not seen anymore. Thanks, Ganapathi |
Created attachment 107360 [details] dmesg output after having to turn interface down and up again for 2 times I'm running Linux 3.10.9 on Microsoft Surface Pro and have WiFi-related problems with mwifiex_usb driver when associated with WPA2-PSK (hadn't tried others) APs: 1) Approximately every 30 minutes (pretty random) interface just hangs - no frames in, no out. Trying to write something (e.g. use ping utility) returns ENOBUFS. If I try to bring interface down, then up again sometimes it will start again, sometimes - say that, suddenly, RF-kill is active (/sys/class/rfkill/rfkillX/state is 0). It may also hang integrated Bluetooth when this happens. dmesg is silent. 2) Kernel says the following from time to time: [ 9291.798173] usb 3-3: data: -ENOSR is returned [10213.065880] usb 3-3: data: -ENOSR is returned It may be a single message, a couple of messages in a hundreds of milliseconds time frame, or a constant flood for tens of seconds. In my experience it is not related with the current traffic rate. 3) Sometimes (rare) the module will cause panic on transition to STR (unfortunately I can't figure a way to capture kernel messages in this case since the only available to user bus in Surface Pro is USB). 4) Sometimes (rare) on resume the USB bus will just stop working. I suppose it is caused by this module because I had never observed this with the module being blacklisted. I should probably mention the fact that I have USB auto-suspend off and power on for all ports in suspend (due to problems with other devices). Once again - I don't know how to capture kernel messages in this case. I do not think those are hardware errors - the device seems to work stable in Windows, though I rarely use Windows, and never for more than an hour or two. STEPS TO REPRODUCE: Not sure. I have those problems since 3.10.3 and still don't see correlations with anything, except, maybe, network activity - if I set iptables to drop all traffic except DHCP it may work for hours. I guess my description is not very detailed - but I can't track what causes the problem. Any help would be appreciated.