Bug 70191 - TX Hang in mwifiex with SD8787 in AP Mode
TX Hang in mwifiex with SD8787 in AP Mode
Status: NEW
Product: Drivers
Classification: Unclassified
Component: network-wireless
ARM Linux
: P1 normal
Assigned To: drivers_network-wireless@kernel-bugs.osdl.org
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-06 20:22 UTC by Andrew Wiley
Modified: 2014-12-05 10:36 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.13.1
Tree: Mainline
Regression: No


Attachments
dmesg including mwifiex, mwifiex_sdio, mvsdio debug messages (1.60 MB, text/plain)
2014-02-06 20:22 UTC, Andrew Wiley
Details
hostapd config (369 bytes, application/octet-stream)
2014-02-06 20:24 UTC, Andrew Wiley
Details
Debugging output on running the Dreamplug 1001 in AP-mode (7.97 KB, text/plain)
2014-02-11 22:11 UTC, Linus Gasser
Details
boot-time crash in mwifiex (583.88 KB, text/plain)
2014-09-22 23:30 UTC, Andrew Wiley
Details
boot-time crash in mwifiex (218.69 KB, application/octet-stream)
2014-09-22 23:31 UTC, Andrew Wiley
Details
boot-time crash in mwifiex (219.06 KB, application/octet-stream)
2014-09-22 23:31 UTC, Andrew Wiley
Details

Description Andrew Wiley 2014-02-06 20:22:48 UTC
Created attachment 124911 [details]
dmesg including mwifiex, mwifiex_sdio, mvsdio debug messages

If I start up an access point using hostapd on my SD8787 with mwifiex_sdio, the transmit queue will hang within a few seconds to a few minutes. If WPA2 is disabled, it may last a few hours, but it still hangs eventually.

Eventually, a watchdog notices that the queue has hung, mwifiex tries to reset the card, and the card fails to load the firmware afterwards.

Platform is a Dreamplug (Marvell Kirkwood)
Kernel is 3.13.1 (as noted above)
Firmware version shows in dmesg as 14.66.9.p96

You can find a discussion of the issue at http://thread.gmane.org/gmane.linux.kernel.wireless.general/118990 and continued at http://thread.gmane.org/gmane.linux.kernel.wireless.general/119133

It's probably also worth mentioning that this isn't the only issue I've observed. On this platform, the card will sometimes fail to load the firmware on boot, the card will hang the entire platform randomly while in station mode (heartbeat LED stops blinking), and mwifiex never succeeds in loading the firmware after it resets the card when an issue occurs. Of these issues, this AP issue seems the most diagnosable because it can be easily reproduced and logged.

Given that Avinash could not reproduce the issue on an amd64 machine with the same kernel and firmware (see the gmane link), and that the driver seems to be stable on other platforms, I suspect that the issues I noted above and this one are actually caused by a glitch in the interaction of mvsdio and mwifiex_sdio.

Attached, you'll find a dmesg capture of the issue. It was captured using these commands:

====
echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control
echo "module mwifiex_sdio +p" > /sys/kernel/debug/dynamic_debug/control
echo "module mvsdio +p" > /sys/kernel/debug/dynamic_debug/control
iw dev mlan0 interface add uap0 type __ap
hostapd /etc/hostapd/test/hostapd.24.conf
====

The AP was tested with a Windows laptop that was constantly pinging a host on the bridged network (not the AP).

Timing-wise, this is *approximately* the timeline, according to the dmesg timestamps:
405 - hostapd launches
413 - wifi is enabled on test machine
419 - test machine shows a network connection (DHCP complete - pings start being returned)
430 - first ping timeout is observed
Comment 1 Andrew Wiley 2014-02-06 20:24:42 UTC
Created attachment 124921 [details]
hostapd config
Comment 2 Andrew Wiley 2014-02-06 20:33:28 UTC
Also, on this particular trace, I didn't wait for the watchdog to notice the TX hang. The command timeout at timestamp 451 occurred because I killed hostapd with ctrl-C, and it attempted to bring down the interface.
Comment 3 Bing Zhao 2014-02-07 20:53:50 UTC
Hi Andrew, is this your DreamPlug?

"Globalscale DreamPlug 036000291452 GHz Class Linux Server"

I found it on Amazon.
Comment 4 Andrew Wiley 2014-02-07 21:41:52 UTC
Yes, that's it, but there's one twist.

During production, Globalscale switched wireless chipsets from SD8688 to SD8787. They don't note the change (they even shipped the first few units with the wrong drivers), but the first user I've seen that noticed the change received their unit in November 2011.

Most likely, the plugs Amazon is selling will have the SD8787 because the hardware change was so long ago, but the only way to be completely sure would probably be to go to Globalscale directly. Then again, Amazon does have a nice return policy.

The serial number can be used to tell which chipset the plug will have. The number is of the form DS2-####-######, and units DS2-1139 and greater seem to have SD8787.
Comment 5 Andrew Wiley 2014-02-08 03:13:42 UTC
I reverted my plug to the factory U-Boot and root filesystem this evening, and I can conform that an AP with WPA2 is stable there. The factory setup runs kernel 2.6.39.4 with the mlan proprietary driver and uaputl userland tools.

I'll let this setup run for a few more hours to make sure it's completely stable, but it looks like I can rule out a hardware issue.
Comment 6 Bing Zhao 2014-02-08 03:22:41 UTC
What's firmware version from factory root filesystem?

Thanks for the information about the plug.
Amazon's Production Description says WiFi is b/g, which implies SD8688.
Globalscale website says it's b/g/n. I will order one from Globalscale directly.
Comment 7 Andrew Wiley 2014-02-08 03:27:08 UTC
The mlan module doesn't report firmware version in dmesg, but I did get this from a console:

# uaputl sys_info
System information = w8787-Ax, RF878X, FP57, 14.57.5.p85, BT_SDIO
Comment 8 Bing Zhao 2014-02-08 03:50:41 UTC
Can you backup this firmware image? You can test it with the mwifiex driver after you switch to upstream kernel 3.13 again.
Comment 9 Linus Gasser 2014-02-08 19:34:22 UTC
(In reply to Bing Zhao from comment #6)
> What's firmware version from factory root filesystem?
> 
> Thanks for the information about the plug.
> Amazon's Production Description says WiFi is b/g, which implies SD8688.
> Globalscale website says it's b/g/n. I will order one from Globalscale
> directly.

There are (at least) two version of the Dreamplug. According to

http://www.madore.org/~david/linux/dreamplug.html

They're 0801/0802 and 0901/1001 respectively.
Wifi is SD8688 for the first two and SD8787 for the latter two.

I have both at home here, so if you want me to test anything, I can do on both.
Comment 10 Andrew Wiley 2014-02-09 03:16:53 UTC
I copied the firmware image over and switched back to 3.13. So far, the only difference I've observed is that WPA2 simply doesn't work with that firmware image and 3.13 - the client doesn't associate successfully. The driver also logs some errors when the AP goes up, but I don't have them copied down.

I did notice that the factory firmware uses dnsmasq and some iptables rules to set up a routed subnet with the AP interface as the gateway, whereas I was adding the AP interface to a bridge with an Ethernet adapter. It looks like the routed-AP setup doesn't have TX hangs even with the latest firmware, so at least the problem is narrowed down to an AP interface bridged with an Ethernet adapter.

The general instability is still around even with the old firmware. The plug still occasionally hangs when the interface is going up and down, and rarely while the AP is running.
Comment 11 Bing Zhao 2014-02-11 04:39:34 UTC
@Linux, if you have time, could you test SD8787 uAP feature as per bug description? I want to see if you get the same problem or not with your plug. Thanks for helping.

@Andrew, let's move back to the p96 firmware (I just wanted to check if the stock wifi firmware could make any difference).

> so at least the problem is narrowed down to an AP interface bridged with an Ethernet adapter.

Could you share with us how exactly this bridge is configured in your setup?
Comment 12 Andrew Wiley 2014-02-11 06:18:09 UTC
(In reply to Bing Zhao from comment #11)
> Could you share with us how exactly this bridge is configured in your setup?

Just this:

brctl addbr br0
brctl addif br0 eth0
brctl addif br0 eth1

and then this line in hostapd.conf:
bridge=br0

The only other relevant detail I can think of is that I run dhcpcd on br0 so that the plug has an IP on the network in addition to forwarding traffic from wireless clients.

I believe the behavior is the same if only one interface is added to the bridge, but this is the most recent configuration I tested with.
Comment 13 Linus Gasser 2014-02-11 22:11:43 UTC
Created attachment 125661 [details]
Debugging output on running the Dreamplug 1001 in AP-mode

See comments inside about the commands that have been run
Comment 14 Linus Gasser 2014-02-11 22:33:33 UTC
I tried on the same Dreamplug but without the bridge, configuring a simple NAT between the Dremaplug and the Internet. I connected my laptop to the wireless and did again a ping to the internet and a "ping -f" to the Dreamplg. After about 5 minutes, my laptop got disconnected. The only message I see is from hostapd:

uap0: STA 28:cf:da:df:d6:fa WPA: pairwise key handshake completed (RSN)
uap0: AP-STA-DISCONNECTED 28:cf:da:df:d6:fa
WPA: wpa_sm_step() called recursively
uap0: STA 28:cf:da:df:d6:fa IEEE 802.11: disassociated
uap0: STA 28:cf:da:df:d6:fa IEEE 802.11: associated
WPA: wpa_sm_step() called recursively
uap0: STA 28:cf:da:df:d6:fa IEEE 802.11: disassociated
uap0: STA 28:cf:da:df:d6:fa IEEE 802.11: associated
WPA: wpa_sm_step() called recursively
uap0: STA 28:cf:da:df:d6:fa IEEE 802.11: deauthenticated due to local deauth request

and nothing in "dmesg". Connecting again on the wireless is impossible. It asks for the passphrase, but then rejects. Even if I "ctrl-c" the hostapd-process and start it again, I can't connect.
Comment 15 Bing Zhao 2014-02-13 05:06:07 UTC
Thanks Linus and Andrew for the info.

> brctl addbr br0
> brctl addif br0 eth0
> brctl addif br0 eth1
> echo "module mwifiex +p" > /sys/kernel/debug/dynamic_debug/control
> echo "module mwifiex_sdio +p" > /sys/kernel/debug/dynamic_debug/control
> echo "module mvsdio +p" > /sys/kernel/debug/dynamic_debug/control
> iw dev mlan0 interface add uap0 type __ap

Did you run "brctl addif br0 uap0" after AP interface was created?

@Comment #14, this looks like a different issue which we may need a sniffer trace.
Comment 16 Linus Gasser 2014-02-13 15:39:17 UTC
I did not use "brctl addif br0 uap0", I think this is done automatically by hostapd, no? There is the line "bridge=br0" in hostapd.24.conf...

Shall I open a new bug for my @comment14? I would need some guidance about how to do a sniffer trace.
Comment 17 Bing Zhao 2014-02-14 05:17:17 UTC
You are right, bridge=br0 in hostapd.conf should add uap0 to br0.
Yes, you can open a new bug for Comment #14 if no tx timeout found in the logs.
As for the sniffer application, you can try WireShark (http://www.wireshark.org/) and it's free. But you have to find another Wi-Fi dongle as the capturing device which is compatible with Wireshark. I don't have much experience on WireShark as I use OmniPeek only.
Comment 18 Andrew Wiley 2014-03-23 18:59:30 UTC
Bump.
Is there anything else I can look into to try to help this along, or are we just waiting on Globalscale to deliver a Dreamplug so you can repro this yourself?
Comment 19 Bing Zhao 2014-03-23 19:22:56 UTC
No, I've received the Dreamplug already. And I checked that it's an SD8787 chip inside and the Linux is Debian. Sorry for the delay. I'm just too busy on my projects currently. I hope I can get something back to you soon.
Comment 20 Bing Zhao 2014-03-29 19:10:50 UTC
Hi Andrew, I'm getting this error (below) on my Dreamplug when trying to apt-get install something. The major concern for me is the "attempt to access beyond end of device" error message as it sounds like a rootfs corruption.

How do you update your Dreamplug to latest venilla/stable kernel? Thanks.


-------------------------------------
Do you want to continue [Y/n]? y
Setting up libcurl3-gnutls (7.21.0-2.1+squeeze7) ...
[1371438.050876] attempt to access beyond end of device
[1371438.056420] sda2: rw=0, want=30282874928, limit=6291456
Bus error
dpkg: error processing libcurl3-gnutls (--configure):
 subprocess installed post-installation script returned error exit status 135
configured to not write apport reports
                                      dpkg: dependency problems prevent configuration of git:
 git depends on libcurl3-gnutls (>= 7.16.2-1); however:
  Package libcurl3-gnutls is not configured yet.
dpkg: error processing git (--configure):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of git-core:
 git-core depends on git (>> 1:1.7.0.2); however:
  Package git is not configured yet.
dpkg: error processing git-core (--configure):
 dependency problems - leaving unconfigured
Comment 21 Linus Gasser 2014-03-29 20:41:20 UTC
Le 29/03/14 20:10, bugzilla-daemon@bugzilla.kernel.org a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=70191
>
> --- Comment #20 from Bing Zhao <bzhao@marvell.com> ---
> Hi Andrew, I'm getting this error (below) on my Dreamplug when trying to
> apt-get install something. The major concern for me is the "attempt to access
> beyond end of device" error message as it sounds like a rootfs corruption.
>
> How do you update your Dreamplug to latest venilla/stable kernel? Thanks.

I had similar problems when using old kernels or bad SD-cards. Which 
kernel-version do you use? Perhaps a fresh install would be best. Once 
SD-cards get corrupted, a good old format can do wonders. Sometimes only 
a new card can help (and I went through many)!

Linus
Comment 22 Bing Zhao 2014-03-29 21:09:16 UTC
Linus, thanks for your reply.

The kernel is 2.6.39.4. It came with the Dreamplug's default kernel.
Linux dreamplug-debian 2.6.39.4 #110 PREEMPT Wed Sep 18 17:38:00 EDT 2013 armv5tel GNU/Linux

Could you point me to the procedure for a fresh install? Thanks!
Comment 23 Linus Gasser 2014-03-29 21:42:33 UTC
Le 29/03/14 22:09, bugzilla-daemon@bugzilla.kernel.org a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=70191
>
> --- Comment #22 from Bing Zhao <bzhao@marvell.com> ---
> Linus, thanks for your reply.
>
> The kernel is 2.6.39.4. It came with the Dreamplug's default kernel.
> Linux dreamplug-debian 2.6.39.4 #110 PREEMPT Wed Sep 18 17:38:00 EDT 2013
> armv5tel GNU/Linux

For the tests in this thread we use the Archlinux-installation from

http://archlinuxarm.org/os/ArchLinuxARM-kirkwood-latest.tar.gz

but you'll also need to update u-boot. Better ask in the forums there 
for help about that, as here it's about the wifi-problem itself.

As said, the wifi is NOT STABLE USING LATEST ARCHLINUX, contrary to the 
kernel shipped with the Dreamplug...

Linus
Comment 24 Andrew Wiley 2014-04-22 22:59:37 UTC
As Linus said, we're both running Arch Linux ARM. As far as I can tell, Debian has no support for running a recent kernel on this board, and Arch does.

There are three approaches to running a recent kernel.

1) Run the kernel with a separate DTB. This is the ideal setup moving forward because it allows for easy kernel upgrades, but it requires an updated U-Boot because the factory one doesn't support DTBs at all. That isn't as bad as it sounds because the Dreamplug is supported in U-Boot mainline, so it's really just a matter of building the latest revision, but it is an extra step.

2) Run the kernel with the DTB appended to the zImage, which is then packaged into a combined uImage for U-Boot. This is the easiest setup to get working because it works with the factory U-Boot.

3) Run the kernel with a boardfile instead of a DTB. Arch Linux ARM still "supports" this approach, in that they maintain an ugly patch that adds a Dreamplug boardfile to the kernel. I've confirmed that this bug manifests exactly the same way with this approach, but I wouldn't recommend it because the boardfile patch may not be completely correct.


Note that with the DTB approaches, there's a minor difference in IRQ handling order that currently results in spurious "unhandled IRQ" warnings. You can find details and a patch that restores the original behavior at https://lkml.org/lkml/2013/11/15/276

I'm currently reconstructing how I set up my Dreamplug because I accidentally wiped the flash drive where I did all my kernel builds. If you'd like, I can do a writeup somewhere.
Comment 25 Bing Zhao 2014-04-24 03:16:59 UTC
My apology to Andrew and Linus as I really don't have time to work on my Dreamplug. I expect this situation will continue for another or two months.
Andrew, if you have a writeup that will be great. I can try to find some time to set it up.
Comment 26 Bing Zhao 2014-06-20 18:49:39 UTC
Hi Andrew, Linus, could you please apply this patch?
http://marc.info/?l=linux-wireless&m=140328997917046&w=2
Comment 27 Linus Gasser 2014-06-23 14:25:24 UTC
Wow - looks great! A 1-liner, that beats a lot of my bugs ;) I'm in the process of trying to compile a kernel for the Dreamplug, as the kernels > 3.13.7 on y dreamplug don't run anymore :( I hope to have a result in a day or two. But already a big thank you for searching the bug!
Comment 28 Andrew Wiley 2014-06-25 02:01:25 UTC
I've been testing this over the past few days, and while it does seem to have resolved the TX hang, I'm now seeing lockups after the AP is running for a while. Unfortunately, they don't seem to be triggering the lockup detectors, so I don't have a stack trace.

Since Michael on the other bug is reporting that his seemingly identical issue is resolved, I'm guessing the lockups are a separate platform-specific issue. I'll continue to try to get enough information for a separate bug report.

Linus, are you seeing the same thing?
Comment 29 Bing Zhao 2014-07-01 22:29:45 UTC
Hi, we found some other missing memset for tx_info which could cause random failures. COuld you give it a try?

http://marc.info/?l=linux-wireless&m=140425083601658&w=4
Comment 30 Andrew Wiley 2014-09-22 17:03:48 UTC
I built a kernel from the wireless-testing tree last week, and the driver basically makes my Dreamplug unusable now. It looks like it's trying to load the firmware, timing out waiting for the card to respond, trying to restart the card and crashing, crashing while trying to print the crash report (repeats for a while), rebooting, and running through it again. Sometimes a different failure will happen during initialization and the driver won't try to restart the card, so the system boots with no wifi interface, and sometimes the system will just hang.

I'll try to get a clean-ish serial log up later today, but something is seriously wrong here.
Comment 31 Andrew Wiley 2014-09-22 23:30:58 UTC
Created attachment 151451 [details]
boot-time crash in mwifiex
Comment 32 Andrew Wiley 2014-09-22 23:31:13 UTC
Created attachment 151461 [details]
boot-time crash in mwifiex
Comment 33 Andrew Wiley 2014-09-22 23:31:25 UTC
Created attachment 151471 [details]
boot-time crash in mwifiex
Comment 34 Andrew Wiley 2014-09-22 23:32:58 UTC
I added three boot logs of crashes, taken from the serial console.

It's pretty clear that this is a new and separate issue (as it seems to be occurring without any userspace interaction with the card). Should we move this discussion to a separate bug?

Has something similar been seen on any other hardware platforms?
Comment 35 Avinash Patil 2014-11-04 10:04:06 UTC
Hi Andrew,


Recently we have uploaded new FW image to git:

http://git.marvell.com/?p=mwifiex-firmware.git;a=commit;h=3f45b8c4cc1eb1d102bc3486b19677332dd215ab

Could you please check if you see issue with this FW?
Comment 36 Andrew Wiley 2014-11-23 20:45:26 UTC
Hello Avinash,

With a build of yesterday's wireless-testing tree and the new firmware, the most I could get the card to do was to start a scan. I only ever got one result back before the driver would complain of an invalid rx_len, the command would timeout, and the driver would try to reset the card and either hang on the mwifiex_sdio_work queue in mwifiex_sdio_remove or just crash the machine completely.

Most of the time, I didn't manage to get far enough to start a scan because the firmware would fail to load or the driver would report the firmware was loaded, then fail the first command.

The firmware seems to load successfully a bit more often if I blacklist btmrvl_sdio, but the boot/test/crash/reboot cycle takes so long that I don't have any useful data on that.

Is there a kernel tree that I should be testing rather than wireless-testing, or is that the best option?
Comment 37 Andrew Wiley 2014-11-24 01:20:22 UTC
I did manage to get the card to work once. I blacklisted mwifiex_sdio but let btmrvl_sdio load. The bluetooth driver failed to load the firmware, but the card started up when I loaded mwifiex (unfortunately, I don't know whether it loaded the firmware or it was already loaded). Client mode functioned perfectly across multiple disassociation/re-associations. I did not test AP mode.

Unfortunately, things broke again on reboot and I haven't managed to reproduce whatever made it work. It definitely seems like loading the firmware is the biggest problem on this platform.
Comment 38 Andrew Wiley 2014-11-30 05:30:10 UTC
I've found that disabling DMA on the mvsdio driver makes the card work very nicely, albeit with the high CPU overhead you'd expect from disabling DMA. This can be accomplished by putting "options mvsdio nodma=1" in a file in /etc/modprobe.d (or mvsdio.nodma=1 on the boot cmdline, if I understand correctly).

btmrvl_sdio still fails to load firmware, and I have it blacklisted.

At this point, should I look into posting debug logs with DMA enabled and disabled, or is there someone at Marvell who's familiar with the Kirkwood SDIO controller that could take a look?
Comment 39 Andrew Wiley 2014-12-01 06:09:42 UTC
This patch seems to fix things such that I don't have to disable DMA (although most host-to-card transfers still won't be using DMA): http://www.spinics.net/lists/arm-kernel/msg376894.html
Is it possible that the performance loss due to using PIO instead of DMA in mwifiex could be avoided by making mwifiex store data it needs to transmit in 64-byte aligned buffers?

I'll continue to fiddle with the card, but this seems to resolve all my issues. Linus, do you see the same? If so, this bug should probably be closed.
Comment 40 Avinash Patil 2014-12-01 06:22:47 UTC
Hi Andrew,

>>Is it possible that the performance loss due to using PIO instead of DMA in mwifiex could be avoided by making mwifiex store data it needs to transmit in 64-byte aligned buffers?

Yes; this can be done. We need to ensure that we have enough headroom during hard_start_xmit() and align DMA to 64 length during TX.
We will work on this.
Comment 41 Linus Gasser 2014-12-05 06:19:42 UTC
Le 01/12/2014 07:09, bugzilla-daemon@bugzilla.kernel.org a écrit :
> I'll continue to fiddle with the card, but this seems to resolve all my issues.
> Linus, do you see the same? If so, this bug should probably be closed.

I finally came around to test it on the latest ArchLinux with a 3.17.4-1 
kernel on a 0802-Dreamplug. Client-mode seems fine so far, except with:

- one "ping -f -s 16384" I launched hung the wireless connection for a 
minute. The serial console didn't show anything, and everything was up 
and running once I killed the ping

I can't get the AP-mode running - am I missing something? I don't 
remember having had to do something special...

Dec 05 05:58:19 alarm hostapd[171]: Configuration file: 
/etc/hostapd/hostapd.conf
Dec 05 05:58:19 alarm hostapd[171]: nl80211: Could not configure driver mode
Dec 05 05:58:19 alarm hostapd[171]: nl80211 driver initialization failed.
Dec 05 05:58:19 alarm hostapd[171]: hostapd_free_hapd_data: Interface 
wlan0 wasn't started

It's like I'm not seeing any firmware loaded, which seems strange to me. 
Journalctl doesn't show up with anything, libertas* and btmrvl* both are 
loaded.

Linus
Comment 42 Linus Gasser 2014-12-05 10:36:17 UTC
Le 01/12/2014 07:09, bugzilla-daemon@bugzilla.kernel.org a écrit :
> I'll continue to fiddle with the card, but this seems to resolve all my issues.
> Linus, do you see the same? If so, this bug should probably be closed.

Tried the same thing on a 1001-Dreamplug, seems to work fine. Again tested

ping -f

for a couple of minutes, while copying a file to the Dreamplug and 
untarring it. So WLAN + microSD-card seem to work fine.

Again, ping -f -s 16384 seems to shut things down, but this might alos 
be my router.

To summarize:

sd8688: 1001-client: OK
sd8688: 1001-ap: OK
sd8787: 0802-client: OK
sd8787: 0802-ap: couldn't test, I think libertas_sdio has no 
ap-functionality

So seems to be OK for me, too!

Linus

Note You need to log in before you can comment on or make changes to this bug.