Bug 48391

Summary: brcmsmac: random stalls
Product: Drivers Reporter: Stanislaw Gruszka (stf_xl)
Component: network-wirelessAssignee: brcm80211 driver developers (brcm80211-dev-list)
Status: CLOSED CODE_FIX    
Severity: normal CC: akb825, aspriel, eteran, pflugerama, phaber, swt
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: log from wpa_supplicant
output from dmesg
flush patch

Description Stanislaw Gruszka 2012-10-05 09:11:24 UTC
I can observe random connection stalls, which takes about 30 seconds and then automatically recovers. Unfortunetly I can not rally tell what are steps to reproduce that problem, it happens at random. In dmesg there are following messages, not sure if related through, I have some of them even if problem do not show up. 

ieee80211 phy0: brcms_c_sendampdu: No fifo space  !!
ieee80211 phy0: wl0: sendampdu: prep_xdu retry; seq 0x0
ieee80211 phy0: brcms_c_sendampdu: No fifo space  !!
ieee80211 phy0: wl0: sendampdu: prep_xdu retry; seq 0x0
ieee80211 phy0: brcms_c_ampdu_dotxstatus_complete: Pkt tx suppressed, illegal channel possibly 1

Feel free to ask me for more info, you think might be useful to track down this bug.
Comment 1 Stanislaw Gruszka 2012-10-05 09:12:10 UTC
This happen on version 3.5 and 3.6 .
Comment 2 Aaron Barany 2012-10-07 19:20:03 UTC
I have been experiencing the same problems on my laptop with a BCM4313 wireless card. dmesg has usually been silent when it happenes, but during a recent test with transferring a large file I got this warning:
 
[  471.789161] WARNING: at drivers/net/wireless/brcm80211/brcmsmac/main.c:7968 brcms_c_wait_for_tx_completion+0x99/0xb0 [brcmsmac]()
[  471.789164] Hardware name: Aspire E1-571
[  471.789167] Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media fuse aes_x86_64 aes_generic ipt_REJECT ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables snd_hda_codec_hdmi snd_hda_codec_realtek arc4 joydev i915 brcmsmac cordic brcmutil mac80211 acpi_cpufreq mperf microcode ghash_clmulni_intel i2c_algo_bit drm_kms_helper drm snd_hda_intel snd_hda_codec snd_hwdep snd_pcm acer_wmi cryptd cfg80211 intel_agp i2c_i801 sdhci_pci sdhci serio_raw tg3 sparse_keymap snd_page_alloc snd_timer snd soundcore intel_gtt iTCO_wdt iTCO_vendor_support mmc_core mei lpc_ich i2c_core mfd_core kvm_intel libphy bcma psmouse processor pcspkr evdev kvm rfkill wmi coretemp crc32c_intel battery ac button video nfs nfs_acl lockd auth_rpcgss sunrpc fscache ext4 crc16 jbd2 mbcache sd_mod sr_mod cdrom ahci libahci libata scsi_mod ehci_hcd usbcore usb_common
[  471.789283] Pid: 1266, comm: kworker/u:30 Not tainted 3.5.5-1-ARCH #1
[  471.789284] Call Trace:
[  471.789291]  [<ffffffff81051a3f>] warn_slowpath_common+0x7f/0xc0
[  471.789293]  [<ffffffff81051a9a>] warn_slowpath_null+0x1a/0x20
[  471.789299]  [<ffffffffa04d3d29>] brcms_c_wait_for_tx_completion+0x99/0xb0 [brcmsmac]
[  471.789303]  [<ffffffffa04c61bb>] brcms_ops_flush+0x3b/0x60 [brcmsmac]
[  471.789312]  [<ffffffffa0478f69>] ieee80211_mgd_probe_ap_send+0xc9/0x230 [mac80211]
[  471.789318]  [<ffffffffa047974f>] ieee80211_mgd_probe_ap.part.22+0x10f/0x130 [mac80211]
[  471.789323]  [<ffffffffa04798fe>] ieee80211_sta_monitor_work+0x2e/0x30 [mac80211]
[  471.789327]  [<ffffffff8106ef1a>] process_one_work+0x12a/0x440
[  471.789332]  [<ffffffffa04798d0>] ? ieee80211_beacon_connection_loss_work+0x160/0x160 [mac80211]
[  471.789335]  [<ffffffff8106f77e>] worker_thread+0x12e/0x2d0
[  471.789337]  [<ffffffff8106f650>] ? manage_workers.isra.26+0x1f0/0x1f0
[  471.789340]  [<ffffffff81074e73>] kthread+0x93/0xa0
[  471.789344]  [<ffffffff8148aa24>] kernel_thread_helper+0x4/0x10
[  471.789346]  [<ffffffff81074de0>] ? kthread_freezable_should_stop+0x70/0x70
[  471.789348]  [<ffffffff8148aa20>] ? gs_change+0x13/0x13
Comment 3 Stanislaw Gruszka 2012-10-08 07:56:00 UTC
On my case there is no warning in dmesg. We have bug 42840 for tracking the warning, it was closed as obsolete, I'll reopen it.
Comment 4 Aaron Barany 2012-10-09 06:37:04 UTC
I ran another test with transferring a large file after upgrading to 3.5.6 and I got the same messages about no fifo space as the original reporter, the same warning I reported earlier, as well as this message printed many times:

[  297.918177] ieee80211 phy0: >brcms_c_prec_enq_head: No where to go, prec == 4

These messages popped up at different times during the transfer. The transfer ended up completing successfully, but it stalled several times in the middle of it. Not every stall produced output in dmesg, though.
Comment 5 Piotr Haber 2012-10-12 06:28:34 UTC
Could you provide some more info about hardware used (lspci -vnn)?

Do the stalls correlate with scanning (done by wpa_supplicant, enabling logs would be helpful)?
Trace from comment #2 might suggest so, as do messages from comment #1.
Comment 6 Aaron Barany 2012-10-12 09:36:29 UTC
This is the output from lspci -vnn:
03:00.0 Network controller [0280]: Broadcom Corporation BCM4313 802.11b/g/n Wireless LAN Controller [14e4:4727] (rev 01)
	Subsystem: Foxconn International, Inc. Device [105b:e042]
	Flags: bus master, fast devsel, latency 0, IRQ 17
	Memory at c0500000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
	Capabilities: [58] Vendor Specific Information: Len=78 <?>
	Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [d0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [13c] Virtual Channel
	Capabilities: [160] Device Serial Number 00-00-e6-ff-ff-4e-e0-06
	Capabilities: [16c] Power Budgeting <?>
	Kernel driver in use: bcma-pci-bridge

I'm not quite sure what you mean by whether or not the stalls are correlated with scanning. Do you mean if I'm searching for new networks during the transfer? The tests were run by transferring a large file from an NFS volume. I wasn't executing any other wireless activity, at least not manually. (I'm using NetworkManager, so it's possible if it was doing something in the background)
Comment 7 Piotr Haber 2012-10-12 10:24:45 UTC
Thanks for info.
Concerning correlation with scanning - wpa_supplicant (and NetworkManger uses it) does periodic scans, it would be helpful to make it log activity.

I think you can do that by exporting environment variables
IF_WPA_DEBUG_LEVEL to set log level
and maybe IF_WPA_LOGFILE to set destination file.

Can you also include dmesg?
Comment 8 Aaron Barany 2012-10-12 18:47:18 UTC
Created attachment 83031 [details]
log from wpa_supplicant
Comment 9 Aaron Barany 2012-10-12 18:52:45 UTC
Created attachment 83041 [details]
output from dmesg

networkmanager doesn't seem to respect those environment variables when it starts wpa_supplicant, but I was able to start it manually to enable logging.

I have attached the output from wpa_supplicant and dmesg. I believe the portion of dmesg that corresponds to the log from wpa_suplicant begins with the last deauthenticate by local choice on line 1894. Some of my previous runs are before that point, though, including some backtraces from the kernel warnings. It doesn't look like I got any kernel warnings from that run, though I did get a disconnect, which hasn't happened before.
Comment 10 Arend van Spriel 2012-10-12 19:03:47 UTC
Too bad the output does not contain time stamps. There is a DEAUTH with reason code 2 (Previous authentication no longer valid) although the rest of the log seems fine
----------
wlan0: State: GROUP_HANDSHAKE -> COMPLETED
dbus: flush_object_timeout_handler: Timeout - sending changed properties of object /fi/w1/wpa_supplicant1/Interfaces/1
RTM_NEWLINK: operstate=1 ifi_flags=0x1003 ([UP])
RTM_NEWLINK, IFLA_IFNAME: Interface 'wlan0' added
nl80211: if_removed already cleared - ignore event
nl80211: Event message available
nl80211: Delete station 00:18:e7:ef:ea:9e
nl80211: Event message available
nl80211: MLME event 39
nl80211: MLME event frame - hexdump(len=26): c0 00 3a 01 e0 06 e6 4e 7c ac 00 18 e7 ef ea 9e 00 18 e7 ef ea 9e 10 9b 02 00
wlan0: Event DEAUTH (12) received
wlan0: Deauthentication notification
wlan0:  * reason 2
wlan0:  * address 00:18:e7:ef:ea:9e
Deauthentication frame IE(s) - hexdump(len=0): [NULL]
wlan0: WPA: Auto connect enabled: try to reconnect (wps=0)
wlan0: Setting scan request: 0 sec 100000 usec
Added BSSID 00:18:e7:ef:ea:9e into blacklist
wlan0: Not rescheduling scan to ensure that specific SSID scans occur
wlan0: CTRL-EVENT-DISCONNECTED bssid=00:18:e7:ef:ea:9e reason=2
wlan0: Disconnect event - remove keys
Comment 11 Piotr Haber 2012-10-15 09:59:51 UTC
Could you elaborate a little bit more on the nature of the stalls?
How long do they last?
Do you see a total loss of transmission or a dip in throughput (how big)?
How long it takes to recover from stall (did you have a situation that it did not recover and you needed a module reload?
Is downlink (AP->STA) affected or uplink?

I was able to occasionally (most of the times it is just a dip it throughput, which is normal) reproduce a stall with 2 iperf sessions running in parallel and then initiating a scan (iw wlanx scan).
Comment 12 Aaron Barany 2012-10-15 18:49:32 UTC
As far as I can tell it's a complete loss of transfer. I've never had an instance where it failed to recover automatically. This also only seems to affect downloads.

The stalls last anywhere from a few seconds to around a minute. Likewise, the amount of time transferring can last anywhere from a few seconds to a minute. It seems to go through phases where sometimes it will spend more time downloading than stalling, while other times it will spend more time stalling than downloading.

I was only running one transfer at a time. However, I do have a few background processes that are running that may access the network: NetworkManager, wpa_supplicant, avahi, and ntpd.
Comment 13 Stanislaw Gruszka 2012-10-16 11:55:26 UTC
Piotr, do you want answers from me or Aaron information is enough?
Comment 14 Piotr Haber 2012-10-17 05:53:51 UTC
It would be helpful to have as many described cases as possible.
There is a related thread on redhat's bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=799168) where traces seem to suggest that this might happen in scenarios other than periodic scanning (there is even a report of a full hang, unfortunately without much details).

On my setup I was able minimize time without any transfer using a simple patch.
Could you try it? It's for 3.6.
Comment 15 Stanislaw Gruszka 2012-10-17 07:59:14 UTC
Yes, I can test the patch.
Comment 16 Piotr Haber 2012-10-18 06:08:42 UTC
Created attachment 83811 [details]
flush patch

Arend's original patch (http://marc.info/?l=linux-wireless&m=134683858909456&w=2)
plus kick of the send queue + some logging to get more insight of the problem nature
Comment 17 Stanislaw Gruszka 2012-10-18 09:46:47 UTC
Patch fixes stall issue for me (and also performance improve dramatically :-)

I have those warnings:

------------[ cut here ]------------
WARNING: at drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c:730 brcms_ops_flush+0x16f/0x180 [brcmsmac]()
Hardware name: HP xw8600 Workstation
Pid: 7030, comm: kworker/u:1 Tainted: G        W    3.6.0+ #5
Call Trace:
 [<ffffffff8105794f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff810579aa>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa068242f>] brcms_ops_flush+0x16f/0x180 [brcmsmac]
 [<ffffffff8107ef00>] ? wake_up_bit+0x40/0x40
 [<ffffffffa05ee717>] ieee80211_scan_state_resume+0x57/0x100 [mac80211]
 [<ffffffffa05ef103>] ieee80211_scan_work+0x1a3/0x2c0 [mac80211]
 [<ffffffff81076555>] process_one_work+0x215/0x530
 [<ffffffff810764e8>] ? process_one_work+0x1a8/0x530
 [<ffffffffa05eef60>] ? ieee80211_request_scan+0x60/0x60 [mac80211]
 [<ffffffff81078ebe>] worker_thread+0x12e/0x320
---[ end trace a14b60a491b872e9 ]---

on other network (11g only, WPA2-Eneterprice), but that different problem, so I'll open a separate bug report for it.
Comment 18 Stanislaw Gruszka 2012-10-18 09:57:24 UTC
> on other network (11g only, WPA2-Eneterprice), but that different problem, so
> I'll open a separate bug report for it.

Actually I think this is related with bug 48131, since environment and steps to reproduce are the same.
Comment 19 Piotr Haber 2012-10-18 11:30:33 UTC
just to clear thing up:
- warning at brcms_ops_flush means that we could not get rid of all outstanding packets in queue in specified time (currently 100ms)

- brcms_c_prec_enq_head: No where to go, prec == 4 - indicate packets being dropped because of insufficient queue space, we hope to address this issue with flow control between mac80211 and driver

both signify non critical issues and should be seen as an aid to debugging
Comment 20 Aaron Barany 2012-10-20 19:25:53 UTC
The patch fixed the random stalls for me as well.
Comment 21 Stanislaw Gruszka 2012-10-25 10:27:37 UTC
Since patch fixing this bug was not yet applied, Arend should repost it ...
Comment 22 Stanislaw Gruszka 2013-01-24 14:34:28 UTC
I can not reproduce this bug on recent kernels.