Bug 48921

Summary: iwlwifi triggers HW restart each 300 seconds
Product: Drivers Reporter: szczarek (szczarek)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED DUPLICATE    
Severity: high CC: akhan, alan, anarsoul, andrewd18, arthur.titeica, assertnull, camden.lindsay+kernel, david, djdjaa89, drivers_network-wireless, fab, hendry, ilw, johannes, kernel.org, kernel.org, linville, stf_xl, the.aidar
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.8 Subsystem:
Regression: No Bisected commit-id:
Attachments: andrewd18_dmesg_output
lspci -vv output

Description szczarek 2012-10-16 11:59:21 UTC
iwlwifi driver causes network connectivity lost each 300 seconds and triggers WiFi HW restart.

[  732.466309] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues
[  732.966062] iwlwifi 0000:03:00.0: Error sending REPLY_ADD_STA: time out after 500ms.
[  732.966073] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 235
[  732.966084] wlan0: failed to remove key (0, d0:57:4c:56:be:47) from hardware (-110)
[  733.466092] iwlwifi 0000:03:00.0: Error sending REPLY_QOS_PARAM: time out after 500ms.
[  733.466104] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 236
[  733.466111] iwlwifi 0000:03:00.0: Failed to update QoS
[  733.966067] iwlwifi 0000:03:00.0: Error sending REPLY_RXON: time out after 500ms.
[  733.966078] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 238
[  733.966087] iwlwifi 0000:03:00.0: Error clearing ASSOC_MSK on BSS (-110)
[  734.468036] iwlwifi 0000:03:00.0: Error sending REPLY_ADD_STA: time out after 500ms.
[  734.468042] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 239
[  734.468048] wlan0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-110)
[  734.968023] iwlwifi 0000:03:00.0: Error sending REPLY_RXON: time out after 500ms.
[  734.968027] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 241
[  734.968030] iwlwifi 0000:03:00.0: Error clearing ASSOC_MSK on BSS (-110)
[  736.970274] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues
[  737.470044] iwlwifi 0000:03:00.0: Error sending REPLY_RXON: time out after 500ms.
[  737.470056] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 243
[  737.470064] iwlwifi 0000:03:00.0: Error clearing ASSOC_MSK on BSS (-110)
[  737.970300] iwlwifi 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
[  737.970312] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 244
[  737.970880] cfg80211: All devices are disconnected, going to restore regulatory settings
[  737.970893] cfg80211: Restoring regulatory settings
[  737.970953] cfg80211: Calling CRDA to update world regulatory domain
[  737.982261] cfg80211: Ignoring regulatory request Set by core since the driver uses its own custom regulatory domain
[  737.982268] cfg80211: World regulatory domain updated:
[  737.982272] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[  737.982277] cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[  737.982282] cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[  737.982287] cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
[  737.982291] cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[  737.982296] cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
[  739.471301] iwlwifi 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
[  739.471314] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 245
[  740.972062] iwlwifi 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
[  740.972075] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 246
[  742.472047] iwlwifi 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
[  742.472060] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 247
[  743.973323] iwlwifi 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
[  743.973335] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 248
[  745.475320] iwlwifi 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
[  745.475332] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 249
[  746.976334] iwlwifi 0000:03:00.0: Error sending REPLY_SCAN_CMD: time out after 500ms.
[  746.976347] iwlwifi 0000:03:00.0: Current CMD queue read_ptr 230 write_ptr 250



opitons used as WA are not working:
options iwlwifi 11n_disable=1 bt_coex_active=N 5ghz_disable=Y
Comment 1 szczarek 2012-10-16 12:06:20 UTC
HW: HP 8530w, 

WiFi HW:
03:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300
Comment 2 Bassu 2012-11-06 22:14:37 UTC
I am experiencing the same problem with Centrino N 2200 series.
Options wd_disable, 11n_disable only delays the restarts but they still happen and quite randomly.

But there's one thing common across all of the iwlwifi crashes I have seen so far -- they all start with "fail to flush all tx fifo queues" messages.
Comment 3 Aidar 2012-11-08 11:44:31 UTC
You guys are definitely not alone here with this 'feature'.

Take a look here: 
  https://bugzilla.redhat.com/show_bug.cgi?id=805285
  https://bugzilla.redhat.com/show_bug.cgi?id=833117 ( unfortunately, the reporter here opted to being deluded with a false negative idea that it is his ram and cheap AP rather than iwlwifi )
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/984552 ( a lot of conjectures as of 802.11n being the cause )
http://www.linuxforums.org/forum/wireless-internet/188886-problem-connect-wifi-my-asus-laptop.html
  http://ubuntuforums.org/showthread.php?t=1941350-
  http://askubuntu.com/questions/153092/cant-find-intel-wireless-n-1000-after-waking-from-sleep
  https://bugzilla.redhat.com/show_bug.cgi?id=825491
  https://bugzilla.redhat.com/show_bug.cgi?id=833117

I have the same "iwlwifi 0000:03:00.0: fail to flush all tx fifo queues" with 5300AGN here with all of the firmwares (iwlwifi-5000-ucode-8.83.5.1-1.tar.gz, iwlwifi-5000-ucode-8.83.5.1-1.tgz, iwlwifi-5000-ucode-8.24.2.12.tgz and iwlwifi-5000-ucode-5.4.A.11.tar.gz) from intellinuxwireless.org

So, yeah, wifi and linux is still apparently a bad idea.
I am waiting here for the shit to hit the fan and then, maybe the whiphy intel will in fact jump in.
Comment 4 Bassu 2012-11-08 13:03:53 UTC
I noticed *_idle errors in few back traces so I went ahead and started trying different cpu idling and power options. I tested with BIOS' Adaptive Thermal Management by turning it to max on battery and then disabled PCI Express Power Management features. It still happened and quite randomly. 

So I went ahead and started testing with different kernel patches. I stumbled upon Brain Fuck Scheduler and after patching with it, until now I did not hit this problem of iwlwifi not getting any reply after 3000ms, reporting unable to flush all tx fifo queues or queues getting stuck or full.

To @Aidar,
It is necessary not all kernel's fault; instead it is Intel's iwlwifi driver which is mostly screwed up to its fullest. To the best, Intel and Intel's developers seem to turning deaf ears and blind eyes to all these bug reports. Well done Intel, you better be not making any more networking hardware as nothing tells me that you are any needles' capable of doing that.

Oh and by the way, I setup a server machine with an on-board Intel network controller. You know what happened with its e1000e driver? It freakingly crashed after few hours. Not cool.
Comment 5 Stanislaw Gruszka 2012-11-08 14:14:56 UTC
Looks like the problem is caused by NetworkManager triggering periodical scanning (every 5 minutes), so perhaps using disable_hw_scan=1 can workaround this bug.
Comment 6 Bassu 2012-11-08 17:05:21 UTC
@Stanislaw Gruszka 
Newer iwlwifi does not seem to provide that option anymore.
And see the references to other tx fifo issues mentioned by Aidar; not all of those crashes are caused by frequent scans. It is Intel developers; lack of interest that is leading to such trashy behavior. Clearly see those increasingly agitating bug reports and see their responses, if are any!
Comment 7 Stanislaw Gruszka 2012-11-09 09:05:39 UTC
Ohh, right, that option was removed. But there is possible one more workaround, periodical scanning can be disabled in NetworkManager by assigning BSSID filed in options to MAC address of AP.

Bug originally reported here indicate problem triggered periodical, if you other problem you should probably CC yourself to other bug or open a new one.
Comment 8 Aidar 2012-11-09 09:41:14 UTC
Stanislaw, unfortunately, your conjecture that specifying the BSSID MAC in NetworkManager is good enough is false.

I checked out my configuration of NM and nm-applet. It has had BSSID mac specified from day 1, but, notwithstanding,  inevitably, I still see "fail to flush".

I appreciate when each bug is very specific. I also appreciate when you follow the rules to the letter, but in this particular case, the indirection you are suggesting by opening yet another new request for the same issue is just going to create another reference to reference. Frankly, this all looks as bad as a ponzi schema. At some point somebody will have to man up and step in. ( I am looking at you, intel whiphy gang ).

Enough of this already, just face & deal with it, in this very context, here, already, please.

  :)


Just a tough: What if nasa would have no choice but use wifi chips from one huge chipmaker for its Curiosity rover? They would have to run a twisted pair from earth to mars given that whiphy option from that co.
Comment 9 meat 2012-12-01 04:40:07 UTC
I've seen on, I don't know how many different forums/blogs/bugs thus far, people alluding to this being something specific to NetworkManager. 

It isn't. And reading this is getting a touch frustrating, as it makes me believe folks are chasing false solutions, or, thinking something is fixed, that isn't fixed, or, that the problem is never going to be solved. 

Two different iwl-1000 machines here. Neither using NetworkManager. You can remove that layer from the equation. I am using, quite simply, wpa_supplicant, (-Dnl80211), and dhcpcd. That's it. No UI of any sort. No networkmanager. Nada. 

Don't recall how far back this goes, but 3.6.0, 3.6.2, 3.6.5, 3.6.6, 3.6.8, the issue hasn't gone anywhere. 

Taking time to figure out the correct fix is perfectly ok in my book. Pulling random solutions out of nowhere, is not - this is not NM, never has been, never will be. For anyone interested, this is a (crappy) HP DV4, PCI bus data as such:

02:00.0 Network controller [0280]: Intel Corporation Centrino Wireless-N 1000 [Condor Peak] [8086:0084]
        Subsystem: Intel Corporation Centrino Wireless-N 1000 BGN [8086:1315]
        Capabilities: [140] Device Serial Number 00-1e-64-ff-ff-2b-90-38

The cynic in me thinks that the way this is going to be "fixed" is by a future commit that simply removes the logging of this error - the error will still occur, but it simply won't be logged. I hope I am wrong about that...
Comment 10 djdjaa89 2013-01-17 20:28:19 UTC
experiencing the same issues with:

  Intel Corporation Centrino Wireless-N 1030 [Rainbow Peak] (rev 34)

could bring debug info if needed / would help
Comment 11 Kai Hendry 2013-01-21 03:19:55 UTC
I think I see the same problem on a stock 3.8.0-1rc4--mainline-dirty build from http://sakuscans.com/pacmanpkg/x86_64/

With lots of

    iwlwifi 0000:03:00.0: fail to flush all tx fifo queues 

messages.

IIUC Arch https://wiki.archlinux.org/index.php/ThinkPad_X230#Suspension seems to imply running an "optimized kernel" which tbh I'm not happy about doing.

x220:~$ lspci -s 03:00.0 -vvv                                                                                                                                             
03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 [Taylor Peak] (rev 34)
        Subsystem: Intel Corporation Centrino Advanced-N 6205 AGN
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 47
        Region 0: Memory at f1500000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: <access denied>
        Kernel driver in use: iwlagn
Comment 12 Vasily Khoruzhick 2013-02-11 18:43:58 UTC
I'm getting this message here as well:
iwlwifi 0000:03:00.0: fail to flush all tx fifo queues

Hardware is 03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 [Taylor Peak] (rev 34)

Saw it on all 3.7.x kernels (now I'm on 3.7.6 now) and on 3.6.11
Comment 13 David Strauss 2013-03-26 18:54:36 UTC
I'm seeing this in 3.8 on Fedora. Specifically:
Linux athena 3.8.3-103.fc17.x86_64 #1 SMP Mon Mar 18 15:46:01 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Comment 14 Victor Engmark 2013-07-16 19:14:00 UTC
A quick Google search didn't come up with any bug reports on intel.com; does anyone know
1. whether this has been reported directly to them,
2. if so, how to bug them about it, and
3. if not, how to report it? I can't seem to find any references to a bug handling system on their site.
Comment 15 Alan 2013-11-19 22:45:21 UTC
2&3: See MAINTAINERS in the kernel tree.
Comment 16 Andrew Dorney 2014-01-13 06:23:21 UTC
Sent an e-mail direct to the maintainers and mentioned this bug report.

Having similar problems on Debian stable, using a vanilla kernel 3.12.6. Will attach a dmesg and lspci.
Comment 17 Andrew Dorney 2014-01-13 06:24:05 UTC
Created attachment 121731 [details]
andrewd18_dmesg_output
Comment 18 Andrew Dorney 2014-01-13 06:25:25 UTC
Created attachment 121741 [details]
lspci -vv output
Comment 19 Emmanuel Grumbach 2014-03-19 11:54:25 UTC
There is a W/A in 3.14 - please test 3.14.
Comment 20 Emmanuel Grumbach 2014-03-19 11:54:32 UTC
There is a W/A in 3.14 - please test 3.14.
Comment 21 Emmanuel Grumbach 2014-04-05 19:24:28 UTC
No information. Closing as duplicate of 56581

*** This bug has been marked as a duplicate of bug 56581 ***