Bug 56581 - iwlwifi iwldvm fail to flush Tx queue
Summary: iwlwifi iwldvm fail to flush Tx queue
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Emmanuel Grumbach
URL: https://bugs.launchpad.net/ubuntu/+so...
Keywords:
: 42686 43184 43186 43259 48921 49631 60658 69051 79711 83071 86101 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-04-13 13:16 UTC by Florian Reitmeir
Modified: 2016-11-01 19:10 UTC (History)
42 users (show)

See Also:
Kernel Version: 3.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (177.76 KB, text/plain)
2014-03-20 13:12 UTC, Martin Scherer
Details
Logs from vanilla 3.14.0 kernel (280.00 KB, application/x-tar)
2014-04-02 16:38 UTC, Victor Leschuk
Details
get more logs (2.57 KB, patch)
2014-04-06 08:22 UTC, Emmanuel Grumbach
Details | Diff
dmesg with debug enhanced logs (207.43 KB, text/plain)
2014-04-06 13:21 UTC, Martin Scherer
Details
dmesg buffer syslogged (2.20 MB, application/x-bzip)
2014-04-06 16:04 UTC, Martin Scherer
Details
add more slightly debug (3.37 KB, patch)
2014-04-07 05:57 UTC, Emmanuel Grumbach
Details | Diff
Reproducement with new patch by other user. (184.92 KB, text/plain)
2014-04-14 00:13 UTC, Tom Wijsman
Details
Reproducement with new patch and debug=0xC080000 by other user. (401.38 KB, text/plain)
2014-04-27 19:41 UTC, Tom Wijsman
Details
Reproducement with new patch and debug=0xc0800000 by other user. (362.17 KB, text/plain)
2014-05-01 15:21 UTC, Tom Wijsman
Details
expire packets after 50ms (581 bytes, patch)
2014-05-01 16:55 UTC, Emmanuel Grumbach
Details | Diff
dmesg output switch from repeating SUCCESS to repeating LIFE_EXPIRE with 50000 * 1024 (3.57 KB, text/plain)
2014-05-01 18:08 UTC, Tom Wijsman
Details
DD-WRT Basic/Security/Advanced settings + iw link/station/scan and wavemon output (3.74 KB, text/plain)
2014-05-02 09:06 UTC, Tom Wijsman
Details
disable power save (911 bytes, patch)
2014-05-29 05:08 UTC, Emmanuel Grumbach
Details | Diff
Logs after applying suggested patches (40.45 KB, text/x-log)
2014-08-01 02:58 UTC, Nikolay Martynov
Details
bits of syslog with stack trace (119.49 KB, text/plain)
2014-08-21 08:28 UTC, Mike
Details
iwlwifi log (23.85 KB, text/x-log)
2014-08-23 12:03 UTC, Martin
Details
dmesg output 3.16-1 (4.93 KB, text/plain)
2014-09-04 22:05 UTC, Bill Bluescreen
Details
drop frames as a work around. (3.50 KB, patch)
2014-10-05 06:29 UTC, Emmanuel Grumbach
Details | Diff
updated version of the fix (3.60 KB, patch)
2014-10-05 08:20 UTC, Emmanuel Grumbach
Details | Diff
fix (3.50 KB, patch)
2014-10-05 13:04 UTC, Emmanuel Grumbach
Details | Diff
syslog (17.89 KB, text/plain)
2014-10-06 13:09 UTC, Bill Bluescreen
Details
dmesg output (50.37 KB, text/plain)
2014-10-06 14:10 UTC, Bill Bluescreen
Details
dmesg output (4.13 KB, text/plain)
2014-10-06 14:51 UTC, Bill Bluescreen
Details
dmesg output (38.97 KB, text/plain)
2014-10-06 15:26 UTC, Bill Bluescreen
Details
dmesg out - version a (59.84 KB, text/plain)
2014-10-06 19:32 UTC, Cristian Aravena Romero
Details
dmesg output - version b (66.79 KB, text/plain)
2014-10-06 22:49 UTC, Cristian Aravena Romero
Details
less aggressive drop (2.38 KB, application/octet-stream)
2014-10-09 19:41 UTC, Emmanuel Grumbach
Details
dmesg output (18.28 KB, text/plain)
2014-10-10 09:44 UTC, Bill Bluescreen
Details
dmesg output (26.94 KB, text/plain)
2014-10-13 13:01 UTC, Bill Bluescreen
Details
syslog (200.06 KB, application/octet-stream)
2014-10-13 15:35 UTC, Bill Bluescreen
Details

Description Florian Reitmeir 2013-04-13 13:16:54 UTC
Hi,

i use 

> uname -a
Linux lade 3.9.0-rc4-iwlwifi #3 SMP Sun Mar 24 17:38:52 CET 2013 x86_64 GNU/Linux

without any patches, and sometimes i get the error below.

i have 2 access points, and i believe the problem is created while roaming between the access points.

greetings


[44524.547565] iwlwifi 0000:03:00.0: Current SW read_ptr 107 write_ptr 194
[44524.547601] iwl data: 00000000: f9 ff e1 ff ff ff ff ff 9c 03 20 e7 ff fb ff ff  .......... .....
[44524.547621] iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
[44524.547641] iwlwifi 0000:03:00.0: FH TRBs(1) = 0x801020de
[44524.547660] iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
[44524.547680] iwlwifi 0000:03:00.0: FH TRBs(3) = 0x803000d9
[44524.547700] iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
[44524.547719] iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
[44524.547739] iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
[44524.547759] iwlwifi 0000:03:00.0: FH TRBs(7) = 0x00709039
[44524.547830] iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [218,218]
[44524.547899] iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44524.547968] iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [223,223]
[44524.548037] iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.548106] iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.548175] iwlwifi 0000:03:00.0: Q 5 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44524.548243] iwlwifi 0000:03:00.0: Q 6 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44524.548312] iwlwifi 0000:03:00.0: Q 7 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44524.548381] iwlwifi 0000:03:00.0: Q 8 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44524.548450] iwlwifi 0000:03:00.0: Q 9 is active and mapped to fifo 7 ra_tid 0x0000 [58,58]
[44524.548550] iwlwifi 0000:03:00.0: Q 10 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44524.548629] iwlwifi 0000:03:00.0: Q 11 is active and mapped to fifo 1 ra_tid 0x0000 [52,52]
[44524.548698] iwlwifi 0000:03:00.0: Q 12 is active and mapped to fifo 1 ra_tid 0x0000 [107,194]
[44524.548767] iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.548837] iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.548905] iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.548975] iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.549043] iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.549112] iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44524.549181] iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.548929] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 12
[44526.548940] iwlwifi 0000:03:00.0: Current SW read_ptr 107 write_ptr 194
[44526.548975] iwl data: 00000000: f9 ff e1 ff ff ff ff ff 9c 03 20 e7 ff fb ff ff  .......... .....
[44526.548996] iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
[44526.549017] iwlwifi 0000:03:00.0: FH TRBs(1) = 0x801020de
[44526.549036] iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
[44526.549056] iwlwifi 0000:03:00.0: FH TRBs(3) = 0x803000da
[44526.549076] iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
[44526.549095] iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
[44526.549115] iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
[44526.549135] iwlwifi 0000:03:00.0: FH TRBs(7) = 0x00709041
[44526.549205] iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [219,219]
[44526.549274] iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44526.549343] iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [223,223]
[44526.549412] iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.549481] iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.549550] iwlwifi 0000:03:00.0: Q 5 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44526.549619] iwlwifi 0000:03:00.0: Q 6 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44526.549688] iwlwifi 0000:03:00.0: Q 7 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44526.549756] iwlwifi 0000:03:00.0: Q 8 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44526.549826] iwlwifi 0000:03:00.0: Q 9 is active and mapped to fifo 7 ra_tid 0x0000 [66,66]
[44526.549918] iwlwifi 0000:03:00.0: Q 10 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44526.549992] iwlwifi 0000:03:00.0: Q 11 is active and mapped to fifo 1 ra_tid 0x0000 [52,52]
[44526.550080] iwlwifi 0000:03:00.0: Q 12 is active and mapped to fifo 1 ra_tid 0x0000 [107,194]
[44526.550149] iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.550220] iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.550293] iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.550362] iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.550431] iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.550500] iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44526.550569] iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.550438] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 12
[44528.550451] iwlwifi 0000:03:00.0: Current SW read_ptr 107 write_ptr 194
[44528.550487] iwl data: 00000000: f9 ff e1 ff ff ff ff ff 9c 03 20 e7 ff fb ff ff  .......... .....
[44528.550509] iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
[44528.550529] iwlwifi 0000:03:00.0: FH TRBs(1) = 0x801020de
[44528.550549] iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
[44528.550568] iwlwifi 0000:03:00.0: FH TRBs(3) = 0x803000db
[44528.550588] iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
[44528.550608] iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
[44528.550628] iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
[44528.550648] iwlwifi 0000:03:00.0: FH TRBs(7) = 0x00709047
[44528.550718] iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [220,220]
[44528.550787] iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44528.550856] iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [223,223]
[44528.550925] iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.550994] iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.551063] iwlwifi 0000:03:00.0: Q 5 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44528.551132] iwlwifi 0000:03:00.0: Q 6 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44528.551201] iwlwifi 0000:03:00.0: Q 7 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44528.551269] iwlwifi 0000:03:00.0: Q 8 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44528.551376] iwlwifi 0000:03:00.0: Q 9 is active and mapped to fifo 7 ra_tid 0x0000 [72,72]
[44528.551459] iwlwifi 0000:03:00.0: Q 10 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44528.551528] iwlwifi 0000:03:00.0: Q 11 is active and mapped to fifo 1 ra_tid 0x0000 [52,52]
[44528.551597] iwlwifi 0000:03:00.0: Q 12 is active and mapped to fifo 1 ra_tid 0x0000 [107,194]
[44528.551666] iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.551735] iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.551804] iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.551873] iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.551942] iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.552010] iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44528.552079] iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.551750] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 12
[44530.551761] iwlwifi 0000:03:00.0: Current SW read_ptr 107 write_ptr 194
[44530.551796] iwl data: 00000000: f9 ff e1 ff ff ff ff ff 9c 03 20 e7 ff fb ff ff  .......... .....
[44530.551817] iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
[44530.551838] iwlwifi 0000:03:00.0: FH TRBs(1) = 0x801020de
[44530.551857] iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
[44530.551878] iwlwifi 0000:03:00.0: FH TRBs(3) = 0x803000dc
[44530.551897] iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
[44530.551917] iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
[44530.551937] iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
[44530.551956] iwlwifi 0000:03:00.0: FH TRBs(7) = 0x0070904f
[44530.552027] iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [221,221]
[44530.552097] iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44530.552166] iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [223,223]
[44530.552235] iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.552303] iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.552372] iwlwifi 0000:03:00.0: Q 5 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44530.552441] iwlwifi 0000:03:00.0: Q 6 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44530.552510] iwlwifi 0000:03:00.0: Q 7 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44530.552579] iwlwifi 0000:03:00.0: Q 8 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44530.552648] iwlwifi 0000:03:00.0: Q 9 is active and mapped to fifo 7 ra_tid 0x0000 [80,80]
[44530.552733] iwlwifi 0000:03:00.0: Q 10 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44530.552803] iwlwifi 0000:03:00.0: Q 11 is active and mapped to fifo 1 ra_tid 0x0000 [52,52]
[44530.552882] iwlwifi 0000:03:00.0: Q 12 is active and mapped to fifo 1 ra_tid 0x0000 [107,194]
[44530.552955] iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.553030] iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.553106] iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.553182] iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.553259] iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.553335] iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44530.553407] iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.553208] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 12
[44532.553219] iwlwifi 0000:03:00.0: Current SW read_ptr 107 write_ptr 194
[44532.553255] iwl data: 00000000: f9 ff e1 ff ff ff ff ff 9c 03 20 e7 ff fb ff ff  .......... .....
[44532.553276] iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
[44532.553296] iwlwifi 0000:03:00.0: FH TRBs(1) = 0x801020de
[44532.553316] iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
[44532.553336] iwlwifi 0000:03:00.0: FH TRBs(3) = 0x803000dd
[44532.553356] iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
[44532.553376] iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
[44532.553395] iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
[44532.553415] iwlwifi 0000:03:00.0: FH TRBs(7) = 0x00709056
[44532.553485] iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [222,222]
[44532.553555] iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44532.553626] iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [223,223]
[44532.553698] iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.553771] iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.553843] iwlwifi 0000:03:00.0: Q 5 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44532.553914] iwlwifi 0000:03:00.0: Q 6 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44532.553986] iwlwifi 0000:03:00.0: Q 7 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44532.554059] iwlwifi 0000:03:00.0: Q 8 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44532.554159] iwlwifi 0000:03:00.0: Q 9 is active and mapped to fifo 7 ra_tid 0x0000 [87,87]
[44532.554249] iwlwifi 0000:03:00.0: Q 10 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44532.554326] iwlwifi 0000:03:00.0: Q 11 is active and mapped to fifo 1 ra_tid 0x0000 [52,52]
[44532.554397] iwlwifi 0000:03:00.0: Q 12 is active and mapped to fifo 1 ra_tid 0x0000 [107,194]
[44532.554478] iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.554548] iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.554617] iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.554686] iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.554755] iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.554824] iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44532.554893] iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.659314] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 12
[44538.659326] iwlwifi 0000:03:00.0: Current SW read_ptr 107 write_ptr 194
[44538.659360] iwl data: 00000000: f9 ff e1 ff ff ff ff ff 9c 03 20 e7 ff fb ff ff  .......... .....
[44538.659381] iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
[44538.659401] iwlwifi 0000:03:00.0: FH TRBs(1) = 0x801020de
[44538.659421] iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
[44538.659441] iwlwifi 0000:03:00.0: FH TRBs(3) = 0x803000de
[44538.659460] iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
[44538.659480] iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
[44538.659499] iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
[44538.659519] iwlwifi 0000:03:00.0: FH TRBs(7) = 0x00709060
[44538.659589] iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [223,223]
[44538.659659] iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44538.659729] iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [223,223]
[44538.659798] iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.659867] iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.659936] iwlwifi 0000:03:00.0: Q 5 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44538.660004] iwlwifi 0000:03:00.0: Q 6 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[44538.660095] iwlwifi 0000:03:00.0: Q 7 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44538.660165] iwlwifi 0000:03:00.0: Q 8 is active and mapped to fifo 4 ra_tid 0x0000 [0,0]
[44538.660264] iwlwifi 0000:03:00.0: Q 9 is active and mapped to fifo 7 ra_tid 0x0000 [97,97]
[44538.660343] iwlwifi 0000:03:00.0: Q 10 is active and mapped to fifo 5 ra_tid 0x0000 [0,0]
[44538.660418] iwlwifi 0000:03:00.0: Q 11 is active and mapped to fifo 1 ra_tid 0x0000 [52,52]
[44538.660495] iwlwifi 0000:03:00.0: Q 12 is active and mapped to fifo 1 ra_tid 0x0000 [107,194]
[44538.660571] iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.660643] iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.660712] iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.660781] iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.660849] iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.660918] iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[44538.660987] iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Comment 1 Emmanuel Grumbach 2014-03-18 06:20:10 UTC
*** Bug 60658 has been marked as a duplicate of this bug. ***
Comment 2 Emmanuel Grumbach 2014-03-18 06:20:38 UTC
*** Bug 69051 has been marked as a duplicate of this bug. ***
Comment 3 Emmanuel Grumbach 2014-03-18 06:26:55 UTC
*** Bug 43259 has been marked as a duplicate of this bug. ***
Comment 4 Emmanuel Grumbach 2014-03-18 08:32:54 UTC
*** Bug 42686 has been marked as a duplicate of this bug. ***
Comment 5 Emmanuel Grumbach 2014-03-18 08:34:07 UTC
*** Bug 43184 has been marked as a duplicate of this bug. ***
Comment 6 Emmanuel Grumbach 2014-03-18 08:34:24 UTC
*** Bug 49631 has been marked as a duplicate of this bug. ***
Comment 7 Emmanuel Grumbach 2014-03-18 11:05:14 UTC
*** Bug 43186 has been marked as a duplicate of this bug. ***
Comment 8 Martin Scherer 2014-03-18 20:35:49 UTC
confirmed to be still present in 3.13.1. It seems to me, that the bug only occurs during high constant loads of the interface. If state is resumed from hibernate it occurs even without much bandwidth usage.

Network controller: Intel Corporation Centrino Ultimate-N 6300 (rev 35)
        Subsystem: Intel Corporation Centrino Ultimate-N 6300 3x3 AG
Comment 9 Emmanuel Grumbach 2014-03-19 06:09:38 UTC
TX AMPDU are disabled by default in 3.14. This workaround should help.
Comment 10 Martin Scherer 2014-03-19 22:05:33 UTC
The consequences of disabling this feature leads to huge slow down issues (ping response times of 5-8s instead of 30ms). They occur periodically nearly at same time (guess), like in the fifo tx queue flushing bug. Testing on 3.14rc7
Comment 11 Martin Scherer 2014-03-19 22:16:14 UTC
Sorry for the spam: the bug is still present in 3.14rc7. Didn't check dmesg output. So the proposed workaround does not work in my case. Is there a way to check, if TX AMPDU is really disabled?
Comment 12 Emmanuel Grumbach 2014-03-20 06:19:42 UTC
can you share your dmesg?
Comment 13 Martin Scherer 2014-03-20 13:12:29 UTC
Created attachment 130131 [details]
dmesg
Comment 14 Victor Leschuk 2014-04-02 16:37:30 UTC
Hello, I confirm that the issue is reproduced on vanilla 3.14.0 kernel. Attaching dmesg, lspci and config.
Comment 15 Victor Leschuk 2014-04-02 16:38:05 UTC
Created attachment 131241 [details]
Logs from vanilla 3.14.0 kernel
Comment 16 Emmanuel Grumbach 2014-04-05 19:24:28 UTC
*** Bug 48921 has been marked as a duplicate of this bug. ***
Comment 17 Emmanuel Grumbach 2014-04-06 08:22:38 UTC
Created attachment 131531 [details]
get more logs

this patch allows to get more logs. Please apply and send the logs.
note that this patch adds tons of logs - and I will need the syslog to get all the history.

Thanks.
Comment 18 Martin Scherer 2014-04-06 13:21:42 UTC
Created attachment 131541 [details]
dmesg with debug enhanced logs
Comment 19 Emmanuel Grumbach 2014-04-06 13:23:45 UTC
please send the syslog too. (as written above).
Thanks
Comment 20 Martin Scherer 2014-04-06 16:04:53 UTC
Created attachment 131561 [details]
dmesg buffer syslogged
Comment 21 Emmanuel Grumbach 2014-04-07 05:57:27 UTC
Created attachment 131611 [details]
add more slightly debug

Please replace the previous patch by this one and send again your syslog.
Unfortunely, from what I am seeing, I don't see any driver bugs, but the FW is really getting stuck for a long time. It does recover long later though.
Comment 22 Tom Wijsman 2014-04-14 00:13:30 UTC
Created attachment 132101 [details]
Reproducement with new patch by other user.

(In reply to Emmanuel Grumbach from comment #21)
> Created attachment 131611 [details]
> add more slightly debug
> 
> Please replace the previous patch by this one and send again your syslog.
> Unfortunely, from what I am seeing, I don't see any driver bugs, but the FW
> is really getting stuck for a long time. It does recover long later though.

Hello

Can someone reopen this bug or let me know if I need to file a new one?

Similar issues to the ones described in this bug I have been experiencing on my system; it is also concerning the "fail to flush all tx fifo queues Q 2" output, which you can be found in the middle of the attachment with 1000 lines of context. This attachment is with the patch of the previous comment included.

This Gentoo runs a 3.14.0-rc5 Linux kernel and linux-firmware 20131230, the NIC:

03:00.0 Network controller [0280]: Intel Corporation WiFi Link 5100 [8086:4232]

Around the moment that this happens, I experience huge latency and packet loss; here for example I have packet loss (10.0.0.1 is router, 8.8.8.8 is internet):

PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=1558 ms

--- 10.0.0.1 ping statistics ---
4 packets transmitted, 1 received, 75% packet loss, time 3001ms
rtt min/avg/max/mdev = 1558.746/1558.746/1558.746/0.000 ms, pipe 2

In another case I for example have a moment where it recovers:

PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=2979 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=2265 ms
64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=2005 ms
64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=1543 ms

--- 10.0.0.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3002ms
rtt min/avg/max/mdev = 1543.949/2198.716/2979.334/519.544 ms, pipe 3

PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=47 time=553 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=47 time=683 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=47 time=77.6 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=47 time=78.8 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3008ms
rtt min/avg/max/mdev = 77.680/348.382/683.868/274.029 ms

But this recovered state won't last very long. After these symptoms; I need to play a bit with rmmod and modprobe, reconnecting or even rebooting as well as wait in a hope to get it to fully recover.

If you need more information, let me know; if I discover more, I'll comment. TIA
Comment 23 Emmanuel Grumbach 2014-04-16 19:54:58 UTC
(In reply to Tom Wijsman from comment #22)
> Created attachment 132101 [details]
> Reproducement with new patch by other user.
> 
> (In reply to Emmanuel Grumbach from comment #21)
> > Created attachment 131611 [details]
> > add more slightly debug
> > 
> > Please replace the previous patch by this one and send again your syslog.
> > Unfortunely, from what I am seeing, I don't see any driver bugs, but the FW
> > is really getting stuck for a long time. It does recover long later though.
> 
> Hello
> 
> Can someone reopen this bug or let me know if I need to file a new one?
> 

You are suffering from the exact same problems as described in this bug.
You don't need to re-open anything - this bug is open.

From what I see from your logs - this is the exact same behavior as the other users. The firmware  takes a very long time to transmit the packets.
Can you please enable debug logs with the patch applied:

modprobe iwlwifi debug=0xC080000

I wonder if we are stuck in high rates while the link is bad. This can increase significantly the latency.
Comment 24 Tom Wijsman 2014-04-16 21:20:03 UTC
(In reply to Emmanuel Grumbach from comment #23)
> From what I see from your logs - this is the exact same behavior as the
> other users. The firmware  takes a very long time to transmit the packets.
> Can you please enable debug logs with the patch applied:
> 
> modprobe iwlwifi debug=0xC080000

Will do; I only experience this at certain hours of the day due to high rates happening then, but I can look into setting up a test case with Raspberry Pi.

> I wonder if we are stuck in high rates while the link is bad. This can
> increase significantly the latency.

Yes, I discovered that too, it happens when there is high rate Ethernet traffic.

Let's say I have two Ethernet connected computers and my laptop that is on WLAN; my laptop on WLAN gets affected when the two computers transfer, eg. a file of 1 GB, this problem then continues for a bit after the file is done transferring.

My router is a Linksys WRT610N running DD-WRT v24-sp2 (04/07/12) std-usb-nas - build 18946M NEWD-2; if needed, I can try to get that back to original firmware.

But before blaming the router, as it should work fine; I think other WLAN clients here are unaffected, I'll try to confirm that by having someone else test with a different NIC here on the network test it.

An early idea I get is that iwlwifi doesn't deal well with the high rates compared to other cards; I'll try to get more debug info, as well as try the rest I described here to see if I can get further confirmations and details.

This has worked fine in the past (let's say, a year ago), but somewhere in the last months this has regressed; I can't pinpoint when exactly and why, I'm also not sure if I can bisect it somehow unless I get to a clear isolated test case.

Thank you for the debug option and patch, set it and patched again; reboot. :)
Comment 25 Tom Wijsman 2014-04-27 19:41:37 UTC
Created attachment 133971 [details]
Reproducement with new patch and debug=0xC080000 by other user.

(In reply to Emmanuel Grumbach from comment #23)
> Can you please enable debug logs with the patch applied:
> 
> modprobe iwlwifi debug=0xC080000
> 
> I wonder if we are stuck in high rates while the link is bad. This can
> increase significantly the latency.

With that debug option, I've managed to capture it happen again now; this log with 2000 lines context captured it twice as well as disconnect(s) that occurred.

The "fail to flush all tx fifo queues Q 2" occurs at 00:44 and 00:49 in the log.

An Ethernet connected laptop is streaming from a Raspberry Pi over Ethernet (= cause, unaffected), another WiFi connected laptop can game fine (= unaffected); I'm on my WiFi connected laptop experiencing this while I am browsing.

Thank you very much in advance for looking into this.
Comment 26 Emmanuel Grumbach 2014-04-28 22:15:40 UTC
ouch - I meant 0xc0800000 (I missed a trailing 0)... sorry...
Comment 27 Tom Wijsman 2014-05-01 15:21:26 UTC
Created attachment 134591 [details]
Reproducement with new patch and debug=0xc0800000 by other user.

This time with new debug parameter, it happened twice close to a disconnect; I sometimes get random disconnects, unsure if it is caused by the same cause.

2000 lines of context; due to disconnect, NetworkManager / wpa_supplicant log is present too; "fail to flush all tx fifo queues Q 2" happens at 00:41 and 00:46.

The situation is the same as before. Thank you in advance.
Comment 28 Emmanuel Grumbach 2014-05-01 16:44:22 UTC
Thank you Tom, this was really helpful.

What I see here is that each packet is re-transmitted a lot times.
This certainly increases the latency. This points to the rate scaling algorithm.

I'll try to see what can be done.

Thanks again.
Comment 29 Emmanuel Grumbach 2014-05-01 16:55:50 UTC
Created attachment 134601 [details]
expire packets after 50ms

Please retry with the patch attached.
This will throw away packets that are too old.
I expect packet loss, but the firmware should be more responsive.

Let me know what happen with this.
Comment 30 Tom Wijsman 2014-05-01 17:37:39 UTC
Doesn't compile, but I suppose that

-tx_cmd->stop_time.life_time = iwl_read32(CSR_UCODE_DRV_GP2) + 50 * 1024;
+tx_cmd->stop_time.life_time = iwl_read32(priv->trans, CSR_UCODE_DRV_GP2) + 50 * 1024;

is what you intended; will try and see how it goes, let's wait it out... :)
Comment 31 Emmanuel Grumbach 2014-05-01 17:41:29 UTC
(In reply to Tom Wijsman from comment #30)
> Doesn't compile, but I suppose that
> 
> -tx_cmd->stop_time.life_time = iwl_read32(CSR_UCODE_DRV_GP2) + 50 * 1024;
> +tx_cmd->stop_time.life_time = iwl_read32(priv->trans, CSR_UCODE_DRV_GP2) +
> 50 * 1024;
> 
> is what you intended; will try and see how it goes, let's wait it out... :)

Yep - sorry, I couldn't really compile when I wrote the change ;)
Comment 32 Tom Wijsman 2014-05-01 17:55:23 UTC
(In reply to Emmanuel Grumbach from comment #29)
> Let me know what happen with this.

Well, it appears I can't even connect with this;
not with 50, not with 500 and not even with 5000.

With the patch, I get:

iwlwifi 0000:03:00.0: I iwlagn_rx_reply_tx TXQ 0 status LIFE_EXPIRE (0x00000087)

Without the patch, I get:

iwlwifi 0000:03:00.0: I iwlagn_rx_reply_tx TXQ 0 status SUCCESS (0x00000201)
Comment 33 Tom Wijsman 2014-05-01 18:00:27 UTC
Argh, just one more worked out; so, this one works:

iwl_read32(priv->trans, CSR_UCODE_DRV_GP2) + 50000 * 1024;

Are you perhaps off by another 1024? Do I need to try 50 * 1024 * 1024?

I've also noticed that while I can now connect, I get disconnected later; I guess that once you give me a range, I need to try out and find a stable value?
Comment 34 Tom Wijsman 2014-05-01 18:08:00 UTC
Created attachment 134631 [details]
dmesg output switch from repeating SUCCESS to repeating LIFE_EXPIRE with 50000 * 1024

Attached you can find dmesg output switching from a repeating SUCCESS to a repeating LIFE_EXPIRE with 50000 * 1024; at that point, pings to my router stop arriving after which after a timeout I get disconnected.
Comment 35 Tom Wijsman 2014-05-02 01:10:32 UTC
(In reply to Tom Wijsman from comment #33)
> Do I need to try 50 * 1024 * 1024?

50 * 1024 * 1024: Disconnects (LIFE_EXPIRE) after a few pings.
100 * 1024 * 1024: Get to YT video, set it Full HD, disconnects (LIFE_EXPIRE).
200 * 1024 * 1024: Watch a YT video, wait, disconnect.

From a first impression it controls how long it takes until it disconnects? :(
Comment 36 Emmanuel Grumbach 2014-05-02 05:27:09 UTC
It control how much to wait until the packet is discarded.

What I see here is that the rate scaling algorithm scaled down (intial rate = 0x8100) - you are using a very low rate because of the bad link, and even then, there are a lot of retransmissions. I guess this is a PHY / RF problem and these are tricky as you can guess...

One last thing I noticed is that we don't seem to go down to CCK rates, are you in 2.4GHz or 5.2GHz?
Comment 37 Tom Wijsman 2014-05-02 09:06:39 UTC
Created attachment 134681 [details]
DD-WRT Basic/Security/Advanced settings + iw link/station/scan and wavemon output

(In reply to Emmanuel Grumbach from comment #36)
> One last thing I noticed is that we don't seem to go down to CCK rates, are
> you in 2.4GHz or 5.2GHz?

2.4 GHz

Attached is are the DD-WRT router's wireless settings, as well as iw dump output; this is a lot of information, but it might contain something valuable.

Given that Basic Rate says "The default setting will ensure maximum compatibility with all devices. You may also choose to enable all data rates by selecting ALL." I'm going to try to change that from ALL to Default to see if that gives an improvement; perhaps it is incompatible with that, I'll report back if this is the case.

This information was captured with the infinite life_time put back and no traffic on Ethernet this morning; given that the given value doesn't work, and raising it a lot just seems to stall a disconnect I'll stay back on infinite life_time until we've come down to a better value or idea.
Comment 38 Emmanuel Grumbach 2014-05-02 14:56:58 UTC
can you paste the output of:

cat /sys/kernel/debug/ieee80211/phy0/netdev\:wlan0/stations/*/rate_scale_table

you can then,

echo 0x820a into that same file and see what happens. This will force to move to a fixed low rate. You seem to be stuck in OFDM rates while you should downgrading to CCK rates - PHY / RF issues.

In any case, please remove the lifetime settings, but you already did that it seems.
Comment 39 Tom Wijsman 2014-05-02 15:20:41 UTC
 # ls /sys/kernel/debug/ieee80211/phy0
fragmentation_threshold  ht40allow_map  long_retry_limit  rts_threshold  short_retry_limit

Hmm, no netdev\:* entries in there; do I need to enable some CONFIG option?
Comment 40 Emmanuel Grumbach 2014-05-02 15:59:15 UTC
CONFIG_MAC80211_DEBUGFS and CONFIG_IWLWIFI_DEBUGFS
Comment 41 Tom Wijsman 2014-05-02 20:20:46 UTC
With the Default setting (Comment #37), I still experience high pings; I'm not sure if I can reproduce the failure right now, we'll have to wait that out.

Here is the output you've requested:

sta_id 0
failed=121 success=121 rate=0FFF
fixed rate 0x0
valid_tx_ant ANT_B,
lq type HT
 SISO 20MHz SGI  
last tx rate=0xA103
general:flags=0x0 mimo-d=0 s-ant0x2 d-ant=0x3
agg:time_limit=4000 dist_start_th=3 frame_cnt_limit=63
Start idx [0]=0x0 [1]=0x0 [2]=0x0 [3]=0x0
 rate[0] 0xA104 36mbps (16QAM 3/4)
 rate[1] 0xA104 36mbps (16QAM 3/4)
 rate[2] 0xA103 24mbps (16QAM 1/2)
 rate[3] 0xA103 24mbps (16QAM 1/2)
 rate[4] 0xA103 24mbps (16QAM 1/2)
 rate[5] 0x8007 18mbps (QPSK 3/4)
 rate[6] 0x8005 12mbps (QPSK 1/2)
 rate[7] 0x826E 11mbps (QPSK CCK)
 rate[8] 0x800F 9mbps (BPSK 1/2)
 rate[9] 0x800D 6mbps (BPSK 1/2)
 rate[10] 0x8237 5.5mbps (BPSK CCK)
 rate[11] 0x8214 2mbps (QPSK DSSS)
 rate[12] 0x820A 1mbps (BPSK DSSS)
 rate[13] 0x820A 1mbps (BPSK DSSS)
 rate[14] 0x820A 1mbps (BPSK DSSS)
 rate[15] 0x820A 1mbps (BPSK DSSS)

After the echo, I see that all the rates repeat the same:

sta_id 0
failed=0 success=0 rate=0FFF
fixed rate 0x820A
valid_tx_ant ANT_B,
lq type HT
 SISO 20MHz SGI
last tx rate=0xA100
general:flags=0x0 mimo-d=0 s-ant0x2 d-ant=0x3
agg:time_limit=4000 dist_start_th=3 frame_cnt_limit=63
Start idx [0]=0x0 [1]=0x0 [2]=0x0 [3]=0x0
 rate[0] 0x820A 1mbps (BPSK DSSS)
 ... with the rest repeating the same up to rate[15] ...

It doesn't help with the pings, they still jump into seconds like before:

64 bytes from 10.0.0.1: icmp_seq=176 ttl=64 time=2.81 ms
64 bytes from 10.0.0.1: icmp_seq=177 ttl=64 time=5.65 ms
64 bytes from 10.0.0.1: icmp_seq=178 ttl=64 time=4.64 ms
64 bytes from 10.0.0.1: icmp_seq=179 ttl=64 time=1060 ms
64 bytes from 10.0.0.1: icmp_seq=180 ttl=64 time=1157 ms
64 bytes from 10.0.0.1: icmp_seq=181 ttl=64 time=1139 ms
64 bytes from 10.0.0.1: icmp_seq=182 ttl=64 time=487 ms
64 bytes from 10.0.0.1: icmp_seq=183 ttl=64 time=888 ms
64 bytes from 10.0.0.1: icmp_seq=184 ttl=64 time=433 ms
64 bytes from 10.0.0.1: icmp_seq=185 ttl=64 time=9.54 ms
64 bytes from 10.0.0.1: icmp_seq=186 ttl=64 time=32.4 ms

This is during the network situation; it just doesn't trigger a failure at the moment, it's not always reproducible.

I'll wait to see if I'll get a reproducement with Default setting (test #1); when it does, it isn't that setting and I'll then try to see what the echo does while the situation is occurring (test #2).

If it turns out the Default setting has fixed this up, I can later try again with the All setting; just to reproduce and provide more information as to why the All setting would break it in that case.

Stay tuned...

PS: Without the echo, it seems like it stays (when idle) in the higher rates, where it only very infrequently goes to the lower rates; I'm not sure how often it's supposed to go there, if needed I can try to capture a video in the future.
Comment 42 Emmanuel Grumbach 2014-05-04 15:36:03 UTC
What I fear is that the rate control doesn't downgrade to low CCK rates, where these might help you when you have a bad link.
If the rate control doesn't downgrade, you'll have tons of re-transmissions, and this can cause big delays. The echo command fixes the rate to a low rate. This is a debug step, just to see if it helps.
I came to the conclusion that this might help since you seemed to have a lot of re-transmissions and you seemed stuck in OFDM rates (higher than CCK).

To go back to a floating rate, you need to unload a reload the module.
Comment 43 Martin Scherer 2014-05-07 17:48:52 UTC
When setting rates statically to constant 1mbps, I still experience high response times and failing tx queue flushes.
Comment 44 Tom Wijsman 2014-05-07 18:15:18 UTC
What I further think is that the disconnects while idle have to do with the rate being too low for the high-rate-stuck iwlwifi to handle; therefore, I think it would timeout due to this mismatch. Whereas making my internet traffic a little less idle it manages to stay connected; eg., opening a few internet related apps, playing a game, ...

That observation put aside; I notice some interesting commits in iwlwifi-next:

iwlwifi: mvm: rs: revert changes to search cycle rules
iwlwifi: mvm: rs: overhaul rs_get_best_rate
iwlwifi: mvm: rs: don't allow TPC when power save is disabled
iwlwifi: mvm: rs: avoid TPC while in search cycle
iwlwifi: mvm: rs: choose correct expected tpt table

^ These are directly related to selecting the rate.

iwlwifi: mvm: implement mac80211's flush callback

^ "This allows mac80211 to flush the Tx queues before it sends critical management frames." says enough and seems quite relevant to this bug.

Are there other commits there that are of interest testing? Should and can I just cherrypick these 6 on top of my 3.15-rc4? Which kernel release will they go in?
Comment 45 Emmanuel Grumbach 2014-05-07 20:04:42 UTC
(In reply to Martin Scherer from comment #43)
> When setting rates statically to constant 1mbps, I still experience high
> response times and failing tx queue flushes.

Can I see the output of debug=0xC0800000 I'd like the see the number of retries - thanks
Comment 46 Emmanuel Grumbach 2014-05-07 20:05:54 UTC
(In reply to Tom Wijsman from comment #44)
>  
> iwlwifi: mvm: rs: revert changes to search cycle rules
> iwlwifi: mvm: rs: overhaul rs_get_best_rate
> iwlwifi: mvm: rs: don't allow TPC when power save is disabled
> iwlwifi: mvm: rs: avoid TPC while in search cycle
> iwlwifi: mvm: rs: choose correct expected tpt table
> 
> ^ These are directly related to selecting the rate.
> 
> iwlwifi: mvm: implement mac80211's flush callback
> 
> ^ "This allows mac80211 to flush the Tx queues before it sends critical
> management frames." says enough and seems quite relevant to this bug.
> 
> Are there other commits there that are of interest testing? Should and can I
> just cherrypick these 6 on top of my 3.15-rc4? Which kernel release will
> they go in?

These aren't related to your device - they are related to newer devices.
Comment 47 kernel 2014-05-12 21:59:36 UTC
Just want to make sure if this ever gets fixed, that the fix includes all affected cards. My lspci reports:

02:00.0 Network controller: Intel Corporation Centrino Wireless-N 1000 [Condor Peak]
	Subsystem: Intel Corporation Centrino Wireless-N 1000 BGN

The info from other reports seems to match for me (my details below), although I have a very basic situation: one router / access point. The problem frequently occurs during bandwidth-intensive video streaming, despite a very high available bandwidth and a (mostly) strong ap signal. However, there are literally dozens of other ap signals available, but I've never connected with any of them. Not sure if they are "jamming" my frequency, but I have tried every available channel for my own connection (most are saturated).

Here are my details:

uname -a:
Linux 3.14.3-200.fc20.x86_64 #1 SMP Tue May 6 19:00:18 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

dmesg snippet:

[  368.768448] iwlwifi 0000:02:00.0: fail to flush all tx fifo queues Q 2
[  368.768463] iwlwifi 0000:02:00.0: Current SW read_ptr 72 write_ptr 223
[  368.768518] iwl data: 00000000: 00 ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[  368.768561] iwlwifi 0000:02:00.0: FH TRBs(0) = 0x00000000
[  368.768603] iwlwifi 0000:02:00.0: FH TRBs(1) = 0x80102057
[  368.768643] iwlwifi 0000:02:00.0: FH TRBs(2) = 0x00000000
[  368.768682] iwlwifi 0000:02:00.0: FH TRBs(3) = 0x80300012
[  368.768726] iwlwifi 0000:02:00.0: FH TRBs(4) = 0x00000000
[  368.768764] iwlwifi 0000:02:00.0: FH TRBs(5) = 0x00000000
[  368.768803] iwlwifi 0000:02:00.0: FH TRBs(6) = 0x00000000
[  368.768842] iwlwifi 0000:02:00.0: FH TRBs(7) = 0x00704061
[  368.768930] iwlwifi 0000:02:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [19,19]
[  368.769016] iwlwifi 0000:02:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[  368.769103] iwlwifi 0000:02:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [72,223]
[  368.769189] iwlwifi 0000:02:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.769333] iwlwifi 0000:02:00.0: Q 4 is active and mapped to fifo 7 ra_tid 0x0000 [98,98]
[  368.769480] iwlwifi 0000:02:00.0: Q 5 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.769587] iwlwifi 0000:02:00.0: Q 6 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.769674] iwlwifi 0000:02:00.0: Q 7 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.769761] iwlwifi 0000:02:00.0: Q 8 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.769848] iwlwifi 0000:02:00.0: Q 9 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.769964] iwlwifi 0000:02:00.0: Q 10 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770062] iwlwifi 0000:02:00.0: Q 11 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770151] iwlwifi 0000:02:00.0: Q 12 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770238] iwlwifi 0000:02:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770325] iwlwifi 0000:02:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770455] iwlwifi 0000:02:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770550] iwlwifi 0000:02:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770636] iwlwifi 0000:02:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770722] iwlwifi 0000:02:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[  368.770808] iwlwifi 0000:02:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
Comment 48 Tom Wijsman 2014-05-12 22:29:02 UTC
(In reply to kernel from comment #47)
> Very high available bandwidth and a (mostly) strong ap signal.

Same here, I sit ~2 meters or so from the router while this happens.

> However, there are literally dozens of other ap signals available, but I've
> never connected with any of them. Not sure if they are "jamming" my
> frequency, but I have tried every available channel for my own connection
> (most are saturated).

Same here, I count ~16 APs at the moment; divided over all 2.4GHz frequencies. Regarding strength, my AP and ~2 others are at 3 of the 4 bars. There are ~6 APs at 2 of the 4 bars, with another ~6 APs at 1 of the 4 bars.
Comment 49 Martin 2014-05-13 13:35:56 UTC
FWIW I'm seeing the "Q xx is inactive" messages also when coming back from standby on my Dell XPS13 (lspci: 01:00.0 Network controller: Intel Corporation Centrino Advanced-N 6235 (rev 24)). And they also inhibit proper connecting after standby. I don't know if that helps, but if I can add anything to this thread I'd be happy to contribute. Kernel: 3.14.3, fw: 18.168.6.1.
Comment 50 Luis Colunga 2014-05-28 08:34:11 UTC
Emmanuel. I am having the same issue, but as an extra comment I have noticed this issue on networks that has 802.11h enabled. 


For example Ruckus Wireless APs with a feature called "ChannelFly" uses 802.11h to change the channel.
Comment 51 Emmanuel Grumbach 2014-05-28 12:14:00 UTC
Thanks for all the reports.

Can someone test this:

diff --git a/drivers/net/wireless/iwlwifi/dvm/power.c
b/drivers/net/wireless/iwlwifi/dvm/power.c
index f2c1439..e5a6b70 100644
--- a/drivers/net/wireless/iwlwifi/dvm/power.c
+++ b/drivers/net/wireless/iwlwifi/dvm/power.c
@@ -288,6 +288,9 @@ static void iwl_power_build_cmd(struct iwl_priv *priv,
        bool enabled = priv->hw->conf.flags & IEEE80211_CONF_PS;
        int dtimper;

+       iwl_power_sleep_cam_cmd(priv, cmd);
+       return;
+
        dtimper = priv->hw->conf.ps_dtim_period ?: 1;

        if (priv->wowlan)
Comment 52 Emmanuel Grumbach 2014-05-29 05:08:37 UTC
Created attachment 137661 [details]
disable power save

The same as above, but in a decent patch format
Comment 53 Martin Scherer 2014-05-30 18:24:07 UTC
On 05/29/2014 07:08 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=56581
> 
> --- Comment #52 from Emmanuel Grumbach <emmanuel.grumbach@intel.com> ---
> Created attachment 137661 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=137661&action=edit
> disable power save
> 
> The same as above, but in a decent patch format
> 
I've noticed, that since I switched the AP network from mixed b/g/n to
b/g the bug disappeared at all. The connection strengths is same like
before, because the AP has not been moved and the laptop is also in same
position.
Maybe it disappeared, because high rates of N-mode are disabled, but
this can be excluded since we have tested fixed rates of 1mbps.

And power saving should not be related to disabling AP mixed n-mode
neither, right?

Hope this gains a few new hints.

Best,
Martin
Comment 54 Tom Wijsman 2014-05-30 18:44:50 UTC
(In reply to Emmanuel Grumbach from comment #51)
> +       iwl_power_sleep_cam_cmd(priv, cmd);
> +       return;

Reproduced: Increased pings, slower network and disconnect happened again with similar traffic in the earlier described network environment; didn't get "fail to flush Tx queue", but that's just like before that I don't always get that.

Extra debugging information: My router shows a high number of "Transmitted (TX) errors"; from the last time I have around 21000 such errors over a 2 hours span.

Thought: An internet search about this reveals that this could have to do something with "Sensitivity Range (ACK Timing)" so I'll be increasing that value with each incident to see if this can help or not.

This reminds me that I still need to set up some kind of test scenario; like, have a Raspberry Pi flood data packets to and/or from my router.

(In reply to Martin Scherer from comment #53)
> I've noticed, that since I switched the AP network from mixed b/g/n to
> b/g the bug disappeared at all.

Well, it could be that; but I'm not willing to downgrade the experience of the network or my client here, I do have seen tricks like the module parameter 11n_disable=1 (disabling N from the client instead of the router) reported to help with cases like these, so that is a way to look out for.

When I experience trouble again, I might try this parameter and compare results.

> And power saving should not be related to disabling AP mixed n-mode
> neither, right?

I perceive it to help a small bit, in terms of disconnects I sometimes receive; but it doesn't seem like a resolution to "fail to flush Tx queue" or slowdowns.
Comment 55 Emmanuel Grumbach 2014-05-31 19:46:50 UTC
@Martin: can you please run with debug=0x00001000 ?
I'd like to see the working scenario (without 11N rates) and the failing scenario (with 11N rates).

Thanks
Comment 56 Martin Scherer 2014-05-31 21:50:50 UTC
On 05/31/2014 09:46 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=56581
> 
> --- Comment #55 from Emmanuel Grumbach <emmanuel.grumbach@intel.com> ---
> @Martin: can you please run with debug=0x00001000 ?
> I'd like to see the working scenario (without 11N rates) and the failing
> scenario (with 11N rates).
> 
> Thanks
> 
Unfortunately there are still queue drops with disabled 11n, but it
doesn't lead to the same unreliable behaviour like with 11n. I will
provide some logs nevertheless.
Comment 57 kernel 2014-06-01 14:35:48 UTC
The 11n comments reminded me of a bug that's been plaguing Linux systems for years. Both of these reports (linked below) are made via Ubuntu but they both seem to be traced back to the kernel. The specific logging messages have changed over the years, but the bug seems to be very much the same! (I'm sure more googling would turn up even more...)

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/836250

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1201470
Comment 58 Emmanuel Grumbach 2014-06-08 05:16:51 UTC
Can someone provide the info I requested in comment 55?

I think that the problem might come from the protection modes.

Or, can someone try this?

diff --git a/drivers/net/wireless/iwlwifi/dvm/rxon.c b/drivers/net/wireless/iwlwifi/dvm/rxon.c
index 15d6881..8be72ac 100644
--- a/drivers/net/wireless/iwlwifi/dvm/rxon.c
+++ b/drivers/net/wireless/iwlwifi/dvm/rxon.c
@@ -1072,8 +1072,8 @@ int iwlagn_commit_rxon(struct iwl_priv *priv, struct iwl_rxon_context *ctx)
         * force CTS-to-self frames protection if RTS-CTS is not preferred
         * one aggregation protection method
         */
-       if (!priv->hw_params.use_rts_for_aggregation)
-               ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
+//     if (!priv->hw_params.use_rts_for_aggregation)
+//             ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
 
        if ((ctx->vif && ctx->vif->bss_conf.use_short_slot) ||
            !(ctx->staging.flags & RXON_FLG_BAND_24G_MSK))
@@ -1480,9 +1480,9 @@ void iwlagn_bss_info_changed(struct ieee80211_hw *hw,
        else
                ctx->staging.flags &= ~RXON_FLG_TGG_PROTECT_MSK;
 
-       if (bss_conf->use_cts_prot)
-               ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
-       else
+//     if (bss_conf->use_cts_prot)
+//             ctx->staging.flags |= RXON_FLG_SELF_CTS_EN;
+//     else
                ctx->staging.flags &= ~RXON_FLG_SELF_CTS_EN;
 
        memcpy(ctx->staging.bssid_addr, bss_conf->bssid, ETH_ALEN);
Comment 59 Cristian Aravena Romero 2014-06-10 03:02:31 UTC
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1328345
Comment 60 Emmanuel Grumbach 2014-06-19 05:01:45 UTC
Can someone test the patch in comment 58?
Comment 61 Rafael Medina 2014-06-19 20:02:27 UTC
(In reply to Emmanuel Grumbach from comment #60)
> Can someone test the patch in comment 58?

I tried the patch in a 3.15.1 kernel and I've been using my machine for around 10-11 hours (including suspends). There have been no iwlwifi errors in my dmesg and my connection has been working perfectly fine since then (I did have those errors and connection drops in a 3.15.0 without the patch)
Comment 62 Emmanuel Grumbach 2014-06-19 21:04:54 UTC
Good - other users?
Comment 63 Rafael Medina 2014-06-19 22:43:46 UTC
I've suddenly gotten:

[23441.760658] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 2

So apparently it's a no-go. Weird thing is I am not doing any network intensive operations right now, which I have done throughout the day.
Comment 64 Tom Wijsman 2014-06-20 00:01:49 UTC
As far as disconnects go, I still have them; however, if I set up an infinite normal ping to my router it keeps the connection alive.

As far as the TX fifo flushes fails go, I don't think I've since reproduced them; but given it was infrequent for me I'm not willing to label this as fixed at all.

It makes me under the impression that increasing "Sensitivity Range (ACK Timing)" has helped. But again, I might just as well experience it later. So, I'm going to give this more time; I still don't have a test, I'll try to set one up in a week.

(In reply to Rafael Medina from comment #63)
> So apparently it's a no-go. Weird thing is I am not doing any network
> intensive operations right now, which I have done throughout the day.

When I experienced this I noticed that sometimes it comes after the intensive operations; afaik, if the intensive computer shuts down, the issue can continue.

It acts somewhat like a lawine...
Comment 65 Emmanuel Grumbach 2014-06-20 16:44:13 UTC
ok - thanks all for the testing.

I will apply the patch anyway. CTS-to-self is a small optimisation, and is hard to get it right in the firmware. I will disable it since it seems to help even if there are still other issues.
Comment 66 Emmanuel Grumbach 2014-07-09 12:29:37 UTC
*** Bug 79711 has been marked as a duplicate of this bug. ***
Comment 67 Emmanuel Grumbach 2014-07-25 10:02:53 UTC
Patch that disables CTS to Self landed in Linus's tree.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=43d826ca5979927131685cc2092c7ce862cb91cd
Comment 68 Nikolay Martynov 2014-07-31 03:42:34 UTC
Hi.

  I'm getting lots of these:

[   45.964346] iwlwifi 0000:03:00.0: fail to flush all tx fifo queues Q 2
[   45.964355] iwlwifi 0000:03:00.0: Current SW read_ptr 161 write_ptr 201
[   45.964406] iwl data: 00000000: 00 00 00 00 00 00 00 00 fe ff 01 00 00 00 00 00  ................
[   45.964443] iwlwifi 0000:03:00.0: FH TRBs(0) = 0x00000000
[   45.964479] iwlwifi 0000:03:00.0: FH TRBs(1) = 0x801020b0
[   45.964500] iwlwifi 0000:03:00.0: FH TRBs(2) = 0x00000000
[   45.964520] iwlwifi 0000:03:00.0: FH TRBs(3) = 0x80300008
[   45.964541] iwlwifi 0000:03:00.0: FH TRBs(4) = 0x00000000
[   45.964562] iwlwifi 0000:03:00.0: FH TRBs(5) = 0x00000000
[   45.964582] iwlwifi 0000:03:00.0: FH TRBs(6) = 0x00000000
[   45.964605] iwlwifi 0000:03:00.0: FH TRBs(7) = 0x00704068
[   45.964687] iwlwifi 0000:03:00.0: Q 0 is active and mapped to fifo 3 ra_tid 0x0000 [9,9]
[   45.964778] iwlwifi 0000:03:00.0: Q 1 is active and mapped to fifo 2 ra_tid 0x0000 [0,0]
[   45.964859] iwlwifi 0000:03:00.0: Q 2 is active and mapped to fifo 1 ra_tid 0x0000 [161,201]
[   45.964936] iwlwifi 0000:03:00.0: Q 3 is active and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965019] iwlwifi 0000:03:00.0: Q 4 is active and mapped to fifo 7 ra_tid 0x0000 [105,105]
[   45.965097] iwlwifi 0000:03:00.0: Q 5 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965182] iwlwifi 0000:03:00.0: Q 6 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965268] iwlwifi 0000:03:00.0: Q 7 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965354] iwlwifi 0000:03:00.0: Q 8 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965443] iwlwifi 0000:03:00.0: Q 9 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965529] iwlwifi 0000:03:00.0: Q 10 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965609] iwlwifi 0000:03:00.0: Q 11 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965691] iwlwifi 0000:03:00.0: Q 12 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965778] iwlwifi 0000:03:00.0: Q 13 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965863] iwlwifi 0000:03:00.0: Q 14 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.965943] iwlwifi 0000:03:00.0: Q 15 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.966030] iwlwifi 0000:03:00.0: Q 16 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.966116] iwlwifi 0000:03:00.0: Q 17 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.966198] iwlwifi 0000:03:00.0: Q 18 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   45.966288] iwlwifi 0000:03:00.0: Q 19 is inactive and mapped to fifo 0 ra_tid 0x0000 [0,0]
[   85.490930] wlan0: deauthenticating from 88:dc:96:05:73:fa by local choice (reason=3)

It looks like the problem happens when signal is (very) weak. Currently I'm at a country side, so there are not many devices - my card can 'see' only 2 APs including one I'm connected to.

The AP I'm connecting to doesn't have encryption and it looks like it's a TP-Link router. Unfortunately that is about all I know about hardware on that side. I have 'Intel Corporation Centrino Ultimate-N 6300 (rev 3e)'.

If I set '11n_disable=1' problem doesn't seem to appear (which is expected, I presume) and I get reasonably stable connection with reasonably good speeds.

Without '11n_disable=1' I get lots of those messages in logs and almost unusable connection.

I'm using Ubuntu Trusty (Mint). Patch from https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=43d826ca5979927131685cc2092c7ce862cb91cd doesn't seem to make any difference.

Is this the same problem? Is there anything I can try to help diagnose this issue?

Thanks!
Comment 69 Emmanuel Grumbach 2014-07-31 04:54:37 UTC
@Nikolay:

can you please try this:
https://bugzilla.kernel.org/attachment.cgi?id=143861&action=diff

thanks
Comment 70 Nikolay Martynov 2014-08-01 02:58:19 UTC
Created attachment 144831 [details]
Logs after applying suggested patches

Hi.

  I've tried a suggested patch and it looks like it made matters even worse.

  I'm attaching dmesg output with errors that happened just after boot. There are lots of errors and software restarts :).

  After those errors have happened reloading wifi related kernel modules produces one more SW restart and after that I get connection that seem to be limited to 1Mbit/s in both directions, without any aggregation, as far as I can see. And even-though it is relatively stable it is still worse than results I get with '11n_disable=1'. And I do not think this is intended result of this patch.

  Just for clarification: I'm using 'stock' Ubuntu kernel designated as 3.13.0-32-generic with two patches Emmanuel Grumbach have suggested.

  Please let me know if there's something I can test or if there's any other information/logs I can provide.

  Thanks.
Comment 71 Emmanuel Grumbach 2014-08-03 18:48:49 UTC
Yes - obviously this was not what I wanted to achieve...
I'll try to poke at it tomorrow - but I am not very optimistic.
Disabling protection can be ... a bad work around - but it is certainly not a viable solution that will work for everybody. I just want to know if it will work for you, and then I can make it a module parameter.
Comment 72 Nikolay Martynov 2014-08-08 03:02:28 UTC
I looked a little bit into how it behaves when problem occurs.

Indeed rate_scale_table shows that card tried to transmit at a very high bit rate (60Mbits) which probably doesn't reach destination because link quality is not good enough. And it doesn't looks like card/driver tries to scale rate down.

I looked a little bit into the code. To my very NOOBy view it seems that hardware/firmware part is not very intelligent: it is given a list of bitrates and it tries to transmit data using those bitrates one by one. Then it reports results of those attempts back to kernel. It looks like hardware/firmware doesn't chose those bitrates - kernel does.

So, assuming that original problem is indeed caused by card/driver sending data on too high bit rate - this seems like a driver bug, not firmware bug. Is it worth poking in driver's internals to figure out what may be causing this problem?

Thanks!
Comment 73 Szilveszter Nemeth 2014-08-10 14:23:15 UTC
The problem does not occur with kernel version 3.16.0-031600. The logs under /var/log are clear and no wireless malfunction has happend since I updated my kernel from version 3.13.9 two days ago.
Comment 74 Oscar Morante 2014-08-20 07:36:33 UTC
I'm still getting this on 3.16.1 (Arch Linux).  It happens less often though.
Comment 75 Mike 2014-08-21 08:28:09 UTC
Created attachment 147591 [details]
bits of syslog with stack trace

Maybe this log will be useful.
Kernel 3.16.1 on Ubuntu x86_64, Intel Corporation Centrino Wireless-N 2230 (rev c4) in AP mode.
Comment 76 Stanislav Graf 2014-08-22 08:03:24 UTC
(In reply to Szilveszter Nemeth from comment #73)

kernel version 3.16 solved the issue also for me.
kernel-3.16.1-300.fc21.x86_64
Comment 77 Emmanuel Grumbach 2014-08-22 13:34:20 UTC
Ok thanks.

Let's close the bug then.
Comment 78 Mike 2014-08-22 16:36:30 UTC
Emmanuel, kernel 3.16.1 is still affected by the issue.
See my message #75 and message #74.
Comment 79 Szilveszter Nemeth 2014-08-22 20:16:05 UTC
(In reply to Mike from comment #78)
> Emmanuel, kernel 3.16.1 is still affected by the issue.
> See my message #75 and message #74.

Could you please post logs about the issue?
Comment 80 Mike 2014-08-23 06:24:09 UTC
Szilveszter, see https://bugzilla.kernel.org/attachment.cgi?id=147591
Comment 81 Martin 2014-08-23 12:03:33 UTC
Created attachment 147841 [details]
iwlwifi log

I still see these errors in my dmesg log. I don't know if it's related, but the logs share the "Q ... is active and mapped to fifo ..." lines. Mostly the problems occur after coming out of standby.
Comment 82 Cristian Aravena Romero 2014-08-23 13:20:41 UTC
linux 3.16.1 problem:
https://bugzilla.kernel.org/show_bug.cgi?id=83071
Comment 83 Emmanuel Grumbach 2014-08-25 13:50:38 UTC
*** Bug 83071 has been marked as a duplicate of this bug. ***
Comment 84 Emmanuel Grumbach 2014-08-25 13:52:43 UTC
(In reply to Mike from comment #80)
> Szilveszter, see https://bugzilla.kernel.org/attachment.cgi?id=147591

This is a different issue which won't be fixed
Comment 85 Emmanuel Grumbach 2014-08-25 13:55:19 UTC
(In reply to Martin from comment #81)
> Created attachment 147841 [details]
> iwlwifi log
> 
> I still see these errors in my dmesg log. I don't know if it's related, but
> the logs share the "Q ... is active and mapped to fifo ..." lines. Mostly
> the problems occur after coming out of standby.

This is another issue - you have an NMI_WDG_INTERRUPT which is clearly a firmware issue and won't be fixed.
Comment 86 Mike 2014-08-25 15:36:36 UTC
(In reply to Emmanuel Grumbach from comment #84)
> (In reply to Mike from comment #80)
> > Szilveszter, see https://bugzilla.kernel.org/attachment.cgi?id=147591
> 
> This is a different issue which won't be fixed

Any link to that issue and why it wontfix?
Comment 87 Emmanuel Grumbach 2014-08-25 19:07:23 UTC
it won't be fixed because it is a firmware issue and unfortunately, the firmware team doesn't fix bugs for these old devices anymore.
Comment 88 Blask 2014-08-29 08:31:12 UTC
this issue won't be fixed or another issue that Mike replied at comment #80?
Comment 89 Emmanuel Grumbach 2014-08-31 17:30:05 UTC
comment #80's.
Comment 90 Bill Bluescreen 2014-09-04 22:03:42 UTC
I still experience this bug sporadically while roaming between access points. I use kernel 3.0.101 with wireless modules from linux-backports 3.16.1 on arm platform. 
I discovered that I can perfectly reproduce the failure when i plug off the access point during data transfer. In this scenario, there is no sporadic behaviour any more, it happens all the time. 

Any ideas how to further address this issue? It's a big problem for us since we need reliable roaming...

Below you can find the dmesg output, but it's very similar to the previous logs.
Comment 91 Bill Bluescreen 2014-09-04 22:05:22 UTC
Created attachment 149251 [details]
dmesg output 3.16-1
Comment 92 Bill Bluescreen 2014-09-17 12:47:58 UTC
I changed the IWL_FLUSH_WAIT_MS from 2000 to 200, just for testing. The problem obviously still exits but the driver doesn't hang for such a long time. 
Of course this is no reliable solution for our roaming problem... Any further ideas?
Comment 93 Emmanuel Grumbach 2014-09-17 19:07:09 UTC
I am afraid this is a FW bug, which makes it impossible to fix.
I guess I can take your suggestion to reduce IWL_FLUSH_WAIT_MS.
Comment 94 Bill Bluescreen 2014-09-17 20:25:40 UTC
But if it's a firmware issue, why does the Windows driver not have any problems at all and roams perfectly? Is the firmware different?
Comment 95 Emmanuel Grumbach 2014-09-18 05:09:14 UTC
(In reply to Bill Bluescreen from comment #94)
> But if it's a firmware issue, why does the Windows driver not have any
> problems at all and roams perfectly? Is the firmware different?

yes - and we have no way to unify them unfortunately.
This has changed starting from 7260.
Comment 96 Tom Wijsman 2014-09-18 14:25:11 UTC
(In reply to Bill Bluescreen from comment #94)
> But if it's a firmware issue, why does the Windows driver not have any
> problems at all and roams perfectly?

Uhm, no, that is an assumption; I do experience high pings trouble on Windows at a similar moment as described above, not sure how to start debugging this...
Comment 97 pdknsk 2014-09-22 09:37:02 UTC
> I am afraid this is a FW bug, which makes it impossible to fix.

It's not impossible. Someone needs to lobby whoever is responsible at Intel to do it. This bug affects many.

I don't have much else to add. On 3.16.2 and still occurs sporadically.
Comment 98 Nikolay Martynov 2014-09-22 12:58:09 UTC
The 6300 cards were sold in Lenovo laptops less then a year ago. Problem was already known back then. Problem still exists. Problem renders 802.11n unusable for many users. And yet Intel decides not to fix firmware for it? Doesn't sound too good for a brand.
Comment 99 piruthiviraj natarajan 2014-09-24 12:39:05 UTC
I am having 02:00.0 Network controller: Intel Corporation Centrino Wireless-N 2200 (rev c4) 
This problem didn't appear until the last month with the workaround of disabling 802.11N mode. But it doesn't appear to help anymore.
Comment 100 Nikolay Martynov 2014-09-25 01:08:28 UTC
I can confirm this.

I'm experiensing these errors (and 3-5s ping times) on  Intel Corporation Centrino Ultimate-N 6300 (rev 3e) connected to Hotel's AP mac of which suggests that it is made by HP.

Setting 11n_disable=1 doesn't seem to stop the problem from happening.
Comment 101 pdknsk 2014-09-29 12:22:57 UTC
> The 6300 cards were sold in Lenovo laptops less then a year ago.

The 6235 is still sold today.
Comment 102 Bill Bluescreen 2014-09-29 14:23:36 UTC
the 6300 as well...
Comment 103 marianitn 2014-10-01 18:39:11 UTC
The "it's a firmware bug, nothing can be done" it's a pretty annoying answer. I've had this bug for *years*. First on a T410 (6200), and now on T430 (6205). I know this is not a good place to complain, but maybe someone can point us to the right channel to ask for this to be fixed.
Comment 104 Emmanuel Grumbach 2014-10-02 08:18:05 UTC
An idea came through my mind.

What we do in the flush flow is that we wait until the driver finishes to transmit all the frames. If we are far away from the AP (and waiting to roam) this can be useless. It might be better to drop the frames.

I guess that this can be something worth trying:

diff --git a/drivers/net/wireless/iwlwifi/dvm/mac80211.c b/drivers/net/wireless/iwlwifi/dvm/mac80211.c
index d8ebc7e..fe1c1af 100644
--- a/drivers/net/wireless/iwlwifi/dvm/mac80211.c
+++ b/drivers/net/wireless/iwlwifi/dvm/mac80211.c
@@ -1108,6 +1108,8 @@ static void iwlagn_mac_flush(struct ieee80211_hw *hw, struct ieee80211_vif *vif,
                goto done;
        }

+       drop = true;
+
        /*
         * mac80211 will not push any more frames for transmit
         * until the flush is completed

---

This will drop frames - but maybe the responsiveness of the firmware will be better. I am pretty sure that it won't solve all the problems though...
Comment 105 Oscar Morante 2014-10-02 10:50:48 UTC
I'm testing this right now. So far, so good.
Comment 106 Oscar Morante 2014-10-03 12:15:50 UTC
It seems to be working much better, knock on wood :)
Comment 107 Gui Iribarren 2014-10-04 16:29:54 UTC
(In reply to Emmanuel Grumbach from comment #104)

I'm testing Emmanuel's oneliner as well, in a scenario where the bug is consistently reproduceable:
*with unpatched driver*, received signal strength between -75dbm and -80dbm, it sometimes works, but below -80dbm (approx) it consistently hangs, spitting out on syslog each 2 seconds the "fail to flush" blurb, and on 'iw event' says "connection quality monitor event: RSSI went below threshold"
ping6 to router's fe80:: also starts to spike, giving from 2000ms to even 40000ms, and everything wifi-related becomes very unresponsive (even a "modprobe -r iwldvm" or any "iw" command hangs for several seconds)

*with patched driver*, same exact scenario (between -75 and -80dbm): not even one "fail to flush" so far, iw event is silent, and ping6 is (at last!) stable, negligible packet loss, only a few periods of latency, max 5 seconds, in the beginning of the test (30 minutes long ping6)

ping6 fe80::ea94:f6ff:fe47:c059%wlan0 -L -i0.2
[...]
9062/8985 packets, 0% loss, min/avg/ewma/max = 0.973/75.654/7.810/5451.022 ms

thanks a lot Emmanuel for making my laptop's wifi a lot less miserable :)

will report tomorrow how it behaved in a longer term.
Comment 108 Oscar Morante 2014-10-05 05:48:31 UTC
Same here, still working steady after more than 48 hours.  This is using my 5Ghz which was just impossible before.  I'm very happy :)

Thanks Emmanuel.
Comment 109 Emmanuel Grumbach 2014-10-05 06:28:18 UTC
thanks for your testing.
I am uploading now a final version of the fix.

In this version, I don't flush the VO frames, which is critical to ensure proper roaming.
With this version I'd expect roaming to work much better.
Comment 110 Emmanuel Grumbach 2014-10-05 06:29:57 UTC
Created attachment 152431 [details]
drop frames as a work around.

This is potential fix for this long living bug.

Please (re-)test this version.
This one should help for roaming issues.
Comment 111 Emmanuel Grumbach 2014-10-05 08:20:37 UTC
Created attachment 152441 [details]
updated version of the fix

small change in the fix
Comment 112 Emmanuel Grumbach 2014-10-05 11:07:16 UTC
@Oscar
@Gui

can you please test the final version of the patch (Comment 111)?

Thanks
Comment 113 Oscar Morante 2014-10-05 11:26:34 UTC
Sure, I'm testing it right now.  It looks good.
Comment 114 Emmanuel Grumbach 2014-10-05 13:04:17 UTC
Created attachment 152451 [details]
fix

I got it wrong again - here is the updated patch - minor changes.
Comment 115 Bill Bluescreen 2014-10-06 08:45:50 UTC
I will now test the latest version of the patch, especially in our roaming environment and report the results later
Comment 116 Bill Bluescreen 2014-10-06 13:09:59 UTC
Created attachment 152561 [details]
syslog

for me it became much worse! 
now i got the attached error, even without any roaming and low network traffic. Signal conditions are very good, access point is 4m away from the device. 

Intel Wireless 6300, linux-backports 3.16.2-1 with the latest patch
Comment 117 Emmanuel Grumbach 2014-10-06 13:13:49 UTC
@Bill Bluescreen,

this is really weird - I tested it with the same NIC as yours and with the same FW version.
Did the backport work smoothly?

can you capture tracing?
Comment 118 Emmanuel Grumbach 2014-10-06 13:15:28 UTC
Also can you please try with debug=0xf?

this will add data to the log.
Comment 119 Emmanuel Grumbach 2014-10-06 13:27:52 UTC
(In reply to Emmanuel Grumbach from comment #118)
> Also can you please try with debug=0xf?
> 

actually I need 0x8000000f
Comment 120 Bill Bluescreen 2014-10-06 13:29:40 UTC
ok i will try it now. the backport compiled without any problems
Comment 121 Bill Bluescreen 2014-10-06 13:37:50 UTC
I found out something: The error doesn't appear if I disable background scanning in my wpa_supplicant.conf

it was set to 10s, with the line commented out the problem doesn't exist anymore. But of course I can't do any roaming.
Comment 122 Emmanuel Grumbach 2014-10-06 13:45:13 UTC
works for me:

$ cat wpa_supplicant.conf 
ctrl_interface=/var/run/wpa_supplicant

bgscan="simple:10:-45:10"

network={
	ssid="Guest"
	key_mgmt=NONE
}


[  454.756007] iwlwifi 0000:02:00.0: U iwlagn_mac_flush Flushing SCD queues: 0xffdee
[  454.757427] iwlwifi 0000:02:00.0: U iwlagn_mac_flush wait transmit/flush all frames
[  454.757438] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Emptying queue 0...
[  454.764316] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Queue 0 is now empty.
[  454.764326] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Emptying queue 1...
[  454.764331] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Queue 1 is now empty.
[  454.764336] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Emptying queue 2...
[  454.764341] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Queue 2 is now empty.
[  454.764346] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Emptying queue 3...
[  454.764351] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Queue 3 is now empty.
[  454.764356] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Emptying queue 11...
[  454.764361] iwlwifi 0000:02:00.0: U iwl_trans_pcie_wait_txq_empty Queue 11 is now empty.

This to work fine...

I have: sudo ping X.X.X.X -s 5000 -f running in background
Comment 123 Emmanuel Grumbach 2014-10-06 13:47:15 UTC
also:

$ uname -r
3.16.1+

$ git log --oneline -2
2f32ded iwlwifi: dvm: drop non VO frames when flushing
9a35988 Linux 3.16.1
Comment 124 Bill Bluescreen 2014-10-06 14:10:47 UTC
Created attachment 152581 [details]
dmesg output

this is my dmesg output with background scan interval 10s

This row seems interessting:
[  213.162517] iwlwifi 0000:01:00.0: Error sending REPLY_SCAN_CMD: time out after 2000ms.
Comment 125 Emmanuel Grumbach 2014-10-06 14:14:06 UTC
this is clearly not related to my patch - nor to the bug discussed here.
didn't you have that before my patch?
Comment 126 Bill Bluescreen 2014-10-06 14:16:28 UTC
and no further scan requests from wpa_supplicant are possible: 

1412109239.265251: wlan0: Starting radio work 'scan'@0xb2238 after 0.000071 second wait
1412109239.265457: wlan0: nl80211: scan request
1412109241.630229: nl80211: Scan trigger failed: ret=-110 (Connection timed out)
1412109241.631613: wlan0: Radio work 'scan'@0xb2238 done in 2.366345 seconds
Comment 127 Bill Bluescreen 2014-10-06 14:19:21 UTC
no, I didn't have this before but I had to upgrade to 3.16 to apply your patch. Before I was using 3.15.1-1 but your patch did not apply since the signature of the method iwlagn_mac_flush changed
Comment 128 Emmanuel Grumbach 2014-10-06 14:24:13 UTC
(In reply to Bill Bluescreen from comment #127)
> no, I didn't have this before but I had to upgrade to 3.16 to apply your
> patch. Before I was using 3.15.1-1 but your patch did not apply since the
> signature of the method iwlagn_mac_flush changed

ok - go back to 3.15.1-1 and add 77be2c54c5bd26279abc13807398771d80cda37a before my patch.
Comment 129 Bill Bluescreen 2014-10-06 14:26:19 UTC
3.15.1-1:
static void iwlagn_mac_flush(struct ieee80211_hw *hw, u32 queues, bool drop)

3.16/patch
static void iwlagn_mac_flush(struct ieee80211_hw *hw, struct ieee80211_vif *vif,
			     u32 queues, bool drop)



ok, I will try this, thanks in advance!
Comment 130 Emmanuel Grumbach 2014-10-06 14:28:38 UTC
(In reply to Bill Bluescreen from comment #129)
> 3.15.1-1:
> static void iwlagn_mac_flush(struct ieee80211_hw *hw, u32 queues, bool drop)
> 
> 3.16/patch
> static void iwlagn_mac_flush(struct ieee80211_hw *hw, struct ieee80211_vif
> *vif,
>                            u32 queues, bool drop)
> 
> 
> 
> ok, I will try this, thanks in advance!

Note that I have the exact same NIC (6300), FW and driver as you (I am currently running 3.16.1 + patch) and it works perfectly.
But the timeout issue you saw may be related to power state of the link (PCI) which may vary from platform to platform.
Comment 131 Bill Bluescreen 2014-10-06 14:35:24 UTC
btw: you are right, the problem exists also without your patch on 3.16.2... will try 3.15.1 again now with the patches...
Comment 132 Bill Bluescreen 2014-10-06 14:51:07 UTC
Created attachment 152591 [details]
dmesg output

with 3.15.1-1 I got the attached error, more or less once a minute and without any roaming
Comment 133 Bill Bluescreen 2014-10-06 15:26:34 UTC
Created attachment 152601 [details]
dmesg output

this is the dmesg output with debug=0x800000f 
3.15.1-1, Intel 6300.

No errors if background scanning is disabled.

Even without your patch I have the error. 
No problems however if I do not apply the 77be2c54c5bd26279abc13807398771d80cda37a from kernel git!
Comment 134 Bill Bluescreen 2014-10-06 15:53:31 UTC
I made it work by commenting out 

+	//if (sdata)
+		//check_sdata_in_driver(sdata);

in the 77be2c54c5bd26279abc13807398771d80cda37a patch. Now the patched driver works with enabled background-scanning and in a few simple and quick roaming tests I was not able to reproduce the problem anymore. I will do further testing during the next days but it does not look too bad at the moment! ;)
Comment 135 Emmanuel Grumbach 2014-10-06 16:11:08 UTC
(In reply to Bill Bluescreen from comment #134)
> I made it work by commenting out 
> 
> +     //if (sdata)
> +             //check_sdata_in_driver(sdata);
> 
> in the 77be2c54c5bd26279abc13807398771d80cda37a patch. Now the patched
> driver works with enabled background-scanning and in a few simple and quick
> roaming tests I was not able to reproduce the problem anymore. I will do
> further testing during the next days but it does not look too bad at the
> moment! ;)

This makes absolutely no sense to me.
Clearly these 2 lines don't affect the flow towards the NIC.

you'll agree with me that if sdata is NULL, these 2 lines are no-op.
If sdata is not NULL, these 2 lines just warn in a certain case... so ?
Comment 136 Emmanuel Grumbach 2014-10-06 16:25:04 UTC
@Bill,

please open a new bug - the discussion we are having isn't really related to this specific bug.
It'll be easier to handle it as a separate bug.
Comment 137 Emmanuel Grumbach 2014-10-06 19:03:04 UTC
@Oscar and @Gui

please share your NIC model.
I'd like to see how many NICs were covered.
I tested 6200 and 6300 which are very similar.

Other potential testers?
I added in the URL section a link to Ubuntu's database in which you can find a link to a build that includes the fix. In case you have Ubuntu and don't feel like compiling the kernel.
Comment 138 Cristian Aravena Romero 2014-10-06 19:08:18 UTC
@Emmanuel

I'm testing kernel in my Ubuntu 14.10.
Comment 139 Emmanuel Grumbach 2014-10-06 19:11:19 UTC
@Cristian

what NIC do you have?
Comment 140 Cristian Aravena Romero 2014-10-06 19:19:01 UTC
@Emmanuel

01:00.0 Network controller [0280]: Intel Corporation Centrino Advanced-N 6235 [8086:088e] (rev 24)
	Subsystem: Intel Corporation Centrino Advanced-N 6235 AGN [8086:4060]
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 29
	Region 0: Memory at f0500000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: <access denied>
	Kernel driver in use: iwlwifi
Comment 141 Cristian Aravena Romero 2014-10-06 19:32:41 UTC
Created attachment 152681 [details]
dmesg out - version a

Now only hope.
Comment 142 Emmanuel Grumbach 2014-10-06 19:35:34 UTC
how fast did you get the "fail to flush" messages usually?
Comment 143 Cristian Aravena Romero 2014-10-06 19:51:53 UTC
@Emmanuel

I lie down to sleep and when I wake up, the message is :). About 8 hours. 

Is there any way to speed up the test?
Comment 144 Emmanuel Grumbach 2014-10-06 19:53:24 UTC
Don't think there is way to speed up things.

I guess I just need to be more patient :)
Comment 145 Martin Scherer 2014-10-06 20:14:32 UTC
> Is there any way to speed up the test?
> 
I experienced this error most of the time, if there is a lot of traffic.
Comment 146 pdknsk 2014-10-06 21:40:32 UTC
I have no clue, but could this be in any way related to channels 12 + 13 which are not used in the US, unlike in many other countries, particularly in Europe.

PS. I still think it's poor company policy from Intel to not update their firmware for "old" (still sold) devices. Rather than making you dance around the issue, trying to fix what your colleagues don't, it could possibly be fixed much easier directly in the firmware. It shows that an open source driver is only worth so much when the firmware is closed.
Comment 147 Bill Bluescreen 2014-10-06 22:42:06 UTC
@Emanuel tomorrow I'll further investigate this issue and maybe open a new bug. It seems strange to me as well, tomorrow I will try to reproduce it again. 

Anyhow, now it seems to work for me. I didn't see the error any more and during this week I will bring the device to our production environment were we discovered the roaming problems and do more testing there. In lab until now everything seems to be fine. 

@pdknsk I absolutely share your point of view. A firmware fix would be very desirable for a device that is still sold by a company big and famous as Intel... However, at the moment a workaround that will do the job is better than nothing. I am curious to see how it performs in our production environment...
Comment 148 Cristian Aravena Romero 2014-10-06 22:49:10 UTC
Created attachment 152711 [details]
dmesg output - version b

[ 9322.009112] iwlwifi 0000:01:00.0: Microcode SW error detected.  Restarting 0x2000000.
[ 9322.009125] iwlwifi 0000:01:00.0: CSR values:
[ 9322.009130] iwlwifi 0000:01:00.0: (2nd byte of CSR_INT_COALESCING is CSR_INT_PERIODIC_REG)
[ 9322.009139] iwlwifi 0000:01:00.0:        CSR_HW_IF_CONFIG_REG: 0X00484b00
[ 9322.009147] iwlwifi 0000:01:00.0:          CSR_INT_COALESCING: 0X0000ff40
[ 9322.009155] iwlwifi 0000:01:00.0:                     CSR_INT: 0X00000000
[ 9322.009163] iwlwifi 0000:01:00.0:                CSR_INT_MASK: 0X00000000
[ 9322.009170] iwlwifi 0000:01:00.0:           CSR_FH_INT_STATUS: 0X00000000
[ 9322.009178] iwlwifi 0000:01:00.0:                 CSR_GPIO_IN: 0X00000038
[ 9322.009186] iwlwifi 0000:01:00.0:                   CSR_RESET: 0X00000000
[ 9322.009194] iwlwifi 0000:01:00.0:                CSR_GP_CNTRL: 0X080403c5
[ 9322.009202] iwlwifi 0000:01:00.0:                  CSR_HW_REV: 0X000000b0
[ 9322.009210] iwlwifi 0000:01:00.0:              CSR_EEPROM_REG: 0Xc4540ffd
[ 9322.009217] iwlwifi 0000:01:00.0:               CSR_EEPROM_GP: 0X90000801
[ 9322.009225] iwlwifi 0000:01:00.0:              CSR_OTP_GP_REG: 0X00030001
[ 9322.009233] iwlwifi 0000:01:00.0:                 CSR_GIO_REG: 0X00080044
[ 9322.009241] iwlwifi 0000:01:00.0:            CSR_GP_UCODE_REG: 0X0000a963
[ 9322.009249] iwlwifi 0000:01:00.0:           CSR_GP_DRIVER_REG: 0X00000000
[ 9322.009256] iwlwifi 0000:01:00.0:           CSR_UCODE_DRV_GP1: 0X00000000
[ 9322.009273] iwlwifi 0000:01:00.0:           CSR_UCODE_DRV_GP2: 0X00000000
[ 9322.009279] iwlwifi 0000:01:00.0:                 CSR_LED_REG: 0X00000040
[ 9322.009285] iwlwifi 0000:01:00.0:        CSR_DRAM_INT_TBL_REG: 0X881189d2
[ 9322.009291] iwlwifi 0000:01:00.0:        CSR_GIO_CHICKEN_BITS: 0X27800200
[ 9322.009298] iwlwifi 0000:01:00.0:             CSR_ANA_PLL_CFG: 0X00000000
[ 9322.009303] iwlwifi 0000:01:00.0:      CSR_MONITOR_STATUS_REG: 0X43f7f7d7
[ 9322.009310] iwlwifi 0000:01:00.0:           CSR_HW_REV_WA_REG: 0X0001001a
[ 9322.009316] iwlwifi 0000:01:00.0:        CSR_DBG_HPET_MEM_REG: 0Xffff0010
[ 9322.009320] iwlwifi 0000:01:00.0: FH register values:
[ 9322.009360] iwlwifi 0000:01:00.0:         FH_RSCSR_CHNL0_STTS_WPTR_REG: 0X0cbb2400
[ 9322.009377] iwlwifi 0000:01:00.0:        FH_RSCSR_CHNL0_RBDCB_BASE_REG: 0X00cccf60
[ 9322.009394] iwlwifi 0000:01:00.0:                  FH_RSCSR_CHNL0_WPTR: 0X00000008
[ 9322.009412] iwlwifi 0000:01:00.0:         FH_MEM_RCSR_CHNL0_CONFIG_REG: 0X80801114
[ 9322.009429] iwlwifi 0000:01:00.0:          FH_MEM_RSSR_SHARED_CTRL_REG: 0X000000fc
[ 9322.009446] iwlwifi 0000:01:00.0:            FH_MEM_RSSR_RX_STATUS_REG: 0X07030000
[ 9322.009463] iwlwifi 0000:01:00.0:    FH_MEM_RSSR_RX_ENABLE_ERR_IRQ2DRV: 0X00000000
[ 9322.009479] iwlwifi 0000:01:00.0:                FH_TSSR_TX_STATUS_REG: 0X07ff0001
[ 9322.009497] iwlwifi 0000:01:00.0:                 FH_TSSR_TX_ERROR_REG: 0X00000000
[ 9322.009503] iwlwifi 0000:01:00.0: Loaded firmware version: 18.168.6.1
[ 9322.009628] iwlwifi 0000:01:00.0: Start IWL Error Log Dump:
[ 9322.009633] iwlwifi 0000:01:00.0: Status: 0x000002CC, count: 6
[ 9322.009639] iwlwifi 0000:01:00.0: 0x0000198A | ADVANCED_SYSASSERT          
[ 9322.009644] iwlwifi 0000:01:00.0: 0x00015920 | uPc
[ 9322.009648] iwlwifi 0000:01:00.0: 0x00015910 | branchlink1
[ 9322.009653] iwlwifi 0000:01:00.0: 0x00015910 | branchlink2
[ 9322.009657] iwlwifi 0000:01:00.0: 0x0000DBEA | interruptlink1
[ 9322.009661] iwlwifi 0000:01:00.0: 0x00000000 | interruptlink2
[ 9322.009666] iwlwifi 0000:01:00.0: 0x000014ED | data1
[ 9322.009670] iwlwifi 0000:01:00.0: 0x00000023 | data2
[ 9322.009674] iwlwifi 0000:01:00.0: 0x000001DC | line
[ 9322.009679] iwlwifi 0000:01:00.0: 0xD1402849 | beacon time
[ 9322.009683] iwlwifi 0000:01:00.0: 0x111267B7 | tsf low
[ 9322.009687] iwlwifi 0000:01:00.0: 0x00000060 | tsf hi
[ 9322.009692] iwlwifi 0000:01:00.0: 0x00000000 | time gp1
[ 9322.009696] iwlwifi 0000:01:00.0: 0x2B4EC6B7 | time gp2
[ 9322.009700] iwlwifi 0000:01:00.0: 0x00000000 | time gp3
[ 9322.009705] iwlwifi 0000:01:00.0: 0x754312A8 | uCode version
[ 9322.009709] iwlwifi 0000:01:00.0: 0x000000B0 | hw version
[ 9322.009714] iwlwifi 0000:01:00.0: 0x00484B00 | board version
[ 9322.009718] iwlwifi 0000:01:00.0: 0x0000001C | hcmd
[ 9322.009722] iwlwifi 0000:01:00.0: 0xAF863040 | isr0
[ 9322.009727] iwlwifi 0000:01:00.0: 0x1189F800 | isr1
[ 9322.009731] iwlwifi 0000:01:00.0: 0x00000E1A | isr2
[ 9322.009736] iwlwifi 0000:01:00.0: 0x0547FCC3 | isr3
[ 9322.009740] iwlwifi 0000:01:00.0: 0x00000000 | isr4
[ 9322.009744] iwlwifi 0000:01:00.0: 0x00800110 | isr_pref
[ 9322.009758] iwlwifi 0000:01:00.0: 0x000260A4 | wait_event
[ 9322.009762] iwlwifi 0000:01:00.0: 0x000000D4 | l2p_control
[ 9322.009765] iwlwifi 0000:01:00.0: 0x00000000 | l2p_duration
[ 9322.009768] iwlwifi 0000:01:00.0: 0x00000007 | l2p_mhvalid
[ 9322.009771] iwlwifi 0000:01:00.0: 0x00101042 | l2p_addr_match
[ 9322.009774] iwlwifi 0000:01:00.0: 0x00000005 | lmpm_pmg_sel
[ 9322.009777] iwlwifi 0000:01:00.0: 0x13011136 | timestamp
[ 9322.009780] iwlwifi 0000:01:00.0: 0x00000818 | flow_handler
[ 9322.009837] iwlwifi 0000:01:00.0: Start IWL Event Log Dump: nothing in log
[ 9322.010165] iwlwifi 0000:01:00.0: Fw not loaded - dropping CMD: 1e
[ 9322.010172] iwlwifi 0000:01:00.0: flush request fail
[ 9322.010178] ieee80211 phy0: Hardware restart was requested
[ 9322.010321] iwlwifi 0000:01:00.0: L1 Disabled; Enabling L0S
[ 9322.017008] iwlwifi 0000:01:00.0: Radio type=0x2-0x1-0x0
Comment 149 Emmanuel Grumbach 2014-10-07 05:38:11 UTC
@Cristian - I think this firmware error is not related to this specific bug.
I have seen it before.
Did the recovery succeed?
Comment 150 Oscar Morante 2014-10-07 12:55:32 UTC
Hi there, here's my NIC:


03:00.0 Network controller: Intel Corporation PRO/Wireless 5100 AGN [Shiloh] Network Connection
	Subsystem: Intel Corporation WiFi Link 5100 AGN
	Physical Slot: 1
	Flags: bus master, fast devsel, latency 0, IRQ 48
	Memory at f2500000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [e0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-1e-65-ff-ff-e6-0f-4c
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi


In my environment, I can't really tell the difference between your first patch and the last.  But they all make a huge difference compared to the previous behavior.
Comment 151 Cristian Aravena Romero 2014-10-07 13:00:30 UTC
@Emmanuel

I do not understand your question. You mean if I could surf the internet? If if I was able to navigate.
Comment 152 marianitn 2014-10-07 13:08:58 UTC
Hi,
   I'm testing the ubuntu build with this NIC:

03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 [Taylor Peak] (rev 34)
	Subsystem: Intel Corporation Centrino Advanced-N 6205 AGN
	Flags: bus master, fast devsel, latency 0, IRQ 45
	Memory at f1c00000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [c8] Power Management version 3
	Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [e0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number a4-4e-31-ff-ff-41-4a-b8
	Kernel driver in use: iwlwifi

So far so good, but tomorrow will be able to test on a roaming environment.


PD: thanks @emmanuel for your work
Comment 153 Emmanuel Grumbach 2014-10-07 13:16:05 UTC
(In reply to Cristian Aravena Romero from comment #151)
> @Emmanuel
> 
> I do not understand your question. You mean if I could surf the internet? If
> if I was able to navigate.

Yes.
I want to know if you had any issues besides this fat error in the log.

@marianitn

thanks for your testing.
Comment 154 Emmanuel Grumbach 2014-10-07 14:39:04 UTC
I pushed the patch to my tree:

https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/iwlwifi-fixes.git/commit/?id=ff36e3756cc8abb42608ceed4d173d128ac4c9ad

I'll leave this bug open for a few more days to be able to get more reports and then I'll close it.
Comment 155 Nikolay Martynov 2014-10-07 15:06:28 UTC
@Emmanuel:

  Does this patch address those timeouts that were happening in non-roaming environments woth weak signal in any way?

  I'm currently testing your patch but my location but unfortunately signal here is too strong to trigger it. I'll try to do more testing in couple of days but all discussion above was concerning roaming scenario and I'm a little bit concerned about this bug being closed since originally it doesn't seem it was about roaming at all.

Thanks.
Nikolay.
Comment 156 Emmanuel Grumbach 2014-10-07 15:16:55 UTC
@Nikolay:

This patch is good for your scenario too.
I don't think that Oscar was testing roaming since my very first version (the one-liner) wasn't good for roaming and he said it improved the behavior on his setup.
Comment 157 Oscar Morante 2014-10-07 15:23:54 UTC
That's right, I'm using a single AP here and it works a lot better with Emmanuel's patch.
Comment 158 Martin 2014-10-08 06:37:04 UTC
My wakeup from standby problems seem to have vanished (or at least become less frequent) as well. Thx for the fix!
Comment 159 Martin 2014-10-08 06:38:25 UTC
This is the device:
01:00.0 Network controller: Intel Corporation Centrino Advanced-N 6235 (rev 24)
        Subsystem: Intel Corporation Centrino Advanced-N 6235 AGN
        Flags: bus master, fast devsel, latency 0, IRQ 28
        Memory at d0400000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [c8] Power Management version 3
        Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [e0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number ************
        Kernel driver in use: iwlwifi
Comment 160 Nikolay Martynov 2014-10-08 16:51:05 UTC
I've given it a try on intel 6300. I didn't get any 'long pings' but after several hours of reasonably good work it just stopped receiving or sending anything. Like it went completelly deaf and mute, suddenly, no errors in logs as far as I can see.

Other clients of same AP do not exibit same behaviour.

I'm running 3.17 + latest patch.

I'm not sure if this is problem itself is in any way related to this bug. But I guess my question at this point is: what should I do to gather reasonable debugging information to open a new bug, considering that logs are silent?

Thanks.
Nikolay.
Comment 161 Emmanuel Grumbach 2014-10-09 16:25:50 UTC
@Nikolay,

This is obviously a different issue. To see what happens, I'd need you to run with debug=0xC0800000 given as parameter to iwlwifi. It'll be very chatty, but it will let us know what is happening.
Comment 162 Bill Bluescreen 2014-10-09 16:54:37 UTC
roaming seems to work much better and faster now. However, now we have a high rate of lost packages. When testing with iperf (UDP at 2Mbit/s) we have 1~2% packet loss, even at best signal conditions. 
Without the last patch, the same device has <0,02% lost packages. 

Does anyone else have similar effects?
Comment 163 Emmanuel Grumbach 2014-10-09 18:13:08 UTC
I can try to make the patch slightly less brutal - I'll come up with an updated version in the next 2 hours.
Comment 164 marianitn 2014-10-09 18:14:16 UTC
I'm experiencing exactly the same behavior as comment 160, although I can't blame the patch because the network is very bad here, and I recall having this behaviour before. But other devices work.

When I try to enable debug I get:

[    3.882957] iwlwifi: unknown parameter 'debug' ignored
[    3.883229] iwlwifi 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
[    3.883360] iwlwifi 0000:03:00.0: irq 45 for MSI/MSI-X
[    3.889803] iwlwifi 0000:03:00.0: loaded firmware version 18.168.6.1 op_mode iwldvm
[    3.916377] iwlwifi 0000:03:00.0: CONFIG_IWLWIFI_DEBUG disabled
[    3.916381] iwlwifi 0000:03:00.0: CONFIG_IWLWIFI_DEBUGFS enabled
[    3.916383] iwlwifi 0000:03:00.0: CONFIG_IWLWIFI_DEVICE_TRACING enabled
[    3.916386] iwlwifi 0000:03:00.0: Detected Intel(R) Centrino(R) Advanced-N 6205 AGN, REV=0xB0
[    3.916547] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[    3.957977] ieee80211 phy0: Selected rate control algorithm 'iwl-agn-rs'
[    4.589261] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[    4.595840] iwlwifi 0000:03:00.0: Radio type=0x1-0x2-0x0
[    4.863533] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
[    4.870060] iwlwifi 0000:03:00.0: Radio type=0x1-0x2-0x0

I think the ubuntu build I'm running (3.16.0-20.27+lp1328345v201410061136) doesn't have debug enabled. Is there other build I can try?
Comment 165 Emmanuel Grumbach 2014-10-09 18:21:34 UTC
(In reply to marianitn from comment #164)

> 
> I think the ubuntu build I'm running (3.16.0-20.27+lp1328345v201410061136)
> doesn't have debug enabled. Is there other build I can try?

exactly - you need CONFIG_IWLWIFI_DEBUG
Note that this is not the same bug.
Comment 166 Emmanuel Grumbach 2014-10-09 19:41:57 UTC
Created attachment 153041 [details]
less aggressive drop

This should be applied on top of the previous patch.
It will be less aggressive and will hopefully reduce the packet loss.

Note that you will probably see
"fail to flush all tx fifo queues Q 0"

ignore those and let me know how it behaves

If this is still not good enough - I will make mac80211 give a hint to the driver if we are disconnecting or not.
Comment 167 Bill Bluescreen 2014-10-09 20:18:13 UTC
@Emanuel 
thank you for the quick response. 
Wouldn't this lead to the same behaviour as before? 
In case of any errors (as before while roaming in our case) this will lead to the same situation: the device will be locked and the driver will wait for 2000ms at iwl_trans_wait_tx_queue_empty and only after this 2 seconds the queues will be flushed. Maybe this could work with shorter time-out values in iwl_trans_wait_tx_queue_empty?
Or do I miss something?

In any case, I will do some testing tomorrow and report as soon as possibile!
Comment 168 Bill Bluescreen 2014-10-09 20:19:53 UTC
ok, I missed the change in the second file, in fact the time out is shorter now ;)
Comment 169 Emmanuel Grumbach 2014-10-09 20:21:54 UTC
I reduced the time out to 100ms.
I first try to give a chance to the packets to be sent.
I that doesn't work - I flush.

Please also note that 1~2% packet loss is not nice I agree, but in TCP you shouldn't really feel it I think. Unless you need super high throughput.
Comment 170 Bill Bluescreen 2014-10-09 21:12:06 UTC
For me at home a loss of 2% doesn't matter, I do completely agree. Our customer however has higher expectations and requirements and won't accept values higher than 1%. (Which was no problem at all before applying the patch). Tomorrow I will try the second version.

PS: I just tried to reproduce the error at home with my laptop (Intel Wireless 6235) and I didn't loose any packets with yesterday's version of the patch. But with the 6235 card I never had any trouble, no "fail to flush all tx fifo queue" messages. Just to give feedback for another card...
Comment 171 Bill Bluescreen 2014-10-10 09:44:12 UTC
Created attachment 153091 [details]
dmesg output

This is the output with the latest patch on intel 6300 NIC :-(
Comment 172 Oscar Morante 2014-10-10 10:05:38 UTC
It feels like that the latest patches aren't perform as well as the early ones when in low-signal (no roaming).

There's a room here where I've always lost the connection, but then it started working perfectly with this patch (https://bugzilla.kernel.org/attachment.cgi?id=152451&action=diff) and now it fails again with the new one.

I'll try to do a more scientific test and post the results.  Which logs would be useful to help debug this?
Comment 173 Emmanuel Grumbach 2014-10-10 11:58:00 UTC
Yes - this is what I was afraid of - this kind of heuristics is ... problematic.
I'll try to think it over during the week end.
Comment 174 Emmanuel Grumbach 2014-10-10 13:42:44 UTC
Chatting with a colleague gave us an idea:

can you please load mac80211 with beacon_loss_count=1000 as a module parameter?

And of course - drop my patch from comment 166.
This should reduce the packet loss and will probably not affect roaming.

thanks
Comment 175 Bill Bluescreen 2014-10-10 14:10:08 UTC
Comment #171 might be a problem with our backport, with kernel 3.16.2 it seems ok, no more package loss with the second patch but I didn't test roaming yet.
Comment 176 Emmanuel Grumbach 2014-10-10 14:16:02 UTC
I think that my suggestion from Comment 174 is better than the second patch.
Comment 177 Emmanuel Grumbach 2014-10-12 19:53:15 UTC
*** Bug 86101 has been marked as a duplicate of this bug. ***
Comment 178 Bill Bluescreen 2014-10-13 13:01:04 UTC
Created attachment 153421 [details]
dmesg output

I removed the last patch and loaded mac80211 with the proposed option but now we have the well-known "fail to flush all tx fifi queues..."
Comment 179 Johannes Berg 2014-10-13 13:05:34 UTC
(In reply to Bill Bluescreen from comment #178)

> I removed the last patch and loaded mac80211 with the proposed option but
> now we have the well-known "fail to flush all tx fifi queues..."

Did you have any patch applied? That was intended to go with the patch from comment #154
Comment 180 Bill Bluescreen 2014-10-13 13:27:56 UTC
(In reply to Johannes Berg from comment #179)
> (In reply to Bill Bluescreen from comment #178)
> 
> > I removed the last patch and loaded mac80211 with the proposed option but
> > now we have the well-known "fail to flush all tx fifi queues..."
> 
> Did you have any patch applied? That was intended to go with the patch from
> comment #154

yes, we applied the patch from comment #154 and removed the one from #166
Comment 181 Emmanuel Grumbach 2014-10-13 13:39:03 UTC
Please reproduce with debug=0x8000000f.

This is very strange. Are sure you didn't have issues with the patches orbackport?
Comment 182 Bill Bluescreen 2014-10-13 15:35:32 UTC
Created attachment 153431 [details]
syslog

Hello,

we tested again with debug messages enabled. The txflush error did not appear any more but we still have packet loss.
In the debug output "iwlagn_mac_flush wait transmit/flush all frames" appears several times. Shouldn't these lines be vanished after setting missed_beacon_count=1000?
Comment 183 Bill Bluescreen 2014-10-14 12:23:12 UTC
Also here in our lab we still have "iwlagn_mac_flush wait transmit/flush all frames" messages in our system log.
Comment 184 Emmanuel Grumbach 2014-10-14 12:45:33 UTC
Please add dump_stack() in iwlagn_mac_flush() so that we can see what triggers the flush.

Thanks
Comment 185 Emmanuel Grumbach 2014-10-14 15:04:57 UTC
@Bill - we are really discussing a different issue.
If you want, you can open a new bug and CC ilw@linux.intel.com.
I don't want to continue spamming all the (many) people following this bug.
Comment 186 Emmanuel Grumbach 2014-10-14 15:55:28 UTC
I am closing this bug.

New / related issues should be discussed in a new bug that can point to this one.

Note You need to log in before you can comment on or make changes to this bug.