Bug 77491 - iwlwifi: 7260 packet loss / instability in 40MHZ
Summary: iwlwifi: 7260 packet loss / instability in 40MHZ
Status: CLOSED WILL_NOT_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-08 00:22 UTC by Nathan Schulte
Modified: 2014-08-22 07:22 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.14.x, 3.15-rc8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
grep 'Linux version\|iwl\|Intel.*Wireless\|80211\|wlan0' /var/log/kern.log > kern.log (895.25 KB, text/x-log)
2014-06-08 00:22 UTC, Nathan Schulte
Details
latency w/ control, both directions (3.14.6, no MAC_PROT_FLG_SELF_CTS_EN fix) (1.87 MB, image/png)
2014-06-13 05:58 UTC, Nathan Schulte
Details
latency w/ control, both directions (3.14.6, no MAC_PROT_FLG_SELF_CTS_EN fix) (3.04 MB, image/png)
2014-06-13 06:12 UTC, Nathan Schulte
Details
trace with e1000 router set to 20MHz only BW (working) (323.33 KB, application/x-xz)
2014-06-25 11:56 UTC, dflogeras2
Details
trace with e1000 router set to Auto (20MHz or 40MHz) (358.85 KB, application/x-xz)
2014-06-25 11:57 UTC, dflogeras2
Details
Latest Core5 FW (665.56 KB, application/octet-stream)
2014-07-28 04:57 UTC, Emmanuel Grumbach
Details
Network traffic graph during large file copy (28.64 KB, image/png)
2014-08-21 23:16 UTC, dflogeras2
Details

Description Nathan Schulte 2014-06-08 00:22:30 UTC
Created attachment 138481 [details]
grep 'Linux version\|iwl\|Intel.*Wireless\|80211\|wlan0' /var/log/kern.log > kern.log

I'm experiencing issues with an Intel Dual Band Wireless AC 7260 in my new laptop.  The networking seems to quickly become unstable and seemingly no data is getting in / out at certain points.  If I'm patient enough, the system seems to go in and out of this state, and also to varying extents.  Sometimes things are just really latent, sometimes it's bursty, sometimes it looks like it's dead.  I'm usually not patient, and if I instruct Network Manager to cycle the WiFi iface off/on, things seem solid for at least a short period again.

At the office where I use the laptop most, we have just installed Ubiquiti Networks' Unifi AP AC (3x UAP-AC).  Previously, we had an Apple Airport Extreme (5th generation, I believe; it had 5 GHz 802.11n) and an Airport Express.  At my home, I have a Linksys WRT54GL running Tomato Firmware v1.28.1816.  I seem to experience similar issues at home as I do at the office.

I am running Debian unstable, presently kernel 3.14.5, but 3.15-rc7 and -rc8 are available in Debian experimental (note, 7260-9.ucode is not packaged by Debian; I haven't tried anything outside of Debian yet; I'm looking for direction in that regard).

I believe I've been having these issues since kernel 3.12/13 (which was around when I received the machine), but it's possible those issues were unrelated (we got new network/infrastructure hardware between now and then while I was not using the laptop).  However, the issues, from the perspective of ioquake3 Quake 3 Arena's networking graph and outputs from tests with mtr, traceroute, ping, dig and nslookup, the issues seem mightily similar.

This bug report seems to discuss issues similar to mine:
https://bugzilla.kernel.org/show_bug.cgi?id=72601

Please let me know if there's any other logging I should provide.  The attached log contains boots from both environments, and multiple kernels (3.14.4, 3.14.5, 3.15-rc7, 3.15-rc8).

As well, it looks like there's a stack trace going through ieee80211 at some point; let me know if you want that full trace for some reason, or if you want me to file another report based on it.
Comment 1 Emmanuel Grumbach 2014-06-08 05:04:36 UTC
Hello,

1) You have an issue with your Ubiquity AP. I have reported it to them about 3 months ago. They said they would fix their Software. Here is the problem:
Jun  7 19:09:17 nms-debian kernel: [30350.093878] iwlwifi 0000:06:00.0 wlan0: disabling HT as WMM/QoS is not supported by the AP
Jun  7 19:09:17 nms-debian kernel: [30350.093886] iwlwifi 0000:06:00.0 wlan0: disabling VHT as WMM/QoS is not supported by the AP
Check you have the latest firmware for the AP.

2) There are quite a few fixes which just made their way to 3.14 (3.14.6) - I suggest to use 3.14.6 + -9.ucode.
Comment 2 Nathan Schulte 2014-06-08 06:42:51 UTC
Actually, all of the WMM/QoS not supported issues are with the Linksys WRT54GL at home, not with the Ubiquiti APs at work.  I guess that means they've fixed their issue.  That said... this isn't really a bug then (and the notice in the log is not indicative of an issue; except that the most applications of the Ubiquiti gear should probably be leveraging it), correct?  What is "HT" and "VHT"?

As for #2, I'm looking into it.  Trying to figure out how to build the kernels the Debian way, and applying the ABI v9 bump to 3.14.5 didn't seem to work.  I'll keep banging on it and report back regarding 3.14.6 + -9.ucode.
Comment 3 Emmanuel Grumbach 2014-06-08 06:47:04 UTC
3.14.5 + API bump isn't enough. There are other fixes in 3.14.6.

HT is 11n.
VHT is 11ac.
Comment 4 Nathan Schulte 2014-06-08 06:51:49 UTC
Noted; I planned to jump to latest 3.14 once I proved the procedure with just a single patch.

And for the record, HT is short for "high throughput" and VHT is short for very high throughput (11n and 11ac respectively, apparently).  WMM/QoS are required to be supported by the 11n and 11ac standards.  Apparently "required to be supported" also means "required to be enabled and active during use" as well, not just verified working throughout the stack at time of certification.  Interesting indeed, though I suppose the benefit likely outweighs any harm.
Comment 5 Nathan Schulte 2014-06-08 11:22:08 UTC
Well, building 3.14.6 and dropping in the -9.ucode seems to have resolved things here at home.  I didn't use the laptop much at home, so I won't have proper feedback until I go into the office tomorrow.
Comment 6 Nathan Schulte 2014-06-10 18:56:01 UTC
I'm still experiencing issues with the network at the office.  Most folks in the Quake server receive sub 10 (all below 20) ms pings; mine is consistently 20+, sometimes spiking up to 200/300+.  We're all on the same LAN via the same WiFi infrastructure.

However, I am not constantly disconnecting/reconnecting like I was with 3.14.5 and -8.ucode, and I don't have long periods of instability as I did previously.

At home, sometimes I have been prompted to re-enter my WiFi passphrase by NetworkManager; sometimes multiple times in a row, which is something I was experiencing prior as well.  At the office, I haven't noticed any such issues.

What can I do to pinpoint my issues?  Anything I can test, or verbose logging I can enable and try to compare?
Comment 7 Emmanuel Grumbach 2014-06-10 18:58:39 UTC
Can you please try this?

diff --git a/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c b/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
index bc57c27..a32dce9 100644
--- a/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
+++ b/drivers/net/wireless/iwlwifi/mvm/mac-ctxt.c
@@ -669,7 +669,7 @@ static void iwl_mvm_mac_ctxt_cmd_common(struct iwl_mvm *mvm,
 
        if (vif->bss_conf.use_cts_prot) {
                cmd->protection_flags |= cpu_to_le32(MAC_PROT_FLG_TGG_PROTECT);
-               cmd->protection_flags |= cpu_to_le32(MAC_PROT_FLG_SELF_CTS_EN);
+//             cmd->protection_flags |= cpu_to_le32(MAC_PROT_FLG_SELF_CTS_EN);
        }
        IWL_DEBUG_RATE(mvm, "use_cts_prot %d, ht_operation_mode %d\n",
                       vif->bss_conf.use_cts_prot,
Comment 8 Nathan Schulte 2014-06-11 16:42:45 UTC
The 3.14.6 I built looks like this:

        /* Don't use cts to self as the fw doesn't support it currently. */
        if (vif->bss_conf.use_cts_prot) {
                cmd->protection_flags |= cpu_to_le32(MAC_PROT_FLG_TGG_PROTECT);
                if (IWL_UCODE_API(mvm->fw->ucode_ver) >= 8)
                        cmd->protection_flags |=
                                cpu_to_le32(MAC_PROT_FLG_SELF_CTS_EN);
        }
        IWL_DEBUG_RATE(mvm, "use_cts_prot %d, ht_operation_mode %d\n",
                       vif->bss_conf.use_cts_prot,

That looks to mean that it's doing the opposite of what your patch tries to do; I will remove that nested conditional (ucode_ver >= 8) and try.  Please confirm if this is what you're after.
Comment 9 Emmanuel Grumbach 2014-06-11 19:20:06 UTC
right - this code means: if the FW is recent enough, allows CTS to self.
I am afraid that this doesn't work even with recent FW - hence my patch.
Comment 10 Nathan Schulte 2014-06-12 00:39:38 UTC
It's difficult to tell, but that patch _may_ have improved stability a bit.

I tested quickly w/ Quake (and some AI bots, not real players w/ extra WiFi traffic), and I noticed similar feedback via it's networking graph.  I wasn't receiving very large/long ping spikes (those seem to have gone away w/ the ucode update), but I didn't test for very long.

That brings me to my next point: is there a better approach to testing this?  Something I can recreate concretely?  https://en.wikipedia.org/wiki/Lagometer doesn't really help pinpoint the issue much.  I had a co-worker join w/ a MacBook Air, and his graph showed "lag" just one time upon his first death, and it looked completely different from mine; after that there were absolutely no issues for him.

The graphs I get in Quake (when there's an issue) are always saw-tooth w/ the same period: the server snapshot frequency seems to slow to about a 1/4 to 1/8 hertz rate.  That lasts for a short time, after which the issue subsides.
Comment 11 Emmanuel Grumbach 2014-06-12 06:59:42 UTC
It is really hard to say how to test this kind of things.
We are testing latency a bit, but mostly in controlled environment (which you can't do).
This kind of phenomenon can be caused by lots of things - hard to say it is because of the driver. Do you have anything in the kernel log during that time?
Comment 12 Nathan Schulte 2014-06-12 14:42:26 UTC
I see.

Well, the fact that the latency graph in Quake is a very consistent sawtooth, that means that the latency is building up at a constant rate, and being "remedied" at a constant rate.

I discovered I can view wireless related events with iw (iw event -t -f); I will start capturing the output of this, the kernel log (can I turn on verbose logging for the driver somehow?) and an `mtr --curses` output to a physical device on the wired network.  As well, I will try to capture the reverse `mtr --curses` trace, as well as a trace to another wireless device (a device running Android), and perhaps something on the WAN (kernel.org?).

Anything else?  Wireshark/tcpdump?  What filters should I use for those?
Comment 13 Emmanuel Grumbach 2014-06-12 16:02:21 UTC
can you try to disable power save?

sudo iw wlan0 set power_save off?

you can also disable bg_scan in the supplicant.

You can also try the low_latency debugfs hook we added - not sure you have it in your kernel though.
Comment 14 Nathan Schulte 2014-06-13 05:58:34 UTC
Created attachment 139541 [details]
latency w/ control, both directions (3.14.6, no MAC_PROT_FLG_SELF_CTS_EN fix)

shows latency over time (from the output of `ping -D -i 0.2`), graphed with `iw events -t -f` displayed on top (scan started, auth, deauth)

the first plot shows pings from the laptop (machine w/ issues) to server (on the wired network), as well as to a tablet (on the wireless network)

the second plot shows pings from the server to the laptop, as well as to the tablet

This doesn't include the removal of the MAC_PROT_FLG_SELF_CTS_EN protection flag for -9 microcode; I noticed that _after_ I had captured and processed all of the data
Comment 15 Nathan Schulte 2014-06-13 06:12:31 UTC
Created attachment 139561 [details]
latency w/ control, both directions (3.14.6, no MAC_PROT_FLG_SELF_CTS_EN fix)
Comment 16 Nathan Schulte 2014-06-16 14:58:49 UTC
A side-by-side view of the latency data for both 3.14.6 w/out the protection flag fix, and w/ the protection flag fix, can be seen here:

(top is w/out, bottom is w/; see previous comment for more details)

http://loki.ist.unomaha.edu/~nmschulte/wifi.log/3.14.6-iwlwifi.png

In that same directory you will find the scripts and logs that I used to generate the graphs.

Is this helpful?

I haven't disabled background scanning for the w/out protection flag fix, but you'll notice I never switched APs.

I'm still having latency issues w/ Quake; I noted particularly troublesome times while playing, and I'll produce an annotated latency graph of those times.

Are these graphs helpful?  I think having the data available is better than a net_graph capture from a game, but perhaps not.
Comment 17 Emmanuel Grumbach 2014-06-17 05:27:32 UTC
well... Frankly, the graphs aren't very useful.
Did you try with the patch in comment 7?
Comment 18 Emmanuel Grumbach 2014-06-17 05:29:37 UTC
oh sorry - I just read again your comment and understand I had misunderstood it.
kinda late here
Comment 19 Emmanuel Grumbach 2014-06-17 05:32:12 UTC
have you tried to disable powersave?
powersave can have a big impact on latency - especially in Rx when no Tx is happening.

I couldn't conclude much from your graphs - did the patch in comment 7 help?
Comment 20 Nathan Schulte 2014-06-17 16:21:35 UTC
It doesn't seem the MAC_PROT_FLG_SELF_CTS_EN patch helped.  On the second graph (the one w/ the patch) I notice that I never deauth/auth (switch APs); I'm guessing this is just coincidence.

What would explain the "directional" latency discrepancy?  The laptop has no issues pinging the server, but at the same time the server has issues pinging the laptop.  I find that strange, but I've never looked at these graphs before.

I'm trying at the moment with `iw dev wlan0 set power_save off`; I'll report back with results later.  I still need to annotate a zoomed in view with the data from quake; the detailed ping curve may shed some light on what is going on as well?
Comment 21 Emmanuel Grumbach 2014-06-17 16:35:54 UTC
AP to Client takes from more time because of WiFi spec because of powersave.
Disabling powersave can help very much in Rx latency.
Comment 22 dflogeras2 2014-06-25 02:44:51 UTC
This is all on a Sony Vaio Pro 13, with 7260ac.  Running 3.15.1 kernel and the latest -9.ucode.

I have noted similar behaviour, at one place I use my laptop the connection is very dodgy, works fine for a bit then will work intermittently or sometimes stop altogether requiring a module reload.

Another place I use my laptop is rock solid for days.  Both are very similar linksys routers; working is an e1200, non-working is e1000.

So I manually "diffed" the router settings, and noted that the working setup had channel width selection set to "20 MHz only", while the non-working one was set to "Auto (20MHz or 40Mhz)"

I changed the non-working one to "20 MHz only" and now it has been stable (+ or - regular wifi variations), and I've hammered it for an hour (it would have failed by now from previous tests I think).

The router at my parents house is dodgy as well, but unfortunately it is owned by the ISP and they only give the customer a crippled interface which doesn't include the channel bandwidth setting.

Maybe this sheds some light?
Comment 23 Emmanuel Grumbach 2014-06-25 04:56:42 UTC
(In reply to dflogeras2 from comment #22)
> 
> So I manually "diffed" the router settings, and noted that the working setup
> had channel width selection set to "20 MHz only", while the non-working one
> was set to "Auto (20MHz or 40Mhz)"
> 
> I changed the non-working one to "20 MHz only" and now it has been stable (+
> or - regular wifi variations), and I've hammered it for an hour (it would
> have failed by now from previous tests I think).
> 

This is interesting... Can you please run (as root):

trace-cmd record -e iwlwifi -e iwlwifi_dbg -e mac80211 -e cfg80211

during the association in both cases (working and non-working)?
Just use the rfkill switch to trigger re-association.

> The router at my parents house is dodgy as well, but unfortunately it is
> owned by the ISP and they only give the customer a crippled interface which
> doesn't include the channel bandwidth setting.
> 
> Maybe this sheds some light?
Comment 24 dflogeras2 2014-06-25 11:56:47 UTC
Created attachment 140921 [details]
trace with e1000 router set to 20MHz only BW (working)
Comment 25 dflogeras2 2014-06-25 11:57:27 UTC
Created attachment 140931 [details]
trace with e1000 router set to Auto (20MHz or 40MHz)
Comment 26 dflogeras2 2014-06-25 12:00:30 UTC
Emmanuel,

This is my first time using trace-cmd; I run Gentoo with a pretty lean kernel normally, so I had to enable the tracer support etc in my kernel. Basically:

- Enabled tracer support & related in Kernel Hacking section
- Enabled anything to do with debug in the cfg80211 and mac80211 areas of Networking->Wireless
- Enabled the debug/debugfs stuff in the IWLWIFI driver in Drivers->Networking->Wireless

If anything looks amiss, let me know what kernel features to enable and I can re-run the traces.
Comment 27 Emmanuel Grumbach 2014-06-27 07:46:10 UTC
I couldn't find time for this yet - Will update next week I hope.
Comment 28 Emmanuel Grumbach 2014-06-27 07:46:17 UTC
I couldn't find time for this yet - Will update next week I hope.
Comment 29 Nathan Schulte 2014-06-27 15:26:56 UTC
Emmanuel(In reply to Emmanuel Grumbach from comment #19)
> have you tried to disable powersave?

I have, and it didn't seem to help any.

> I couldn't conclude much from your graphs - did the patch in comment 7 help?

The patch does not seem to help.

--

I tried looking into the lead Dave gave, but I'm unable to tell if that's the case here.  For one, I don't know how to find that information as a client of the AP.  Two, info from `iw dev wlan0 info` shows 20 MHz bandwidths both at home and at the office.  It seems the driver prefers the 2.4 GHz band for my setup at the office (has the best SNR/signal quality, even over 5 GHz).

I plan to poke more at this with recent kernels over the weekend.
Comment 30 snoozerman 2014-06-28 21:53:07 UTC
Hi,
I'm struggling with seemingly the same issue. My system is Debian. Some system info below

*******
$ uname -v
#1 SMP Debian 3.14.7-1~bpo70+1 (2014-06-21)

$ dmesg | egrep "Dual|firmware"
[    1.370976] iwlwifi 0000:02:00.0: firmware: direct-loading firmware iwlwifi-7260-9.ucode
[    1.371086] iwlwifi 0000:02:00.0: loaded firmware version 23.214.9.0 op_mode iwlmvm
[    1.381741] iwlwifi 0000:02:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144
[ 2294.211430] (NULL device *): firmware: direct-loading firmware iwlwifi-7260-9.ucode
*******

My Router is a TP-Link Archer C7 with latest firmware. I am testing using my 30 Mbit ISP fiber connection. These are my observations and test setups;

#1 Concurrent 2,4 MHz and 5 MHz enabled and "lw wlan set power_save on":
I can connect to it. I normally get results < 1Mbit/s tx and <0,5 Mbit/s rx.

#2 Concurrent 2,4 MHz and 5 MHz enabled and "lw wlan set power_save off":
The result seems to get somewhat better, like 1-3Mbit/s rx and 1-2Mbit/s tx. 

#3 Router 2,4 MHz band only (5 MHz disabled):
Significatly better speeds! Usually I get >8 Mbit/s rx and >10Mbit/s tx and sometimes also much more (>20 rx and > 25 tx). I cannot for sure tell if there is difference between power save on or off.

I don't no if this adds any value. Let me know if I can help any further. Unfortunatelty my Linux experience is limited so please be clear with what command I should attempt.
Comment 31 snoozerman 2014-06-28 22:00:12 UTC
I forgot to tell that with router set using the 5Mhz band only, it seems I get viritually no speed with it. I can connect, but speed is ~0.

I should also add that I don't have perfect link quality. Network manager usually shows ~60% during mentioned tests.
Comment 32 Emmanuel Grumbach 2014-06-30 13:44:46 UTC
(In reply to Emmanuel Grumbach from comment #28)
> I couldn't find time for this yet - Will update next week I hope.

I looked at the trace - and they look similar.
IOW - it smells like a firmware bug...
I'll discuss this with firmware poeple.
Comment 33 Nathan Schulte 2014-06-30 14:13:05 UTC
Bad news from my end:

I tried using a USB WiFi card, and I was receiving similar results here at the office.  The experience in Quake was quite similar.  Details on the card below.

At home (WRT54GL, 2.4 GHz), with the Intel DB W AC 7260, I have serious connection issues now: constantly asking to re-auth (prompted for password), and almost no throughput.  I don't even bother anymore and always plug in some Ethernet.

(a TP-Link TL-WN822N "300 Mbps High Gain Wireless N USB adapter"; ID 0cf3:7015 Atheros Communications, Inc. TP-Link TL-WN821N v3 802.11n [Atheros AR7010+AR9287])

[ 1942.989493] usb 3-9: new high-speed USB device number 11 using xhci_hcd
[ 1943.173948] usb 3-9: New USB device found, idVendor=0cf3, idProduct=7015
[ 1943.173959] usb 3-9: New USB device strings: Mfr=16, Product=32, SerialNumber=48
[ 1943.173963] usb 3-9: Product: USB WLAN
[ 1943.173967] usb 3-9: Manufacturer: ATHEROS
[ 1943.173970] usb 3-9: SerialNumber: 12345
[ 1943.174854] usb 3-9: ath9k_htc: Firmware htc_7010.fw requested
[ 1943.175206] usb 3-9: firmware: direct-loading firmware htc_7010.fw
[ 1943.277489] usb 3-9: ath9k_htc: Transferred FW: htc_7010.fw, size: 72992
[ 1943.341054] ath9k_htc 3-9:1.0: ath9k_htc: HTC initialized with 45 credits
[ 1943.564075] ath9k_htc 3-9:1.0: ath9k_htc: FW Version: 1.3
Comment 34 Emmanuel Grumbach 2014-06-30 15:11:13 UTC
at that stage I am tempted to close the bug.

The problem occurs with another card. Your environment seem really problematic.
Comment 35 Nathan Schulte 2014-06-30 15:23:07 UTC
Emmanuel: my sentiments as well.

I'm still confident there is an issue with this card, along the lines of what's been discussed.  I would like to setup a controlled environment so that I can test this further, but I can't comment as to when that will happen.  I have a Linksys E4200 I plan to test with.

Feel free to close the report (I can open it later, or make a new one, right?), or leave it open with the same understanding or to track the others' issues.
Comment 36 dflogeras2 2014-06-30 17:00:04 UTC
Emmanuel, please let me know the outcome of your firmware discussion and if we move to a new bug so I can attach myself to that. I'm happy to test
Comment 37 Emmanuel Grumbach 2014-06-30 17:05:10 UTC
The FW team is completely overloaded - so that can take time.
Comment 38 Emmanuel Grumbach 2014-07-06 16:56:42 UTC
I am renaming the bug to make sure that we don't have too many people reporting unrelated stuff in this bug. Please add data only if your setup matches *exactly* what is described in the title. Thanks.
Comment 39 Emmanuel Grumbach 2014-07-10 18:06:48 UTC
I talked to the FW team - and they checked that the FW is actually behaving as expected. So my lead goes away.
Comment 40 dflogeras2 2014-07-11 02:24:53 UTC
Hmm, it happens pretty reliably here, can you reproduce it by using a 40MHz AP?
Comment 41 Emmanuel Grumbach 2014-07-27 18:19:48 UTC
Can you please test the firmware from here:
https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/plain/iwlwifi-7260-9.ucode?h=Core6&id=59a2c0aa8b9e26533d01d153a9be2c5f61cc0d62

I have no reasons to think that it'll help, but at least it'll let us know on what FW version this bug (still) happens.

We test 40Mhz all the time - and also 80Mhz. Unfortunately, we can't reproduce the issue you are facing.
Comment 42 dflogeras2 2014-07-28 01:11:46 UTC
(In reply to Emmanuel Grumbach from comment #41)
> Can you please test the firmware from here:
> https://git.kernel.org/cgit/linux/kernel/git/egrumbach/linux-firmware.git/
> plain/iwlwifi-7260-9.
> ucode?h=Core6&id=59a2c0aa8b9e26533d01d153a9be2c5f61cc0d62
> 
> I have no reasons to think that it'll help, but at least it'll let us know
> on what FW version this bug (still) happens.
> 

The above firmware seems to fix this issue for me.  Now before counting my chickens, I'll test on exactly one of the routers that was misbehaving in a couple days (I'm currently in another city, but enabled 40MHz on this AP and ran fine for about an hour).  After I test with one that I _know_ was failing before I'll report back.  I'll also extensively use this setup at 40MHz all day tomorrow.
Comment 43 Emmanuel Grumbach 2014-07-28 04:57:55 UTC
Created attachment 144391 [details]
Latest Core5 FW

Good - when you'll be able to test again with the same AP that you know was failing, can you please test also this one?

Thanks again for your time.
Comment 44 Emmanuel Grumbach 2014-08-21 20:14:26 UTC
do we have news here?
Comment 45 dflogeras2 2014-08-21 23:12:51 UTC
Emmanuel, I apologize for dragging my heels on this.  I recently moved and did not have much time for testing.  I am back up and running now.

The short answer is that I think I was wrong that the newer firmware fixed the issue.  The issue seems gone; but even with the _original_ firmware.  I now believe that a change in the kernel code may have fixed it along the way, and that coincided with my successful test from Comment 42 with a 40MHz band AP.

I failed to mention that I was then running a newer (3.15.6) kernel.  For posterity, here are my kernel versions over the summer, which may shed some light given the time of my comments in this bug:

Fri Jun 13 19:38:18 2014 >>> sys-kernel/gentoo-sources-3.15.0-r1
Thu Jun 19 08:35:48 2014 >>> sys-kernel/gentoo-sources-3.15.1
Fri Jun 27 12:24:47 2014 >>> sys-kernel/gentoo-sources-3.15.2
Tue Jul  1 17:08:00 2014 >>> sys-kernel/gentoo-sources-3.15.3
Sat Jul 12 07:19:17 2014 >>> sys-kernel/gentoo-sources-3.15.5
Fri Jul 18 12:47:26 2014 >>> sys-kernel/gentoo-sources-3.15.6
Sat Aug  2 22:24:14 2014 >>> sys-kernel/gentoo-sources-3.15.8
Fri Aug  8 17:03:45 2014 >>> sys-kernel/gentoo-sources-3.15.9
Fri Aug 15 17:20:26 2014 >>> sys-kernel/gentoo-sources-3.16.1

-------------------------------------------------------------

Regardless, here are some test results from tonight (running 3.16.1 kernel).

Let's call the first firmware (comment 41) test1, and the second one (comment 43) test2, with the released 23.214.9.0 version called original.

My testing consisted of removing the iwlmvm module, copying the given firmware into /lib64/firmware/iwlwifi-7260-9.ucode and then inserting iwlmvm again.  I then mounted a NFS share and copied a large (~1GiB) file.

I attempted to get network graphs via KDE's network monitor applet, but the data was not useful...  I will upload one image for a qualitative picture, but comparing them against each other (or even multiple runs of the same firmware) did not yield anything useful, too much unknown.

The short story is that all three firmwares (original, test1, test2) copied fine without once locking up (about an hour of stress testing).

One thing I did notice (which may be a different issue) is while copying, I had another terminal running 'ping google.ca'.  I would notice a change from about 40ms avg before the copy, go to 300-400ms during the copy.  Occasionally, I would see the ping command stop responding for a short while, then a burst (with some ping times in the 4-5 second range).  The original firmware _may_ have been worse, but it is most likely completely random.  TCP seems to be handling this, since rsync would just stall, then keep truckin'.

I would like to retest on my parents router, but I'm not sure when I'll travel there next.  That router seemed to be the worst (although the one I used tonight would also cause a lockup before).

Please let me know if I can clarify anything here, or test more.



Finally, once while loading the test1 firmware, I noticed a couple second hang after the modprobe, and noted the following in dmesg.  It looks like it recovered fine, but thought you might want to know.

[55074.439134] Intel(R) Wireless WiFi driver for Linux, in-tree:
[55074.439136] Copyright(c) 2003- 2014 Intel Corporation
[55074.439365] iwlwifi 0000:01:00.0: irq 60 for MSI/MSI-X
[55074.439753] iwlwifi 0000:01:00.0: loaded firmware version 25.223.9.0 op_mode iwlmvm
[55074.440125] iwlwifi 0000:01:00.0: Detected Intel(R) Dual Band Wireless N 7260, REV=0x144
[55074.440177] iwlwifi 0000:01:00.0: L1 Disabled; Enabling L0S
[55074.440382] iwlwifi 0000:01:00.0: RF_KILL bit toggled to disable radio.
[55074.459331] ------------[ cut here ]------------
[55074.459336] WARNING: CPU: 1 PID: 22521 at /usr/src/linux-3.16.1-gentoo/drivers/net/wireless/iwlwifi/pcie/trans.c:1184 iwl_trans_pcie_grab_nic_access+0xec/0x100 [iwlwifi]()
[55074.459337] Timeout waiting for hardware access (CSR_GP_CNTRL 0x000403d8)
[55074.459355] Modules linked in: iwlmvm(+) iwlwifi auth_rpcgss oid_registry nfsv4 nfs lockd sunrpc tun rfcomm bnep xfs libcrc32c snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_codec_realtek coretemp crc32_pclmul snd_hda_codec_generic crc32c_intel ecb uvcvideo videobuf2_vmalloc videobuf2_memops snd_hda_intel videobuf2_core v4l2_common snd_hda_controller i2c_i801 videodev btusb snd_hda_codec hid_multitouch bluetooth snd_hwdep snd_pcm snd_timer mei_me snd mei soundcore sony_laptop [last unloaded: iwlwifi]
[55074.459357] CPU: 1 PID: 22521 Comm: modprobe Not tainted 3.16.1-gentoo #1
[55074.459358] Hardware name: Sony Corporation SVP13215CDB/VAIO, BIOS R1044V7 03/24/2014
[55074.459360]  0000000000000009 ffff8801abd23af8 ffffffff815a5153 ffff8801abd23b40
[55074.459362]  ffff8801abd23b30 ffffffff8105f0e3 ffff880206aa8000 ffff880206aabb30
[55074.459363]  ffff8801abd23bd0 0000000000000000 00000000000003e8 ffff8801abd23b90
[55074.459364] Call Trace:
[55074.459369]  [<ffffffff815a5153>] dump_stack+0x45/0x56
[55074.459372]  [<ffffffff8105f0e3>] warn_slowpath_common+0x73/0x90
[55074.459373]  [<ffffffff8105f147>] warn_slowpath_fmt+0x47/0x50
[55074.459376]  [<ffffffffa004429c>] iwl_trans_pcie_grab_nic_access+0xec/0x100 [iwlwifi]
[55074.459378]  [<ffffffffa003a0a0>] iwl_read_direct32+0x20/0x60 [iwlwifi]
[55074.459381]  [<ffffffffa003a11d>] iwl_poll_direct_bit+0x3d/0x70 [iwlwifi]
[55074.459383]  [<ffffffffa00413a8>] iwl_pcie_tx_stop+0x78/0x120 [iwlwifi]
[55074.459385]  [<ffffffffa0044fa8>] iwl_trans_pcie_stop_device+0x1b8/0x1f0 [iwlwifi]
[55074.459388]  [<ffffffffa0044de1>] iwl_trans_pcie_rf_kill+0x31/0x40 [iwlwifi]
[55074.459390]  [<ffffffffa0045433>] iwl_trans_pcie_start_hw+0x93/0xe0 [iwlwifi]
[55074.459396]  [<ffffffffa0263781>] iwl_op_mode_mvm_start+0x441/0x590 [iwlmvm]
[55074.459398]  [<ffffffffa003a82b>] iwl_opmode_register+0xcb/0xf0 [iwlwifi]
[55074.459401]  [<ffffffffa0087000>] ? 0xffffffffa0086fff
[55074.459405]  [<ffffffffa0087037>] iwl_mvm_init+0x37/0x57 [iwlmvm]
[55074.459407]  [<ffffffff810002c4>] do_one_initcall+0x84/0x1c0
[55074.459410]  [<ffffffff8111ef3a>] ? __vunmap+0xaa/0xf0
[55074.459413]  [<ffffffff810c024e>] load_module+0x1a0e/0x1fc0
[55074.459414]  [<ffffffff810bdc00>] ? ref_module+0x130/0x130
[55074.459416]  [<ffffffff810c0936>] SyS_finit_module+0x76/0x80
[55074.459419]  [<ffffffff815ad0d2>] system_call_fastpath+0x16/0x1b
[55074.459420] ---[ end trace 930d914f6f7e67be ]---
[55076.418187] iwlwifi 0000:01:00.0: Failing on timeout while stopping DMA channel 5 [0x5a5a5a5a]
[55078.367692] iwlwifi 0000:01:00.0: Failing on timeout while stopping DMA channel 7 [0x5a5a5a5a]
[55080.281979] iwlwifi 0000:01:00.0: L1 Disabled; Enabling L0S
[55080.292867] ieee80211 phy7: Selected rate control algorithm 'iwl-mvm-rs'
[55080.302885] systemd-udevd[22530]: renamed network interface wlan0 to wlp1s0
Comment 46 dflogeras2 2014-08-21 23:16:37 UTC
Created attachment 147691 [details]
Network traffic graph during large file copy

Here is a traffic graph.  I believe the long (4-5 second) ping times correspond with some of the "valleys" in the graph.
Comment 47 dflogeras2 2014-08-21 23:23:42 UTC
One thing I know is slightly different between this AP and my parents' (which I do not have access to immediately) is that the Linksys router (mine) seems to leave the CRDA domain unset, which results in lower antenna power settings.  At my parents the router uses the US CDRA domain, and allows for higher powers.

I have no idea if this could exacerbate the issue, but wanted to note it.
Comment 48 Emmanuel Grumbach 2014-08-22 07:21:49 UTC
ok - I think I'll close the bug then... There isn't much I can do with this.

regarding the 4-5 seconds delay: this might be related to scanning although it should not be *that* long.

The warning you saw is not related to this bug - it is most probably some electrical problem.

Closing this bug for now.

Note You need to log in before you can comment on or make changes to this bug.