Bug 93431
Summary: | iwlmvm: 7260: intermittent connection (no data transferred) - MWG100230068 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Stefan Soeffing (soeffing) |
Component: | network-wireless | Assignee: | drivers_network-wireless (drivers_network-wireless) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | enrico.tagliavini, hashem, ilw, linville, myspecialids-kernel, soeffing, xaviermiller |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
See Also: | https://bugzilla.kernel.org/show_bug.cgi?id=95941 | ||
Kernel Version: | 3.19 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
sudo trace-cmd record -e iwlwifi -e mac80211 -e iwlwifi_msg
Core9 firmware with uSniffer iwl.dump sudo trace-cmd record -e iwlwifi -e mac80211 -e iwlwifi_msg iwl.dump dump dump (with firmware tracing enabled) dump (with firmware tracing enabled) Core9 firmware with uSniffer dmesg output dmesg output NUC iwl 7265 kernel 3.19.3 dmesg output NUC iwl 7265 kernel 4.0 dmesg output - Tx queue problem? |
Description
Stefan Soeffing
2015-02-17 20:19:14 UTC
Created attachment 167401 [details]
sudo trace-cmd record -e iwlwifi -e mac80211 -e iwlwifi_msg
I forgot to mention that I tried the options 11n_disable=1 and wd_disable=1 , but got no improvement. Please try with: https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/linux-firmware.git/plain/iwlwifi-7260-10.ucode thanks. Thanks Emmanuel, Ucode identified itself as version 23.14.10.0. Unfortunately, I get the same behavior (doesn't matter if power_save is enabled or disabled). Ok - please try this one: https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/linux-firmware.git/plain/iwlwifi-7260-12.ucode if that doesn't work - I guess I'll have to really look at the traces ;) can't see anything obvious from your trace. Can you get me a sniffer capture? Do you have another WiFi device you could put in monitor mode? I'll try that, but it will take a few days. I haven't played around with sniffers for a while, so I'm not sure about the sniffer settings. Is sniffing in monitor mode more helpful to you than promiscious mode? If I understand correctly, in monitor mode you won't be able to read the transferred data since my network is set up with WPA2... Anything I'd need to consider? Thanks for helping. With firmware version 25.15.12.0: No improvement, unfortunately. ok - we need to debug the firmware. I'll send instruction on how to do that on Sunday. BTW, I just realized that the effect is worse when the wireless network is under load (video streaming from router to a 3rd device) - the connection stalls more often and takes longer to come back. That is interesting... And it clearly points to the firmware. I'll give the instructions shortly. Created attachment 167901 [details] Core9 firmware with uSniffer Please use the firmware attached. Load iwlwifi with fw_monitor=1 as a module parameter. When you feel that the bug is happening, *quickly* do the following: echo 1 > /sys/kernel/debug/iwlwifi/*/iwlmvm/fw_restart This will crash the firmware which will create a devcoredump entry: cat /sys/devices/virtual/devcoredump/devcd1/data > iwl.dump echo 1 > /sys/devices/virtual/devcoredump/devcd1/data iwl.dump should weigh around 4.2MB. Please compress it and attach it to the bug. Please the time to read this note: https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi#privacy_aspects correction, if you don't want to have the data attached to the bug, please send by email to me. Note that the data is very obfuscated. Your call. Even with sudo it says bash: /sys/kernel/debug/iwlwifi/*/iwlmvm/fw_restart: Permission denied CONFIG_IWLWIFI_DEBUGFS is probably not enabled, which means I need to recompile the kernel module, is that right? Well, CONFIG_IWLWIFI_DEBUGFS is set, in fact, but CONFIG_IWLWIFI_DEBUG is not. Any hint what I'm missing? No. Sudo doesn't apply on the redirection. You should sudo su - and then run the command. Created attachment 167911 [details]
iwl.dump
Thanks, kind of a beginner's mistake...
Please see the attached file.
ok - I'll send the data to the firmware team. Are you sure that you did echo 1 > /sys/.../../../fw_restart while the bug was happening? Pretty sure. I had ping running, then packets started dropping as usual and I got the 'ping: sendmsg: No buffer space available' message. Immediately after I did the fw_restart, ping recovered. I realized that this was actually with power_save on. Would you like me to get another dump with power_save off? not needed at that stage. The fact that you had the pings recover by restarting the firmware is very interesting. I'll add all this to the internal bug I'll open on the firmware team. Worst case would be that it recovered by itself and I was just a split-second too late in doing the fw_restart. I'll try to reproduce it the next time I see the bug happening. Thanks a lot. Before I involve firmware guys, can you please reproduce the bug including the FW log thing while tracing is running? The flags you used in comment #1 are perfect. I would like to clear out something from the driver level. Thanks. Created attachment 167941 [details]
sudo trace-cmd record -e iwlwifi -e mac80211 -e iwlwifi_msg
Same thing, ping recovered right after fw_restart.
Created attachment 167951 [details]
iwl.dump
Corresponding fw dump
Thanks - the data has been sent to the firmware team Hi! Kernel 3.19.0-gentoo SMP PREEMPT x86_64 Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz GenuineIntel GNU/Linux 04:00.0 Network controller: Intel Corporation Wireless 7260 (rev 73) Subsystem: Intel Corporation Wireless-N 7260 Tested with iwlwifi-7260-10.ucode and iwlwifi-7260-12.ucode: too many disconnections. This is worse with V12. With iwlwifi-7260-9.ucode, it's better unless the driver warning: "Firmware has old API version, expected v10 through v12, got v9." Will watch this bug report :-) Kind regards, Xavier Miller. @Xavier, please follow the procedure from comment 12 so that I can collect data for the firmware team. Don't forget to replace the firmware. Thanks. Emmanuel, How can I enable iwlwifi debug? I've activated it in the kernel: [*] Enable full debugging output in the iwlwifi driver [ ] Experimental uCode support [*] iwlwifi device access tracing But I don't see /sys/kernel/debug/iwlwifi entry (there is a /sys/kernel/debug entry) I had micro-lags watching youtube videos, and I would crash the module as described in the comment 12, but without luck as now. Xavier. Meanwhile, here are dmesg output during some youtube lags: [ 361.721019] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 361.721024] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 361.721114] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 361.721135] iwlwifi 0000:04:00.0: U iwl_mvm_sta_rx_agg RX BA Session started in fw [ 361.959954] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 361.959962] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 361.960050] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 361.960064] iwlwifi 0000:04:00.0: U iwl_mvm_sta_rx_agg RX BA Session stopped in fw [ 361.960072] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 361.960075] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 361.960163] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 361.960171] iwlwifi 0000:04:00.0: U iwl_mvm_sta_rx_agg RX BA Session started in fw [ 361.973692] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 361.973697] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 361.973782] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 361.973806] iwlwifi 0000:04:00.0: U iwl_mvm_sta_rx_agg RX BA Session started in fw [ 362.349189] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 362.349196] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 362.349279] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 362.349289] iwlwifi 0000:04:00.0: U iwl_mvm_sta_rx_agg RX BA Session stopped in fw [ 362.349295] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 362.349297] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 362.349378] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 362.349385] iwlwifi 0000:04:00.0: U iwl_mvm_sta_rx_agg RX BA Session started in fw [ 562.739045] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 562.739052] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 562.739246] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 572.780630] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command ADD_STA [ 572.780636] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command ADD_STA [ 572.781014] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command ADD_STA [ 621.009543] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Attempting to send sync command SCAN_OFFLOAD_REQUEST_CMD [ 621.009549] iwlwifi 0000:04:00.0: U iwl_pcie_send_hcmd_sync Setting HCMD_ACTIVE for command SCAN_OFFLOAD_REQUEST_CMD [ 621.009716] iwlwifi 0000:04:00.0: I iwl_pcie_hcmd_complete Clearing HCMD_ACTIVE for command SCAN_OFFLOAD_REQUEST_CMD (In reply to Xavier Miller from comment #28) > Emmanuel, > > How can I enable iwlwifi debug? I've activated it in the kernel: > [*] Enable full debugging output in the iwlwifi driver > [ ] Experimental uCode support > [*] iwlwifi device access tracing > > But I don't see /sys/kernel/debug/iwlwifi entry (there is a > /sys/kernel/debug entry) > You need debugfs (IWLWIFI_DEBUGFS) Created attachment 169311 [details]
dump
dump during a disconnect
you forgot fw_monitor=1 And please, make sure that you capture the data by restarting *quickly* when the throughput goes down. Thanks. Created attachment 169361 [details]
dump (with firmware tracing enabled)
oops sorry. Let's hope that dump will be useful for you. Created attachment 169511 [details]
dump (with firmware tracing enabled)
other dump. Scenario : using this computer as distcc server, launching heavy compilations on other computer, then watching the load. When it is down (and seeing a network client disconnected), I restart the firmware and get this dump.
thanks - these look good. Guys, is there any news on that? I switched to a 3.19.2 kernel with firmware 25.16.12.0, which unfortunately gave no improvement at all. To be honest, I'm pretty unhappy with the situation. Thanks for your help, anyway. I opened a ticket on the firmware team. I can't do much more than that unfortunately. I think that with more people moving to kernel 3.19(for which only ucode 10 and 12 work), this will be a big issue. I have struggled with this issue for around one month now. Now running kernel 3.18 and ucode 9 and things are okay. However, by staying on 3.18, I miss out on the updates that my distro offers. This issue has gone away since I turned off WPA2 on my access point. My issue happens with WEP on my access point too. Pretty consistently reproducible. Play a couple of youtube videos and ping latencies drop to seconds instead of ms. (In reply to hashem@riseup.net from comment #40) > This issue has gone away since I turned off WPA2 on my access point. Also, this couldn't be an access point issue because "Kernel 3.18 + ucode-9" works for me while "Kernel 3.19 + ucode-10/12" does not. There is a regression in this combo somewhere. Can you try 3.19 + ucode-9? This combo is supported even if the driver will warn about the firmware being old. Thanks. (In reply to Emmanuel Grumbach from comment #43) > Can you try 3.19 + ucode-9? > This combo is supported even if the driver will warn about the firmware > being old. > > Thanks. Emmanuel, Thanks for engaging on this. 3.19 + ucode-9 works. So I guess this squarely puts all the blame on ucode-10, ucode-12. Also, this testing was done on both WEP and WPA2 with the same results. Thanks. (In reply to sheksis from comment #44) > (In reply to Emmanuel Grumbach from comment #43) > > Can you try 3.19 + ucode-9? > > This combo is supported even if the driver will warn about the firmware > > being old. > > > > Thanks. > Emmanuel, > Thanks for engaging on this. 3.19 + ucode-9 works. So I guess this squarely > puts > all the blame on ucode-10, ucode-12. > > Also, this testing was done on both WEP and WPA2 with the same results. > > Thanks. Okay, declared victory too soon. There might be something in 3.19 + ucode-9 combo too. The youtube videos started getting stuck after 15 minutes of play. I also noticed that the average ping latency for 3.19+ucode-9 was more than 3.18+ucode-10/12. Summary: 1. 3.18 + ucode-9 is better than 3.19 + ucode-9 which is better than 3.19+ucode-10/12. 2. I see less ping latencies going to seconds in 3.18 + ucode-9. And when they do, they recover very fast. 3. In 3.19 + ucode-9 has more ping latencies going to seconds and sometimes they recover fast enough for a video playback. But many videos do end up getting stuck. 4. 3.19 + ucode-10/12 --- *FUBAR* Correction in above post: It should be "...3.19+ucode-9 was more than 3.18+ucode-9". @sheksis: Please open a new bug. CC ilw@linux.intel.com to the bug. I prefer not to mix at this stage to avoid even more confusion. Please attach dmesg output + wpa_supplicant output + tracing output for each one of the scenario you mentioned above. Instructions about tracing: https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi#tracing Thanks. Created attachment 173981 [details]
Core9 firmware with uSniffer
Hi all,
I have a request from the firmware team.
I attached a new firmware with more debug data. Can you please repeat the data collection process with this firmware?
It also has a few more bug fixes, so the problem might be gone / harder to reproduce.
Thanks!
Hi!
With kernel 4.0.0 and attachment 173981 [details], I don't see any lag for now.
Watched some HD videos in Youtube + compilations using distcc, no disconnections so far.
It sounds encouraging.
Any news from other users? Thanks Emmanuel and the firmware team.
I've installed 4.0.0 and attachment 173981 [details] on Tuesday, but hadn't much time to test it.
Nevertheless, I haven't had any disconnect so far which is encouraging, indeed. From time to time, ping goes up for a few seconds - but the effect is much less pronounced than before.
I will keep testing for a few days and report back.
The firmware team has released a firmware with the fix: https://git.kernel.org/cgit/linux/kernel/git/iwlwifi/linux-firmware.git/plain/iwlwifi-7260-12.ucode I'll send a pull request to the official linux-firmware.git soon. Same observation as Stefan. Have an Intel Dual Band Wireless AC 7265 (so slightly different HW), REV=0x210 here, from Intel NUC NUC5i5RYH. I'm affected by the same problem. Using the firmware provided by default in Fedora 21 (version 25.15.12.0) the situation is quite severe as in the "no data transfer, but wireless is connected" and the "ping: sendmsg: No buffer space available" are very frequent. Note that when it happens the card usually *does not recover* and a full system reboot is needed. Yes disconnecting and reconnecting fixes the issue, but for a very short time usually, restarting has better chances. There is some lucky day here and there, but usually it is a matter of watching a single youtube video in HD and the issue appears quite easily. Using kernel 3.17.9 instead of 3.19.3 might have a slightly better working ratio, but still breaks. I tried firmware version 25.17.12.0 from iwlwifi/linux-firmware.git . There is a clear improvement. The card manage to recover now, sometimes takes few minutes but it recovers. Pinging it from the AP usually increase of 3 order of magnitude during the "crysis" periods (let's say from 15 ms to > 5000, sometimes 15000). This is still with kernel 3.19.3 as provided by fedora 21. Today I took the kernel SRPM version 4.0.0-1 for fedora 22 and recompiled it for fedora 21, keeping the firmware to version 25.17.12.0. I can see no improvement compared to kernel 3.19.3. Maybe it might be even slightly worst, I got wireless disconnecting with 4.0, but I'm not sure if I did the test with 3.19.3 with power management enabled or not, with 4.0 PM was enabled when the disconnect happened. I will do more test with 3.19.3. Compared to another computer with an older card and looking into the AP interface I can see something interesting. While the signal quality for both clients is very similar (around 50%, -56 db) the TX and RX rates are very different: 1M (very unstable with some, short, peak maybe at 10M, sometimes more) for the NUC, 65M - 130M (and quite stable) for the other computer. However if I run the wavemon utility (or iwconfig) on the NUC it reports a bitrate of 78 Mbit, so it seems the AP and the NUC are not agreeing. Note this is with 11n 2.4 GHz only AP. I tried also to force 11g only mode on the AP and it made no difference. I'm quite happy it doesn't get stuck now, but please keep working on this, the current state is not really what I would define usable yet. Using VoIP, video chat, VOD, streams can be a pain if the card has one void moment. Even SSH gets stuck and can be painful. Ok 3.19.3 is the same as 4.0, depending on the day you get disconnects or not. Yesterday I was lucky, today I get disconnected. I confirm I have less disconnections, but there still happen some. There are less frequent and shorter, but they still exist. Kernel 4.0.0 Thanks to all of you for your time testing. I'll bring all this to the firmware team and I will discuss the next steps with them. In the meantime, Enrico, please share your dmesg output. I'd like to check if the Tx queues get stuck. Thanks. Created attachment 174371 [details]
dmesg output
Linux redpoint 4.0.0-gentoo #1 SMP PREEMPT Mon Apr 13 20:19:30 CEST 2015 x86_64 Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz GenuineIntel GNU/Linux
with proposed GIT firmware
Thanks Xavier. Clearly, you are getting Tx queue hang. There is another bug on that: https://bugzilla.kernel.org/show_bug.cgi?id=95941 I'll wait a few more days to get more reports (I'd like to see the dmesg output from Enrico), but I will probably close this bug then and we will concentrate our effort on bug 95941. Thanks Emmanuel, I will register on bug 95941 Created attachment 174471 [details]
dmesg output NUC iwl 7265 kernel 3.19.3
This is dmesg out from two days ago with kernel 3.19.3. Today is one of the good days and it is going to be harder to reproduce the issue (as I mentioned already it is like a Russian roulette, something it works like a charm sometimes it doesn't. with the newest firmware the chances of having a bad one are fewer). Also the affected system is 500 Kms away from me, it's my parents pc, not mine. I have an hard time debugging the issue since I'm accessing via the unstable wireless and my parents struggles starting firefox cannot ask their help.
Kernel 4.0 dmesg has also a trace that can be related. attaching it in a minute.
Created attachment 174481 [details]
dmesg output NUC iwl 7265 kernel 4.0
this is the one with the trace, near the end after 45 minutes of use. However you can see there are many disconnection in the mean time. Those are not due to bad signal or manual disconnection.
Also full disconnections are way rarer than bad performance, but that produce no output in dmesg.
Ok - this is something completely different. Please loook at bug 95901 @Enrico, please open a new bug and attach the output of your supplicant with -dd to get more debug. Thanks. Anybody else about this bug? The stuck queue issue is monitored in another bug, so I am considering closing this one. Emmanuel, I'm pretty happy with the new combination so far. Connection is much more stable. Sometimes the ping rate goes up to ~1s, but only for a short time (few seconds). Attaching a dmesg output - is that the Tx queue problem as well? Thanks a lot! Created attachment 174491 [details]
dmesg output - Tx queue problem?
nope - this is not the Tx queue problem. The tx queue problem will happen only when we can't get any packet out for 10 seconds. You are far from this. Ok - the original submitter is happy. A lot has gone on on this one. Closing it now. As a reminder, Enrico needs to open a new bug, and anyone with the tx queue problem can look at bug 95941. FYI - the tx queue problem gives you a very detailed information in the dmesg output - you can't really miss it. Hi Emmanuel, I opened bug #96931 and while doing that potentially made a very interesting discovery. Thank you for looking into this. This bug requires further investigation. Look at this! 64 bytes from 192.168.1.1: icmp_seq=57 ttl=254 time=11631 ms 64 bytes from 192.168.1.1: icmp_seq=58 ttl=254 time=10633 ms 64 bytes from 192.168.1.1: icmp_seq=59 ttl=254 time=13219 ms 64 bytes from 192.168.1.1: icmp_seq=60 ttl=254 time=12936 ms 64 bytes from 192.168.1.1: icmp_seq=61 ttl=254 time=12628 ms 64 bytes from 192.168.1.1: icmp_seq=62 ttl=254 time=14109 ms 64 bytes from 192.168.1.1: icmp_seq=63 ttl=254 time=13543 ms 64 bytes from 192.168.1.1: icmp_seq=64 ttl=254 time=15495 ms 64 bytes from 192.168.1.1: icmp_seq=65 ttl=254 time=16075 ms 64 bytes from 192.168.1.1: icmp_seq=67 ttl=254 time=16165 ms 64 bytes from 192.168.1.1: icmp_seq=68 ttl=254 time=16663 ms 64 bytes from 192.168.1.1: icmp_seq=69 ttl=254 time=17606 ms I'll file a new bug and upload the traces, dmesg and wpa_supplicant output there with the firmware that Emmanuel provided. One question, I do not know how to collect the wpa_supplicant output. Are there steps anywhere to guide on this? You might want to have a look at bug 97291 as well... @Stefan Seems very similar to my issue. I'll monitor bug 97291. |