Bug 112931

Summary: iwlwifi: 7260: can't use 11n when the on AC (TFD queue hang) - MWG100256100
Product: Drivers Reporter: Lev Melnikovsky (melnikovsky)
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED WILL_NOT_FIX    
Severity: normal CC: linuxwifi, luca, melnikovsky
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.4.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
Core14 FW with debug probes
firmware dump [encrypted]
Core14 FW with debug probes
firmware dump [encrypted]
Core14 FW with LPRX disabled
dmesg with 17.320364.0

Description Lev Melnikovsky 2016-02-23 10:13:16 UTC
Created attachment 204391 [details]
dmesg

I am using Lenovo ideapad U330p, Intel Wireless-N 7260 (rev 73 / 0x144) inside, Gentoo outside. 

Wireless does not work unless I set iwlwifi option 11n_disable=1. Until recently I was quite pessimistic about chances to make it work with 40MHz band. It was accidentally discovered that 802.11n works fine when on battery, but dies soon after AC adapter is connected.

I have tried with no observable effect:
(a) 3 different AC adapters
(b) fixing CPU frequency
(c) different power_save and power_scheme values
What else I can try to help debug it?

Thanks in advance
-L

P.S. The logs are collected with vanilla 4.4.2.
I started wget <huge-file> after the boot. AC power is connected at ~375s up, wget had successfully downloaded ~3GB by that time and stalls immediately after. I can send the firmware dump if needed.
Comment 1 Emmanuel Grumbach 2016-02-23 10:18:49 UTC
This is mostly because of interference with the charger...
Can you try with power_scheme=1 as a module parameter to iwlmvm ?

I'll send a firmware for debug later. That will allow us to know why the firmware is unhappy (most probably because of the interference mention above).
Comment 2 Emmanuel Grumbach 2016-02-23 10:25:10 UTC
Created attachment 204401 [details]
Core14 FW with debug probes

Please use the firmware attached and follow the instructions here:

https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#firmware_debugging

Take the time to read the privacy notice:
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#privacy_aspects
Comment 3 Lev Melnikovsky 2016-02-23 10:28:37 UTC
# just an interference or some particular electric noise from the adapter?

# power_scheme=1 - I have tried it earlier to no avail. Should I produce dmesg as well?
Comment 4 Emmanuel Grumbach 2016-02-23 10:33:36 UTC
(In reply to Lev Melnikovsky from comment #3)
> # power_scheme=1 - I have tried it earlier to no avail. Should I produce
> dmesg as well?

Nope. Just check you see:
[  385.290146] iwlwifi 0000:02:00.0: 0x00000084 | NMI_INTERRUPT_UNKNOWN 

as well. This is enough to assume it is the same problem.
BTW - did you try 11n with 40MHz on 5.2GHz?
Comment 5 Lev Melnikovsky 2016-02-23 11:22:42 UTC
> Please use the firmware attached and follow the instructions here
I've sent the dump via email.

> Nope. Just check you see:
> [  385.290146] iwlwifi 0000:02:00.0: 0x00000084 | NMI_INTERRUPT_UNKNOWN 
> as well. This is enough to assume it is the same problem.
Yes, I do see it with power_scheme=1 (and your debug firmware):

[   47.166725] iwlwifi 0000:02:00.0: Start IWL Error Log Dump:
[   47.166730] iwlwifi 0000:02:00.0: Status: 0x00000000, count: 6
[   47.166735] iwlwifi 0000:02:00.0: Loaded firmware version: 17.288042.0
[   47.166740] iwlwifi 0000:02:00.0: 0x00000084 | NMI_INTERRUPT_UNKNOWN
[   47.166745] iwlwifi 0000:02:00.0: 0x00800634 | uPc
[   47.166750] iwlwifi 0000:02:00.0: 0x00000000 | branchlink1
[   47.166754] iwlwifi 0000:02:00.0: 0x00000B30 | branchlink2
[   47.166758] iwlwifi 0000:02:00.0: 0x000167DC | interruptlink1

> BTW - did you try 11n with 40MHz on 5.2GHz?
Now I wonder if my card should support 5.2GHz band? I have only tried 2.4GHz.

Another observation: if I set 11n_disable=1 (w/ or w/o AC adapter), then I get max throughput about 2.5 MByte/s. If I remove 11n_disable and work on battery, then the throughput is 10 MByte/s (probably limited by 100Mbps ethernet at the access point). Does it make sense?
Comment 6 Emmanuel Grumbach 2016-02-23 11:29:20 UTC
(In reply to Lev Melnikovsky from comment #5)
> 
> > BTW - did you try 11n with 40MHz on 5.2GHz?
> Now I wonder if my card should support 5.2GHz band? I have only tried 2.4GHz.

Just checked - your card doesn't support 5.2GHz.
But working in 40MHz in 2.4GHz is not really recommended. Also, take a look at https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi#about_platform_noise

> 
> Another observation: if I set 11n_disable=1 (w/ or w/o AC adapter), then I
> get max throughput about 2.5 MByte/s. If I remove 11n_disable and work on
> battery, then the throughput is 10 MByte/s (probably limited by 100Mbps
> ethernet at the access point). Does it make sense?


2.5 MBytes/s = 20Mb/s which is what you can expect without 11n.
10MByte/s = 80Mb/s which can be a decent throughput for 11n in certain case.
Comment 7 Lev Melnikovsky 2016-02-23 12:29:15 UTC
> 2.5 MBytes/s = 20Mb/s which is what you can expect without 11n.
> 10MByte/s = 80Mb/s which can be a decent throughput for 11n in certain case.
Sorry, I naively assumed that Shannon theorem predicts twice throughput for 40MHz vs 20MHz band. I was also deceived by iw reporting tx bitrate 54 Mb/s (vs 150 Mb/s with N enabled).

> But working in 40MHz in 2.4GHz is not really recommended.
I can not replace the card due to Lenovo BIOS whitelist...

> Also, take a look at
> https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi#about_platform_noise
Want me to look at the AC adapter output with an oscilloscope?
Comment 8 Lev Melnikovsky 2016-02-23 12:30:37 UTC
Created attachment 204421 [details]
firmware dump [encrypted]
Comment 9 Emmanuel Grumbach 2016-02-23 13:08:57 UTC
(In reply to Lev Melnikovsky from comment #7)
> Sorry, I naively assumed that Shannon theorem predicts twice throughput for
> 40MHz vs 20MHz band. I was also deceived by iw reporting tx bitrate 54 Mb/s
> (vs 150 Mb/s with N enabled).

Should be so. 40MHz should give twice the throughput of 20MHz under ideal conditions. Due to interference and the shared nature of the medium, it can be sometimes better to work in 20MHz only, especially on 2.4GHz as I said in the wiki page.
150Mb/s means that you don't have 40MHz enabled, or you don't have SISO. I can't remember which right now.

> 
> > But working in 40MHz in 2.4GHz is not really recommended.
> I can not replace the card due to Lenovo BIOS whitelist...

You can still limit yourself to 20MHz with the cfg80211 module parameter.

> 
> > Also, take a look at
> https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi#about_platform_noise
> Want me to look at the AC adapter output with an oscilloscope?

That's nice to offer, but I am not sure I'll be able to understand anything from the output :)


Anyway, I got your firmware dump, I'll take a look later.
Comment 10 Johannes Berg 2016-02-23 13:11:06 UTC
> 150Mb/s means that you don't have 40MHz enabled, or you don't have SISO. I
> can't remember which right now.

it means only single stream, I guess

http://mcsindex.com/
Comment 11 Emmanuel Grumbach 2016-02-24 13:00:57 UTC
I looked at the dump file you created. Seems like at some point, we just stop sending any data. This can be caused by interference. I am involving the firmware team.

Thank you.
Comment 12 Emmanuel Grumbach 2016-02-29 12:48:28 UTC
Hello,

I got a reply from the firmware team. We had issues with the probe enabled in the firmware and this explains the strange things I saw. The firmware team will fix the probes and then we will need another dump from you.

Thank you.
Comment 13 Emmanuel Grumbach 2016-03-09 12:42:03 UTC
Created attachment 208381 [details]
Core14 FW with debug probes

Here is the firmware we need you to use to get the debug data.
Thank you!
Comment 14 Lev Melnikovsky 2016-03-09 23:02:02 UTC
Created attachment 208571 [details]
firmware dump [encrypted]
Comment 15 Lev Melnikovsky 2016-03-09 23:12:27 UTC
-this time it *seemed* more stable. The "Microcode SW error" was eventually triggered but it took much longer. Are you sure the debug probes were changed only?
Comment 16 Emmanuel Grumbach 2016-03-10 03:32:57 UTC
Should be debug probes changes only.

I'll take a look at the dump a bit later.
Is the same story as before (with the power adapter and 11n_disable)?
Comment 17 Emmanuel Grumbach 2016-03-10 10:12:18 UTC
I transferred the data to the firmware team.
Comment 18 Emmanuel Grumbach 2016-03-10 11:01:32 UTC
There are bad interference.
I can see that each time you want to transmit anything, our receiver detects energy in the air and hence can't transmit.
The whole log is full of:

trying to Tx, aborting Tx due to Rx.
I'd need to change the timeout of the hang detection to see when it starts to happen, but I am pessimistic since you say that it is AC / DC related which hints to platform noise.
Comment 19 Lev Melnikovsky 2016-03-10 19:52:45 UTC
Hello,

> Is the same story as before (with the power adapter and 11n_disable)?
I had not tried 11n_disable with this firmware. Should I?

This time the "bug" was not triggered immediately after connecting AC adapter. It took a minute or two (while downloading at ~10MB/s) before download stalled. I've never observed such stability before. I'll probably try again to see if this is reproducible.

Is it necessary to reboot to try different firmware or I can just rmmod iwlmvm/iwlwifi? What is supposed to be a clean test procedure?

> I can see that each time you want to transmit anything, our receiver detects
> energy in the air and hence can't transmit.
Well, this is what wget should look like - Rx lots, Tx acks only.

I can try a different pattern, like uploading something, or make it symmetric with balanced up/down loading. Would it give more information?

> trying to Tx, aborting Tx due to Rx
Sorry for my naive assumption, but this *sounds* like an echo - RF feedback Tx->Rx...
Comment 20 Emmanuel Grumbach 2016-03-10 19:58:58 UTC
(In reply to Lev Melnikovsky from comment #19)
> Hello,
> 
> > Is the same story as before (with the power adapter and 11n_disable)?
> I had not tried 11n_disable with this firmware. Should I?

Yes please

> 
> Is it necessary to reboot to try different firmware or I can just rmmod
> iwlmvm/iwlwifi? What is supposed to be a clean test procedure?

modprobe -r iwlwifi is sufficient. As long as you get again the line "Loading firmware XXX" in dmesg

> 
> > I can see that each time you want to transmit anything, our receiver
> detects
> > energy in the air and hence can't transmit.
> Well, this is what wget should look like - Rx lots, Tx acks only.

This is not really relevant. We are most probably not talking about real Rx since these would end at some point. But here we *constantly* see energy (which can't be a real wifi Rx).

> 
> I can try a different pattern, like uploading something, or make it
> symmetric with balanced up/down loading. Would it give more information?

I don't think so.

> 
> > trying to Tx, aborting Tx due to Rx
> Sorry for my naive assumption, but this *sounds* like an echo - RF feedback
> Tx->Rx...

No since we don't get to the Tx part. Before transmitting, we check if someone else is already transmitting (CSMA) and in your case, we keep hearing energy, so we stop our Tx procedure before the radio emitted anything.
Comment 21 Lev Melnikovsky 2016-03-17 10:16:58 UTC
Hi again,

I have repeated the test many times and gathered a lot of statistics:

(a) the "bug" was never triggered with 11n_disable=1
(b) the "bug" was never triggered w/o AC adapter
(c) the "bug" was always there w/ AC adapter and 11n_disable=0

The time required to trigger the bug in scenario (c) may vary from <1s to 5min. It seems that firmware 17.295852.0 sometimes stands longer than 17.288042.0 , but this is statistically insignificant. It also seems that proper power cycle and reboot give better hang recovery than module unloading/reloading, but again this is just a feeling.

What happens during 10000 ms between wget stall and the woes from iwlwifi about stuck queue?

Would it be possible to set the hung detection timeout as a module parameter?

I have wiped the Windows off the hard drive immediately after the purchase so I can not tell if it behaves better or worse...
Comment 22 Emmanuel Grumbach 2016-03-17 10:44:04 UTC
(In reply to Lev Melnikovsky from comment #21)
> 
> What happens during 10000 ms between wget stall and the woes from iwlwifi
> about stuck queue?

The firmware is trying to send data ... and can't because we feel energy in the air.

> 
> Would it be possible to set the hung detection timeout as a module parameter?
> 

Yes, but we have another way. We add debug data to the .ucode file and configure the timeout this way.
If you want, I can prepare a firmware with a debug data that will reduce the timeout.
Comment 23 Lev Melnikovsky 2016-03-17 21:11:59 UTC
(In reply to Emmanuel Grumbach from comment #22)

> > What happens during 10000 ms between wget stall and the woes from iwlwifi
> > about stuck queue?
> The firmware is trying to send data ... and can't because we feel energy in
> the air.
Do you really mean to assume that "we feel energy in the air" for 10000 ms without a stop?

Sometimes the "bug" is triggered only after a minute of high traffic. The throughput may fluctuate (probably due to severe interference you mentioned) but this is still >1000 packets per second. And then suddenly wget stops and we can not send a single packet for 10 consecutive seconds?

> Yes, but we have another way. We add debug data to the .ucode file and
> configure the timeout this way.
> If you want, I can prepare a firmware with a debug data that will reduce the
> timeout.
Honestly, I don't even understand if this timeout should be increased or reduced... But I will try whatever you suggest to gather information you think is important.
Comment 24 Emmanuel Grumbach 2016-03-20 07:17:34 UTC
(In reply to Lev Melnikovsky from comment #23)
> (In reply to Emmanuel Grumbach from comment #22)
> 
> > > What happens during 10000 ms between wget stall and the woes from iwlwifi
> > > about stuck queue?
> > The firmware is trying to send data ... and can't because we feel energy in
> > the air.
> Do you really mean to assume that "we feel energy in the air" for 10000 ms
> without a stop?

Yes - this can happen because of interference or noise created by the platform / power adapter.
In any case, I opened a bug on the firmware team, there isn't much more I can do for now.
Comment 25 Emmanuel Grumbach 2016-04-12 08:46:46 UTC
Created attachment 212461 [details]
Core14 FW with LPRX disabled

Please retest with this firmware. In this firmware, we disabled a feature that can cause the problem you are seeing and we would like to know if it feels better now.
Thank you.
Comment 26 Emmanuel Grumbach 2016-05-16 12:30:37 UTC
ping?
Comment 27 Lev Melnikovsky 2016-05-18 20:09:03 UTC
Created attachment 216621 [details]
dmesg with 17.320364.0

Sorry for the delay, I could not exactly reproduce the setup I had at home (I don't have a second computer here to fill the bandwidth). So I redirected the traffic back to the laptop using iptables on an openwrt router. This gives stable ~4MB/s up + 4MB/s down with 11n_disable=0 and AC adapter disabled.

(In reply to Emmanuel Grumbach from comment #25)
> Please retest with this firmware. In this firmware, we disabled a feature
> that can cause the problem you are seeing and we would like to know if it
> feels better now.
Unfortunately, it *feels* worse. The transfer stops immediately after I connect AC adapter and the connectivity is not restored even after I disconnect it again. I had to rmmod/insmod to make this post.

dmesg is attached, the AC adapter is connected at 1090.
Comment 28 Luca Coelho 2016-11-23 13:10:43 UTC
Unfortunately this bug is already very old and we haven't really been able to work around these platform issues, so I'll have to close it as won't fix...