Bug 198265 - iwlwifi: 7265D: Hardware error when network at full speed
Summary: iwlwifi: 7265D: Hardware error when network at full speed
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: DO NOT USE - assign "network-wireless-intel" component instead
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-25 22:12 UTC by joel
Modified: 2018-01-02 21:27 UTC (History)
0 users

See Also:
Kernel Version: 4.14.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (188.06 KB, text/plain)
2017-12-25 22:12 UTC, joel
Details
kernel config (96.76 KB, text/plain)
2017-12-25 22:13 UTC, joel
Details
dmesg kernel 4.15 (180.92 KB, text/plain)
2017-12-26 20:24 UTC, joel
Details
config 4.15rc5 (97.82 KB, text/x-mpsub)
2017-12-26 20:24 UTC, joel
Details
dmesg with backport iwlwifi (74.66 KB, text/plain)
2017-12-26 21:52 UTC, joel
Details
bisect trace (1.09 MB, text/plain)
2017-12-29 20:50 UTC, joel
Details
selective revert (2.03 KB, patch)
2017-12-31 07:56 UTC, Emmanuel Grumbach
Details | Diff
add debug data (1.40 KB, patch)
2017-12-31 10:54 UTC, Emmanuel Grumbach
Details | Diff
Fix canditate (5.91 KB, application/mbox)
2017-12-31 14:35 UTC, Emmanuel Grumbach
Details

Description joel 2017-12-25 22:12:39 UTC
Created attachment 273301 [details]
dmesg

Hi,

I'm using kernel 4.14.8 and linux-firmware-20171206 with an intel wifi 7265D card. I've been reading that there are (were?) some issues with kernel 4.14 and iwlwifi (most of them were for iwlwifi 8260) but nothing of what I've read there has solved my problem. If this is duplicated in anyhow, I'm sorry for wasting your time.

I can reproduce this bug in:
- Kernel 4.14.8
- Linux-firmware 20171206
- Iwlwifi 7265D with firmware 29.
- Connected to a 5Ghz WiFi (don't know if it's related, and I haven't tried in a 2.4Ghz). Can do if you need to.
- Performing something at full speed (+- 40MB/s) for a while (less than 5 minutes), for example a rsync in my LAN.

I'm attaching my dmesg and my kernel config. Feel free to ask anything you need.

Kind regards,
Comment 1 joel 2017-12-25 22:13:07 UTC
Created attachment 273303 [details]
kernel config
Comment 2 joel 2017-12-25 22:13:41 UTC
Another thing, I come from a kernel 4.12.12, with no issues there (same config, just made a make oldconfig to update it).
Comment 3 joel 2017-12-26 20:23:39 UTC
Same happens with kernel 4.15.0-rc5
Comment 4 joel 2017-12-26 20:24:11 UTC
Created attachment 273321 [details]
dmesg kernel 4.15
Comment 5 joel 2017-12-26 20:24:49 UTC
Created attachment 273323 [details]
config 4.15rc5
Comment 6 joel 2017-12-26 20:40:20 UTC
With kernel 4.12 and firmware 29 I don't get the same speeds that I get with >=4.14 and firmware 29 (28MB/s in kernel 4.12 vs 44MB/s in kernel 4.14/4.15), but I don't get any error.
Comment 7 Emmanuel Grumbach 2017-12-26 21:02:49 UTC
Hardware errors as shown in the first problem:
[   86.216784] iwlwifi 0000:02:00.0: Hardware error detected.  Restarting.

aren't fun to debug.

I can't think about anything that could improve the bandwidth either.
You run the same firmware in both configurations, and 7265D is a fairly old device so that we don't really add new stuff for that device.

What could help here is to run on 4.12 and install our backport tree. This means that you'll have our latest driver but on 4.12.
This will tell us if the difference comes from the kernel (PCI / Net) or from iwlwifi itself.
Our backport tree is also much more convenient to bisect.

The backport tree is here:

https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/

BTW: are you connect to an AC network?
Can you share the output of iw wlp2s0 link?
Comment 8 joel 2017-12-26 21:52:33 UTC
Created attachment 273325 [details]
dmesg with backport iwlwifi
Comment 9 joel 2017-12-26 21:54:19 UTC
Hi,

very similar behavior, but slightly different. 

I have reached again 44 MB/s (most 42 than 44, but still a noticeable difference). When the error has happened, I haven't been disconnected from the WiFi, but instead the transfer rate dropped to 5MB/s.

Yes, I am connected to an AC Network. Here's the output you requested:

Connected to 80:2a:a8:c8:79:e4 (on wlp2s0)
	SSID: Tuxhound
	freq: 5180
	RX: 26323473 bytes (105204 packets)
	TX: 1965211233 bytes (1276643 packets)
	signal: -53 dBm
	tx bitrate: 866.7 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 2

	bss flags:	short-slot-time
	dtim period:	1
	beacon int:	100
Comment 10 Emmanuel Grumbach 2017-12-26 22:05:31 UTC
not exactly the same: now you have
iwlwifi 0000:02:00.0: HCMD skipped: index (199) 199 198

Can you please record tracing?
https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#tracing


I am afraid I'll have to ask you to try to go back in the backport tree and find a commit that works.

You can try commit 335a37f70b86095794a048b34b70309117bf113b there.

I just hope that you'll be able to compile that old commit on 4.12. Sometimes, old old backport commits can't work on newer kernel and 4.12 wasn't really out when we merged 335a37f70b86095794a048b34b70309117bf113b.
Comment 11 joel 2017-12-27 18:30:46 UTC
Hi,

I'm unable to set "CONFIG_IWLWIFI_TRACING", which option should I check in order to see it available?
Comment 12 joel 2017-12-27 18:54:10 UTC
Anyway, I cannot compile that commit with kernel 4.12.
Comment 13 Emmanuel Grumbach 2017-12-27 21:09:44 UTC
I found that commit cebfea4918852b2cf9c5eec2f0a60a4280090e65 comiles with 4.12

Let's hope that on this commit, the bug doesn't reproduce and then we'll be able to bisect.
Comment 14 joel 2017-12-27 21:44:45 UTC
Hi, with that commit I've sent >20GB with no issues (usually, the bug appeared when sending 2GB or similar), in dmesg the only thing I see is:

[  325.157294] userif-2: sent link down event.
[  325.157296] userif-2: sent link up event.

but I didn't get a reconnection (at least NetworkManager didn't show anything). I'm sending the file at 40MB/s... everything is perfect.

So, what is the next step?

Kind regards,
Comment 15 Emmanuel Grumbach 2017-12-28 06:27:46 UTC
Awesome!

Then... we can start git bisect.

git bisect start
git bisect good cebfea4918852b2cf9c5eec2f0a60a4280090e65
git bisect bad origin/master

make etc...
Then for each iteration you can either say:

git bisect bad
or
git bisect good


After a few iterations, it'll point to a commit, this is what we need.
Comment 16 joel 2017-12-28 19:54:05 UTC
Hi, these are the results:

99e6ccb18602eceec7f951fd989d5e6976f35097 is the first bad commit
commit 99e6ccb18602eceec7f951fd989d5e6976f35097
Author: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Date:   Sun Jul 16 12:28:05 2017 +0300

    iwlwifi: pcie: support short Tx queues for A000 device family
    
    This allows to modify TFD_TX_CMD_SLOTS to a power of 2
    which is smaller than 256.
    Note that we still need to set values to wrap at 256
    into the scheduler's write pointer, but all the rest of
    the code can use shorter transmit queues.
    
    type=feature
    ticket=jira:WIFILNX-956
    
    Change-Id: I60859e7f9586cf3825281873ed8b292264f0d074
    Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
    Reviewed-on: https://git-amr-3.devtools.intel.com/gerrit/127494
    Reviewed-by: Coelho, Luciano <luciano.coelho@intel.com>
    Reviewed-by: ec ger unix iil jenkins <EC.GER.UNIX.IIL.JENKINS@INTEL.COM>
    Tested-by: ec ger unix iil jenkins <EC.GER.UNIX.IIL.JENKINS@INTEL.COM>
    x-iwlwifi-stack-dev: 2b399bbe3e2a6aaeab41193f80f21be0368a7242

:040000 040000 6f8e4ca3caf27885d0db6b8f7cc00c3755053e7e 265a57860b1c1b0b0f082c64019b08044ecd2cf5 M	drivers
:100644 100644 bf2b3c23ef9019d1bb5464806c24f9dd09f030bc 37124383ab7b928d9e0896d0653c507baebbc0b3 M	versions
Comment 17 Emmanuel Grumbach 2017-12-28 20:04:11 UTC
wow... ok.
Can you please double check that the commit right before works?

This is very surprising because this patch is fairly trivial but it may be totally bogus.

I'll look at the patch.


HUGE thanks!
Comment 18 joel 2017-12-29 00:02:38 UTC
Hi, sure, no problem. Give me 24 hours and I'll come back to you asap :)

Thanks!
Comment 19 joel 2017-12-29 20:50:40 UTC
Created attachment 273345 [details]
bisect trace
Comment 20 joel 2017-12-29 20:51:11 UTC
Hi,

it's the same output. I've attached the trace git command; git output; dmesg... and so on.

Kind regards,
Comment 21 Emmanuel Grumbach 2017-12-31 07:56:18 UTC
Created attachment 273361 [details]
selective revert

Hi,

this is a selective revert of the patch you found in the bisect problem.
Please check if that helps, in the meantime, I'll try to understand why this commit cause you trouble.

Annoying ring indexes stuff.

Thanks.
Comment 22 Emmanuel Grumbach 2017-12-31 10:54:40 UTC
Created attachment 273363 [details]
add debug data

Hi again,

I looked again at the patch you found with the bisection, and I can't understand why it's causing any issues. It should really be a noop for the device you own.

Please apply the attached patch and try to reproduce while you have tracing running.

https://wireless.wiki.kernel.org/en/users/drivers/iwlwifi/debugging#tracing

Thanks.
Comment 23 Emmanuel Grumbach 2017-12-31 14:35:03 UTC
Created attachment 273365 [details]
Fix canditate

Hi,

I think I found the bug. Please apply the patch attached and let me know.
Comment 24 joel 2018-01-02 19:59:30 UTC
Hi,

sorry for the delay, have been a complicated days. The patch works perfect. I've tried it in a kernel 4.14 and transferred more than 15GB with no issues.

I think you can close this.

Thank you so much and have a happy new year :)
Comment 25 Emmanuel Grumbach 2018-01-02 21:27:49 UTC
Thanks for testing and having bisected that.

I'd have never been able to fix this bug without your bisection, it is tremendously helpful.

Happy new year to you as well.

Note You need to log in before you can comment on or make changes to this bug.