Bug 209081 - r8169 network regression
Summary: r8169 network regression
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL: https://bugzilla.opensuse.org/show_bu...
Keywords:
Depends on:
Blocks:
 
Reported: 2020-08-30 16:41 UTC by Ralf Koelmel
Modified: 2020-10-02 07:26 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.3 - 5.8
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
verbose lspci from the affected network card from a kernel 4.12 (3.49 KB, text/plain)
2020-08-30 18:37 UTC, Ralf Koelmel
Details
offload engine settings from a kernel 4.12 (1.64 KB, text/plain)
2020-08-30 18:42 UTC, Ralf Koelmel
Details
verbose lspci from the affected network card from kernel 5.8.1 (3.49 KB, text/plain)
2020-08-30 19:25 UTC, Ralf Koelmel
Details
offload engine settings from kernel 5.8.1 (1.72 KB, text/plain)
2020-08-30 19:26 UTC, Ralf Koelmel
Details
stack trace from the system log with kernel 5.8.1 (8.14 KB, text/plain)
2020-08-30 19:42 UTC, Ralf Koelmel
Details
verbose lspci from the affected network card from kernel 5.3.18 with pci=nomsi (3.49 KB, text/plain)
2020-08-30 22:37 UTC, Ralf Koelmel
Details
disabling EEE on kernel 5.3 (687 bytes, text/plain)
2020-08-30 23:34 UTC, Ralf Koelmel
Details
verbose lspci from the affected network card from kernel 5.3.18 with r8169 module parameter msix_disable=1 (4.33 KB, text/plain)
2020-08-31 18:37 UTC, Ralf Koelmel
Details
network statistic with kernel 5.3.18 with r8169 module parameter msix_disable=1 (296 bytes, text/plain)
2020-08-31 18:40 UTC, Ralf Koelmel
Details
patch for msix_disable as diff to the Linux kernel stable tree (934 bytes, patch)
2020-08-31 18:47 UTC, Ralf Koelmel
Details | Diff
stack trace from the system log with kernel 5.3.18 (MSI active) (5.43 KB, text/plain)
2020-08-31 19:09 UTC, Ralf Koelmel
Details
list of patches which are integrated into the distribution kernel of Leap 15.2 (834 bytes, text/plain)
2020-09-01 12:25 UTC, Ralf Koelmel
Details

Description Ralf Koelmel 2020-08-30 16:41:50 UTC
The bug with more details was first reported on the openSUSE bugzilla (https://bugzilla.opensuse.org/show_bug.cgi?id=1175296).
As primer the bug could be described:
An onboard network Realtek RTL8111E controller, which is served via r8169 driver, doesn't provide a stable network connection on kernel versions 5.3 (5.3.18-lp152.33-default openSUSE Leap 15.2) and 5.8 (5.8.1-3.1.g846658e). The workaround, mentioned in some similiar bug reports, to disable ASPM (pcie_aspm=off) doesn't stabilize the network.
Also with some backported patches the situation could only improved, that the network wasn't totally lost, but still with a severe network performance regression (e.g. rsync: sent 10,640,117 bytes  received 234,491 bytes  13,384.13 bytes/sec over a 1Gbs link).
Comment 1 Heiner Kallweit 2020-08-30 17:46:28 UTC
Please attach outputs of:
"lspci -vv" for network device
ethtool -k <if>
ethtool -S <if>   (after some time with network issues, check for rx_missed)
Comment 2 Heiner Kallweit 2020-08-30 17:47:52 UTC
Forgot: Let's test with the 5.8 kernel only for now
Comment 3 Ralf Koelmel 2020-08-30 18:37:32 UTC
Created attachment 292233 [details]
verbose lspci from the affected network card from a kernel 4.12

i know from my further tests, that there were no visible network errors (like rx_missed)
Comment 4 Ralf Koelmel 2020-08-30 18:42:42 UTC
Created attachment 292235 [details]
offload engine settings from a kernel 4.12

i disabled the offload engine because i'm using a network bridge.
For now i'm back on a stable kernel (4.12) with the same offload settings.
Comment 5 Ralf Koelmel 2020-08-30 19:25:54 UTC
Created attachment 292241 [details]
verbose lspci from the affected network card from kernel 5.8.1

the first logs are from a 4.12 kernel, now from a kernel 5.8.1
Comment 6 Ralf Koelmel 2020-08-30 19:26:46 UTC
Created attachment 292243 [details]
offload engine settings from kernel 5.8.1

the first logs are from a 4.12 kernel, now from a kernel 5.8.1
Comment 7 Ralf Koelmel 2020-08-30 19:42:24 UTC
Created attachment 292245 [details]
stack trace from the system log with kernel 5.8.1

prior, before the network went dead, the attached stacktrace is happening. But the network is not directly dead, but very slow, also already before that stacktrace.
Comment 8 Heiner Kallweit 2020-08-30 20:12:15 UTC
ASPM issues typically resulted in missed rx packets, however the ASPM issues should be fixed since 5.3 anyway. So question is: Does the issue impact rx and/or tx direction?
I have a test system with this chip version that runs fine, also this chip version is very common, means there should be more such reports in case of a driver bug. Having said that the issue may be system-dependent. 4.12 and 5.8 may have slight differences in PHY initialization. Did you check with other cable / link partner?
A further difference is that newer kernels use MSI-X. Does booting with nomsi parameter make a difference?
Comment 9 Ralf Koelmel 2020-08-30 20:48:27 UTC
I have 2 systems with this Gigabyte motherboard Z68XP-UD3, which are showing this problem with >= 5.3 kernel. Therefore i can definitely exclude hardware problems. 
I can't say, if tx or rx is more affected. I believe it doesn't matter. I can replicate the problem doing a outgoing rsync from the affected system. But also a incoming traffic (like a querying of the package repos) can trigger the problem. But no errors are visible in the network statistics.
I have also seen the difference about used MSI-X with kernel 5.8. But can i disable MSI-X usage with pcie=nomsi or is this only influencing MSI ?
I believe that it must be another bug as the ASPM problem, because "pcie_aspm=off" doesn't help and i have a total network lost after some time (even with no returning pings). The system must be rebooted, although the interface is up and no further messages are in the system log.
Takashi Iwai (s. CC) has integrated some patches into a test kernel 5.3 (openSUSE Leap 15.2), with whom i also tested. Maybe he can mention these. With these patches the severe performance regression was still present, but the network was keeping alive. But it seems that they aren't included in the kernel 5.8. If i see my problem, it is really interesting that you have a stable system with a good network performance !
Comment 10 Heiner Kallweit 2020-08-30 21:02:32 UTC
nomsi disables MSI and MSI-X and falls back to legacy interrupts.
Takashi backported selected changes, therefore all of them should be included in 5.8. A number of Gigabyte boards from ~2009 have BIOS bugs, however only different symptoms are known (wrong PHY ID reported).
To rule out EEE issues you could use ethtool to switch off EEE (ethtool --set-eee <if> eee off).
Last but not least you could bisect between 4.12 and 5.3.
Comment 11 Ralf Koelmel 2020-08-30 22:00:56 UTC
It is a board from 2011 with a F8 bios version. There are 2 newer bios versions (F9, F10 from 2012), but i believe i had tried to update the bios several years ago and it was not working ;-) The setting of the ethernet energy setting can't be changed:
"
ethtool --set-eee eth0 eee off
Cannot get EEE settings: Operation not supported
"

As Takashi has written, the changes related to the r8169 driver between the openSUSE distribution kernel 4.12 and 5.3  are only minor (https://bugzilla.opensuse.org/show_bug.cgi?id=1175296#c2). 
I will now try with pci=nomsi, although it would be interesting to use MSI instead of MSI-X. Can i change this after boot maybe with setpci tool ?
Comment 12 Ralf Koelmel 2020-08-30 22:37:09 UTC
Created attachment 292247 [details]
verbose lspci from the affected network card from kernel 5.3.18 with pci=nomsi

The first impression after some rsync tests with the patched 5.3 kernel and the additional kernel setting pci=nomsi is positive. Until now i have a stable network with a good performance (800 Mbps on a 1Gbs link). The usage of the legacy interrupt is not so effective as MSI. Now in "perf top" the rtl8169_interrupt function is very dominant. 
As i know, this RTL8111E chip is supporting both modes MSI and MSI-X. 
Question now is, why is MSI-X used for this network chipset with kernel >= 5.3 and how can this be changed, maybe also per configuration.
Comment 13 Heiner Kallweit 2020-08-30 22:46:46 UTC
Weird that changing EEE settings doesn't work for you. r8169 supports these ops (at least on 5.3 and 5.8). Or did you try this under 4.12?

There are massive changes between 4.12 and 5.3 with regard to r8169.
See following for the changes between 4.12 and 5.2:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/drivers/net/ethernet/realtek/r8169.c?id=v5.2&id2=v4.12

For using MSI instead of MSI-X you'd have to change the driver in rtl_alloc_irq(): change PCI_IRQ_ALL_TYPES to PCI_IRQ_MSI.
MSI-X is superior to MSI, and the kernel selects the best supported option. Older r8169 versions didn't have support for MSI-X.
Maybe also setpci can be used to disable MSI-X for the device, but I'm not an expert with this tool.

So basically the outcome is that MSI-X is broken on your board.
Comment 14 Ralf Koelmel 2020-08-30 23:34:10 UTC
Created attachment 292249 [details]
disabling EEE on kernel 5.3

Sorry, i have tried to disable EEE on 4.12. It's working on 5.3 kernel.
Yes, it seems so, that MSI-X is not usable on this motherboard and the 4.x kernel has kindly configured MSI.
Thank you for your support !!!
Comment 15 Ralf Koelmel 2020-08-31 10:43:22 UTC
i have one problem with this change (activating MSI-X if the chipset seems to be capable) in the r8169 driver.
This motherboard is only supporting PCIe 2.0 and couldn't serve MSI-X, which is first supported on PCIe 3.0.
It would be fine to have a module parameter for disabling MSI-X.
Comment 16 Heiner Kallweit 2020-08-31 10:56:50 UTC
MSI-X was introduced with PCI 3.0, not PCIe 3.0. Other MSI-X capable devices may suffer from the same issue when used with this board. Therefore a workaround should be implemented in the PCI core. You could add a chip quirk for disabling MSI-X based on the board DMI info. See e.g. nvenet_msi_disable(), even though this refers to MSI and not MSI-X.
Comment 17 Ralf Koelmel 2020-08-31 12:49:24 UTC
your are right with the introduction of MSI-X in PCI 3.0 ! I also use a  NVIDIA GeForce GTX 580 as separate card, which has enabled MSI-X, but do not suffer with this  Intel motherboard chipset Z68.
I would like to use MSI with the Realtek chipset, which was the state with the older kernels.
Many drivers have such a module parameter to disable MSI-X. In my opinion it would be the best solution for such a legacy driver to not break backward compatibility.
Comment 18 Ralf Koelmel 2020-08-31 18:37:36 UTC
Created attachment 292261 [details]
verbose lspci from the affected network card from kernel 5.3.18 with r8169 module parameter msix_disable=1

i have integrated a module parameter msix_disable, with which i can control if MSI-X is used.
I've tested with a self-compiled kernel on base of the recent openSUSE Leap 15.2 kernel 5.3.18r1-lp152.36-default (with Takashis and my changes) the usage of MSI instead of MSI-X. I have also disabled EEE.
But the problem is back again ! 
The usage of MSI/MSI-X doesn't seem the reason for the regression.
For now only the legacy interrupt is a solution, but this can only be a workaround.
Comment 19 Ralf Koelmel 2020-08-31 18:40:52 UTC
Created attachment 292263 [details]
network statistic with kernel 5.3.18 with r8169 module parameter msix_disable=1

a slight error rate is visible
Comment 20 Ralf Koelmel 2020-08-31 18:47:00 UTC
Created attachment 292265 [details]
patch for msix_disable as diff to the Linux kernel stable tree

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/net/ethernet/realtek/r8169.c?id=v5.2
Even if MSI usage isn't a solution, the module parameter could be useful.
Comment 21 Ralf Koelmel 2020-08-31 19:09:46 UTC
Created attachment 292267 [details]
stack trace from the system log with kernel 5.3.18 (MSI active)

the error counter is stable at the 32 (s. https://bugzilla.kernel.org/attachment.cgi?id=292263) after 30 minutes. I don't see an outgoing network rate over 500Kbs for a rsync job to a NFS share. It is mostly under 100Kbs.
Again this watchdog stacktrace has happened.
What can i analyze in this situation ?
Comment 22 Heiner Kallweit 2020-08-31 19:16:39 UTC
What you could do is bisect between last known good and first known bad kernel.
Comment 23 Ralf Koelmel 2020-09-01 08:53:09 UTC
Dear Heiner,
Trying all kernel between 4.12 and 5.3 is a hard way on a system, which has no remote management. I can't do this at the moment and for such tests i must first buy and install a separate network card to not loose the connection.
Because r8169 is behaving normally with legacy interrupt usage, there must be a problem in interrupt handling/initialization and/or data processing in the usage of MSI/MSI-X under some unclear circumstances. 
I've also tested with the proprietary Realtek driver r8168 (v. 8.048.03 	2020/05/25 ), which is performantly running with MSI and the same 5.x kernel on this board.
So the problem is isolated to r8169, if another driver is working.
Maybe you can also look in the code to find the differences.
Thank you for the support !
Best Regards,
Ralf
Comment 24 Heiner Kallweit 2020-09-01 10:33:32 UTC
It's not about trying all kernel versions, it's about doing a git bisect.
There's no other way, as the issue seems to be system-dependent.
r8168 and r8169 are quite different, and r8168 is full of undocumented magic settings. If only this one board type is affected (and no bisect can be done), then it should be best if you simply go with the alternative of using r8168 driver.
Comment 25 Takashi Iwai 2020-09-01 10:47:44 UTC
Well, even without biseciton, we can see easily that the likely culprit is the commit 6c6aa15fdea52aa4a19cc0b5478884af591c9351
    r8169: improve interrupt handling
which enables MSI-X in general for r8169.  This one is merged in 4.17, hence 4.12 doesn't suffer from the problem.

If this should be fixed in pci quirk, that's fine, we can go for it.
Comment 26 Heiner Kallweit 2020-09-01 10:52:24 UTC
Didn't you say that you see the issue also when using MSI instead of MSI-X? And only switching to legacy interrupts helps on your system?
Comment 27 Ralf Koelmel 2020-09-01 11:02:03 UTC
(In reply to Heiner Kallweit from comment #26)
> Didn't you say that you see the issue also when using MSI instead of MSI-X?
> And only switching to legacy interrupts helps on your system?

Yes, the problem is also occuring with MSI.
Comment 28 Heiner Kallweit 2020-09-01 11:14:50 UTC
MSI was used also before the commit you're referring to, therefore it's not necessarily the change that causes the issue on our system.
Comment 29 Takashi Iwai 2020-09-01 11:31:20 UTC
Hrm, I'm confused.  The patch in comment 20 still enables MSI, and this worked around the problem, no?
Comment 30 Ralf Koelmel 2020-09-01 11:59:06 UTC
(In reply to Takashi Iwai from comment #29)
> Hrm, I'm confused.  The patch in comment 20 still enables MSI, and this
> worked around the problem, no?

No, the patch to activate MSI doesn't help.
Comment 31 Takashi Iwai 2020-09-01 12:04:59 UTC
Ah OK, then it can be a different cause.

You can check 4.12 kernel what kind of interrupt is actually used there.  If it's MSI and still works, there must be another cause of the breakage than MSI-X enablement, indeed.
Comment 32 Ralf Koelmel 2020-09-01 12:25:32 UTC
Created attachment 292277 [details]
list of patches which are integrated into the distribution kernel of Leap 15.2

(In reply to Takashi Iwai from comment #31)
> Ah OK, then it can be a different cause.
> 
> You can check 4.12 kernel what kind of interrupt is actually used there.  If
> it's MSI and still works, there must be another cause of the breakage than
> MSI-X enablement, indeed.
Dear Takashi,
the Leap 15.1 kernel 4.12 is using MSI and hasn't a problem.
But you have integrated some patches into the 5.3 kernel, which are keeping the network alive with MSI and also with MSI-X. You should remember, that a kernel 5.3 with these patches are better as a kernel 5.8, which has after some time a dead network ! 
Is the processing really different between MSI and MSI-X ? From my understanding it shouldn't be. The difference is more on the hardware side, as on the software side, but i'm not a developer. For me it isn't a suprise, that both modes are not working, if it is a driver problem. Maybe some firmware problem ? Has the firmware changed between the 4.12 and 5.3 in openSUSE ?
I have appended a log of your patches. Is this list complete or do i have missed something ?
Comment 33 Takashi Iwai 2020-09-01 15:27:03 UTC
The commits we've backported on top of 5.3 are:
d4ed7463d02aef4b2270ec2a680813cd8b17def7
bcf2b868a5ae8b9b332176b65247953176630990
7366016d2d4c7b2e5168db6fa7920fa094561db5
4ebcb113edcc498888394821bca2e60ef89c6ff3
62bdc8fd1c21d4263ebd18bec57f82532d09249f
9c6850fea3edefef6e7153b2c466f09155399882
14012c9f3bb922b9e0751ba43d15cc580a6049bf
398fd408ccfb5e44b1cbe73a209d2281d3efa83c
0fc75219fe9a3c90631453e9870e4f6d956f0ebc
f325937735498afb054a0195291bbf68d0b60be5
f13bc68131b0c0d67a77fb43444e109828a983bf
2e8c339b4946490a922a21aa8cd869c6cfad2023
1f8492df081bd66255764f3ce82ba1b2c37def49
Comment 34 Takashi Iwai 2020-09-03 16:47:25 UTC
Also, I have a collection of kernel packages in OBS.  e.g. 5.2.x kernel is found in OBS home:tiwai:kernel:5.2,
  http://download.opensuse.org/repositories/home:/tiwai:/5.2/standard/

Those home:tiwai:kernel:* repos contain the latest stable kernel build from each 4.x and 5.x kernels.  You can try installing those one by one, and check which kernel version started regression.  It may help for narrowing down the regression range.
Comment 35 Ralf Koelmel 2020-09-03 17:55:19 UTC
Dear Iwai,
i will test with your packages the kernel versions from 4.13 to 5.3 through bisecting, as soon i have installed a second network adapter. It can last a few weeks before i report back, but i promise to not forget this issue ;-)
Comment 36 Ralf Koelmel 2020-09-03 18:22:33 UTC
Dear Takashi,
am i right, that i must test the 4.x kernel on a Leap 15.1 and first a 5.x kernel can be tested on a Leap 15.2 ?
Comment 37 Ralf Koelmel 2020-09-03 20:28:15 UTC
The regression is first occuring with 5.1.
I've tested kernel 4.18 and 4.20 (both using MSI-X) without problems. In "perf top" rtl8169_poll is the topmost function during a higher network load.
Then i tested 5.0 (using MSI-X) without problem. 
In "perf top" rtl8169_interrupt is the topmost function during a higher network load.
With 5.1 again no network throughput and in "perf top" no "rtl8169_XXX" function.
Comment 38 Heiner Kallweit 2020-09-23 09:20:16 UTC
Issue on your system may be related to 288ac524cf70 ("r8169: disable default rx interrupt coalescing on RTL8168"). So you could try to enable rx interrupt coalescing via ethtool -C.
Comment 39 Heiner Kallweit 2020-10-02 07:26:34 UTC
Still best would to bisect the issue (between 5.0 and 5.1). Can you do this (as I can't reproduce the issue)?

Note You need to log in before you can comment on or make changes to this bug.