Bug 203703
Summary: | 5.1 regression makes r8169 Ethernet connection inoperable if fq_codel qdisc is used | ||
---|---|---|---|
Product: | Networking | Reporter: | Sergey Kondakov (virtuousfox) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | high | CC: | dahlialarson26, dave.taht, hkallweit1, samwaters1307, stalkerg, varunintern23, wgh, xiyou.wangcong |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.1.4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
kernel config
r8169_stuck-on_5.9.1.txt |
Description
Sergey Kondakov
2019-05-25 04:55:19 UTC
Root cause of the issue could be in quite different places in the network stack. It's not necessarily a r8169 issue. Best would be if you could bisect the issue. Hmm, you narrow down the problem to fq_codel, which is great. But I don't see any relevant changes of fq_codel between 5.0 and 5.1. So, the cause is probably somewhere else. As Heiner suggested, a bisect would help a lot. Indeed it may not be directly relate to fq_codel. Today I've noticed few strange thing about it: 1) During last boot everything worked for more than an hour without me changing qdisc or resetting network. But there still was that dmesg message under 10 minutes after boot. Couldn't see where it goes for long, my apartment got electricity cut off. 2) If issue has manifested then just changing qdisc or removing and inserting r8169 module doesn't help. Or, at least, isn't guaranteed to help. Usually, I have to "disable" and "enable" "networking" in NetworkManager one or several times, whatever it does, and then change qdisc to be safe until next boot. My current, binary distribution is as unsuited as they come for kernel recompiling and manual reinstalling (I use OBS servers to auto-build & update kernel package with my config), so I planned to at least wait for more minor or one major kernel releases before going full bisect in hope that it's a consequence of a known mishap. I even thought that maybe it's a result of openSUSE pushing some raw patches against latest Intel vulnerabilities. By the way, you may check them out at https://github.com/openSUSE/kernel-source/tree/stable/patches.suse The builds are in: https://build.opensuse.org/project/monitor/home:X0F:HSF:Kernel - my fork https://build.opensuse.org/package/binaries/home:X0F:HSF:Kernel/kernel-HSF/standard - that package in particular Just updated to 5.2. The bad news is that aforementioned warning didn't go away. The "good" (?) news is that it showed itself about 6 hours of uptime and currently, at 18 hours, complete glitch has not happened yet. I've noticed that fq_codel does a lot of requeues and new_flow_count is constantly rising but I don't remember if its different with how it behaved prior to the glitch. For some reason there are also packet drops on a machine with 6-thread FX-6100 CPU and 50 Mbit/s connection. Here's is `tc qdisc` dump at ~18 hours up uptime with only mild Youtube watching and Tor relay (capped at 128/512 KB rate/burst) in the background: qdisc fq_codel 0: dev enp4s0 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn Sent 3225700533 bytes 3411973 pkt (dropped 18, overlimits 0 requeues 79903) backlog 0b 0p requeues 79903 maxpacket 2846 drop_overlimit 0 new_flow_count 40713 ecn_mark 0 new_flows_len 0 old_flows_len 0 The requeues, new_flow_count, and dropped stats show fq_codel working as designed. I think. However I'm puzzled as to why you'd have any drops at all at a 50mbit "rate". Does the device imposing that limit use pause frames? (normally we shape using sqm-scripts or these days via cake to slightly below the provided rate) Two questions regarding the affected system: 1. Which RTL8168 chip version is it? (dmesg | grep XID) 2. TSO activated? At least one chip version (RTL8168evl) has a hw issue with TSO, therefore it's disabled per default. (In reply to Dave Täht from comment #5) > The requeues, new_flow_count, and dropped stats show fq_codel working as > designed. I think. > > However I'm puzzled as to why you'd have any drops at all at a 50mbit > "rate". Does the device imposing that limit use pause frames? > > (normally we shape using sqm-scripts or these days via cake to slightly > below the provided rate) So, my PC with r8169 is connected to cheapo 100mbit switch which is connected to Huawei EchoLife HG8245 GPON Terminal that is connected via fiber to ISP that limits bandwidth to 50Mbits somehow. I recently switched to cake and I was able to reproduce the problem with it… but I forgot exactly how. Now I have: tc qdisc replace dev ${interface} root cake ethernet ether-vlan rtt 100ms flows diffserv4 ack-filter split-gso I think it was setting 50ms (I can get about 40-60ms to some big in-country sites but google stuff is about 90ms) and autorate-ingress. Judging by statistics, it was dropping and removing whole bunch of acks at the time. But now it's working fine for days with statistics like this: qdisc cake 8003: dev enp4s0 root refcnt 2 bandwidth unlimited diffserv4 flows nonat nowash ack-filter split-gso rtt 100.0ms noatm overhead 42 mpu 84 Sent 1105703468 bytes 1264192 pkt (dropped 3, overlimits 0 requeues 0) backlog 0b 0p requeues 0 memory used: 162Kb of 15140Kb capacity estimate: 0ibit min/max network layer size: 28 / 1475 min/max overhead-adjusted size: 84 / 1517 average network hdr offset: 14 Bulk Best Effort Video Voice thresh 0ibit 0ibit 0ibit 0ibit target 5.0ms 5.0ms 5.0ms 5.0ms interval 100.0ms 100.0ms 100.0ms 100.0ms pk_delay 12us 8us 14us 9us av_delay 4us 3us 4us 4us sp_delay 2us 2us 2us 3us backlog 0b 0b 0b 0b pkts 1019412 238119 6102 562 bytes 1065322775 36965315 3368008 51779 way_inds 20562 4859 0 0 way_miss 99399 10891 19 155 way_cols 0 0 0 0 drops 3 0 0 0 marks 0 0 0 0 ack_drop 0 0 0 0 sp_flows 1 6 0 1 bk_flows 0 0 0 0 un_flows 0 0 0 0 max_len 2816 11672 554 590 quantum 1514 1514 1514 1514 (In reply to Heiner Kallweit from comment #6) > Two questions regarding the affected system: > 1. Which RTL8168 chip version is it? (dmesg | grep XID) > 2. TSO activated? > > At least one chip version (RTL8168evl) has a hw issue with TSO, therefore > it's disabled per default. 1) RTL8168evl/8111evl, XID 2c9, IRQ 42 2) Yes but I think that it manifested before I recently started enabling everything I could via ethtool in a boot-up script: `ethtool -K ${interface} tx-nocache-copy on rx on tx on sg on tso on ufo on gso on gro on lro on rxvlan on txvlan on ntuple on rxhash on` (In reply to Sergey Kondakov from comment #7) > (In reply to Dave Täht from comment #5) > > The requeues, new_flow_count, and dropped stats show fq_codel working as > > designed. I think. > > > > However I'm puzzled as to why you'd have any drops at all at a 50mbit > > "rate". Does the device imposing that limit use pause frames? > > > > (normally we shape using sqm-scripts or these days via cake to slightly > > below the provided rate) > > So, my PC with r8169 is connected to cheapo 100mbit switch which is > connected to Huawei EchoLife HG8245 GPON Terminal that is connected via > fiber to ISP that limits bandwidth to 50Mbits somehow. > > I recently switched to cake and I was able to reproduce the problem with it… > but I forgot exactly how. Now I have: tc qdisc replace dev ${interface} root > cake ethernet ether-vlan rtt 100ms flows diffserv4 ack-filter split-gso > > I think it was setting 50ms (I can get about 40-60ms to some big in-country > sites but google stuff is about 90ms) and autorate-ingress. Judging by > statistics, it was dropping and removing whole bunch of acks at the time. > But now it's working fine for days with statistics like this: > qdisc cake 8003: dev enp4s0 root refcnt 2 bandwidth unlimited diffserv4 > flows nonat nowash ack-filter split-gso rtt 100.0ms noatm overhead 42 mpu 84 > Sent 1105703468 bytes 1264192 pkt (dropped 3, overlimits 0 requeues 0) > backlog 0b 0p requeues 0 > memory used: 162Kb of 15140Kb > capacity estimate: 0ibit > min/max network layer size: 28 / 1475 > min/max overhead-adjusted size: 84 / 1517 > average network hdr offset: 14 > > Bulk Best Effort Video Voice > thresh 0ibit 0ibit 0ibit 0ibit > target 5.0ms 5.0ms 5.0ms 5.0ms > interval 100.0ms 100.0ms 100.0ms 100.0ms > pk_delay 12us 8us 14us 9us > av_delay 4us 3us 4us 4us > sp_delay 2us 2us 2us 3us > backlog 0b 0b 0b 0b > pkts 1019412 238119 6102 562 > bytes 1065322775 36965315 3368008 51779 > way_inds 20562 4859 0 0 > way_miss 99399 10891 19 155 > way_cols 0 0 0 0 > drops 3 0 0 0 > marks 0 0 0 0 > ack_drop 0 0 0 0 > sp_flows 1 6 0 1 > bk_flows 0 0 0 0 > un_flows 0 0 0 0 > max_len 2816 11672 554 590 > quantum 1514 1514 1514 1514 It looks like this box is originating traffic, and not acting as a router? If your intent was to shape traffic the topology would be more like gpon - this box - switch - your other boxes and you'd set cake's bandwidth param a bit below 50mbit. (and also shape inbound) diffserv4 is primarily intended for video-heavy loads. ack-filtering doesn't do much unless you are on an asymmetric connection (1gb down, 50mbit up) > (In reply to Heiner Kallweit from comment #6) > > Two questions regarding the affected system: > > 1. Which RTL8168 chip version is it? (dmesg | grep XID) > > 2. TSO activated? > > > > At least one chip version (RTL8168evl) has a hw issue with TSO, therefore > > it's disabled per default. > > 1) RTL8168evl/8111evl, XID 2c9, IRQ 42 > 2) Yes but I think that it manifested before I recently started enabling > everything I could via ethtool in a boot-up script: `ethtool -K ${interface} > tx-nocache-copy on rx on tx on sg on tso on ufo on gso on gro on lro on > rxvlan on txvlan on ntuple on rxhash on` At 100mbit you really don't want or need gro/tso etc. It just bulks packets up for no reason (and cake automagically splits them again). If it's off by default, leave it off. That said if it's on be default and causing problems, well... (In reply to Dave Täht from comment #8) > > It looks like this box is originating traffic, and not acting as a router? > > If your intent was to shape traffic the topology would be more like > > gpon - this box - switch - your other boxes > > and you'd set cake's bandwidth param a bit below 50mbit. (and also shape > inbound) > > diffserv4 is primarily intended for video-heavy loads. > > ack-filtering doesn't do much unless you are on an asymmetric connection > (1gb down, 50mbit up) > Yes, it's not a router. What I would like to achieve is automatically heuristically manage up&down-stream queues and recognize & mark QoS classes appropriately so in case of sudden congestions or system overload, less important ones may be sacrificed by delays or drops. On the endpoints. Not just because nothing can be done about ISP's GPON boxes but because actual originator/receiver "knows" best what's what and how should be handled, ISP's internal backbone shenanigans notwithstanding. You know, a network scheduler. I don't know much about ack-filtering but it was hyped up buy its creators somewhere in https://forum.netgate.com/topic/112527/playing-with-fq_codel-in-2-4/410 as something that squeezes some benefits without any downsides. > At 100mbit you really don't want or need gro/tso etc. It just bulks packets > up for no reason (and cake automagically splits them again). If it's off by > default, leave it off. That said if it's on be default and causing problems, > well... So, I will try disabling those explicitly and bring fq_codel to see if it hangs again. HOWEVER… "If it's off by default, leave it off" is NOT a good axiom to follow, especially on Linux. Many if not most things are inadequately configured by default, especially for desktop. Such as KMS TearFree of AMD and Intel GPUs being disabled by default, bluez not compiling in SixAxis/DualShock support, autogroup being enabled and screwing up actual process prioritization, kernel.sched_rr_timeslice_ms being defaulted at embarrassing 100ms with CFS granularity being on par with it, input, audio and GPU kernel processes not being of highest FIFO realtime priority, high-hz mice not using their actual proper polling rate on unless forced or never (bug#60586 on USB2 but !3 and bug#82571 - on USB3 but !2), PulseAudio not even trying to use available 24-32bit 48-192khz or figure out optimal minimal buffer/latency (talk about "buffer-bloat", huh…) from what hardware advertises and so on. Low-latency high-throughput networking is nice, perpetually desirable, and _should_ all the time be used if available, BUT I would gladly sacrifice networking latency tenfold and bandwidth in half if otherwise it would make a frame-length stutter on my video output, delay/skip input or worse, create audio buffer under/over run / crackle. Right now if I enable my ath9k WiFi on this PC, my very nice stable 8ms 32b/96khz in hardware hw / float/192khz (software limiter processing stage requirement) in software (JACK pipeline) audio may start to throw occasional x-run errors and crackle. If I disable MSI on it then it will ruin everything with its interrupt storm, even though it can handle complete saturation of all CPU cores with multimedia load and have 0 x-runs. But on my old laptop enabling MSI on ath9k WiFi hangs it… or something, I don't remember but I gave up on it completely. Thankfully, Ethernet in general and r8169 in particular don't seem to have this problem (unless that newest 003bd5b4a7b4a94b501e3a1e2e7c9df6b2a94ed4 patch of 5.2.8 release from bug#204079 changes it…). But interactive I/O and A/V latencies always stand above network performance. There should be no network-related kernel workload that could compromise those ever, which means that if buffer needs to be get bigger and more stuff needs to be offloaded to PHY then so be it. It should stand on that line just behind kernel interactive input/output & userland multimedia processing but before all of latency-insensitive muck. A/V networking and wireless I/O would be a "grey area" in this. The point is that, for now, it would be very foolish to blindly trust developer defaults in anything unless reasoning behind them is clearly defined and factually correct. They often have very strange and… single-minded priorities and compromises. That was a good rant, thank you. I too have issues with linux audio. I used to work on ardour until it became too hard to get a good result. I'm incidentally one of the authors of the latest ath9k code, but have not tried to run it under an R/T kernel. The current default of a 300 byte quantum creates a lot of extra processing, I'd try a MTU or two instead. However, signalling intent over the endpoints as you are trying to do won't work. The bottleneck router needs to share the same interpretation of your intent and it doesn't. Stick an (for example - many others to choose from) edgerouter X w/openwrt in front of the gpon box and setup cake and see what you get. Also, sometimes, the devs are right. If this card is buggy with gso, don't use it. (In reply to Dave Täht from comment #10) > That was a good rant, thank you. I too have issues with linux audio. I used > to work on ardour until it became too hard to get a good result. > > I'm incidentally one of the authors of the latest ath9k code, but have not > tried to run it under an R/T kernel. The current default of a 300 byte > quantum creates a lot of extra processing, I'd try a MTU or two instead. > > However, signalling intent over the endpoints as you are trying to do won't > work. The bottleneck router needs to share the same interpretation of your > intent and it doesn't. Stick an (for example - many others to choose from) > edgerouter X w/openwrt in front of the gpon box and setup cake and see what > you get. > > Also, sometimes, the devs are right. If this card is buggy with gso, don't > use it. Thanks ! I'm really baffled that among: * audio/kernel (DAC/ADC drivers); * audio/userspace (JACK, PA, realtime processing pipeline, apps); * video/kernel (CPU & GPU RAM management, GPU drivers); * video/userspace (libdrm, Mesa, LLVM, X11/compositor, Wayland, apps and their rendering pipelines); * input/kernel (USB, HID, sensors); * input/userspace (libevdev, libinput, apps); * network/kernel (TCP/IP & WiFi & BT stacks, filtering, qdisc/shaping?, drivers); * network/userspace (apps); network may be considered anything but lowest priority for CPU time. Although, with interactive/realtime traffic and remote I/O (such as wireless input devices and displays) point does become moot. Same goes for storage where random reads are much-much-much more important than writes even though prioritising reads may horribly screw up sequential writes but doing otherwise may halt any or all of the above. For audio I really hope for PipeWire & Gstreamer success as universal userspace realtime A/V stream pipelining framework to replace PA & JACK, maybe give a push to ffmpeg, conventional DAWs and fancy, heavily-processing video players (mpv, vlc) & editors. All of those should scale with system's parallelization and acceleration capabilities better, JACK's LV2 filters, for example, aren't great with threading at all. But userspace can't do much if kernel's priorities are all upside down. My kernel isn't from RT branch, that one updates too slow nowadays and its benefits are not clear. But I used RT priority on all latency-sensitive processes and rearranged all I could to minimize I/O latency of GPU, DAC & JACK and USB. Which does wonders on Linux even on such an old system. Windows has 1up over it in one case though: Bluetooth and Sony DualShock gamepads. There is no DS4 polling rate controls on Linux and BT stack is finicky to off-brand controllers. And what have they done to DS3's multitude of pressure sensors for each button… just ignored all of them in kernel ! But to the point: "quantum" as in codel's and cake's 'quantum' setting or is there something similar on more generic driver-level ? I may try so in codel but doesn't cake lack 'target' and 'quantum' settings ? I'm thinking about also setting target to 15-20ms by default instead of 5ms, as you've suggested once here: https://github.com/systemd/systemd/issues/9725#issuecomment-414079283 If I had actually latency-sensitive remote devices, I would want 4-8ms. For DS4 via BT on Windows I use 7ms polling with 1ms on wired devices. With things like mice and styli you would want to keep it to the minimum at all times. But those still come with their WiFi/BT mystery-dongles, so that's irrelevant for now. Router indeed is the problem that may always screw up any attempt in anything. Especially if it's ISP's "blackbox" that they force you to "buy" but even then try to lock it with their password which, I'm pretty sure, is technically illegal but so is country-wide censorship and surveillance and that did not stop them… But can another box before ISP's make anything better ? What I'm worried about is: * proper congestion control for the whole path, mainly ECN; * distinction for realtime/interactive traffic (remote I/O, realtime A/V streaming/conferencing, gaming) and the rest. However, it's not just about transit points dropping or slowing down different classes of traffic, it also about behaviour of the system under 100% CPU load or RAM exhaustion, momentary and constant. Ideally, it should delay (dynamically increase buffers), then drop traffic of lowest priority, then all, then halt network driver and stacks activity, and only after that affect the rest. It should not be "all or nothing" situation where network stack always gets what it wants, nothing else does or it drops all of its activity during any interference. We can't expect ISPs behave sanely until state auditory agencies start to actually doing their job of technical quality control for communication infrastructure instead of whatever they are doing now. So it all may be an exercise in futility. But at least we may control endpoints. If something is assuredly bugy then it should be blacklisted and not passed to userspace controls. If defaults were sane it would have instilled actual trust to developers' intentions but right now only distro maintainers bring some kind of sanity, even when they don't understand much more than a random user. For example, by default entire kernel is configured to bare minimum, as per Linus' own instructions, and debug options are discouraged from being enabled even by maintainers. Devs _actually_ expect all people to make their own situational builds after encountering every single problem which is ridiculous. So you don't even know which "debug" options just add some useful verbose lines, use some insignificant amount of RAM and which will crap-out your entire log and/or slow-down whole system 10 times over, actual debug versus paranoia. And all that goes with "safe for a 2-core per 4-sockets 10-year old headless remote-controlled web-server with a fiber-connected high-latency storage-array" default configuration. More to the point: I tested with codel and TSO disabled then with cake and TSO/GSO/GRO disabled but it still halted. cake seems to be able to actually recover (at least partially) after a minute or few on its own with increased "delay" stats and some drops. And it doesn't trigger it as fast as codel. I `rmmod r8169` just in case and re-load the module. But in "stuck" state it doesn't recover with module & qdisc reload, just seconds or a minute after. I can't find correlation between settings, network activity (even a hundred or few KBs is enough to suddenly "choke" after gobbling up tens of gigabytes on full 50mbits for hours) and time for manifestation of the halt, only that cake uses codel's 'flows' and I'm not keen on changing that. So, after fiddling around with all kinds of networking parameters and trying to revert r8169 changes between 5.0 and 5.1, I think I have actually figured out what triggers the issue and it's these being set too low: * net.core.netdev_budget_usecs lower than 500 * net.core.netdev_budget lower than 100 * net.core.dev_weight lower than 30 The lower they are the faster qdisc kernel warning comes and then all transfers on the interface halt, after a while some packets come through but backlog is getting filled and many or most packets are getting dropped (rx_missed rises in `ethtool -S enp4s0` and drops in `tc qdisc show`). It may recover after 5-10 minutes but too make sure that it did so fully I have to reinsert the r8169 module (even then, it may not start working normally for tens of seconds) but if parameters aren't changes, it will happen again. Strangely, when issue manifests, coalescence parameters get silently dropped to 'rx-frames 1 tx-frames 1 rx-usecs 0 tx-usecs 0' from anything that they might have been set. But, otherwise, any manual ethtool changes of any parameters do not seem to influence the issue, except maybe, again, too conservative coalescence such as 'rx-usecs 250 tx-usecs 500' or 'rx-frames 22 tx-frames 44'. During normal operation, I have never seen backlog to be non-zero, codel showing its 'memory_used' stat, anything being dropped or missed. So, in the end, I had to put more generous values in those sysctl parameters and increase my audio buffering from 8 to 12 ms because otherwise perfectly dropless audio became unsustainable due to that or some other tuning. But WiFi does not seem to influence audio stability anymore either. I did: * net.core.dev_weight=39 * net.core.dev_weight_rx_bias=3 * net.core.dev_weight_tx_bias=2 * net.core.netdev_budget=117 * net.core.netdev_budget_usecs=751 * net.core.rps_sock_flow_entries=1024 * /sys/class/net/*/queues/rx-0/rps_flow_cnt = 1024 (there is only 1 rx queue) * ethtool -K enp4s0 tx-nocache-copy off rx on tx on sg on ufo on gro on gso off tso on lro on rxvlan on txvlan on ntuple on rxhash on * ethtool -C enp4s0 rx-frames 0 tx-frames 0 rx-usecs 125 tx-usecs 250 adaptive-rx on adaptive-tx on (but adaptive doesn't apply and real Rus/Tus is 120/240) * tc qdisc replace dev enp4s0 root fq_codel limit 102400 flows 10240 target 15ms interval 100ms quantum 3028 ecn * tc qdisc replace dev wlp3s0 root fq_codel limit 102400 flows 5120 target 20ms interval 100ms quantum 2327 ecn (maxmtu of wifi for some reason is 2304 so I've put quantum at ~1.01 of that "just to make sure", as codel by default puts ethernet's quantum to ~1.01 of 1500 mtu) * tc qdisc replace dev lo root fq_codel limit 102400 flows 1024 target 4ms interval 20ms ecn (for local DNS caching and sockets of daemons) Seems fine now. Honestly, I no longer sure if it was a regression or change in those sysctl parameters just coincided with the kernel update. Still, not an adequate behaviour on kernel's part to hang all network activity semi-randomly like that. Puh, impressive analysis. From a r8169 maintainers perspective it's of course good news that the issue doesn't seem to be caused by a driver bug. I agree that kernel should handle the situation more gentle. Question is, which sub-system / module this would refer to. (In reply to Heiner Kallweit from comment #13) > Puh, impressive analysis. From a r8169 maintainers perspective it's of > course good news that the issue doesn't seem to be caused by a driver bug. > I agree that kernel should handle the situation more gentle. Question is, > which sub-system / module this would refer to. Thanks. Have you been able to reproduce it too ? After figuring out safe parameters I haven't had a single network "halt", dropped IP packet, missed Ethernet frame, video frame stutter or audio dropout even under full system stress-load on an old clunker of mine. Even requeues are almost non-existent, had to sacrifice some audio latency but now all network, graphics, audio and inputs at their peak interactivity. Youtube player in FF fills its buffer with big chunks so fast that it waits most of the time for need of the next "portion" with zero network activity and downloading speed from German openSUSE distro update build-servers increased from ~300KB to ~1,5MB that shows that either old parameters were that bad (which is most likely), they recently upgraded their servers or some transit line on path from Russia's midland to Germany got better. Local DNS caching is also finally working correctly, previously I thought that there is some bug in unbound that makes it fail semi-randomly. On both fq_codel and cake, cake even increases its quantum from 300 to 1514 in WiFi connection by itself. Although, with more relaxed audio latency, it may not be a problem even if it wouldn't. I doubt it's relevant but here's some sysctl overrides that I also did: net.core.netdev_tstamp_prequeue=0 net.core.somaxconn=4096 net.core.optmem_max=1048576 net.core.rmem_default=1048576 net.core.rmem_max=134217728 net.core.wmem_default=1048576 net.core.wmem_max=134217728 net.ipv4.tcp_mem=32768 196608 262144 net.ipv4.tcp_rmem=16384 1048576 134217728 net.ipv4.tcp_wmem=16384 1048576 134217728 net.ipv4.udp_mem=32768 196608 262144 net.ipv4.udp_rmem_min=16384 net.ipv4.udp_wmem_min=16384 net.ipv4.tcp_mtu_probing=2 net.ipv6.conf.default.mtu=1480 net.ipv4.tcp_min_snd_mss=48 net.ipv4.tcp_base_mss=256 net.ipv4.tcp_timestamps=2 net.ipv4.tcp_reordering=2 net.ipv4.tcp_max_reordering=1000 net.ipv4.tcp_tso_win_divisor=25 net.ipv4.tcp_tw_reuse=2 net.ipv4.tcp_ecn=1 net.ipv4.tcp_allowed_congestion_control=bbr reno cubic scalable highspeed bic cdg dctcp westwood hybla htcp vegas nv veno lp yeah illinois net.ipv4.tcp_available_congestion_control = reno bbr bic cdg cubic dctcp westwood highspeed hybla htcp vegas nv veno scalable lp yeah illinois net.ipv4.tcp_congestion_control = bbr net.ipv4.tcp_slow_start_after_idle=0 net.ipv4.ip_local_port_range=18000 65535 net.ipv4.tcp_synack_retries=2 net.ipv4.tcp_comp_sack_nr=132 net.unix.max_dgram_qlen=8192 On my test systems used for network driver development I never saw this problem. However they are only slightly loaded and don't have a complex setup like yours. Ah, I've changed motherboard to one for LGA2011 socket which has 'r8169 0000:0d:00.0 eth0: RTL8168evl/8111evl, XID 2c9, IRQ 66' on it and the issue returned (not as aggressively, without kernel warnings but still with needlessly requeuing, delaying or dropping packets/connections) even on kernel 5.9.1 but this time no amount of fiddling with any system options seem to help. I've resorted to replacing r8169 with out-of-tree r8168 and it works flawlessly so far. Do you run the system with forced interrupt threading (CONFIG_PREEMPT_RT or threadirqs command line parameter)? (In reply to Heiner Kallweit from comment #18) > Do you run the system with forced interrupt threading (CONFIG_PREEMPT_RT or > threadirqs command line parameter)? Yes, both, in fact, CONFIG_PREEMPT_RT and CONFIG_IRQ_FORCED_THREADING for the sake of low-latency audio and vsync/compositing. Although, I don't see 'threadirqs' parameter now, I probably thought that it was implied by CONFIG_IRQ_FORCED_THREADING or something else and removed it. OK, there's a general known issue with napi_schedule_irqoff() under forced irq threading. See following very recent fix: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=424a646e072a887aa87283b53aa6f8b19c2a7bef It should be backported to the stable versions very soon. Can you apply this patch and re-test with r8169? Explicitly added 'threadirqs' parameter and applied linked patch to new 5.9.1 build, used system for about 6 hours with low network activity but then connections started to get "stuck" again. Kernel log got suspicious "NETDEV WATCHDOG: enp13s0 (r8169): transmit queue 0 timed out" about the same time. `ethtool -i enp13s0` says 'firmware-version: rtl8168e-3_0.0.4 03/27/12', by the way. `tc qdisc` in about same time showed: qdisc cake 8003: dev enp13s0 root refcnt 2 bandwidth unlimited diffserv4 flows nonat nowash ack-filter no-split-gso rtt 80ms noatm overhead 42 mpu 84 Sent 117300483 bytes 894392 pkt (dropped 55, overlimits 0 requeues 7) backlog 0b 0p requeues 7 memory used: 42014b of 92080Kb capacity estimate: 0ibit min/max network layer size: 28 / 1560 min/max overhead-adjusted size: 84 / 1602 average network hdr offset: 14 Bulk Best Effort Video Voice thresh 0ibit 0ibit 0ibit 0ibit target 4ms 4ms 4ms 4ms interval 80ms 80ms 80ms 80ms pk_delay 1.64s 1.38s 4us 76us av_delay 149ms 34.1ms 0us 4us sp_delay 279us 1us 0us 2us backlog 0b 0b 0b 0b pkts 25439 835490 12 3741 bytes 27602686 89505596 648 196972 way_inds 147 28855 0 0 way_miss 3409 30537 3 81 way_cols 0 0 0 0 drops 0 3 0 0 marks 0 1 0 0 ack_drop 0 52 0 0 sp_flows 12 26 1 1 bk_flows 0 0 0 0 un_flows 0 0 0 0 max_len 7305 5792 54 590 quantum 1514 1514 1514 1514 Used Wi-Fi to get r8168 back, `tc qdisc` shows: qdisc cake 8004: dev enp13s0 root refcnt 2 bandwidth unlimited diffserv4 flows nonat nowash ack-filter no-split-gso rtt 80ms noatm overhead 42 mpu 84 Sent 13149956 bytes 91039 pkt (dropped 0, overlimits 0 requeues 1) backlog 0b 0p requeues 1 memory used: 16640b of 92080Kb capacity estimate: 0ibit min/max network layer size: 28 / 1560 min/max overhead-adjusted size: 84 / 1602 average network hdr offset: 14 Bulk Best Effort Video Voice thresh 0ibit 0ibit 0ibit 0ibit target 4ms 4ms 4ms 4ms interval 80ms 80ms 80ms 80ms pk_delay 13us 13us 0us 11us av_delay 3us 3us 0us 4us sp_delay 2us 2us 0us 2us backlog 0b 0b 0b 0b pkts 18348 69261 0 940 bytes 1497390 11607601 0 44965 way_inds 339 1410 0 0 way_miss 2132 11116 0 19 way_cols 0 0 0 0 drops 0 0 0 0 marks 0 0 0 0 ack_drop 0 0 0 0 sp_flows 11 2 0 0 bk_flows 0 0 0 0 un_flows 0 0 0 0 max_len 644 5792 0 590 quantum 1514 1514 1514 1514 OK, thanks for testing. r8169 and r8168 use different features of net subsystem, therefore the test result doesn't necessarily indicate a driver bug. Best would still be a bisect, but as you wrote earlier it's difficult in your setup. r8169 uses the byte queue limit feature, not sure whether this could be related to your problem. But it might be worth a try to adjust BQL setting via sysfs. Yes, bisecting is almost impossible especially since it often manifests slowly & gradually and I don't even know if that exact chip ever worked fine on r8169. `tc qdisc` and lack of scarier dmesg messages don't actually show how extensive the problem is: drop and requeue numbers may not be big but when they even start counting then most of connections slow-down and/or freeze as if all servers started to respond slowly or not at all. Gajim, a python XMPP/jabber client, straight up crashes and can't restart until I `rmmod r8169` (or at least switch off the link) because it just can't handle whatever is happening in network stack. I might play around with BQL limits under r8169 when feeling like checking it out again but isn't that only for transfer and not receive ? W/o bisect basically everything is a shot in the dark. What else you could try: - use r8169 from 5.0 on top of a recent kernel (you said it was still ok in 5.0) - rx irq coalescing was removed in 5.1, as a better replacement you could do: use latest linux-next and set: echo 20000 > /sys/class/net/<if>/gro_flush_timeout echo 1 > /sys/class/net/<if>/napi_defer_hard_irqs Both your affected systems have a RTL8168evl. However this chip version is very common, therefore something seems to be special in your setup, else I would expect much more such reports. Created attachment 293147 [details] r8169_stuck-on_5.9.1.txt Tried out napi-patched 5.9.1 with napi_defer_hard_irqs and gro_flush_timeout options, it worked perfectly for about 12 hours of uptime and then got stuck completely with a familiar kernel trace. So far the only consistent things I've noticed are: 1) r8169 uses MSI-X while r8168 uses legacy MSI. 2) If coalesce parameters were customized for r8169, they're silently reset to default 0/1-0/1 when glitching starts. 3) Usually pk_delay of tc-cake is <20us even on sustained ~100% all-core CPU loads but on glitch it maxes out to >1s. 4) For AM3 motherboard and 5.2 kernel it was enough to increase net.core.netdev_budget* but here it seems to have no influence now. 5) Torrent downloading with summary speed of >2MB/s, a lot of active seeders and a lot of connections allowed in a client while having XMPP client logged-in and playing streamed video in background seems to be a good way to trigger it. Except that it may work fine without a single drop or delay for a while. Bisect would be just as much of a shot in the dark with such long trigger, it would be easier to buy a PCIe Ethernet card with another chip. Besides, to properly install kernel it has to be packaged on a build-server. Haven't tried manual installation for years since after dropping Gentoo. As for what's special on my system: pretty much all tunables for most aggressive preemption and latency of process scheduling that's possible. As much of kernel's interrupt handling is offloaded to threads and audio/video/input/SSD-access is bumped while networking and HDD/complex-i/o-scheduling is sacrificed in priority. For example, scheduler precision is maximized with: kernel.sched_autogroup_enabled=0 - crucial for priorities to work as expected. # inspired by https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/ kernel.sched_latency_ns=100000 kernel.sched_min_granularity_ns=100000 kernel.sched_wakeup_granularity_ns=99 kernel.sched_nr_migrate=12 kernel.sched_migration_cost_ns=99000 kernel.timer_migration=0 kernel.sched_cfs_bandwidth_slice_us=100 kernel.sched_tunable_scaling=1 kernel.sched_child_runs_first=1 kernel.sched_rt_period_us=2000000 kernel.sched_rt_runtime_us=1500000 kernel.sched_rr_timeslice_ms=1 So anything that is not expecting to be preempted will have a bad time for the sake of me not having a bad time due to output stuttering and UI hanging. And RCU stuff: rcu_nocbs=0-126 rcu_nocb_poll rcutree.kthread_prio=1 rcutree.use_softirq=0 rcupdate.rcu_task_ipi_delay=3333 rcutree.rcu_idle_lazy_gp_delay=4 rcutree.rcu_idle_gp_delay=1 io_delay=none Although, at the beginning I've tested this bug with much more relaxed scheduling & RCU parameters and without CONFIG_PREEMPT_RT I had the same driver hang on my new system that I have been running for a week. It's new RTL8125B, XID 641 (support for which got mainlined only in Linux 5.9). The problem occured when I attempted to connect to Windows 7 VM via RDP, but that might be just a coincidence. [94171.732301] NETDEV WATCHDOG: enp6s0 (r8169): transmit queue 0 timed out [94171.732319] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x232/0x240 [94171.732321] Modules linked in: macvtap macvlan tap veth xt_MASQUERADE xt_CHECKSUM xt_comment bridge stp llc ip6table_raw ip6table_nat iptable_raw iptable_nat bpfilter fuse btrfs blake2b_generic xor zstd_compress lzo_compress raid6_pq sctp kvm_amd kvm amdgpu irqbypass mfd_core gpu_sched ttm ghash_clmulni_intel nct6775 hwmon_vid k10temp efivarfs [94171.732345] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.9.1-gentoo #6 [94171.732348] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B550 Extreme4, BIOS P1.20 08/13/2020 [94171.732353] RIP: 0010:dev_watchdog+0x232/0x240 [94171.732357] Code: 85 c0 75 e5 eb 9c 4c 89 ef c6 05 c5 e6 c9 00 01 e8 b3 1d fc ff 44 89 e1 48 89 c2 4c 89 ee 48 c7 c7 c8 12 a8 91 e8 1e 9d 7e ff <0f> 0b e9 7a ff ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 c7 47 [94171.732360] RSP: 0018:ffff9e8d00003ea0 EFLAGS: 00010286 [94171.732363] RAX: 0000000000000000 RBX: ffff8c1aa331d600 RCX: 0000000000000000 [94171.732365] RDX: ffff8c1aaea278a0 RSI: ffff8c1aaea17820 RDI: 0000000000000300 [94171.732368] RBP: ffff8c1aa306a440 R08: ffff8c1aaea17820 R09: 00000000000006e3 [94171.732370] R10: ffffffff922cdd78 R11: ffff9e8d00003d48 R12: 0000000000000000 [94171.732372] R13: ffff8c1aa306a000 R14: ffff8c1aa306a440 R15: 0000000000000000 [94171.732374] FS: 0000000000000000(0000) GS:ffff8c1aaea00000(0000) knlGS:0000000000000000 [94171.732377] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [94171.732380] CR2: 00007f8d50009048 CR3: 0000001e9b98a000 CR4: 0000000000350ef0 [94171.732382] Call Trace: [94171.732385] <IRQ> [94171.732390] ? qdisc_put_unlocked+0x30/0x30 [94171.732395] call_timer_fn+0x2d/0x130 [94171.732399] run_timer_softirq+0x393/0x450 [94171.732403] ? tick_sched_handle.isra.0+0x40/0x40 [94171.732405] ? __hrtimer_run_queues+0xfd/0x260 [94171.732408] ? ktime_get+0x4a/0xc0 [94171.732412] __do_softirq+0xe1/0x2bf [94171.732416] asm_call_irq_on_stack+0x12/0x20 [94171.732418] </IRQ> [94171.732423] do_softirq_own_stack+0x36/0x40 [94171.732427] irq_exit_rcu+0x9a/0xa0 [94171.732432] sysvec_apic_timer_interrupt+0x2e/0x80 [94171.732435] asm_sysvec_apic_timer_interrupt+0x12/0x20 [94171.732440] RIP: 0010:cpuidle_enter_state+0xd5/0x380 [94171.732443] Code: c4 0f 1f 44 00 00 31 ff e8 88 77 91 ff 80 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 8d 02 00 00 31 ff e8 7f c0 96 ff fb 45 85 f6 <0f> 88 20 01 00 00 49 63 c6 be 68 00 00 00 4c 2b 24 24 48 89 c2 48 [94171.732446] RSP: 0018:ffffffff91c03e58 EFLAGS: 00000202 [94171.732448] RAX: ffff8c1aaea2a5c0 RBX: ffff8c1aa5d6b000 RCX: 000000000000001f [94171.732451] RDX: 0000000000000000 RSI: 0000000021bf5c7a RDI: 0000000000000000 [94171.732453] RBP: ffffffff91cdce00 R08: 000055a610a6578d R09: ffff8c1aa77c9000 [94171.732455] R10: ffff8c1aaea29584 R11: ffff8c1aaea29564 R12: 000055a610a6578d [94171.732458] R13: ffffffff91cdcee8 R14: 0000000000000002 R15: ffff8c1aa5d6b000 [94171.732463] ? cpuidle_enter_state+0xb8/0x380 [94171.732466] cpuidle_enter+0x37/0x60 [94171.732470] do_idle+0x1c9/0x240 [94171.732473] cpu_startup_entry+0x19/0x20 [94171.732476] start_kernel+0x50a/0x52c [94171.732480] secondary_startup_64+0xa4/0xb0 [94171.732484] ---[ end trace 896922ae98389a20 ]--- [94171.744374] r8169 0000:06:00.0 enp6s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100). [94171.752789] r8169 0000:06:00.0 enp6s0: rtl_rxtx_empty_cond_2 == 0 (loop: 42, delay: 100). (I did link up/down at this point) [94262.048882] r8169 0000:06:00.0 enp6s0: Link is Down [94264.253871] RTL8125B 2.5Gbps internal r8169-600:00: attached PHY driver [RTL8125B 2.5Gbps internal] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE) [94264.418592] r8169 0000:06:00.0 enp6s0: Link is Down (In reply to WGH from comment #26) > I had the same driver hang on my new system that I have been running for a > week. It's new RTL8125B, XID 641 (support for which got mainlined only in > Linux 5.9). The problem occured when I attempted to connect to Windows 7 VM > via RDP, but that might be just a coincidence. > This bug report is about a regression in 5.1. The trace you posted is a generic tx timeout, rout cause can be anything. Also the report is about a different chip version. Having said that it's most likely a different issue. What you could do: - check whether you can reproduce the issue - check the 5.9-rc kernels whether there was one w/o this issue, so that you have a basis for a bisect RDP activity is triggering this bug very easily somehow. No other activity including downloading packages, downloading games from Steam, running iperf in all possible directions, copying VM images with SSH+rsync could trigger this. The first time I got this stacktrace AND two rtl_rxtx_empty_cond warnings. The next times connectivity disappears, and after ~5-60 seconds I get two rtl_rxtx_empty_cond warnings, and connectivity recovers. Hello all!
I have the same issue but not only for r8169 but also for e1000e and ibt (I have many devices) but it's mostly gone if I change fc_codel to pfifo_fast.
> tc qdisc replace dev eth1 root pfifo_fast
Interesting, it happens only with a 1Gb connection for 100Mb is stable.
I suppose we have some regression in fc_codel itself. All kernels started from 5.4 and up to 5.10 affected (it's what I tested).
(In reply to Zhuravlev Uriy from comment #29) > Hello all! > > I have the same issue but not only for r8169 but also for e1000e and ibt (I > have many devices) but it's mostly gone if I change fc_codel to pfifo_fast. > > > tc qdisc replace dev eth1 root pfifo_fast > > Interesting, it happens only with a 1Gb connection for 100Mb is stable. > I suppose we have some regression in fc_codel itself. All kernels started > from 5.4 and up to 5.10 affected (it's what I tested). I can reproduce it also with tc-cake (with 'no-split-gso' option) and even tc-sfq. And it seems to be a lower-level issue because it started happening on third-party r8168 too as well as kernel's r8169, now even without complains in dmesg or not happening after complains like "WARNING: CPU: 7 PID: 75 at kernel/softirq.c:484 __raise_softirq_irqoff+0x64/0x110" with trace starting with "napi_watchdog+0x78/0x90". But with codel and cake it happens much more sooner and often, only full reboot fixes things, just resetting driver modules or replacing qdisc after the fact - doesn't. There is no relevant output in dmesg, like network activity hanging is OK. After some tweaks it seem to work, despite aforementioned warnings at boot up. Like: * not enabling /sys/class/net/${interface}/napi_defer_hard_irqs * setting: net.core.dev_weight=8 net.core.dev_weight_rx_bias=3 net.core.dev_weight_tx_bias=2 net.core.busy_poll=5 net.core.netdev_max_backlog=1024 net.core.netdev_budget=48 net.core.netdev_budget_usecs=5000 * doing this at interface bring-up: for txq in /sys/class/net/${interface}/queues/tx-*; do echo "42" > ${txq}/byte_queue_limits/limit_min echo "65536" > ${txq}/byte_queue_limits/limit_max done echo "9999" > /sys/class/net/${interface}/gro_flush_timeout ip l set "${interface}" gso_max_size 8192 gso_max_segs 8 multicast on txqueuelen 256 ip l set "${interface}" mtu 9194 But it's all complete guesswork. I also solved this issue - just disable the ASPM for the kernel (pcie_aspm=off). (In reply to Zhuravlev Uriy from comment #31) > I also solved this issue - just disable the ASPM for the kernel > (pcie_aspm=off). Except that both my boards are too old to have it enabled by default and I've forced-disabled it in kernel from beginning anyway. Also, there were some reports that it doesn't actually gets disabled for good unless you put 'pcie_aspm=force pcie_aspm.policy=performance', don't know if that got ever fixed. (In reply to dnieper from comment #33) > [...] > > https://geometrydash-free.com > > [...] Spam Is this still an issue with 6.11? (In reply to Artem S. Tashkinov from comment #36) > Is this still an issue with 6.11? To avoid that I've been using: 1) cake instead of codel; 2) much beefier CPU, 12-core Xeon on LGA2011-1 board (also with r8169), with high lowest allowed frequency (min_perf_pct=63) and almost lowest scheduling/wakeup latency possible set in /sys/kernel/debug/sched/; 3) much safer 'net.core.*' (budget=2048) and '/byte_queue_limits/' (256-8192) defaults. So I don't know and not keen on finding out. But I also have ASPM enabled now, despite it being discouraged on that platform, for the sake of NVMe's power-saving. `lspci -vv` shows it to be disabled for the Ethernet controller though. |