Bug 217503
Summary: | nvsp_rndis_pkt_complete error status and net_ratelimit: callbacks suppressed messages on 6.4.0rc4 | ||
---|---|---|---|
Product: | Virtualization | Reporter: | Adam Baxter (kernel) |
Component: | lguest | Assignee: | virtualization_lguest |
Status: | NEW --- | ||
Severity: | normal | CC: | bagasdotme, decui, haiyangz, korylprince, mhkelley, snoop05b |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.4.0rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
config used for the kernel
good and bad minimal kernel configs and module list |
(In reply to Adam Baxter from comment #0) > After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I > see the following > > [ 756.697753] net_ratelimit: 34 callbacks suppressed > [ 756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0: > nvsp_rndis_pkt_complete error status: 2 > (snipped repeated messages) > What previous kernel version do you use before you upgrade? Does this issue occur on that version? Previously I used Debian's 6.1 package and no, I don't see the error there. (In reply to Adam Baxter from comment #2) > Previously I used Debian's 6.1 package and no, I don't see the error there. Can you then bisect between v6.1 and v6.4.0-rc4? does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 occurs in dca5161f9bd052e9e73be90716ffd57e8762c697 Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to. (In reply to Adam Baxter from comment #4) > does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 > occurs in dca5161f9bd052e9e73be90716ffd57e8762c697 > > Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to. I have forwarded this BZ report at [1]. Next time, if you want to forward CC, click Edit on Cc list. [1]: https://lore.kernel.org/regressions/15dd93af-fcd5-5b9a-a6ba-9781768dbae7@gmail.com/ The commit in question does not change the behavior of the networking code. It only checks for an error and outputs a message where previously any errors were silently ignored. The error status "2" indicates that the network packet send failed at the Hyper-V level. I've asked the Hyper-V team for more specifics. Question: When you run the same scenario (including the "sudo strings /dev/sda") with the older kernel version, do you also see the slight drop in network speed from ~17-18 Gbps to ~16 Gbps? I'm guessing that you do, because the Linux kernel patch just outputs a message about an error that previously was silently ignored. Hopefully outputting the error message is not the cause of the performance drop. The goal of the patch is to bring more visibility to errors if they *are* happening, rather than having network performance degradation with no indication of what might be causing it. Adam, haven you tried doing what Michael asked for? Hi Thorsten & Michael. I'll hopefully get back to testing this this weekend. Confirming I see similar behaviour without the error messages in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7 (In reply to Adam Baxter from comment #9) > Confirming I see similar behaviour without the error messages in > 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7 Thanks. My interpretation of this result is that there is no regression. The error has happening all along, and what's new is that the error is now reported. Do you agree? I still want to learn more about why the error is occurring and the implications. I think one of the implications is that outgoing packets are being dropped, which could affect performance but isn't fatal. I'm in conversations with the Hyper-V team about the meaning of the error. One possible cause is that the "send packet" request from the Linux guest is malformed, but that doesn't seem likely to be the case here. There are other cases that the Hyper-V team described in terms of internal Hyper-V behaviors, and I'm following up to clarify the full implications. FWIW on 6.4.0rc4 I am now seeing that message without putting the system under significant load (e.g. tar -cf - . | ssh otherhost "tar xf") I'm also getting those messages as well as significantly degraded SMB performance, but i don't believe that this is a kernel regression either, since it only happens if kernel was built using make localmodconfig. "Full" configs that distribution use don't have this issue, so there has to be some configuration option that's not enabled using localmodconfig, except i'm without a clue which one it might be. I came across this bug report because I was having weird network issues that coincided with several "hv_netvsc ... nvsp_rndis_pkt_complete error status: 2" messages being logged. I can reliably reproduce the error by trying to run a tftp server (https://github.com/korylprince/tftphere, though I'm sure others would as well) and trying to send a file from another device. The tftp PUT would time out, with the server "sending" acknowledge packets (according to tcpdump) but the client not actually receiving them. The "hv_netsvc ..." errors coincided with the acknowledge packets being dropped. I tested multiple kernel versions across multiple servers (all Ubuntu): 5.4.0-1109-azure, 5.15.0-86-generic, 6.2.0-34-generic, 6.2.0-1014-azure and they all had the same issue. These are all VMs in Hyper-V, Server 2019. I ultimately discovered that turning off tx-checksumming on the virtual network interface (e.g. ethtool -K eth0 tx-checksumming off) fixed my weird network issue and stopped the messages from being logged. Hopefully this is helpful in debugging the underlying issue. Thanks Matej and Kory. I realised this was still going on with 6.8.0-rc4 and noticed the comments about both degraded SMB performance (which I just hit) and turning tx-checksumming off - which fixes the performance. I don't think this is a regression, but I do think someone from the client Hyper-V team at Microsoft needs to take a look at this. Created attachment 306972 [details]
good and bad minimal kernel configs and module list
Hello everyone,
Since there was no update for so long i decided to dig bit deeper.
Since looking at the hyperv network code didn't show anything obvious i took a brute-force approach, cutting the extra modules list in halves and rebuiling till i narrow it down, and after way too many kernel compiles and reboots later the single module making the issue magically disappear when additionaly passed to localmodconfig was `tunnel4`.
Checked what was the difference in resulting config and managed to reduce it further to just 4 config options:
CONFIG_NET_IPIP=m
CONFIG_GRO_CELLS=y
CONFIG_INET_TUNNEL=m
CONFIG_NET_IP_TUNNEL=m
Maybe not absolutely minimal but i believe this should narrow it down enough for any skilled engineer at Microsoft to come up with some fix (e.g. enable those options as dependency when building the hv_netvsc) in time worth of a coffee break.
This was tested on 6.12rc1, hopefully the fix will make it for the final LTS release and may also be worth as backport.
Hi Matej, Are you able to provide any more info here? What did you test to decide this fixed it? I can confirm that enabling CONFIG_NET_IPIP=m CONFIG_INET_TUNNEL=m CONFIG_NET_IP_TUNNEL=m as well seems to fix the issue, or at least silence the errors when I run iperf3. I do not understand how a module that's not loaded fixes an unrelated (?) issue. At any rate, thank you for investigating! Another note - there's about a 5 gigabit / second slowdown between VM host and guest in iperf3 speeds between kernel 6.2rc7 (the first commit the error occurred in) and 6.12rc1 - possibly interesting to someone on the Hyper-V team. I did try to get ntttcp working but could not. Hi Adam, My test workload was in fact regular access of SMB share from Host (exported from Linux VM guest). On kernel without tunnel4 module/config options it enables would result in very low performance at times even lockups/timeouts of the whole share, without any heavy load mind you, and obviously spam of the mesage in dmesg. Once enabling them the errors are gone and SMB works flawelssly, basically at disk IO speeds (and same if module/config options omitted but tx checksumming disabled) Since you mention iperf3 i give that couple tests too, interestingly enough its not affected nearly as much in all configs yielding 25-30Gbps (guest as server, host as client). Re not loaded module affecting behavior, as i already mentioned there are non-module options that may get auto-enabled as dependency of another module being built, even when not loaded, so i guess this is the case. Could potentially be the CONFIG_GRO_CELLS=y that gets enabled as can be seen in https://github.com/torvalds/linux/blob/27cc6fdf720183dce1dbd293483ec5a9cb6b595e/net/ipv4/Kconfig#L186 |
Created attachment 304350 [details] config used for the kernel After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I see the following [ 756.697753] net_ratelimit: 34 callbacks suppressed [ 756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0: nvsp_rndis_pkt_complete error status: 2 (snipped repeated messages) *but*, I'm only able to reliably reproduce this if I am generating garbage on another terminal, e.g. sudo strings /dev/sda This doesn't appear to affect latency or bandwidth a huge amount, I ran an iperf3 test between the guest and the host while trying to cause these messages. Although you if you take 17-18 gigabit as the "base" speed, you can see it drop a bit to 16 gigabit while the errors happen and "catch up" when I stop spamming the console. [ 5] 99.00-100.00 sec 1.89 GBytes 16.2 Gbits/sec [ 5] 100.00-101.00 sec 1.91 GBytes 16.4 Gbits/sec [ 5] 101.00-102.00 sec 1.91 GBytes 16.4 Gbits/sec [ 5] 102.00-103.00 sec 1.91 GBytes 16.4 Gbits/sec [ 5] 103.00-104.00 sec 1.92 GBytes 16.5 Gbits/sec [ 5] 104.00-105.00 sec 1.94 GBytes 16.6 Gbits/sec [ 5] 105.00-106.00 sec 1.89 GBytes 16.2 Gbits/sec [ 5] 106.00-107.00 sec 1.90 GBytes 16.3 Gbits/sec [ 5] 107.00-108.00 sec 2.23 GBytes 19.2 Gbits/sec [ 5] 108.00-109.00 sec 2.57 GBytes 22.0 Gbits/sec [ 5] 109.00-110.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 110.00-111.00 sec 2.64 GBytes 22.7 Gbits/sec [ 5] 111.00-112.00 sec 2.65 GBytes 22.7 Gbits/sec [ 5] 112.00-113.00 sec 2.65 GBytes 22.8 Gbits/sec [ 5] 113.00-114.00 sec 2.65 GBytes 22.8 Gbits/sec [ 5] 114.00-115.00 sec 2.65 GBytes 22.8 Gbits/sec [ 5] 115.00-116.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 116.00-117.00 sec 2.63 GBytes 22.6 Gbits/sec [ 5] 117.00-118.00 sec 2.69 GBytes 23.1 Gbits/sec [ 5] 118.00-119.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 119.00-120.00 sec 2.67 GBytes 22.9 Gbits/sec [ 5] 120.00-121.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 121.00-122.00 sec 2.49 GBytes 21.4 Gbits/sec [ 5] 122.00-123.00 sec 2.15 GBytes 18.5 Gbits/sec [ 5] 123.00-124.00 sec 2.16 GBytes 18.6 Gbits/sec [ 5] 124.00-125.00 sec 2.16 GBytes 18.6 Gbits/sec