Bug 217503

Summary: nvsp_rndis_pkt_complete error status and net_ratelimit: callbacks suppressed messages on 6.4.0rc4
Product: Virtualization Reporter: Adam Baxter (kernel)
Component: lguestAssignee: virtualization_lguest
Status: NEW ---    
Severity: normal CC: bagasdotme, decui, haiyangz, korylprince, mhkelley, snoop05b
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: 6.4.0rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: config used for the kernel
good and bad minimal kernel configs and module list

Description Adam Baxter 2023-05-29 10:39:32 UTC
Created attachment 304350 [details]
config used for the kernel

After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I see the following 

[  756.697753] net_ratelimit: 34 callbacks suppressed
[  756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0: nvsp_rndis_pkt_complete error status: 2
(snipped repeated messages)

*but*, I'm only able to reliably reproduce this if I am generating garbage on another terminal, e.g. sudo strings /dev/sda


This doesn't appear to affect latency or bandwidth a huge amount, I ran an iperf3 test between the guest and the host while trying to cause these messages.
Although you if you take 17-18 gigabit as the "base" speed, you can see it drop a bit to 16 gigabit while the errors happen and "catch up" when I stop spamming the console.

[  5]  99.00-100.00 sec  1.89 GBytes  16.2 Gbits/sec
[  5] 100.00-101.00 sec  1.91 GBytes  16.4 Gbits/sec
[  5] 101.00-102.00 sec  1.91 GBytes  16.4 Gbits/sec
[  5] 102.00-103.00 sec  1.91 GBytes  16.4 Gbits/sec
[  5] 103.00-104.00 sec  1.92 GBytes  16.5 Gbits/sec
[  5] 104.00-105.00 sec  1.94 GBytes  16.6 Gbits/sec
[  5] 105.00-106.00 sec  1.89 GBytes  16.2 Gbits/sec
[  5] 106.00-107.00 sec  1.90 GBytes  16.3 Gbits/sec
[  5] 107.00-108.00 sec  2.23 GBytes  19.2 Gbits/sec
[  5] 108.00-109.00 sec  2.57 GBytes  22.0 Gbits/sec
[  5] 109.00-110.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 110.00-111.00 sec  2.64 GBytes  22.7 Gbits/sec
[  5] 111.00-112.00 sec  2.65 GBytes  22.7 Gbits/sec
[  5] 112.00-113.00 sec  2.65 GBytes  22.8 Gbits/sec
[  5] 113.00-114.00 sec  2.65 GBytes  22.8 Gbits/sec
[  5] 114.00-115.00 sec  2.65 GBytes  22.8 Gbits/sec
[  5] 115.00-116.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 116.00-117.00 sec  2.63 GBytes  22.6 Gbits/sec
[  5] 117.00-118.00 sec  2.69 GBytes  23.1 Gbits/sec
[  5] 118.00-119.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 119.00-120.00 sec  2.67 GBytes  22.9 Gbits/sec
[  5] 120.00-121.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 121.00-122.00 sec  2.49 GBytes  21.4 Gbits/sec
[  5] 122.00-123.00 sec  2.15 GBytes  18.5 Gbits/sec
[  5] 123.00-124.00 sec  2.16 GBytes  18.6 Gbits/sec
[  5] 124.00-125.00 sec  2.16 GBytes  18.6 Gbits/sec
Comment 1 Bagas Sanjaya 2023-05-29 12:52:38 UTC
(In reply to Adam Baxter from comment #0)
> After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I
> see the following 
> 
> [  756.697753] net_ratelimit: 34 callbacks suppressed
> [  756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0:
> nvsp_rndis_pkt_complete error status: 2
> (snipped repeated messages)
> 

What previous kernel version do you use before you upgrade? Does this issue
occur on that version?
Comment 2 Adam Baxter 2023-05-30 05:44:22 UTC
Previously I used Debian's 6.1 package and no, I don't see the error there.
Comment 3 Bagas Sanjaya 2023-05-30 09:00:21 UTC
(In reply to Adam Baxter from comment #2)
> Previously I used Debian's 6.1 package and no, I don't see the error there.

Can you then bisect between v6.1 and v6.4.0-rc4?
Comment 4 Adam Baxter 2023-05-30 12:03:31 UTC
does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 
occurs in dca5161f9bd052e9e73be90716ffd57e8762c697

Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to.
Comment 5 Bagas Sanjaya 2023-05-30 12:33:34 UTC
(In reply to Adam Baxter from comment #4)
> does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 
> occurs in dca5161f9bd052e9e73be90716ffd57e8762c697
> 
> Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to.

I have forwarded this BZ report at [1].

Next time, if you want to forward CC, click Edit on Cc list.

[1]: https://lore.kernel.org/regressions/15dd93af-fcd5-5b9a-a6ba-9781768dbae7@gmail.com/
Comment 6 Michael Kelley 2023-06-05 14:54:43 UTC
The commit in question does not change the behavior of the networking code. It only checks for an error and outputs a message where previously any errors were silently ignored. The error status "2" indicates that the network packet send failed at the Hyper-V level. I've asked the Hyper-V team for more specifics.

Question: When you run the same scenario (including the "sudo strings /dev/sda") with the older kernel version, do you also see the slight drop in network speed from ~17-18 Gbps to ~16 Gbps?  I'm guessing that you do, because the Linux kernel patch just outputs a message about an error that previously was silently ignored.  Hopefully outputting the error message is not the cause of the performance drop.

The goal of the patch is to bring more visibility to errors if they *are* happening, rather than having network performance degradation with no indication of what might be causing it.
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-06-15 12:18:34 UTC
Adam, haven you tried doing what Michael asked for?
Comment 8 Adam Baxter 2023-06-15 13:16:16 UTC
Hi Thorsten & Michael. I'll hopefully get back to testing this this weekend.
Comment 9 Adam Baxter 2023-06-18 01:12:42 UTC
Confirming I see similar behaviour without the error messages in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7
Comment 10 Michael Kelley 2023-06-19 17:00:52 UTC
(In reply to Adam Baxter from comment #9)
> Confirming I see similar behaviour without the error messages in
> 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7

Thanks. My interpretation of this result is that there is no regression. The error has happening all along, and what's new is that the error is now reported.  Do you agree?

I still want to learn more about why the error is occurring and the implications.  I think one of the implications is that outgoing packets are being dropped, which could affect performance but isn't fatal.  I'm in conversations with the Hyper-V team about the meaning of the error.  One possible cause is that the "send packet" request from the Linux guest is malformed, but that doesn't seem likely to be the case here.  There are other cases that the Hyper-V team described in terms of internal Hyper-V behaviors, and I'm following up to clarify the full implications.
Comment 11 Adam Baxter 2023-06-22 12:29:11 UTC
FWIW on 6.4.0rc4 I am now seeing that message without putting the system under significant load (e.g. tar -cf - . | ssh otherhost "tar xf")
Comment 12 Matej Dian 2023-08-10 14:12:48 UTC
I'm also getting those messages as well as significantly degraded SMB performance, but i don't believe that this is a kernel regression either, since it only happens if kernel was built using make localmodconfig. "Full" configs that distribution use don't have this issue, so there has to be some configuration option that's not enabled using localmodconfig, except i'm without a clue which one it might be.
Comment 13 Kory Prince 2023-10-07 00:15:58 UTC
I came across this bug report because I was having weird network issues that coincided with several "hv_netvsc ... nvsp_rndis_pkt_complete error status: 2" messages being logged.

I can reliably reproduce the error by trying to run a tftp server (https://github.com/korylprince/tftphere, though I'm sure others would as well) and trying to send a file from another device. The tftp PUT would time out, with the server "sending" acknowledge packets (according to tcpdump) but the client not actually receiving them. The "hv_netsvc ..." errors coincided with the acknowledge packets being dropped.

I tested multiple kernel versions across multiple servers (all Ubuntu): 5.4.0-1109-azure, 5.15.0-86-generic, 6.2.0-34-generic, 6.2.0-1014-azure and they all had the same issue. These are all VMs in Hyper-V, Server 2019.

I ultimately discovered that turning off tx-checksumming on the virtual network interface (e.g. ethtool -K eth0 tx-checksumming off) fixed my weird network issue and stopped the messages from being logged. Hopefully this is helpful in debugging the underlying issue.
Comment 14 Adam Baxter 2024-02-18 01:43:40 UTC
Thanks Matej and Kory.

I realised this was still going on with 6.8.0-rc4 and noticed the comments about both degraded SMB performance (which I just hit) and turning tx-checksumming off - which fixes the performance.

I don't think this is a regression, but I do think someone from the client Hyper-V team at Microsoft needs to take a look at this.
Comment 15 Matej Dian 2024-10-05 12:05:13 UTC
Created attachment 306972 [details]
good and bad minimal kernel configs and module list

Hello everyone,

Since there was no update for so long i decided to dig bit deeper.
Since looking at the hyperv network code didn't show anything obvious i took a brute-force approach, cutting the extra modules list in halves and rebuiling till i narrow it down, and after way too many kernel compiles and reboots later the single module making the issue magically disappear when additionaly passed to localmodconfig was `tunnel4`.
Checked what was the difference in resulting config and managed to reduce it further to just 4 config options:
CONFIG_NET_IPIP=m
CONFIG_GRO_CELLS=y
CONFIG_INET_TUNNEL=m
CONFIG_NET_IP_TUNNEL=m

Maybe not absolutely minimal but i believe this should narrow it down enough for  any skilled engineer at Microsoft to come up with some fix (e.g. enable those options as dependency when building the hv_netvsc) in time worth of a coffee break.

This was tested on 6.12rc1, hopefully the fix will make it for the final LTS release and may also be worth as backport.
Comment 16 Adam Baxter 2024-10-05 13:02:44 UTC
Hi Matej,
Are you able to provide any more info here? What did you test to decide this fixed it?

I can confirm that enabling 
CONFIG_NET_IPIP=m
CONFIG_INET_TUNNEL=m
CONFIG_NET_IP_TUNNEL=m
as well seems to fix the issue, or at least silence the errors when I run iperf3.

I do not understand how a module that's not loaded fixes an unrelated (?) issue.

At any rate, thank you for investigating!
Comment 17 Adam Baxter 2024-10-05 13:28:32 UTC
Another note - there's about a 5 gigabit / second slowdown between VM host and guest in iperf3 speeds between kernel 6.2rc7 (the first commit the error occurred in) and 6.12rc1 - possibly interesting to someone on the Hyper-V team.

I did try to get ntttcp working but could not.
Comment 18 Matej Dian 2024-10-05 15:04:57 UTC
Hi Adam,

My test workload was in fact regular access of SMB share from Host (exported from Linux VM guest). On kernel without tunnel4 module/config options it enables would result in very low performance at times even lockups/timeouts of the whole share, without any heavy load mind you, and obviously spam of the mesage in dmesg. Once enabling them the errors are gone and SMB works flawelssly, basically at disk IO speeds (and same if module/config options omitted but tx checksumming disabled)

Since you mention iperf3 i give that couple tests too, interestingly enough its not affected nearly as much in all configs yielding 25-30Gbps (guest as server, host as client).

Re not loaded module affecting behavior, as i already mentioned there are non-module options that may get auto-enabled as dependency of another module being built, even when not loaded, so i guess this is the case. Could potentially be the CONFIG_GRO_CELLS=y that gets enabled as can be seen in https://github.com/torvalds/linux/blob/27cc6fdf720183dce1dbd293483ec5a9cb6b595e/net/ipv4/Kconfig#L186