Bug 217503

Summary: nvsp_rndis_pkt_complete error status and net_ratelimit: callbacks suppressed messages on 6.4.0rc4
Product: Virtualization Reporter: Adam Baxter (kernel)
Component: lguestAssignee: virtualization_lguest
Status: NEW ---    
Severity: normal CC: bagasdotme, decui, haiyangz, korylprince, mhkelley, snoop05b
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: 6.4.0rc4 Subsystem:
Regression: No Bisected commit-id:
Attachments: config used for the kernel

Description Adam Baxter 2023-05-29 10:39:32 UTC
Created attachment 304350 [details]
config used for the kernel

After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I see the following 

[  756.697753] net_ratelimit: 34 callbacks suppressed
[  756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0: nvsp_rndis_pkt_complete error status: 2
(snipped repeated messages)

*but*, I'm only able to reliably reproduce this if I am generating garbage on another terminal, e.g. sudo strings /dev/sda


This doesn't appear to affect latency or bandwidth a huge amount, I ran an iperf3 test between the guest and the host while trying to cause these messages.
Although you if you take 17-18 gigabit as the "base" speed, you can see it drop a bit to 16 gigabit while the errors happen and "catch up" when I stop spamming the console.

[  5]  99.00-100.00 sec  1.89 GBytes  16.2 Gbits/sec
[  5] 100.00-101.00 sec  1.91 GBytes  16.4 Gbits/sec
[  5] 101.00-102.00 sec  1.91 GBytes  16.4 Gbits/sec
[  5] 102.00-103.00 sec  1.91 GBytes  16.4 Gbits/sec
[  5] 103.00-104.00 sec  1.92 GBytes  16.5 Gbits/sec
[  5] 104.00-105.00 sec  1.94 GBytes  16.6 Gbits/sec
[  5] 105.00-106.00 sec  1.89 GBytes  16.2 Gbits/sec
[  5] 106.00-107.00 sec  1.90 GBytes  16.3 Gbits/sec
[  5] 107.00-108.00 sec  2.23 GBytes  19.2 Gbits/sec
[  5] 108.00-109.00 sec  2.57 GBytes  22.0 Gbits/sec
[  5] 109.00-110.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 110.00-111.00 sec  2.64 GBytes  22.7 Gbits/sec
[  5] 111.00-112.00 sec  2.65 GBytes  22.7 Gbits/sec
[  5] 112.00-113.00 sec  2.65 GBytes  22.8 Gbits/sec
[  5] 113.00-114.00 sec  2.65 GBytes  22.8 Gbits/sec
[  5] 114.00-115.00 sec  2.65 GBytes  22.8 Gbits/sec
[  5] 115.00-116.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 116.00-117.00 sec  2.63 GBytes  22.6 Gbits/sec
[  5] 117.00-118.00 sec  2.69 GBytes  23.1 Gbits/sec
[  5] 118.00-119.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 119.00-120.00 sec  2.67 GBytes  22.9 Gbits/sec
[  5] 120.00-121.00 sec  2.66 GBytes  22.9 Gbits/sec
[  5] 121.00-122.00 sec  2.49 GBytes  21.4 Gbits/sec
[  5] 122.00-123.00 sec  2.15 GBytes  18.5 Gbits/sec
[  5] 123.00-124.00 sec  2.16 GBytes  18.6 Gbits/sec
[  5] 124.00-125.00 sec  2.16 GBytes  18.6 Gbits/sec
Comment 1 Bagas Sanjaya 2023-05-29 12:52:38 UTC
(In reply to Adam Baxter from comment #0)
> After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I
> see the following 
> 
> [  756.697753] net_ratelimit: 34 callbacks suppressed
> [  756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0:
> nvsp_rndis_pkt_complete error status: 2
> (snipped repeated messages)
> 

What previous kernel version do you use before you upgrade? Does this issue
occur on that version?
Comment 2 Adam Baxter 2023-05-30 05:44:22 UTC
Previously I used Debian's 6.1 package and no, I don't see the error there.
Comment 3 Bagas Sanjaya 2023-05-30 09:00:21 UTC
(In reply to Adam Baxter from comment #2)
> Previously I used Debian's 6.1 package and no, I don't see the error there.

Can you then bisect between v6.1 and v6.4.0-rc4?
Comment 4 Adam Baxter 2023-05-30 12:03:31 UTC
does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 
occurs in dca5161f9bd052e9e73be90716ffd57e8762c697

Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to.
Comment 5 Bagas Sanjaya 2023-05-30 12:33:34 UTC
(In reply to Adam Baxter from comment #4)
> does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 
> occurs in dca5161f9bd052e9e73be90716ffd57e8762c697
> 
> Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to.

I have forwarded this BZ report at [1].

Next time, if you want to forward CC, click Edit on Cc list.

[1]: https://lore.kernel.org/regressions/15dd93af-fcd5-5b9a-a6ba-9781768dbae7@gmail.com/
Comment 6 Michael Kelley 2023-06-05 14:54:43 UTC
The commit in question does not change the behavior of the networking code. It only checks for an error and outputs a message where previously any errors were silently ignored. The error status "2" indicates that the network packet send failed at the Hyper-V level. I've asked the Hyper-V team for more specifics.

Question: When you run the same scenario (including the "sudo strings /dev/sda") with the older kernel version, do you also see the slight drop in network speed from ~17-18 Gbps to ~16 Gbps?  I'm guessing that you do, because the Linux kernel patch just outputs a message about an error that previously was silently ignored.  Hopefully outputting the error message is not the cause of the performance drop.

The goal of the patch is to bring more visibility to errors if they *are* happening, rather than having network performance degradation with no indication of what might be causing it.
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-06-15 12:18:34 UTC
Adam, haven you tried doing what Michael asked for?
Comment 8 Adam Baxter 2023-06-15 13:16:16 UTC
Hi Thorsten & Michael. I'll hopefully get back to testing this this weekend.
Comment 9 Adam Baxter 2023-06-18 01:12:42 UTC
Confirming I see similar behaviour without the error messages in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7
Comment 10 Michael Kelley 2023-06-19 17:00:52 UTC
(In reply to Adam Baxter from comment #9)
> Confirming I see similar behaviour without the error messages in
> 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7

Thanks. My interpretation of this result is that there is no regression. The error has happening all along, and what's new is that the error is now reported.  Do you agree?

I still want to learn more about why the error is occurring and the implications.  I think one of the implications is that outgoing packets are being dropped, which could affect performance but isn't fatal.  I'm in conversations with the Hyper-V team about the meaning of the error.  One possible cause is that the "send packet" request from the Linux guest is malformed, but that doesn't seem likely to be the case here.  There are other cases that the Hyper-V team described in terms of internal Hyper-V behaviors, and I'm following up to clarify the full implications.
Comment 11 Adam Baxter 2023-06-22 12:29:11 UTC
FWIW on 6.4.0rc4 I am now seeing that message without putting the system under significant load (e.g. tar -cf - . | ssh otherhost "tar xf")
Comment 12 Matej Dian 2023-08-10 14:12:48 UTC
I'm also getting those messages as well as significantly degraded SMB performance, but i don't believe that this is a kernel regression either, since it only happens if kernel was built using make localmodconfig. "Full" configs that distribution use don't have this issue, so there has to be some configuration option that's not enabled using localmodconfig, except i'm without a clue which one it might be.
Comment 13 Kory Prince 2023-10-07 00:15:58 UTC
I came across this bug report because I was having weird network issues that coincided with several "hv_netvsc ... nvsp_rndis_pkt_complete error status: 2" messages being logged.

I can reliably reproduce the error by trying to run a tftp server (https://github.com/korylprince/tftphere, though I'm sure others would as well) and trying to send a file from another device. The tftp PUT would time out, with the server "sending" acknowledge packets (according to tcpdump) but the client not actually receiving them. The "hv_netsvc ..." errors coincided with the acknowledge packets being dropped.

I tested multiple kernel versions across multiple servers (all Ubuntu): 5.4.0-1109-azure, 5.15.0-86-generic, 6.2.0-34-generic, 6.2.0-1014-azure and they all had the same issue. These are all VMs in Hyper-V, Server 2019.

I ultimately discovered that turning off tx-checksumming on the virtual network interface (e.g. ethtool -K eth0 tx-checksumming off) fixed my weird network issue and stopped the messages from being logged. Hopefully this is helpful in debugging the underlying issue.
Comment 14 Adam Baxter 2024-02-18 01:43:40 UTC
Thanks Matej and Kory.

I realised this was still going on with 6.8.0-rc4 and noticed the comments about both degraded SMB performance (which I just hit) and turning tx-checksumming off - which fixes the performance.

I don't think this is a regression, but I do think someone from the client Hyper-V team at Microsoft needs to take a look at this.