Bug 217503
Summary: | nvsp_rndis_pkt_complete error status and net_ratelimit: callbacks suppressed messages on 6.4.0rc4 | ||
---|---|---|---|
Product: | Virtualization | Reporter: | Adam Baxter (kernel) |
Component: | lguest | Assignee: | virtualization_lguest |
Status: | NEW --- | ||
Severity: | normal | CC: | bagasdotme, decui, haiyangz, korylprince, mhkelley, snoop05b |
Priority: | P3 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.4.0rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | config used for the kernel |
(In reply to Adam Baxter from comment #0) > After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I > see the following > > [ 756.697753] net_ratelimit: 34 callbacks suppressed > [ 756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0: > nvsp_rndis_pkt_complete error status: 2 > (snipped repeated messages) > What previous kernel version do you use before you upgrade? Does this issue occur on that version? Previously I used Debian's 6.1 package and no, I don't see the error there. (In reply to Adam Baxter from comment #2) > Previously I used Debian's 6.1 package and no, I don't see the error there. Can you then bisect between v6.1 and v6.4.0-rc4? does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 occurs in dca5161f9bd052e9e73be90716ffd57e8762c697 Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to. (In reply to Adam Baxter from comment #4) > does not occur in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 > occurs in dca5161f9bd052e9e73be90716ffd57e8762c697 > > Can you please cc linux-hyperv@vger.kernel.org? I wasn't able to. I have forwarded this BZ report at [1]. Next time, if you want to forward CC, click Edit on Cc list. [1]: https://lore.kernel.org/regressions/15dd93af-fcd5-5b9a-a6ba-9781768dbae7@gmail.com/ The commit in question does not change the behavior of the networking code. It only checks for an error and outputs a message where previously any errors were silently ignored. The error status "2" indicates that the network packet send failed at the Hyper-V level. I've asked the Hyper-V team for more specifics. Question: When you run the same scenario (including the "sudo strings /dev/sda") with the older kernel version, do you also see the slight drop in network speed from ~17-18 Gbps to ~16 Gbps? I'm guessing that you do, because the Linux kernel patch just outputs a message about an error that previously was silently ignored. Hopefully outputting the error message is not the cause of the performance drop. The goal of the patch is to bring more visibility to errors if they *are* happening, rather than having network performance degradation with no indication of what might be causing it. Adam, haven you tried doing what Michael asked for? Hi Thorsten & Michael. I'll hopefully get back to testing this this weekend. Confirming I see similar behaviour without the error messages in 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7 (In reply to Adam Baxter from comment #9) > Confirming I see similar behaviour without the error messages in > 991cbd4f34b1d2d4e4cc41aed6eb4799186c3887 ~6.2rc7 Thanks. My interpretation of this result is that there is no regression. The error has happening all along, and what's new is that the error is now reported. Do you agree? I still want to learn more about why the error is occurring and the implications. I think one of the implications is that outgoing packets are being dropped, which could affect performance but isn't fatal. I'm in conversations with the Hyper-V team about the meaning of the error. One possible cause is that the "send packet" request from the Linux guest is malformed, but that doesn't seem likely to be the case here. There are other cases that the Hyper-V team described in terms of internal Hyper-V behaviors, and I'm following up to clarify the full implications. FWIW on 6.4.0rc4 I am now seeing that message without putting the system under significant load (e.g. tar -cf - . | ssh otherhost "tar xf") I'm also getting those messages as well as significantly degraded SMB performance, but i don't believe that this is a kernel regression either, since it only happens if kernel was built using make localmodconfig. "Full" configs that distribution use don't have this issue, so there has to be some configuration option that's not enabled using localmodconfig, except i'm without a clue which one it might be. I came across this bug report because I was having weird network issues that coincided with several "hv_netvsc ... nvsp_rndis_pkt_complete error status: 2" messages being logged. I can reliably reproduce the error by trying to run a tftp server (https://github.com/korylprince/tftphere, though I'm sure others would as well) and trying to send a file from another device. The tftp PUT would time out, with the server "sending" acknowledge packets (according to tcpdump) but the client not actually receiving them. The "hv_netsvc ..." errors coincided with the acknowledge packets being dropped. I tested multiple kernel versions across multiple servers (all Ubuntu): 5.4.0-1109-azure, 5.15.0-86-generic, 6.2.0-34-generic, 6.2.0-1014-azure and they all had the same issue. These are all VMs in Hyper-V, Server 2019. I ultimately discovered that turning off tx-checksumming on the virtual network interface (e.g. ethtool -K eth0 tx-checksumming off) fixed my weird network issue and stopped the messages from being logged. Hopefully this is helpful in debugging the underlying issue. Thanks Matej and Kory. I realised this was still going on with 6.8.0-rc4 and noticed the comments about both degraded SMB performance (which I just hit) and turning tx-checksumming off - which fixes the performance. I don't think this is a regression, but I do think someone from the client Hyper-V team at Microsoft needs to take a look at this. |
Created attachment 304350 [details] config used for the kernel After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I see the following [ 756.697753] net_ratelimit: 34 callbacks suppressed [ 756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0: nvsp_rndis_pkt_complete error status: 2 (snipped repeated messages) *but*, I'm only able to reliably reproduce this if I am generating garbage on another terminal, e.g. sudo strings /dev/sda This doesn't appear to affect latency or bandwidth a huge amount, I ran an iperf3 test between the guest and the host while trying to cause these messages. Although you if you take 17-18 gigabit as the "base" speed, you can see it drop a bit to 16 gigabit while the errors happen and "catch up" when I stop spamming the console. [ 5] 99.00-100.00 sec 1.89 GBytes 16.2 Gbits/sec [ 5] 100.00-101.00 sec 1.91 GBytes 16.4 Gbits/sec [ 5] 101.00-102.00 sec 1.91 GBytes 16.4 Gbits/sec [ 5] 102.00-103.00 sec 1.91 GBytes 16.4 Gbits/sec [ 5] 103.00-104.00 sec 1.92 GBytes 16.5 Gbits/sec [ 5] 104.00-105.00 sec 1.94 GBytes 16.6 Gbits/sec [ 5] 105.00-106.00 sec 1.89 GBytes 16.2 Gbits/sec [ 5] 106.00-107.00 sec 1.90 GBytes 16.3 Gbits/sec [ 5] 107.00-108.00 sec 2.23 GBytes 19.2 Gbits/sec [ 5] 108.00-109.00 sec 2.57 GBytes 22.0 Gbits/sec [ 5] 109.00-110.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 110.00-111.00 sec 2.64 GBytes 22.7 Gbits/sec [ 5] 111.00-112.00 sec 2.65 GBytes 22.7 Gbits/sec [ 5] 112.00-113.00 sec 2.65 GBytes 22.8 Gbits/sec [ 5] 113.00-114.00 sec 2.65 GBytes 22.8 Gbits/sec [ 5] 114.00-115.00 sec 2.65 GBytes 22.8 Gbits/sec [ 5] 115.00-116.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 116.00-117.00 sec 2.63 GBytes 22.6 Gbits/sec [ 5] 117.00-118.00 sec 2.69 GBytes 23.1 Gbits/sec [ 5] 118.00-119.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 119.00-120.00 sec 2.67 GBytes 22.9 Gbits/sec [ 5] 120.00-121.00 sec 2.66 GBytes 22.9 Gbits/sec [ 5] 121.00-122.00 sec 2.49 GBytes 21.4 Gbits/sec [ 5] 122.00-123.00 sec 2.15 GBytes 18.5 Gbits/sec [ 5] 123.00-124.00 sec 2.16 GBytes 18.6 Gbits/sec [ 5] 124.00-125.00 sec 2.16 GBytes 18.6 Gbits/sec