Hi, I have a server with plenty of free ram (16GB free when running the guests). Each guest has 8GB of ram. One guest has 3 network interfaces. They are connected to bridges in the host (is it called "DOM0" in KVM as well?). That guest now frequently (multiple times per day) loses all connectivity to that interface. If I then ping any host connected to that interface, no ping comes back: only a message about buffer space not being enough. As a test I increased wmem and friends but that did not help at all. I also removed the bridge for that one problematic interface and replaced by a direct connection: did not help. Neither the host nor the guest log anything to dmesg or syslog or anything. The only thing that helps is completely rebooting the guest. Ifdown in the guest and/or host does not help. This problem happens only with 1 guest and only with 1 interface. I verified that the other adapters can use their networks just fine. When it is used in bridging mode, the host is still able to use it; only the guest isn't. When the guest is connected via a bridge and it fails, then the "dropped packets" counter starts increasing for every packet send out. If it is "directly" connected, this counter is not increased. I did some googling and found the suggestion to modprobe the vhost_net module: did not help. Using e1000 instead of virtio: did not help. I verified that there were not too many sockets open: only +/- 800. Note that this exact configuration ran fine for years on real hardware. Also the guest frequently has plenty of free ram (not even used by cache/buffers as it also happens after a couple of minutes up time); mostly 5GB. I tried disabling STP on the bridge: did not help. Apart of that increasing "dropped packets" counter, there's also an other difference between direct-connection and connected via a bridge: 20:19:03.910892 ARP, Request who-has 192.168.178.2 tell 192.168.178.1, length 46 20:19:04.906854 ARP, Request who-has 192.168.178.2 tell 192.168.178.1, length 46 20:19:05.493445 ARP, Request who-has 192.168.178.2 tell 192.168.178.83, length 46 20:19:05.903027 ARP, Request who-has 192.168.178.2 tell 192.168.178.1, length 46 20:19:06.490750 ARP, Request who-has 192.168.178.2 tell 192.168.178.83, length 46 ... 2 is the problematic guest and 83 and 1 are indeed devices in that network! So arp comes in but no replies go out also no other traffic comes in: this is the interface to the internet and normally it has a constant input of data (e.g. NTP requests, VPN data (tinc), web-server requests, mail, etc.). Versions used: pxe-qemu 1.0.0+git-20120202.f6840ba-3 kvm 1:1.1.2+dfsg-6 qemu 1.1.2+dfsg-6a qemu-keymaps 1.1.2+dfsg-6a qemu-kvm 1.1.2+dfsg-6 qemu-system 1.1.2+dfsg-6a qemu-user 1.1.2+dfsg-6a qemu-utils 1.1.2+dfsg-6a Kernels: 3.8, 3.9 and 3.10. Both on the guest and the host.
Both the guests as well as their configurations are made through libvirt.
I verified that there are no duplicate mac addresses (apart from the bridges and their hardware backend).
Created attachment 107047 [details] configuration of the problem guest vm
I did some sniffing on the problem network port and just before the problem happens, I see suddenly tons of tcp retransmits and then after a while (9 seconds) things become silent when the guest stops totally replying to all traffic.
I thought that it would help running tcpdump on the problematic interface in the guest but tonight I found out that was a placebo: I had this problem about 18 in a row. Sometimes even during boot of the guest. I also thought that I had noticed that rebooting the host would help: that's also incorrect I found out tonight. I have no idea what triggers it.
Today an other interface failed on the same guest. This time it was the interface for the local lan. It could not be pinged from within the host and the guest could also not ping hosts on the lan.
Moments ago an other(!) guest on this system lost its connectivity. I did a tcpdump on its network interface from within the host/hypervisor/dom0 and saw that traffic goes in but nothing comes out - e.g. no traffic from the client comes out of the 'vnet1' interface. While I ran the tcpdump I also ran a ping from within that guest which indeed showed only timeouts. As there's _no_ firewall on this guest, we can rule out the possibility that it is a firewall issue.
And now the firewall lost connectivity again. - I removed all iptables rules and set the default to ACCEPT - from within the guest I started tcpdump - from the host I started ping - from the guest I see the pings coming in but no ping reply goes out! the same happened on the other guest that had this problem 10 minutes ago - after a while the host starts doing arp requests and also does not get any replies from it
Good news! If I - bring down all interfaces in the guest (ifdown eth0...) - rmmod virtio_net - modprobe virtio_net - bring up the interfaces again + it all works again! So hopefully this helps the bug hunt?
This problem could be related to power management putting the virtio network devices to sleep. Try deactivating power management entirely by using noapm or pci=noacpi as kernel boot parameters and see if the network devices still die.
You can also disable power management on devices selectively by writing "on" to /sys/devices/pci<your device>/power/control. E.g.: cat on> /sys/devices/pci0000\:00/0000\:00\:03.0/power/control
> cat on> /sys/devices/pci0000\:00/0000\:00\:03.0/power/control Aaargh! It was supposed to be: echo on> /sys/devices/pci0000\:00/0000\:00\:03.0/power/control
Michael Tsirkin and I are currently looking into a similar case. The bug can be triggered when the guest transmits large packets using GSO. As a workaround, try turning off TX offload features inside the guest: # ethtool --offload eth0 tx off # ethtool --offload eth0 tso off # ethtool --offload eth0 ufo off # ethtool --offload eth0 gso off If the guest is already broken you can check whether it's the same bug by looking at the output of "ifconfig <tap>" for the guest's tap device on the host. If you see "RX dropped" > 0 then the virtio-net device is stuck.
Hi, I've executed the ethtool commands. It happens at least once a week so hopefully I'll be able to tell if it is resolved next week. Thanks for looking at it because last tuesday-evening I had to rmmod/modprobe virtio_net at least 30 times to keep networking alive. And I'm not exaggerating :-)
Hi Folkert: Which kinds of workload did you run in the guest? Did the vm do forwarding in guest? Thanks
@jason It happened with multiple virtual machines but 99 out of 100 times it happened with one specific. That one is a firewall/gateway. It does iptables filtering and forwarding. Both ipv4 and ipv6. It also runs things like mail, web, ntp, etc. It seems, by the way, that the suggestion by Stefan Hajnoczi from 2013-10-30 08:58:26 UTC work pretty well! (the ethtool tricks) In fact this virtual machine ran for a week without any disconnects! Only a week because the kernel paniced when I implemented traffic control rules to prioritize ssh but that's a different problem.
Created attachment 114331 [details] patch to limit the head length of skb allocated Would you care to test this patch to see if it solves your problem? Thanks
(In reply to Jason Wang from comment #17) > Created attachment 114331 [details] > patch to limit the head length of skb allocated > Would you care to test this patch to see if it solves your problem? I'm not an expert in kernel programming but wouldn't an oversized skb only affect the current incoming packet and not the whole interface stability? Because what I see is that indeed packets get dropped and then after a while communication totally stops and also does not recover anymore (unless I rmmod/modprobe virtio_net).
(In reply to Folkert van Heusden from comment #18) > (In reply to Jason Wang from comment #17) > > Created attachment 114331 [details] > > patch to limit the head length of skb allocated > > Would you care to test this patch to see if it solves your problem? > > I'm not an expert in kernel programming but wouldn't an oversized skb only > affect the current incoming packet and not the whole interface stability? Yes, the patch was just to limit the head length of a packet received by tun/tap. Without this patch, host may try to allocate 64K contiguous memory at one time which has a high possibility of failure when memory is fragmented or under heavy stress. So this patch make sure only one page were allocated at one time. > Because what I see is that indeed packets get dropped and then after a while > communication totally stops and also does not recover anymore (unless I > rmmod/modprobe virtio_net).