Bug 60620 - guest loses frequently (multiple times per day!) connectivity to network device
Summary: guest loses frequently (multiple times per day!) connectivity to network device
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: kvm (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: virtualization_kvm
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-07-24 19:31 UTC by Folkert van Heusden
Modified: 2013-11-12 14:26 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.8, 3.9, 3.10 in both the host as well as the guest
Subsystem:
Regression: No
Bisected commit-id:


Attachments
configuration of the problem guest vm (2.92 KB, text/plain)
2013-07-30 18:58 UTC, Folkert van Heusden
Details
patch to limit the head length of skb allocated (1.45 KB, patch)
2013-11-12 10:05 UTC, Jason Wang
Details | Diff

Description Folkert van Heusden 2013-07-24 19:31:41 UTC
Hi,

I have a server with plenty of free ram (16GB free when running the guests). Each guest has 8GB of ram.
One guest has 3 network interfaces. They are connected to bridges in the host (is it called "DOM0" in KVM as well?). That guest now frequently (multiple times per day) loses all connectivity to that interface. If I then ping any host connected to that interface, no ping comes back: only a message about buffer space not being enough. As a test I increased wmem and friends but that did not help at all. I also removed the bridge for that one problematic interface and replaced by a direct connection: did not help.
Neither the host nor the guest log anything to dmesg or syslog or anything.
The only thing that helps is completely rebooting the guest. Ifdown in the guest and/or host does not help.
This problem happens only with 1 guest and only with 1 interface. I verified that the other adapters can use their networks just fine. When it is used in bridging mode, the host is still able to use it; only the guest isn't.
When the guest is connected via a bridge and it fails, then the "dropped packets" counter starts increasing for every packet send out. If it is "directly" connected, this counter is not increased.
I did some googling and found the suggestion to modprobe the vhost_net module: did not help. Using e1000 instead of virtio: did not help.
I verified that there were not too many sockets open: only +/- 800. Note that this exact configuration ran fine for years on real hardware. Also the guest frequently has plenty of free ram (not even used by cache/buffers as it also happens after a couple of minutes up time); mostly 5GB.
I tried disabling STP on the bridge: did not help.
Apart of that increasing "dropped packets" counter, there's also an other difference between direct-connection and connected via a bridge:

20:19:03.910892 ARP, Request who-has 192.168.178.2 tell 192.168.178.1, length 46
20:19:04.906854 ARP, Request who-has 192.168.178.2 tell 192.168.178.1, length 46
20:19:05.493445 ARP, Request who-has 192.168.178.2 tell 192.168.178.83, length 46
20:19:05.903027 ARP, Request who-has 192.168.178.2 tell 192.168.178.1, length 46
20:19:06.490750 ARP, Request who-has 192.168.178.2 tell 192.168.178.83, length 46
...

2 is the problematic guest and 83 and 1 are indeed devices in that network!
So arp comes in but no replies go out also no other traffic comes in: this is the interface to the internet and normally it has a constant input of data (e.g. NTP requests, VPN data (tinc), web-server requests, mail, etc.).

Versions used:

pxe-qemu        1.0.0+git-20120202.f6840ba-3
kvm     1:1.1.2+dfsg-6
qemu    1.1.2+dfsg-6a
qemu-keymaps    1.1.2+dfsg-6a
qemu-kvm        1.1.2+dfsg-6
qemu-system     1.1.2+dfsg-6a
qemu-user       1.1.2+dfsg-6a
qemu-utils      1.1.2+dfsg-6a

Kernels: 3.8, 3.9 and 3.10.
Both on the guest and the host.
Comment 1 Folkert van Heusden 2013-07-30 15:59:14 UTC
Both the guests as well as their configurations are made through libvirt.
Comment 2 Folkert van Heusden 2013-07-30 18:42:47 UTC
I verified that there are no duplicate mac addresses (apart from the bridges and their hardware backend).
Comment 3 Folkert van Heusden 2013-07-30 18:58:57 UTC
Created attachment 107047 [details]
configuration of the problem guest vm
Comment 4 Folkert van Heusden 2013-07-30 19:03:34 UTC
I did some sniffing on the problem network port and just before the problem happens, I see suddenly tons of tcp retransmits and then after a while (9 seconds) things become silent when the guest stops totally replying to all traffic.
Comment 5 Folkert van Heusden 2013-07-30 20:45:07 UTC
I thought that it would help running tcpdump on the problematic interface in the guest but tonight I found out that was a placebo: I had this problem about 18 in a row. Sometimes even during boot of the guest.

I also thought that I had noticed that rebooting the host would help: that's also incorrect I found out tonight.

I have no idea what triggers it.
Comment 6 Folkert van Heusden 2013-07-31 19:43:52 UTC
Today an other interface failed on the same guest. This time it was the interface for the local lan. It could not be pinged from within the host and the guest could also not ping hosts on the lan.
Comment 7 Folkert van Heusden 2013-08-02 10:38:00 UTC
Moments ago an other(!) guest on this system lost its connectivity.
I did a tcpdump on its network interface from within the host/hypervisor/dom0 and saw that traffic goes in but nothing comes out - e.g. no traffic from the client comes out of the 'vnet1' interface. While I ran the tcpdump I also ran a ping from within that guest which indeed showed only timeouts.
As there's _no_ firewall on this guest, we can rule out the possibility that it is a firewall issue.
Comment 8 Folkert van Heusden 2013-08-02 11:03:48 UTC
And now the firewall lost connectivity again.
- I removed all iptables rules and set the default to ACCEPT
- from within the guest I started tcpdump
- from the host I started ping
- from the guest I see the pings coming in but no ping reply goes out! the same happened on the other guest that had this problem 10 minutes ago
- after a while the host starts doing arp requests and also does not get any replies from it
Comment 9 Folkert van Heusden 2013-08-02 11:28:45 UTC
Good news!
If I

- bring down all interfaces in the guest (ifdown eth0...)
- rmmod virtio_net
- modprobe virtio_net
- bring up the interfaces again
+ it all works again!

So hopefully this helps the bug hunt?
Comment 10 gitne 2013-10-20 19:55:48 UTC
This problem could be related to power management putting the virtio network devices to sleep. Try deactivating power management entirely by using noapm or pci=noacpi as kernel boot parameters and see if the network devices still die.
Comment 11 gitne 2013-10-20 20:34:10 UTC
You can also disable power management on devices selectively by writing "on" to /sys/devices/pci<your device>/power/control. E.g.:

cat on> /sys/devices/pci0000\:00/0000\:00\:03.0/power/control
Comment 12 gitne 2013-10-20 20:36:33 UTC
> cat on> /sys/devices/pci0000\:00/0000\:00\:03.0/power/control

Aaargh! It was supposed to be:

echo on> /sys/devices/pci0000\:00/0000\:00\:03.0/power/control
Comment 13 Stefan Hajnoczi 2013-10-30 08:58:26 UTC
Michael Tsirkin and I are currently looking into a similar case.  The bug can be triggered when the guest transmits large packets using GSO.  As a workaround, try turning off TX offload features inside the guest:

# ethtool --offload eth0 tx off
# ethtool --offload eth0 tso off
# ethtool --offload eth0 ufo off
# ethtool --offload eth0 gso off

If the guest is already broken you can check whether it's the same bug by looking at the output of "ifconfig <tap>" for the guest's tap device on the host.  If you see "RX dropped" > 0 then the virtio-net device is stuck.
Comment 14 Folkert van Heusden 2013-10-30 15:19:09 UTC
Hi,

I've executed the ethtool commands.
It happens at least once a week so hopefully I'll be able to tell if it is resolved next week.
Thanks for looking at it because last tuesday-evening I had to rmmod/modprobe virtio_net at least 30 times to keep networking alive. And I'm not exaggerating :-)
Comment 15 Jason Wang 2013-11-11 03:22:51 UTC
Hi Folkert:

Which kinds of workload did you run in the guest? Did the vm do forwarding in guest?

Thanks
Comment 16 Folkert van Heusden 2013-11-11 09:53:20 UTC
@jason

It happened with multiple virtual machines but 99 out of 100 times it happened with one specific. That one is a firewall/gateway. It does iptables filtering and forwarding. Both ipv4 and ipv6. It also runs things like mail, web, ntp, etc.

It seems, by the way, that the suggestion by Stefan Hajnoczi from 2013-10-30 08:58:26 UTC work pretty well! (the ethtool tricks)
In fact this virtual machine ran for a week without any disconnects! Only a week because the kernel paniced when I implemented traffic control rules to prioritize ssh but that's a different problem.
Comment 17 Jason Wang 2013-11-12 10:05:56 UTC
Created attachment 114331 [details]
patch to limit the head length of skb allocated

Would you care to test this patch to see if it solves your problem?

Thanks
Comment 18 Folkert van Heusden 2013-11-12 11:09:04 UTC
(In reply to Jason Wang from comment #17)
> Created attachment 114331 [details]
> patch to limit the head length of skb allocated
> Would you care to test this patch to see if it solves your problem?

I'm not an expert in kernel programming but wouldn't an oversized skb only affect the current incoming packet and not the whole interface stability?
Because what I see is that indeed packets get dropped and then after a while communication totally stops and also does not recover anymore (unless I rmmod/modprobe virtio_net).
Comment 19 Jason Wang 2013-11-12 14:26:03 UTC
(In reply to Folkert van Heusden from comment #18)
> (In reply to Jason Wang from comment #17)
> > Created attachment 114331 [details]
> > patch to limit the head length of skb allocated
> > Would you care to test this patch to see if it solves your problem?
> 
> I'm not an expert in kernel programming but wouldn't an oversized skb only
> affect the current incoming packet and not the whole interface stability?

Yes, the patch was just to limit the head length of a packet received by tun/tap. Without this patch, host may try to allocate 64K contiguous memory at one time which has a high possibility of failure when memory is fragmented or under heavy stress. So this patch make sure only one page were allocated at one time.
> Because what I see is that indeed packets get dropped and then after a while
> communication totally stops and also does not recover anymore (unless I
> rmmod/modprobe virtio_net).

Note You need to log in before you can comment on or make changes to this bug.