Bug 16413
Summary: | Kernel BUG and guest hang on network activity | ||
---|---|---|---|
Product: | Networking | Reporter: | Alex Unigovsky (unik) |
Component: | Other | Assignee: | Michael S. Tsirkin (m.s.tsirkin) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alexander.h.duyck, herbert, jbrandeb, jeffrey.t.kirsher, m.s.tsirkin |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.34.1 (gentoo 2.6.34-r2) | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Full dmesg from host machine
lspci -vv from host machine Kernel .config from host machine Kernel .config with NETPOLL disabled debugging patch for tun Full dmesg with diagnostic patch applied gro: Fix bogus gso_size on the first fraglist entry |
Created attachment 27139 [details]
lspci -vv from host machine
Created attachment 27140 [details]
Kernel .config from host machine
Looks like a bug in NETPOLL in host bridge code. please try disabling CONFIG_NET_POLL_CONTROLLER (and/or all of netpoll) and try to reproduce this. Thanks! Nope, no luck. I'm reproducing this without NETPOLL: aegis ~ # zcat /proc/config.gz |grep NET_POLL # CONFIG_NET_POLL_CONTROLLER is not set aegis ~ # zcat /proc/config.gz |grep NETPOLL # CONFIG_NETPOLL is not set I'm attaching new dmesg. Further details: SCP connection attempt is made from a host in 192.168.2.0/24 subnet, which is routed on eth0 via OSPF (quagga). Initial SSH auth succeeds, and guest always hangs only after first file of the batch (scp) is transferred. That connection request passes NAT inside this box. Also this box uses multiple routing tables (ip route ... table X, ip rule), in case this matters. Created attachment 27149 [details]
Kernel .config with NETPOLL disabled
aegis ~ # zcat /proc/config.gz |grep NET_POLL
# CONFIG_NET_POLL_CONTROLLER is not set
aegis ~ # zcat /proc/config.gz |grep NETPOLL
# CONFIG_NETPOLL is not set
Bah, sorry. Previous attachment was dmesg, not config. So if I understand it correctly, we have the packet with a wrong gso_type come in through igb_poll? sorry assigned by mistake. I see no definitive proof that it is coming from igb_poll. I think what would help is to print out some more info about the bogus packet. We should probably make that permanent as this isn't the first time we've had a crash at this spot and had to hunt for more information. Created attachment 27152 [details]
debugging patch for tun
pls try this patch (might need minor tweaks to appy to gentoo kernel)
this replaces BUG with WARN and adds extra printouts
Created attachment 27156 [details]
Full dmesg with diagnostic patch applied
With this no processes/guests hang, and traffic is undisturbed, but still triggers WARNs.
Hoping that the attached log will help squashing the bug. Thanks.
My guess would be that this is due to some sort of interaction with GRO since the in-kernel igb driver doesn't support LRO and does not set gso_size or gso_type. One thing you might try would be to disable GRO via "ethtool -K <ethX> gro off". If that resolves the issue it is likely something in the way that igb is working with GRO to aggregate packets. (In reply to comment #12) > My guess would be that this is due to some sort of interaction with GRO since > the in-kernel igb driver doesn't support LRO and does not set gso_size or > gso_type. > > One thing you might try would be to disable GRO via "ethtool -K <ethX> gro > off". If that resolves the issue it is likely something in the way that igb > is > working with GRO to aggregate packets. Yes, indeed, switching GRO off on eth0 (incoming interface) made all the WARNs disappear. Created attachment 27161 [details]
gro: Fix bogus gso_size on the first fraglist entry
Does this patch help?
(In reply to comment #14) > gro: Fix bogus gso_size on the first fraglist entry > > Does this patch help? Yes, it does. Now I can no longer get WARNs even with GRO enabled. Thank you very much. I'm closing the bug. Is it ok? |
Created attachment 27138 [details] Full dmesg from host machine Kernel produces BUG(), hangs qemu process and generally misbehaves on inbound scp traffic. Further attempts to connect to libvirtd fail. After SIGKILL'ing hanging qemu and libvirtd processes, cleaning stale PID-files etc. libvirt and its guests are able to start again. But they use new vnetX interfaces. Old vnetX interfaces just sit there. After downing them with ifconfig, an attempt to delete these interfaces with command 'ip link delete vnet0' hangs with kernel spamming 'kernel:unregister_netdevice: waiting for vnet0 to become free. Usage count = 1' to syslog. This BUG() and hanging is reproducible 100% time. Host configuration ------------------ x86_64, Gentoo 16-way (2x Xeon X5550) Kernel 2.6.34.1 (gentoo 2.6.34-r2) Partial list of interfaces: * eth0 192.168.0.1/24 SCP connection came from here. * br0 188.114.196.59/29 Guests join this bridge. Enslaved physical interfaces: vlan108, vnetX. * vlan108 Sits on top of eth1, using 802.1Q tagging. No IP address. Part of br0 bridge. * eth1 No IP address. vhost_net loaded and used by TUN/TAP. lspci -vv and full dmesg are attached. Libvirt listens on eth0. Kerberos is used for auth. Guest ----- Fedora 13 x86_64 Uses 2 cores. Uses 1536MB RAM. 32GB LVM volume as the only block device (provided by host).