Hello, I encountered a very weird bug: after a while of uptime kernel stops to deliver DNS reply to applications. Tcpdump shows that correct reply is recieved, but strace shows inquiring application never recieves it and ends with timeout, epoll_wait() always returns 0: a slice from: $ host kernel.org 8.8.8.8: sendmsg(20, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\266\344\1\0\0\1\0\0\0\0\0\0\6k ernel\3org\0\0\1\0\1", 28}], msg_controllen=0, msg_flags=0}, 0) = 28 epoll_wait(3, {}, 64, 0) = 0 epoll_wait(3, {}, 64, 4999) = 0 Though tcpdump shows a normal reply: 20:28:44.162897 IP 10.7.74.7.43167 > 8.8.8.8.domain: 46820+ A? kernel.org. (28) 20:28:44.221308 IP 8.8.8.8.domain > 10.7.74.7.43167: 46820 1/0/0 A 149.20.4.69 (44) After this bug has occured, it is no longer possible to perform DNS request on the crippled system. I tried to stop/restart all network-related daemons, to recreate network interfaces whenever possible (e.g. pppX devices), but with no help. I use iptables and ebtables on this host, but reseting them (flushing all chains, removing user chains, setting all policies to ACCEPT) doesn't help. The only worknig solution is to reboot the system. This bug happens rarely and randomly (about once in 7-12 days on 24x7 available production system), but I had it 5 times already. Due to rare and random nature of the bug I can't bisect it. This problem occured after I updated vanilla kernel from 2.6.39.4 to 3.4.6. Afterward I updated kernel to 3.4.10 in the hope that this will fix the problem, but with no result. (I updated kernel due to commit 2ce42ec4ef551b08d2e5d26775d838ac640f82ad, which describes somewhat similar issue, though I don't use I/OAT engine due to lack of hardware support.) I run multiple hosts with somewhat similar kernel setups, though only this one runs unbound DNS Resolver daemon (as of now 1.4.18). Maybe this program somehow triggers a kernel bug being accumulated with time and resulting into unusable system, though I can't test for several weeks without unbound. Soon I'll attach the output of strace of host kernel.org 8.8.8.8 and dig @8.8.8.8 kernel.org commands after the bug occured and after the system reboot, together with tcpdumps and kernel configs for both working 2.6.39.4 and crippled 3.4.10 kernels.
Created attachment 81461 [details] bug.host.strace Output of: $ strace -f host kernel.org 8.8.8.8 when bug occurs Corresponding tcpdump output: 20:35:50.661470 IP (tos 0x0, ttl 49, id 46967, offset 0, flags [none], proto UDP (17), length 72) 8.8.8.8.domain > 10.7.74.7.55644: [udp sum ok] 63060 q: A? kernel.org. 1/0/0 kernel.org. A 149.20.4.69 (44) 20:35:55.608519 IP (tos 0x0, ttl 64, id 19471, offset 0, flags [none], proto UDP (17), length 56) 10.7.74.7.55644 > 8.8.8.8.domain: [udp sum ok] 63060+ A? kernel.org. (28) 20:35:55.666798 IP (tos 0x0, ttl 49, id 46971, offset 0, flags [none], proto UDP (17), length 72) 8.8.8.8.domain > 10.7.74.7.55644: [udp sum ok] 63060 q: A? kernel.org. 1/0/0 kernel.org. A 149.20.4.69 (44) 20:36:18.334447 IP (tos 0x0, ttl 64, id 19472, offset 0, flags [none], proto UDP (17), length 67) 10.7.74.7.40416 > 8.8.8.8.domain: [udp sum ok] 59770+ [1au] A? kernel.org. ar: . OPT UDPsize=4096 (39)
Created attachment 81471 [details] bug.dig.strace Output of: $ strace -f dig @8.8.8.8 kernel.org when bug occurs Corresponding tcpdump output: 20:36:18.392081 IP (tos 0x0, ttl 49, id 39834, offset 0, flags [none], proto UDP (17), length 83) 8.8.8.8.domain > 10.7.74.7.40416: [udp sum ok] 59770 q: A? kernel.org. 1/0/1 kernel.org. A 149.20.4.69 ar: . OPT UDPsize=512 (55) 20:36:23.339239 IP (tos 0x0, ttl 64, id 19473, offset 0, flags [none], proto UDP (17), length 67) 10.7.74.7.40416 > 8.8.8.8.domain: [udp sum ok] 59770+ [1au] A? kernel.org. ar: . OPT UDPsize=4096 (39) 20:36:23.397329 IP (tos 0x0, ttl 49, id 38051, offset 0, flags [none], proto UDP (17), length 83) 8.8.8.8.domain > 10.7.74.7.40416: [udp sum ok] 59770 q: A? kernel.org. 1/0/1 kernel.org. A 149.20.4.69 ar: . OPT UDPsize=512 (55) 20:36:28.345484 IP (tos 0x0, ttl 64, id 19474, offset 0, flags [none], proto UDP (17), length 67) 10.7.74.7.40416 > 8.8.8.8.domain: [udp sum ok] 59770+ [1au] A? kernel.org. ar: . OPT UDPsize=4096 (39) 20:36:28.403043 IP (tos 0x0, ttl 49, id 49982, offset 0, flags [none], proto UDP (17), length 83) 8.8.8.8.domain > 10.7.74.7.40416: [udp sum ok] 59770 q: A? kernel.org. 1/0/1 kernel.org. A 149.20.4.69 ar: . OPT UDPsize=512 (55)
Created attachment 81481 [details] good.host.strace Output of: $ strace -f host kernel.org 8.8.8.8 after system was rebooted.
Created attachment 81491 [details] good.dig.strace Output of: $ strace -f dig @8.8.8.8 kernel.org after system was rebooted.
Created attachment 81501 [details] .config-2.6.39.4 Old working kernel.
Created attachment 81511 [details] .config-3.4.10 Currently used buggy kernel.
Created attachment 81521 [details] sysctl.conf
Kernel cmdline options are the same for both kernels: max_loop=32 max_part=4 nf_conntrack_ftp.ports=21,2121 If you need some additional information, let me know.
Please can you post a summary of the problem and a pointer to the bug to netdev@vger.kernel.org. That will then reach the folks who might have a good idea about it. We only really use bugzilla for tracking bugs not fixing them
Hi, I've mailed this issue to netdev list. Thanks.
Was fixed in February: http://lists.openwall.net/netdev/2013/02/08/1