Bug 48081 - Kernel recieves DNS reply, but doesn't deliver it to a waiting application
Summary: Kernel recieves DNS reply, but doesn't deliver it to a waiting application
Status: RESOLVED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-09-29 18:05 UTC by Andrew Savchenko
Modified: 2013-08-12 22:17 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.4.10
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
bug.host.strace (15.39 KB, application/octet-stream)
2012-09-29 18:23 UTC, Andrew Savchenko
Details
bug.dig.strace (15.88 KB, application/octet-stream)
2012-09-29 18:27 UTC, Andrew Savchenko
Details
good.host.strace (18.75 KB, application/octet-stream)
2012-09-29 18:30 UTC, Andrew Savchenko
Details
good.dig.strace (16.71 KB, application/octet-stream)
2012-09-29 18:31 UTC, Andrew Savchenko
Details
.config-2.6.39.4 (58.56 KB, application/octet-stream)
2012-09-29 18:34 UTC, Andrew Savchenko
Details
.config-3.4.10 (64.91 KB, application/octet-stream)
2012-09-29 18:35 UTC, Andrew Savchenko
Details
sysctl.conf (461 bytes, application/octet-stream)
2012-09-29 18:35 UTC, Andrew Savchenko
Details

Description Andrew Savchenko 2012-09-29 18:05:28 UTC
Hello,

I encountered a very weird bug: after a while of uptime kernel stops to deliver DNS reply to applications. Tcpdump shows that correct reply is recieved, but  strace shows inquiring application never recieves it and ends with timeout, epoll_wait() always returns 0:
a slice from: $ host kernel.org 8.8.8.8:

sendmsg(20, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("8.8.8.8")}, msg_iov(1)=[{"\266\344\1\0\0\1\0\0\0\0\0\0\6k
ernel\3org\0\0\1\0\1", 28}], msg_controllen=0, msg_flags=0}, 0) = 28                                                                           
epoll_wait(3, {}, 64, 0)                = 0                                                                                                    
epoll_wait(3, {}, 64, 4999)             = 0

Though tcpdump shows a normal reply:

20:28:44.162897 IP 10.7.74.7.43167 > 8.8.8.8.domain: 46820+ A? kernel.org. (28)                                                                
20:28:44.221308 IP 8.8.8.8.domain > 10.7.74.7.43167: 46820 1/0/0 A 149.20.4.69 (44)

After this bug has occured, it is no longer possible to perform DNS request on the crippled system. I tried to stop/restart all network-related daemons, to recreate network interfaces whenever possible (e.g. pppX devices), but with no help. I use iptables and ebtables on this host, but reseting them (flushing all chains, removing user chains, setting all policies to ACCEPT) doesn't help. The only worknig solution is to reboot the system.

This bug happens rarely and randomly (about once in 7-12 days on 24x7 available production system), but I had it 5 times already. Due to rare and random nature of the bug I can't bisect it.

This problem occured after I updated vanilla kernel from 2.6.39.4 to 3.4.6. Afterward I updated kernel to 3.4.10 in the hope that this will fix the problem, but with no result. (I updated kernel due to commit 2ce42ec4ef551b08d2e5d26775d838ac640f82ad, which describes somewhat similar issue, though I don't use I/OAT engine due to lack of hardware support.)

I run multiple hosts with somewhat similar kernel setups, though only this one runs unbound DNS Resolver daemon (as of now 1.4.18). Maybe this program somehow triggers a kernel bug being accumulated with time and resulting into unusable system, though I can't test for several weeks without unbound.

Soon I'll attach the output of strace of
host kernel.org 8.8.8.8
and 
dig @8.8.8.8 kernel.org
commands after the bug occured and after the system reboot, together with tcpdumps and kernel configs for both working 2.6.39.4 and crippled 3.4.10 kernels.
Comment 1 Andrew Savchenko 2012-09-29 18:23:38 UTC
Created attachment 81461 [details]
bug.host.strace

Output of:
$ strace -f host kernel.org 8.8.8.8
when bug occurs

Corresponding tcpdump output:

20:35:50.661470 IP (tos 0x0, ttl 49, id 46967, offset 0, flags [none], proto UDP (17), length 72)                                              
    8.8.8.8.domain > 10.7.74.7.55644: [udp sum ok] 63060 q: A? kernel.org. 1/0/0 kernel.org. A 149.20.4.69 (44)                                
20:35:55.608519 IP (tos 0x0, ttl 64, id 19471, offset 0, flags [none], proto UDP (17), length 56)                                              
    10.7.74.7.55644 > 8.8.8.8.domain: [udp sum ok] 63060+ A? kernel.org. (28)                                                                  
20:35:55.666798 IP (tos 0x0, ttl 49, id 46971, offset 0, flags [none], proto UDP (17), length 72)                                              
    8.8.8.8.domain > 10.7.74.7.55644: [udp sum ok] 63060 q: A? kernel.org. 1/0/0 kernel.org. A 149.20.4.69 (44)                                
20:36:18.334447 IP (tos 0x0, ttl 64, id 19472, offset 0, flags [none], proto UDP (17), length 67)                                              
    10.7.74.7.40416 > 8.8.8.8.domain: [udp sum ok] 59770+ [1au] A? kernel.org. ar: . OPT UDPsize=4096 (39)
Comment 2 Andrew Savchenko 2012-09-29 18:27:34 UTC
Created attachment 81471 [details]
bug.dig.strace

Output of:
$ strace -f dig @8.8.8.8 kernel.org
when bug occurs

Corresponding tcpdump output:

20:36:18.392081 IP (tos 0x0, ttl 49, id 39834, offset 0, flags [none], proto UDP (17), length 83)                                              
    8.8.8.8.domain > 10.7.74.7.40416: [udp sum ok] 59770 q: A? kernel.org. 1/0/1 kernel.org. A 149.20.4.69 ar: . OPT UDPsize=512 (55)          
20:36:23.339239 IP (tos 0x0, ttl 64, id 19473, offset 0, flags [none], proto UDP (17), length 67)                                              
    10.7.74.7.40416 > 8.8.8.8.domain: [udp sum ok] 59770+ [1au] A? kernel.org. ar: . OPT UDPsize=4096 (39)                                     
20:36:23.397329 IP (tos 0x0, ttl 49, id 38051, offset 0, flags [none], proto UDP (17), length 83)                                              
    8.8.8.8.domain > 10.7.74.7.40416: [udp sum ok] 59770 q: A? kernel.org. 1/0/1 kernel.org. A 149.20.4.69 ar: . OPT UDPsize=512 (55)          
20:36:28.345484 IP (tos 0x0, ttl 64, id 19474, offset 0, flags [none], proto UDP (17), length 67)                                              
    10.7.74.7.40416 > 8.8.8.8.domain: [udp sum ok] 59770+ [1au] A? kernel.org. ar: . OPT UDPsize=4096 (39)                                     
20:36:28.403043 IP (tos 0x0, ttl 49, id 49982, offset 0, flags [none], proto UDP (17), length 83)                                              
    8.8.8.8.domain > 10.7.74.7.40416: [udp sum ok] 59770 q: A? kernel.org. 1/0/1 kernel.org. A 149.20.4.69 ar: . OPT UDPsize=512 (55)
Comment 3 Andrew Savchenko 2012-09-29 18:30:30 UTC
Created attachment 81481 [details]
good.host.strace

Output of:
$ strace -f host kernel.org 8.8.8.8
after system was rebooted.
Comment 4 Andrew Savchenko 2012-09-29 18:31:19 UTC
Created attachment 81491 [details]
good.dig.strace

Output of:
$ strace -f dig @8.8.8.8 kernel.org
after system was rebooted.
Comment 5 Andrew Savchenko 2012-09-29 18:34:45 UTC
Created attachment 81501 [details]
.config-2.6.39.4

Old working kernel.
Comment 6 Andrew Savchenko 2012-09-29 18:35:25 UTC
Created attachment 81511 [details]
.config-3.4.10

Currently used buggy kernel.
Comment 7 Andrew Savchenko 2012-09-29 18:35:47 UTC
Created attachment 81521 [details]
sysctl.conf
Comment 8 Andrew Savchenko 2012-09-29 18:38:14 UTC
Kernel cmdline options are the same for both kernels:
max_loop=32 max_part=4 nf_conntrack_ftp.ports=21,2121

If you need some additional information, let me know.
Comment 9 Alan 2012-09-30 13:33:21 UTC
Please can you post a summary of the problem and a pointer to the bug to netdev@vger.kernel.org. That will then reach the folks who might have a good idea about it.

We only really use bugzilla for tracking bugs not fixing them
Comment 10 Andrew Savchenko 2012-10-03 19:30:08 UTC
Hi,

I've mailed this issue to netdev list.

Thanks.
Comment 11 Andrew Savchenko 2013-08-12 22:17:51 UTC
Was fixed in February: http://lists.openwall.net/netdev/2013/02/08/1

Note You need to log in before you can comment on or make changes to this bug.