Distribution:Debian Hardware Environment:DELL 1850 two xeon 2.8Ghz CPU, 4G RAM, disabled Hyper- threading Software Environment:sarge 3.1 using the office kernel image 2.6.8-2-686-smp Problem Description: I use it running an udp service, using nonblock udp socket. Sometimes I found it use too much cpu. so I use strace. I found recvfrom would got an error return when select got a good return. ... recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 34901}) = 1 (in [5], left {0, 35000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 34668}) = 1 (in [5], left {0, 35000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 34454}) = 1 (in [5], left {0, 35000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 34239}) = 1 (in [5], left {0, 35000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 34023}) = 1 (in [5], left {0, 35000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 33811}) = 1 (in [5], left {0, 34000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 33598}) = 1 (in [5], left {0, 34000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 33386}) = 1 (in [5], left {0, 34000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 33172}) = 1 (in [5], left {0, 34000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 32956}) = 1 (in [5], left {0, 33000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 32743}) = 1 (in [5], left {0, 33000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 32531}) = 1 (in [5], left {0, 33000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 32317}) = 1 (in [5], left {0, 33000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 32105}) = 1 (in [5], left {0, 33000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 31889}) = 1 (in [5], left {0, 32000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 31677}) = 1 (in [5], left {0, 32000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 31464}) = 1 (in [5], left {0, 32000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 31250}) = 1 (in [5], left {0, 32000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 31034}) = 1 (in [5], left {0, 32000}) recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource temporarily unavailable) select(7, [3 4 5 6], NULL, NULL, {0, 30822}) = 1 (in [5], left {0, 31000}) ....
This bug has already been fixed probably in 2.6.10. The problem was if a packet with a bad checksum was received, it could get queued and wake up select but never get received. The patch introduced new code (udp_poll) in net/ipv4/udp.c
after upgrade to the debian kernel 2.6.12-1-686-smp the bug still here, and I got the same system call trace.
This is the trace with count: strace -c -p 6301 Process 6301 attached - interrupt to quit Process 6301 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 71.14 2.209248 10 221775 219851 recvfrom 28.52 0.885781 8 110194 select 0.18 0.005714 14 399 sendto 0.08 0.002361 39 60 read 0.03 0.000992 33 30 open 0.01 0.000378 13 30 munmap 0.01 0.000331 11 30 close 0.01 0.000326 11 30 mmap2 0.01 0.000181 6 30 fstat64 0.01 0.000170 6 30 getsockname ------ ----------- ----------- --------- --------- ---------------- 100.00 3.105482 332608 219851 total And this is my /etc/sysctl.conf kernel.pid_max=65536 net.ipv4.tcp_max_syn_backlog=4096 net.ipv4.icmp_echo_ignore_broadcasts=1 net.ipv4.tcp_tw_recycle=1 net.ipv4.tcp_tw_reuse=1 net.core.rmem_max=16777216 net.core.wmem_max=1048576 net.core.wmem_default=1048576 kernel.core_pattern=core.%e.%h.%p
Is this a dup (or related to Bug 5505)?
Created attachment 6454 [details] the trace log of the good status this is the normal syscall strace, and I attached the good trace log. Process 8187 attached - interrupt to quit Process 8187 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 84.89 0.039607 61 652 select 9.64 0.004496 5 992 652 recvfrom 2.59 0.001210 38 32 read 1.21 0.000564 35 16 open 0.42 0.000194 12 16 munmap 0.36 0.000166 10 16 mmap2 0.32 0.000151 9 16 close 0.23 0.000109 14 8 sendto 0.18 0.000082 5 16 fstat64 0.17 0.000077 5 16 getsockname ------ ----------- ----------- --------- --------- ---------------- 100.00 0.046656 1780 652 total
Created attachment 6455 [details] the trace log of the bad status the problem usually occur when the problem running more than 1 day. the best record is after running 7 daya. if I restart the program, the problem will disappear. sometimes the problem would disappear without restart the program, and the problem can occur when the program near 100% idle( I means there is very little packet come in). I also use the tcpdump to check what kind of the packet can make this problem. But I found there is almost the same with the normal status. my program idea is try to select udp packet, and then recvfrom it, and loop, if the kernel give me the wrong return code when do a select, then it will make it use too much cpu. Is it any help that I attache the tcpdump log here?
Is this issue still present in kernel 2.6.16.2?
Please reopen this bug if it's still present in kernel 2.6.17.