Bug 5498 - nonblock udp socket recvfrom got an error when select got a good return
Summary: nonblock udp socket recvfrom got an error when select got a good return
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-10-26 04:37 UTC by kyo guan
Modified: 2006-07-10 13:25 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.8
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
the trace log of the good status (25.59 KB, application/x-gzip-compressed)
2005-11-02 20:05 UTC, kyo guan
Details
the trace log of the bad status (58.56 KB, application/x-gzip-compressed)
2005-11-02 20:25 UTC, kyo guan
Details

Description kyo guan 2005-10-26 04:37:26 UTC
Distribution:Debian
Hardware Environment:DELL 1850 two xeon 2.8Ghz CPU, 4G RAM, disabled Hyper-
threading 
Software Environment:sarge 3.1 using the office kernel image 2.6.8-2-686-smp
Problem Description:


I use it running an udp service, using nonblock udp socket. Sometimes I found
it use too much cpu. so I use strace. I found recvfrom would got an error return
when select got a good return. 

...
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 34901}) = 1 (in [5], left {0, 35000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 34668}) = 1 (in [5], left {0, 35000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 34454}) = 1 (in [5], left {0, 35000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 34239}) = 1 (in [5], left {0, 35000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 34023}) = 1 (in [5], left {0, 35000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 33811}) = 1 (in [5], left {0, 34000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 33598}) = 1 (in [5], left {0, 34000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 33386}) = 1 (in [5], left {0, 34000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 33172}) = 1 (in [5], left {0, 34000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 32956}) = 1 (in [5], left {0, 33000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 32743}) = 1 (in [5], left {0, 33000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 32531}) = 1 (in [5], left {0, 33000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 32317}) = 1 (in [5], left {0, 33000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 32105}) = 1 (in [5], left {0, 33000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 31889}) = 1 (in [5], left {0, 32000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 31677}) = 1 (in [5], left {0, 32000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 31464}) = 1 (in [5], left {0, 32000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 31250}) = 1 (in [5], left {0, 32000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 31034}) = 1 (in [5], left {0, 32000})
recvfrom(5, 0xb247bb0, 1304, 0, 0xbfffa200, 0xbfffa1ec) = -1 EAGAIN (Resource 
temporarily unavailable)
select(7, [3 4 5 6], NULL, NULL, {0, 30822}) = 1 (in [5], left {0, 31000})
....
Comment 1 Stephen Hemminger 2005-10-26 09:54:21 UTC
This bug has already been fixed probably in 2.6.10.
The problem was if a packet with a bad checksum was received, it could get queued
and wake up select but never get received.

The patch introduced new code (udp_poll) in net/ipv4/udp.c
Comment 2 kyo guan 2005-11-01 20:29:31 UTC
after upgrade to the debian kernel 2.6.12-1-686-smp
the bug still here, and I got the same system call trace.
Comment 3 kyo guan 2005-11-01 20:59:39 UTC
This is the trace with count:

strace -c -p 6301
Process 6301 attached - interrupt to quit
Process 6301 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 71.14    2.209248          10    221775    219851 recvfrom
 28.52    0.885781           8    110194           select
  0.18    0.005714          14       399           sendto
  0.08    0.002361          39        60           read
  0.03    0.000992          33        30           open
  0.01    0.000378          13        30           munmap
  0.01    0.000331          11        30           close
  0.01    0.000326          11        30           mmap2
  0.01    0.000181           6        30           fstat64
  0.01    0.000170           6        30           getsockname
------ ----------- ----------- --------- --------- ----------------
100.00    3.105482                332608    219851 total

And this is my /etc/sysctl.conf

kernel.pid_max=65536
net.ipv4.tcp_max_syn_backlog=4096
net.ipv4.icmp_echo_ignore_broadcasts=1
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_tw_reuse=1

net.core.rmem_max=16777216
net.core.wmem_max=1048576
net.core.wmem_default=1048576
kernel.core_pattern=core.%e.%h.%p
Comment 4 Stephen Hemminger 2005-11-02 08:55:45 UTC
Is this a dup (or related to Bug 5505)?
Comment 5 kyo guan 2005-11-02 20:05:30 UTC
Created attachment 6454 [details]
the trace log of the good status

this is the normal syscall strace, and I attached the good trace log.

Process 8187 attached - interrupt to quit
Process 8187 detached
% time	   seconds  usecs/call	   calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 84.89	  0.039607	    61	     652	   select
  9.64	  0.004496	     5	     992       652 recvfrom
  2.59	  0.001210	    38	      32	   read
  1.21	  0.000564	    35	      16	   open
  0.42	  0.000194	    12	      16	   munmap
  0.36	  0.000166	    10	      16	   mmap2
  0.32	  0.000151	     9	      16	   close
  0.23	  0.000109	    14	       8	   sendto
  0.18	  0.000082	     5	      16	   fstat64
  0.17	  0.000077	     5	      16	   getsockname
------ ----------- ----------- --------- --------- ----------------
100.00	  0.046656		    1780       652 total
Comment 6 kyo guan 2005-11-02 20:25:05 UTC
Created attachment 6455 [details]
the trace log of the bad status

the problem usually occur when the problem running more than 1 day.
the best record is after running 7 daya. if I restart the program,
the problem will disappear. sometimes the problem would disappear without
restart the program, and the problem can occur when the program
near 100% idle( I means there is very little packet come in). I also use
the tcpdump to check what kind of the packet can make this problem. But
I found there is almost the same with the normal status. my program idea
is try to select udp packet, and then recvfrom it, and loop, if the kernel
give me the wrong return code when do a select, then it will make it use too
much cpu. Is it any help that I attache the tcpdump log here?
Comment 7 Adrian Bunk 2006-04-10 16:34:21 UTC
Is this issue still present in kernel 2.6.16.2?
Comment 8 Adrian Bunk 2006-07-10 13:25:26 UTC
Please reopen this bug if it's still present in kernel 2.6.17.

Note You need to log in before you can comment on or make changes to this bug.