Bug 8536 - Kernel drops UDP packets silently when reading from certain proc file entries
Summary: Kernel drops UDP packets silently when reading from certain proc file entries
Status: CLOSED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Arnaldo Carvalho de Melo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-05-24 12:51 UTC by andsve
Modified: 2008-09-26 05:07 UTC (History)
0 users

See Also:
Kernel Version: 2.6.x
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description andsve 2007-05-24 12:51:23 UTC
Most recent kernel where this bug did *NOT* occur:
I do not know, but I now that it exists in RHEL4 2.6.9.x kernels
Distribution:
All
Hardware Environment:
Multi core SMP
Software Environment:
All
Problem Description:
It is possible to introduce UDP packet losses by reading
the proc file entry /proc/net/tcp. The really strange thing is that
the error counters for packet drops are not increased. 
This means that the kernel introduce "silent" packet drops by just reading a
proc statistics entry which is Not a good thing! I can most probably be used for
denial of service attacks from no root users.

When looking at the network code it does not seem possible that silent packet
drops can ocurr so it is probably a quite nasty kernel bug.


Steps to reproduce:

* Send high speed RTP/UDP multicast traffic towards the system, 50Mbit/s. 

* Receive the RTP packets and check/validate the RTP counters and print out when
the counter is not continous.

* Do a while loop cat:ing from the /proc/net/tcp and see the packets beeing
dropped but not accounted for in the counter statistics.

I have reproduced this behavior on all our systems ranging from dual to quad
core Xeon and Opteron and also on different OS releases, RHEL4, RHEL5, Fedora
Core 5 and 6
Comment 1 Herbert Xu 2007-05-24 23:27:11 UTC
Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> It is possible to introduce UDP packet losses by reading
>> the proc file entry /proc/net/tcp. The really strange thing is that
>> the error counters for packet drops are not increased. 

Please try this patch and let us know if it helps.

[TCPv4]: Improve BH latency in /proc/net/tcp

Currently the code for /proc/net/tcp disable BH while iterating
over the entire established hash table.  Even though we call
cond_resched_softirq for each entry, we still won't process
softirq's as regularly as we would otherwise do which results
in poor performance when the system is loaded near capacity.

This anomaly comes from the 2.4 code where this was all in a
single function and the local_bh_disable might have made sense
as a small optimisation.

The cost of each local_bh_disable is so small when compared
against the increased latency in keeping it disabled over a
large but mostly empty TCP established hash table that we
should just move it to the individual read_lock/read_unlock
calls as we do in inet_diag.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5a3e7f8..9dab06d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2039,10 +2039,7 @@ static void *established_get_first(struct seq_file *seq)
 		struct hlist_node *node;
 		struct inet_timewait_sock *tw;
 
-		/* We can reschedule _before_ having picked the target: */
-		cond_resched_softirq();
-
-		read_lock(&tcp_hashinfo.ehash[st->bucket].lock);
+		read_lock_bh(&tcp_hashinfo.ehash[st->bucket].lock);
 		sk_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
 			if (sk->sk_family != st->family) {
 				continue;
@@ -2059,7 +2056,7 @@ static void *established_get_first(struct seq_file *seq)
 			rc = tw;
 			goto out;
 		}
-		read_unlock(&tcp_hashinfo.ehash[st->bucket].lock);
+		read_unlock_bh(&tcp_hashinfo.ehash[st->bucket].lock);
 		st->state = TCP_SEQ_STATE_ESTABLISHED;
 	}
 out:
@@ -2086,14 +2083,11 @@ get_tw:
 			cur = tw;
 			goto out;
 		}
-		read_unlock(&tcp_hashinfo.ehash[st->bucket].lock);
+		read_unlock_bh(&tcp_hashinfo.ehash[st->bucket].lock);
 		st->state = TCP_SEQ_STATE_ESTABLISHED;
 
-		/* We can reschedule between buckets: */
-		cond_resched_softirq();
-
 		if (++st->bucket < tcp_hashinfo.ehash_size) {
-			read_lock(&tcp_hashinfo.ehash[st->bucket].lock);
+			read_lock_bh(&tcp_hashinfo.ehash[st->bucket].lock);
 			sk = sk_head(&tcp_hashinfo.ehash[st->bucket].chain);
 		} else {
 			cur = NULL;
@@ -2138,7 +2132,6 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
 
 	if (!rc) {
 		inet_listen_unlock(&tcp_hashinfo);
-		local_bh_disable();
 		st->state = TCP_SEQ_STATE_ESTABLISHED;
 		rc	  = established_get_idx(seq, pos);
 	}
@@ -2171,7 +2164,6 @@ static void *tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 		rc = listening_get_next(seq, v);
 		if (!rc) {
 			inet_listen_unlock(&tcp_hashinfo);
-			local_bh_disable();
 			st->state = TCP_SEQ_STATE_ESTABLISHED;
 			rc	  = established_get_first(seq);
 		}
@@ -2203,8 +2195,7 @@ static void tcp_seq_stop(struct seq_file *seq, void *v)
 	case TCP_SEQ_STATE_TIME_WAIT:
 	case TCP_SEQ_STATE_ESTABLISHED:
 		if (v)
-			read_unlock(&tcp_hashinfo.ehash[st->bucket].lock);
-		local_bh_enable();
+			read_unlock_bh(&tcp_hashinfo.ehash[st->bucket].lock);
 		break;
 	}
 }

Comment 2 Anonymous Emailer 2007-05-24 23:53:35 UTC
Reply-To: dada1@cosmosbay.com

Herbert Xu a 
Comment 3 Herbert Xu 2007-05-25 00:00:28 UTC
On Fri, May 25, 2007 at 08:50:20AM +0200, Eric Dumazet wrote:
>
> If this patch really helps, this means cond_resched_softirq()
> doesnt work at all and should be fixed, or just zapped as it
> is seldom used.

cond_resched_softirq lets other threads run if they want to.
It doesn't run pending softirq's at all.  In fact, it doesn't
even wake up ksoftirqd.

So if the only work we get come from softirq's then we'll just
block them until we're done with /proc/net/tcp.

You can (correctly) argue that cond_resched_softirq is broken,
but it doesn't change the fact that we don't even need to call
it for /proc/net/tcp.

This patch simply changes /proc/net/tcp to be in line with the
behaviour of inet_diag.

Cheers,
Comment 4 andsve 2007-05-25 00:07:35 UTC
I encountered this strange behavior when running "vacuum analyze" on a postgres
database and I noticed UDP paket drops that were not accounted for in the
statistics. 

My first idea was that there was some issue with high load on the SATA disk that
triggered this behavior, but I tested running other disk intense operations and
those did not have any effect. 

After some trouble shooting and searching on the web I noticed that access to 
the /proc/net/tcp entry did indeed affect the packet drops. I also noticed that 
if I set the kernel argument thash_entries to a low value, 10, I could not
trigger the packet loss problem by just cat:in the /proc/net/tcp file.

However this setting did not help out the issues when running vacuum so there
must be more of the same problem in the kernel and I must say that it does not
feel comportable that one can trigger this kind of error just by walking the
proc file system so it would ge great to find the real cause to this.
Comment 5 Anonymous Emailer 2007-05-25 00:18:27 UTC
Reply-To: dada1@cosmosbay.com

Herbert Xu a 
Comment 6 Herbert Xu 2007-05-25 00:20:56 UTC
On Fri, May 25, 2007 at 09:15:17AM +0200, Eric Dumazet wrote:
> 
> I am very glad you fixed /proc/net/tcp, but I would like to
> understand why this cond_resched_softirq() even exist.

Well presumably it lets other threads have a chance to run in
a BH-disabled section.

> Its name and behavior dont match at all.

But yes it probably makes sense for it to process some softirq
work as well.  Ingo?

Cheers,
Comment 7 andsve 2007-05-25 00:35:34 UTC
Even if there should be some issues with softirq handling how come that the
packets are dropped completely silent? 

Because the error counters on the NICs is not increased so the packets must have
entered the netdev queues?
 

Note You need to log in before you can comment on or make changes to this bug.