Bug 52991 - Race condition in netfilter connection tracking can lead to erroneous DROPs
Summary: Race condition in netfilter connection tracking can lead to erroneous DROPs
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: Netfilter/Iptables (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: networking_netfilter-iptables@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-24 20:49 UTC by Brian Conry
Modified: 2014-03-14 18:10 UTC (History)
0 users

See Also:
Kernel Version: 3.5.0+ (earlier versions untested)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
A small program that can demonstrate the race condition (6.28 KB, text/plain)
2013-01-24 20:49 UTC, Brian Conry
Details
The kernel configuration file I used to repro with 3.8.0-rc4 (79.80 KB, text/plain)
2013-01-24 20:52 UTC, Brian Conry
Details

Description Brian Conry 2013-01-24 20:49:49 UTC
Created attachment 91761 [details]
A small program that can demonstrate the race condition

If multiple threads simultaneously attempt a "first send" on a newly-created outbound IPv4 UDP socket, all but one of them can be incorrectly DROPped by the netfilter conntrack logic.  This is due to the mistaken assumption that a new entry in the connection hash table is result of a NAT rule and therefore has higher priority.  What can actually happen is that the send on behalf of one thread completes the connection tracking logic and has an entry created for it in the table, causing the sends on behalf of the other threads to be DROPped.

I have determined that the DROP originates in net/netfilter/nf_conntrack_core.c:__nf_conntrack_confirm

extract from net/netfilter/nf_conntrack_core.c:

	/* See if there's one in the list already, including reverse:
	   NAT could have grabbed it without realizing, since we're
	   not in the hash.  If there is, we lost race. */
	hlist_nulls_for_each_entry(h, n, &net->ct.hash[hash], hnnode)
		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple,
				      &h->tuple) &&
		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
			goto out;
	hlist_nulls_for_each_entry(h, n, &net->ct.hash[repl_hash], hnnode)
		if (nf_ct_tuple_equal(&ct->tuplehash[IP_CT_DIR_REPLY].tuple,
				      &h->tuple) &&
		    zone == nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)))
			goto out;


The DROP caused by this logic gets transformed into an EPERM errno set on a failed sendmsg(2) call.


It has been observed in the wild affecting BIND DNS servers on RedHat 6.1 (2.6.32-131.6.1.el6.x86_64), RedHat 6.2 (2.6.32-220.4.1.el6.x86_64), and Debian Squeeze (3.2.0-0.bpo.1-amd64).

It has been reproduced outside of BIND on Gentoo with both the Gentoo sources (3.5.7-gentoo) and with the mainline kernel (3.5.0, 3.5.7, and multiple other versions, as selected by "git bisect" up through 3.8.0-rc4 at commit 903ab86d195cca295379699299c5fc10beba31c7).

Some kernel builds seem to be strangely immune to this issue with no known reason. RedHat 6.3 (2.6.32-279.19.1.el6.x86_64) is the only one currently known to be in this category.

Due to the timing constraints in this race condition, it may not be possible to reproduce it on a VM.

I have done no testing to determine when this behavior was introduced.


The symptoms within BIND are that slave servers will log messages of the form:

> general: info: zone zone.example/IN: refresh: failure trying master
>  192.168.7.27#53 (source 0.0.0.0#0): operation canceled

and may have difficulty transferring zones from their master(s).


Work-arounds include:
* unloading the netfilter nf_conntrack kernel modules
* writing iptables rules that exempt the affected packet flows from connection tracking.

ISC may also patch BIND to work around this bug.


Notes on the attached program:
* it requires libpthread to link
* it requires a port number to be passed as the first argument.  this port number will be used for the test on 127.0.0.1.
* it accepts an optional number of workers to spawn, with a default of 4.  Matching the number of cores will probably give the best results.
* it performs better (i.e. reproduces the error more consistently) on lightly-loaded systems (e.g. on console without X or any other major services running).
Comment 1 Brian Conry 2013-01-24 20:52:01 UTC
Created attachment 91771 [details]
The kernel configuration file I used to repro with 3.8.0-rc4
Comment 2 Patrick McHardy 2013-01-25 21:20:46 UTC
bugzilla-daemon@bugzilla.kernel.org schrieb:

>https://bugzilla.kernel.org/show_bug.cgi?id=52991
>
>
>
>
>
>--- Comment #1 from Brian Conry <bconry@isc.org>  2013-01-24 20:52:01
>---
>Created an attachment (id=91771)
> --> (https://bugzilla.kernel.org/attachment.cgi?id=91771)
>The kernel configuration file I used to repro with 3.8.0-rc4

This is mainly done for stateful protocols like TCP, where two simulaneous new connections with the same identity can't be handled properly due to different ISNs etc. In case of UDP we might be able to associate the second packet with the first connection if NAT mappings, helpers etc all match. Will try to look into this over the weekend.
Comment 3 Brian Conry 2014-03-14 18:10:18 UTC
In the months since I first submitted this ticket we (ISC) developed a patch intended to work around this issue by allowing a retry after receiving an EPERM from sendmsg(2) for any of the "first sends", just as in my example program.

It didn't help either of our customers.

So while this is (probably) a valid bug, it's one with no known impact.

Note You need to log in before you can comment on or make changes to this bug.