Bug 5505

Summary: BIND9 getting errno 14: Bad address
Product: Networking Reporter: Stefan Schmidt (zaphodb)
Component: OtherAssignee: Herbert Xu (herbert)
Status: RESOLVED CODE_FIX    
Severity: normal CC: chutz
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.14-rc5 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Patch to fix zero-length packet reception

Description Stefan Schmidt 2005-10-27 05:02:52 UTC
Most recent kernel where this bug did not occur: 2.6.14-rc5 , 2.6.14-rc4-mm1
Distribution: Debian GNU/Linux unstable
Hardware Environment: Dual PIV Xeon HT (HP Proliant DL360 G4)
Software Environment: 

ii  libc6                       2.3.5-7              GNU C Library: Shared
libraries and Timezone
ii  libc6-dev                   2.3.5-7              GNU C Library: Development
Libraries and Hea
ii  libc6-i686                  2.3.5-7              GNU C Library: Shared
libraries [i686 optimi
ii  bin86                       0.16.14-1.2          16-bit x86 assembler and loader
ii  gcc-4.0                     4.0.2-3              The GNU C compiler
ii  gcc-4.0-base                4.0.2-3              The GNU Compiler Collection
(base package)
ns1:~# cc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v
--enable-languages=c,c++,java,f95,objc,ada,treelang --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--program-suffix=-4.0 --enable-__cxa_atexit --enable-libstdcxx-allocator=mt
--enable-clocale=gnu --enable-libstdcxx-debug --enable-java-gc=boehm
--enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-4.0-1.4.2.0/jre --enable-mpfr
--disable-werror --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.0.3 20051023 (prerelease) (Debian 4.0.2-3)


Problem Description:

BIND9 receives errno 14 and is unable to answer further queries.
Oct 26 15:38:53 ns1 named[2197]: errno2result.c:109: unexpected error:
Oct 26 15:38:53 ns1 named[2197]: unable to convert errno to isc_result: 14: Bad
address
Oct 26 15:38:53 ns1 named[2197]: UDP client handler shutting down due to fatal
receive error: unexpected error

I got this with BIND9 version 9.3.2beta1 which is running smoothly on my other
machines and with 9.3.2beta2 and the debian bind9 package which provides 9.3.1 .
I tried with and without CONFIG_PACKET_MMAP=y.

The same BIND9 binary runs on 2.6.13.4 for 3 hours without problems now, on the
abovementioned other two 2.6 kernels the problem occurred from 10m to 1h after
BIND9 startup. I had no problems whatsoever with 2.6.14-rc4-mm1 running about a
day of distributed.net's dnetc client. The machine is brand new and passes all
BIOS checks.
DNS-Workload of the machine is ~10 queries per second. It is holding 40k zones
as secondary nameserver.
I have machines running BIND 9.3.1beta1 recursive-only that are seeing ~1000
queries per second - i tell you this to explain why i tend to believe BIND9 is
not the problem here.
Comment 1 Stefan Schmidt 2005-10-31 02:18:25 UTC
Date: Mon, 31 Oct 2005 11:04:54 +0100
From: Stefan Schmidt <zaphodb@zaphods.net>
To: Ian McDonald <imcdnzl@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>,
        Andrew Morton <akpm@osdl.org>,
        Herbert Xu <herbert@gondor.apana.org.au>,
        Arnaldo Carvalho de Melo <acme@conectiva.com.br>,
        "David S. Miller" <davem@davemloft.net>, netdev@vger.kernel.org,
        Linus Torvalds <torvalds@osdl.org>
Subject: Re: [Bug 5505] New: BIND9 getting errno 14: Bad address

On Fri, Oct 28, 2005 at 12:18:27AM +0200, Stefan Schmidt wrote:
> > > > Stefan, are you able to obtain the `strace -f' output from a failur=
e?
> > > Please.

=2E..
> I will try to get hold of another machine and load it with the same
> installation and zone-data in order to reproduce the problem.
> Getting the same machine model will be difficult as we are low on pool
> machines but i'll try to aquire a PIV Xeon HT machine.

It seems my strategy to forget about the week through some pints of guinness
was a full success. I just remembered i haven't sent you my results yet this
very morning, sorry:

I was able to reproduce the bug with on another DL360 and kernel 2.6.14-rc5
and i have a full strace for you. This is the interesting part of it:

2539  18:31:57.648544 <... futex resumed> ) =3D 0
2539  18:31:57.648624 futex(0x818ab08, FUTEX_WAKE, 1) =3D 0
2539  18:31:57.648722 gettimeofday({1130517117, 648752}, NULL) =3D 0
2539  18:31:57.648797 recvmsg(20, {msg_name(16)=3D{sa_family=3DAF_INET, sin=
_port=3Dhtons(4269), sin_addr=3Dinet_addr("222.255.121.130")}, msg_iov(1)=
=3D[{"+\236\0\0\0\1\0\0\0\0\0\1\rextreme-event\2de\0\0\17"..., 4096}], msg_=
con
trollen=3D20, {cmsg_len=3D20, cmsg_level=3DSOL_SOCKET, cmsg_type=3D0x1d /* =
SCM_??? */, ...}, msg_flags=3D0}, 0) =3D 45
2539  18:31:57.648930 recvmsg(20, 0xb764e3c8, 0) =3D -1 EFAULT (Bad address)
2539  18:31:57.648993 futex(0x8186eac, FUTEX_WAKE, 2147483647) =3D 0
2539  18:31:57.649086 time([1130517117]) =3D 1130517117
2539  18:31:57.649152 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649233 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649305 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649391 send(3, "<27>Oct 28 16:31:57 named[2537]:"..., 70, MS=
G_NOSIGNAL) =3D 70
2539  18:31:57.649478 time([1130517117]) =3D 1130517117
2539  18:31:57.649544 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649617 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649688 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649769 send(3, "<27>Oct 28 16:31:57 named[2537]:"..., 87, MS=
G_NOSIGNAL) =3D 87
2539  18:31:57.648930 recvmsg(20, 0xb764e3c8, 0) =3D -1 EFAULT (Bad address)
2539  18:31:57.648993 futex(0x8186eac, FUTEX_WAKE, 2147483647) =3D 0
2539  18:31:57.649086 time([1130517117]) =3D 1130517117
2539  18:31:57.649152 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649233 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649305 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649391 send(3, "<27>Oct 28 16:31:57 named[2537]:"..., 70, MS=
G_NOSIGNAL) =3D 70
2539  18:31:57.649478 time([1130517117]) =3D 1130517117
2539  18:31:57.649544 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649617 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649688 open("/etc/localtime", O_RDONLY) =3D -1 ENOENT (No su=
ch file or directory)
2539  18:31:57.649769 send(3, "<27>Oct 28 16:31:57 named[2537]:"..., 87, MS=
G_NOSIGNAL) =3D 87
2539  18:31:57.649852 futex(0x818ab40, FUTEX_WAKE, 1) =3D 1
2539  18:31:57.649954 recvmsg(20,  <unfinished ...>
2539  18:31:57.650021 <... recvmsg resumed> 0xb764e3c8, 0) =3D -1 EAGAIN (R=
esource temporarily unavailable)
2539  18:31:57.650088 write(8, "\24\0\0\0\375\377\377\377", 8 <unfinished .=
=2E.>
2539  18:31:57.650209 <... write resumed> ) =3D 8
2539  18:31:57.650459 sendmsg(20, {msg_name(16)=3D{sa_family=3DAF_INET, sin=
_port=3Dhtons(4269), sin_addr=3Dinet_addr("222.255.121.130")}, msg_iov(1)=
=3D[{"+\236\204\0\0\1\0\1\0\2\0\3\rextreme-event\2de\0\0\17"..., 152}], msg=
_co
ntrollen=3D0, msg_flags=3D0}, 0 <unfinished ...>
2539  18:31:57.650625 <... sendmsg resumed> ) =3D 152
2539  18:31:57.650764 recvmsg(20,  <unfinished ...>
2539  18:31:57.650880 <... recvmsg resumed> 0xb764cdc8, 0) =3D -1 EAGAIN (R=
esource temporarily unavailable)
2539  18:31:57.650964 futex(0x818ab40, FUTEX_WAIT, 3815, NULL <unfinished .=
=2E.>

I only provided you with the output from pid 2539 here.
Find the full strace -f -tt here:
 -> http://www.zaphods.net/~zaphodb/bigger.named.strace.bz2
but be warned:
-rw-r--r--   1 root root 256M 2005-10-28 18:38 bigger.named.strace
-rw-r--r--  1 root root 19M 2005-10-28 18:38 bigger.named.strace.bz2

> > I would also add are you able to use git-bisect to track down the
> > changeset causing this. I've used this to find obscure bugs already
> > and it is easy to use.
> I will google on what git-bisect is and then tell you whether or not i am
> able to. ;)

I will have to give back the machine soon, i'll see what i can do.

        Stefan

PS: tcpreplay rocks.
Comment 2 Gerald Hopf 2005-11-02 03:45:58 UTC
I had this error twice now on my Gentoo Linux System with

- bind-9.3.1
- kernel 2.6.14

Since i upgraded both Kernel and Bind on the same day, i can't really say who's
fault this is. Bind 9.3.1 has been out for quite some time, however...

I will go back to 2.6.13 tonight and within a week i'll know if this has fixed
it :-)

Oct 30 19:04:08 [named] errno2result.c:109: unexpected error:
Oct 30 19:04:08 [named] unable to convert errno to isc_result: 14: Bad address
Oct 30 19:04:08 [named] UDP client handler shutting down due to fatal receive
error: unexpected error

Oct 31 13:44:44 [named] errno2result.c:109: unexpected error:
Oct 31 13:44:44 [named] unable to convert errno to isc_result: 14: Bad address
Oct 31 13:44:44 [named] UDP client handler shutting down due to fatal receive
error: unexpected error
Comment 3 Herbert Xu 2005-11-02 14:45:22 UTC
Created attachment 6453 [details]
Patch to fix zero-length packet reception

This patch fixes zero-length packet reception which currently results in EFAULT
due to the rewrite of skb_copy_datagram_iovec.