Bug 6802 - pktgen cause kernel oops with transmit load balanced bonding
Summary: pktgen cause kernel oops with transmit load balanced bonding
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Arnaldo Carvalho de Melo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-07-07 07:35 UTC by Chen-Li Tien
Modified: 2006-07-10 13:46 UTC (History)
0 users

See Also:
Kernel Version: 2.4.32, 2.6.17.2
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Chen-Li Tien 2006-07-07 07:35:30 UTC
Most recent kernel where this bug did not occur:
not found
Distribution:
Fedora Core 4

Hardware Environment:
i386, ppc

Software Environment:
gcc

Problem Description:
If running the kernel packet generator (pktgen), and the output device is set 
to bonding interface with mode balance-tlb or balance-alb, then there will be 
kernel oops.

I only set the odev, dst and count (as 0 for infinite test) for the pktgen. I 
wonder if I made mistake for the pktgen parameters but it doesn't cause problem 
if the odev set to physical device such as eth0, etc.

My investigations shows that the problem happen when the bond_alb_xmit tries to 
access the daddr fields of IP header in skb->nh.iph. If I did the same in round-
robin mode, it can generate oops too.

Steps to reproduce:
1. Build kernel with pktgen (CONFIG_NET_PKTGEN) and bonding driver 
(CONFIG_BONDING).
2. Setup bonding interface.
ifenslave bond0 eth0
3. Create a script for starting packet generator,
the script I start packet generator for kernel 2.4 series is like following:
-----------cut here --------
#! /bin/sh

modprobe pktgen

PGDEV=/proc/net/pktgen/pg0

function pgset() {
    local result

    echo $1 > $PGDEV

    result=`cat $PGDEV | fgrep "Result: OK:"`
    if [ "$result" = "" ]; then
         cat $PGDEV | fgrep Result:
    fi
}

function pg() {
    echo inject > $PGDEV
    cat $PGDEV
}

pgset "odev bond0"
pgset "dst 127.1.16.1"
pgset "count 0"
pg
-----------cut here --------

4. If the odev is set to eth0, the this script will not have problem, problem 
only happen when it is set to bond0.
Comment 1 Chen-Li Tien 2006-07-07 07:45:05 UTC
I forgot to mention that the kernel I used has been modified so only 127.0.x.x 
are in loopback address, 127.1.16.1 is the destination machine in the LAN.

I run tcpdump in the destination machine, it shows as following:

17:40:57.801701 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.801740 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.801802 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.801841 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.801880 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.801919 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.801958 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.801997 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.802036 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.802075 127.1.18.1.discard > 127.1.16.1.discard:  udp 18
17:40:57.802114 127.1.18.1.discard > 127.1.16.1.discard:  udp 18

Is this normal to get packets as discard? This is identical to the result if 
packets sent to eth0, however.

Chen-Li Tien
Comment 2 Chen-Li Tien 2006-07-07 10:20:40 UTC
Before step 2, you need to load the bonding driver using transmit load balance:
modprobe bonding mode=balance-tlb

The balance-alb will have the same problem, but it depends on ethernet device 
driver.

The default balance-rr mode has no such a problem.

Chen-Li Tien
Comment 3 Andrew Morton 2006-07-07 11:48:04 UTC
On Fri, 7 Jul 2006 07:37:52 -0700
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=6802
> 
>            Summary: pktgen cause kernel oops with transmit load balanced
>                     bonding
>     Kernel Version: 2.4.32, 2.6.17.2
>             Status: NEW
>           Severity: high
>              Owner: acme@conectiva.com.br
>          Submitter: cltien@gmail.com
> 
> 
> Most recent kernel where this bug did not occur:
> not found
> Distribution:
> Fedora Core 4
> 
> Hardware Environment:
> i386, ppc
> 
> Software Environment:
> gcc
> 
> Problem Description:
> If running the kernel packet generator (pktgen), and the output device is set 
> to bonding interface with mode balance-tlb or balance-alb, then there will be 
> kernel oops.
> 
> I only set the odev, dst and count (as 0 for infinite test) for the pktgen. I 
> wonder if I made mistake for the pktgen parameters but it doesn't cause problem 
> if the odev set to physical device such as eth0, etc.
> 
> My investigations shows that the problem happen when the bond_alb_xmit tries to 
> access the daddr fields of IP header in skb->nh.iph. If I did the same in round-
> robin mode, it can generate oops too.
> 
> Steps to reproduce:
> 1. Build kernel with pktgen (CONFIG_NET_PKTGEN) and bonding driver 
> (CONFIG_BONDING).
> 2. Setup bonding interface.
> ifenslave bond0 eth0
> 3. Create a script for starting packet generator,
> the script I start packet generator for kernel 2.4 series is like following:
> -----------cut here --------
> #! /bin/sh
> 
> modprobe pktgen
> 
> PGDEV=/proc/net/pktgen/pg0
> 
> function pgset() {
>     local result
> 
>     echo $1 > $PGDEV
> 
>     result=`cat $PGDEV | fgrep "Result: OK:"`
>     if [ "$result" = "" ]; then
>          cat $PGDEV | fgrep Result:
>     fi
> }
> 
> function pg() {
>     echo inject > $PGDEV
>     cat $PGDEV
> }
> 
> pgset "odev bond0"
> pgset "dst 127.1.16.1"
> pgset "count 0"
> pg
> -----------cut here --------
> 
> 4. If the odev is set to eth0, the this script will not have problem, problem 
> only happen when it is set to bond0.
> 

Please send (via an emailed reply-to-all) a copy of the oops output.

Thanks.

Comment 4 Chen-Li Tien 2006-07-10 07:50:36 UTC
Output of kernel 2.6.17.4:

BUG: unable to handle kernel NULL pointer dereference at virtual
address 00000010
 printing eip:
e0988e2e
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: pktgen bonding ip_tables x_tables microcode ata_piix libata
CPU:    0
EIP:    0060:[<e0988e2e>]    Not tainted VLI
EFLAGS: 00010213   (2.6.17.4 #1)
EIP is at bond_alb_xmit+0xb2/0x1c9 [bonding]
eax: 00000000   ebx: df526ba0   ecx: 00000005   edx: dd8f0ee8
esi: df53b6a3   edi: e098ad51   ebp: dca23f18   esp: dca23f00
ds: 007b   es: 007b   ss: 0068
Process pktgen/0 (pid: 2104, threadinfo=dca22000 task=ddd10520)
Stack: df53b6a2 00000001 df526cb8 df526940 dde8e16c df526940 dca23fe4 e097a324
       dd8f0ee8 df526940 5a5a5a5a 5a5a5a5a 5a5a5a5a 5a5a5a5a 5a5a5a5a 5a5a5a5a
       00000000 00000000 dca22000 5a5a5a5a 5a5a5a5a 5a5a5a5a 5a5a5a5a bc5e5016
Call Trace:
 <c0102a82> show_stack_log_lvl+0x87/0x8f  <c0102bd3> show_registers+0x112/0x17b
 <c0102d8e> die+0xda/0x19f  <c010b228> do_page_fault+0x467/0x551
 <c0102727> error_code+0x4f/0x54  <e097a324>
pktgen_thread_worker+0x3b1/0x790 [pktgen]
 <c0100d3d> kernel_thread_helper+0x5/0xb
Code: 74 69 81 fa dd 86 00 00 74 3f e9 b5 00 00 00 fc 8b 75 e8 bf 50
ad 98 e0 b9 06 00 00 00 f3 a6 0f 84 a3 00 00 00 8b 55 08 8b 42 20 <83>
78 10 ff 0f 84 93 00 00 00 80 78 09 02 0f 84 89 00 00 00 8d
EIP: [<e0988e2e>] bond_alb_xmit+0xb2/0x1c9 [bonding] SS:ESP 0068:dca23f00
 <0>Kernel panic - not syncing: Fatal exception in interrupt

The script I ran with kernel 2.6 is:

#! /bin/sh

modprobe pktgen

PGDEV=/proc/net/pktgen/bond0
PGCTL=/proc/net/pktgen/pgctrl

function pgset() {
    local result

    echo $1 > $PGDEV

    result=`cat $PGDEV | fgrep "Result: OK:"`
    if [ "$result" = "" ]; then
         cat $PGDEV | fgrep Result:
    fi
}

function pg() {
    echo start > $PGCTL
    cat $PGDEV
}

echo "add_device bond0" > /proc/net/pktgen/kpktgend_0
pgset "frags 5"         # packet will consist of 5 fragments
pgset "dst 192.168.0.1"

pgset "count 0"    # sets number of packets to send, set to zero
                        # for continious sends untill explicitly
                        # stopped.
pg

2006/7/7, Andrew Morton <akpm@osdl.org>:
> On Fri, 7 Jul 2006 07:37:52 -0700
> bugme-daemon@bugzilla.kernel.org wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=6802
> >
> >            Summary: pktgen cause kernel oops with transmit load balanced
> >                     bonding
> >     Kernel Version: 2.4.32, 2.6.17.2
> >             Status: NEW
> >           Severity: high
> >              Owner: acme@conectiva.com.br
> >          Submitter: cltien@gmail.com
> >
> >
> > Most recent kernel where this bug did not occur:
> > not found
> > Distribution:
> > Fedora Core 4
> >
> > Hardware Environment:
> > i386, ppc
> >
> > Software Environment:
> > gcc
> >
> > Problem Description:
> > If running the kernel packet generator (pktgen), and the output device is set
> > to bonding interface with mode balance-tlb or balance-alb, then there will be
> > kernel oops.
> >
> > I only set the odev, dst and count (as 0 for infinite test) for the pktgen. I
> > wonder if I made mistake for the pktgen parameters but it doesn't cause problem
> > if the odev set to physical device such as eth0, etc.
> >
> > My investigations shows that the problem happen when the bond_alb_xmit tries to
> > access the daddr fields of IP header in skb->nh.iph. If I did the same in round-
> > robin mode, it can generate oops too.
> >
> > Steps to reproduce:
> > 1. Build kernel with pktgen (CONFIG_NET_PKTGEN) and bonding driver
> > (CONFIG_BONDING).
> > 2. Setup bonding interface.
> > ifenslave bond0 eth0
> > 3. Create a script for starting packet generator,
> > the script I start packet generator for kernel 2.4 series is like following:
> > -----------cut here --------
> > #! /bin/sh
> >
> > modprobe pktgen
> >
> > PGDEV=/proc/net/pktgen/pg0
> >
> > function pgset() {
> >     local result
> >
> >     echo $1 > $PGDEV
> >
> >     result=`cat $PGDEV | fgrep "Result: OK:"`
> >     if [ "$result" = "" ]; then
> >          cat $PGDEV | fgrep Result:
> >     fi
> > }
> >
> > function pg() {
> >     echo inject > $PGDEV
> >     cat $PGDEV
> > }
> >
> > pgset "odev bond0"
> > pgset "dst 127.1.16.1"
> > pgset "count 0"
> > pg
> > -----------cut here --------
> >
> > 4. If the odev is set to eth0, the this script will not have problem, problem
> > only happen when it is set to bond0.
> >
>
> Please send (via an emailed reply-to-all) a copy of the oops output.
>
> Thanks.
>

Comment 5 Chen-Li Tien 2006-07-10 07:56:51 UTC
It seems to happen in following line (line 1679 in 2.6.17.4) of bond_alb_xmit():

                    (skb->nh.iph->daddr == ip_bcast) ||
Comment 6 Chen-Li Tien 2006-07-10 13:46:32 UTC
This is caused by pktgen, which doesn't initialize skb->nh, witch is used by 
bonding to check destination address.

I made a patch for 2.6.17.4, 2.4.32 can also be fixed in the same way.

--- linux-2.6.17.4/net/core/pktgen.c.orig       2006-07-06 16:02:28.000000000 -0
400
+++ linux-2.6.17.4/net/core/pktgen.c    2006-07-10 16:40:47.000000000 -0400
@@ -2149,6 +2149,9 @@
        skb->mac.raw = ((u8 *) iph) - 14 - pkt_dev->nr_labels*sizeof(u32);
        skb->dev = odev;
        skb->pkt_type = PACKET_HOST;
+       skb->mac.raw = eth;
+       skb->nh.iph = iph;
+       skb->h.uh = udph;

        if (pkt_dev->nfrags <= 0)
                pgh = (struct pktgen_hdr *)skb_put(skb, datalen);

Please help review the code, thanks!

Note You need to log in before you can comment on or make changes to this bug.