Bug 12698

Summary: bnx2_poll_work kernel panic
Product: Drivers Reporter: Pascal de Bruijn (pmjdebruijn)
Component: NetworkAssignee: Michael Chan (mchan)
Status: CLOSED CODE_FIX    
Severity: high CC: alan, jtorrice, pterjan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27.15 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Digital screen capture

Description Pascal de Bruijn 2009-02-13 03:10:15 UTC
Latest working kernel version: 
  2.6.22.19

Earliest failing kernel version: 
  2.6.27.15 
  We did not test kernel versions in between 2.6.27.15 and 2.6.22.19

Distribution: 
  Debian GNU/Linux 4.0r6 (fully updated), with self built vanilla kernel.

Hardware Environment: 
  HP DL360 G5, with two onboard Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet Controllers

Software Environment:
  Debian GNU/Linux 4.0r6 using rsync 3.0.3

Problem Description:
  After initializing daily (nightly) rsync backups our backup server panics with the following call trace (we've had this issue twice, so we downgraded our kernel).

The following call trace was captured using a digital camera and reproduced as the following text:

[24756.631365] Call Trace:
[24756.631365]  [<c0108ba8>] nommu_map_single+0x38/0x80
[24756.631365]  [<c0261e77>] bnx2_poll+0x67/01d0
[24756.631365]  [<c02e9e47>] net_rx_Action+0x87/0x130
[24756.631365]  [<c0128b82>] __do_softirq+0x82/0x100
[24756.631365]  [<c0128da0>] ksoftirqd+0x0/0xf0
[24756.631365]  [<c0128c37>] do_softirq+0x37/0x40
[24756.631365]  [<c0128e12>] ksoftirqd+0x72/0xf0
[24756.631365]  [<c01365d2>] kthread+0x42/0x70
[24756.631365]  [<c0136590>] kthread+0x0/0x70
[24756.631365]  [<c0103c2f>] kernel_thread_helper+0x7/0x18
[24756,631697]  =======================
[24756.631697] Code: 00 00 00 8b 9c 24 a8 00 00 00 48 83 ea 04 3b 84 24 8c 00 00
 00 0f 45 94 24 88 00 00 00 89 24 88 00 00 00 8b 94 24 8c 00 00 00 <8b> 0e 8d
 04 52 8b 93 8c 00 00 00 8d 04 85 10 00 00 00 01 d0 8d
[24756.631925] EIP: [<c02613d6>] bnx2_poll_work+0xa0d/0x1010 SS:ESP 0068:f787fea8
[24756.632218] Kernel panic - not syncing: Fatal exception in interrupt

Steps to reproduce:
1. Boot server using 2.6.27.15 kernel
2. Start several (about 10) simultaneous rsync processes to the server
3. Wait for crash
Comment 1 Pascal de Bruijn 2009-02-13 03:11:30 UTC
Created attachment 20223 [details]
Digital screen capture

The original screen capture taken with a digital camera.
Comment 2 Anonymous Emailer 2009-02-17 14:16:53 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 13 Feb 2009 03:10:16 -0800 (PST)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=12698
> 
>            Summary: bnx2_poll_work kernel panic
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.27.15
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Network
>         AssignedTo: jgarzik@pobox.com
>         ReportedBy: pmjdebruijn@pcode.nl
> 
> 
> Latest working kernel version: 
>   2.6.22.19
> 
> Earliest failing kernel version: 
>   2.6.27.15 
>   We did not test kernel versions in between 2.6.27.15 and 2.6.22.19

Is a regression.

> Distribution: 
>   Debian GNU/Linux 4.0r6 (fully updated), with self built vanilla kernel.
> 
> Hardware Environment: 
>   HP DL360 G5, with two onboard Broadcom Corporation NetXtreme II BCM5708
> Gigabit Ethernet Controllers
> 
> Software Environment:
>   Debian GNU/Linux 4.0r6 using rsync 3.0.3
> 
> Problem Description:
>   After initializing daily (nightly) rsync backups our backup server panics
> with the following call trace (we've had this issue twice, so we downgraded
> our
> kernel).
> 
> The following call trace was captured using a digital camera and reproduced
> as
> the following text:
> 
> [24756.631365] Call Trace:
> [24756.631365]  [<c0108ba8>] nommu_map_single+0x38/0x80
> [24756.631365]  [<c0261e77>] bnx2_poll+0x67/01d0
> [24756.631365]  [<c02e9e47>] net_rx_Action+0x87/0x130
> [24756.631365]  [<c0128b82>] __do_softirq+0x82/0x100
> [24756.631365]  [<c0128da0>] ksoftirqd+0x0/0xf0
> [24756.631365]  [<c0128c37>] do_softirq+0x37/0x40
> [24756.631365]  [<c0128e12>] ksoftirqd+0x72/0xf0
> [24756.631365]  [<c01365d2>] kthread+0x42/0x70
> [24756.631365]  [<c0136590>] kthread+0x0/0x70
> [24756.631365]  [<c0103c2f>] kernel_thread_helper+0x7/0x18
> [24756,631697]  =======================
> [24756.631697] Code: 00 00 00 8b 9c 24 a8 00 00 00 48 83 ea 04 3b 84 24 8c 00
> 00
>  00 0f 45 94 24 88 00 00 00 89 24 88 00 00 00 8b 94 24 8c 00 00 00 <8b> 0e 8d
>  04 52 8b 93 8c 00 00 00 8d 04 85 10 00 00 00 01 d0 8d
> [24756.631925] EIP: [<c02613d6>] bnx2_poll_work+0xa0d/0x1010 SS:ESP
> 0068:f787fea8
> [24756.632218] Kernel panic - not syncing: Fatal exception in interrupt
> 
> Steps to reproduce:
> 1. Boot server using 2.6.27.15 kernel
> 2. Start several (about 10) simultaneous rsync processes to the server
> 3. Wait for crash
> 

There's a partial screenshot in bugzilla.  It crashed in
bnx2_poll_work().
Comment 3 Michael Chan 2009-02-17 14:26:24 UTC
On Tue, 2009-02-17 at 14:16 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> There's a partial screenshot in bugzilla.  It crashed in
> bnx2_poll_work().
> 

Pascal, what's the MTU setting?  Is it set to 1500 or bigger jumbo
frames?
Comment 4 Pascal de Bruijn 2009-02-17 23:56:50 UTC
Well, we never touched the MTU settings... We're not using Jumbo frames at all. So we're at the normal MTU 1500...

The following was taking using the 2.6.22.19 kernel, but 2.6.27.15 should have been the same:

eth0      Link encap:Ethernet  HWaddr 00:1E:0B:E9:A1:32  
          inet addr:XX.XX.XX.XX  Bcast:XX.XX.XX.XX  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:185791948 errors:0 dropped:41 overruns:0 frame:0
          TX packets:90091372 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:2325280305 (2.1 GiB)  TX bytes:2478280955 (2.3 GiB)
          Interrupt:18 Memory:f8000000-f8012100 
Comment 5 Pascal de Bruijn 2009-02-19 07:26:01 UTC
On Tue, Feb 17, 2009 at 11:26 PM,  <bugme-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
>
> ------- Comment #3 from mchan@broadcom.com  2009-02-17 14:26 -------
>
> On Tue, 2009-02-17 at 14:16 -0800, bugme-daemon@bugzilla.kernel.org
> wrote:
>> There's a partial screenshot in bugzilla.  It crashed in
>> bnx2_poll_work().
>>
>
> Pascal, what's the MTU setting?  Is it set to 1500 or bigger jumbo
> frames?

We haven't fiddled with the MTU, no jumbo frames, just plain old
1500 byte ones...
Comment 6 Michael Chan 2009-02-20 14:32:00 UTC
> Code: 00 00 00 8b 9c 24 a8 00 00 00 48 83 ea 04 3b 84 24 8c 00 00

This i386 32-bit code looks a bit strange and I suspect that it's been corrupted.
Can you send me the output of objdump -d bnx2.o taken from the kernel that
crashed? 
Comment 7 Pascal de Bruijn 2009-02-21 04:04:10 UTC
On Fri, 2009-02-20 at 14:32 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
> 
> ------- Comment #6 from mchan@broadcom.com  2009-02-20 14:32 -------
> > Code: 00 00 00 8b 9c 24 a8 00 00 00 48 83 ea 04 3b 84 24 8c 00 00
> 
> This i386 32-bit code looks a bit strange and I suspect that it's been
> corrupted.
> Can you send me the output of objdump -d bnx2.o taken from the kernel that
> crashed? 

How could it be corrupted? We've built this kernel with GCC 4.1.1
(Debian Etch), and the module was compiled into the kernel, so I don't
have a seperate bnx2.o file.

I could provide the vmlinuz?

Please also note, we had EDAC (i5000) enabled, and no ECC errors occured
on this specific machine.
Comment 8 Michael Chan 2009-02-21 22:40:28 UTC
bugme-daemon@bugzilla.kernel.org wrote:

> How could it be corrupted? We've built this kernel with GCC 4.1.1
> (Debian Etch), and the module was compiled into the kernel, so I don't
> have a seperate bnx2.o file.

You should still have the bnx2.o in the drivers/net directory.  It
could be corrupted because buggy software wrote to the driver's
code pages or corrupted by DMA.  That's why I wanted to compare
with the disassembled instructions from bnx2.o.

Just run objdump -d bnx2.o from the kernel source tree that crashed.

>
> I could provide the vmlinuz?

Or send me drivers/net/bnx2.o

>
> Please also note, we had EDAC (i5000) enabled, and no ECC
> errors occured
> on this specific machine.

Memory overwritten by buggy code or DMA will not produce ECC errors.

Thanks.
Comment 9 Pascal de Bruijn 2009-02-22 05:02:57 UTC
On Sat, 2009-02-21 at 22:40 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
> 
> ------- Comment #8 from mchan@broadcom.com  2009-02-21 22:40 -------
> bugme-daemon@bugzilla.kernel.org wrote:
> 
> > How could it be corrupted? We've built this kernel with GCC 4.1.1
> > (Debian Etch), and the module was compiled into the kernel, so I don't
> > have a seperate bnx2.o file.
> 
> You should still have the bnx2.o in the drivers/net directory.  It
> could be corrupted because buggy software wrote to the driver's
> code pages or corrupted by DMA.  That's why I wanted to compare
> with the disassembled instructions from bnx2.o.
> 
> Just run objdump -d bnx2.o from the kernel source tree that crashed.
> 
> >
> > I could provide the vmlinuz?
> 
> Or send me drivers/net/bnx2.o

I've attached the bnx2.o file.

> >
> > Please also note, we had EDAC (i5000) enabled, and no ECC
> > errors occured
> > on this specific machine.
> 
> Memory overwritten by buggy code or DMA will not produce ECC errors.

Right.
Comment 10 Michael Chan 2009-02-23 16:33:30 UTC
I've looked at the disassembled code from the bnx2.o.  The code was not
corrupted and it indeed crashed in bnx2_poll_work + 0xa06 which was
somewhere in bnx2_rx_skb().  If I decoded the instructions right, it was
trying to process rx pages and it shouldn't be doing that if the MTU was
set to 1500.

Since it is related to jumbo frames logic (even though you're not using
jumbo frames), the latest 2.6.29-rc has the latest firmware with some
related bug fixes.  If you can test that kernel, it will be great.  I
can also send you some debug patches for 2.6.27.15 to confirm what I
suspect and narrow it down further.  We can also try to reproduce this
in our lab.

Thanks.
Comment 11 Pascal de Bruijn 2009-02-24 02:53:38 UTC
On Mon, 2009-02-23 at 16:33 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
> 
> ------- Comment #10 from mchan@broadcom.com  2009-02-23 16:33 -------
> I've looked at the disassembled code from the bnx2.o.  The code was not
> corrupted and it indeed crashed in bnx2_poll_work + 0xa06 which was
> somewhere in bnx2_rx_skb().  If I decoded the instructions right, it was
> trying to process rx pages and it shouldn't be doing that if the MTU was
> set to 1500.
> 
> Since it is related to jumbo frames logic (even though you're not using
> jumbo frames), the latest 2.6.29-rc has the latest firmware with some
> related bug fixes.  If you can test that kernel, it will be great.  I
> can also send you some debug patches for 2.6.27.15 to confirm what I
> suspect and narrow it down further.  We can also try to reproduce this
> in our lab.

We are mostly interested in 2.6.27.x (-stable), so I hope all related
fixes will also appear in 2.6.27.20+.
Comment 12 Michael Chan 2009-02-25 15:41:59 UTC
Because of the size of the firmware patch, it won't be eligible for -stable.

But we need to confirm if it actually fixes the problem or not.  Will you be able to test 2.6.29-rc for us?
Comment 13 Pascal de Bruijn 2009-02-26 00:24:30 UTC
On Thu, Feb 26, 2009 at 12:41 AM,  <bugme-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
>
> ------- Comment #12 from mchan@broadcom.com  2009-02-25 15:41 -------
> Because of the size of the firmware patch, it won't be eligible for -stable.
>
> But we need to confirm if it actually fixes the problem or not.  Will you be
> able to test 2.6.29-rc for us?

I'll have to discuss this... I'll let you know...

I understand the rules for -stable... though I can't image leaving
bnx2 broken in
-stable, especially since bnx2/e1000 are the two most common NICs in
server parks.
Comment 14 Pascal de Bruijn 2009-03-11 05:46:31 UTC
On Thu, Feb 26, 2009 at 9:24 AM, Pascal de Bruijn <pmjdebruijn@pcode.nl> wrote:
> On Thu, Feb 26, 2009 at 12:41 AM,  <bugme-daemon@bugzilla.kernel.org> wrote:
>> http://bugzilla.kernel.org/show_bug.cgi?id=12698
>>
>> ------- Comment #12 from mchan@broadcom.com  2009-02-25 15:41 -------
>> Because of the size of the firmware patch, it won't be eligible for -stable.
>>
>> But we need to confirm if it actually fixes the problem or not.  Will you be
>> able to test 2.6.29-rc for us?
>
> I'll have to discuss this... I'll let you know...
>
> I understand the rules for -stable... though I can't image leaving
> bnx2 broken in -stable, especially since bnx2/e1000 are the two
> most common NICs in server parks.

We're trying to reproduce the bug in a testing environment, but we're finding it
difficult to reliably reproduce the problem.

1. We run into this bug every night in our production environment, that's why
 we rolled back to our older kernel.
2. In the testing environment we can reproduce the issue, but not reliably. Just
 transfering lots of data simultaneously from multiple sources seems to help
 in triggering the issue, but I can't say for sure.

In our production environment we have about 200 rsyncs, of which 20 run
simultaneously in sequences of 10, during the night.

Without a reliable way to reproduce this in our testing environment, we
can't reliably test if this also occurs in 2.6.29-rc.

Regards,
Pascal
Comment 15 Michael Chan 2009-03-11 09:29:39 UTC
OK, we'll try to reproduce this in our lab.  Thanks.
Comment 16 Pascal de Bruijn 2009-04-23 10:07:18 UTC
Out of curiousity, is there a reason why bnx2 isn't using FW_LOADER yet? So firmware can be relatively easily interchanged...
Comment 17 Michael Chan 2009-05-04 22:23:19 UTC
We're using it starting in 2.6.30.  But firmware will not be easily changeable because the firmware interface is tightly coupled and not all combinations will work.

Anyway, I think we have some breakthrough with this problem.  Another user who encountered the same problem was able to help us debug it.  The problem is in the compiled code and we'll need to add some barrier() to force it to generate the proper code.

If you can send me the disassembled driver code on the kernel that shows the problem, I can confirm it.  Use objdump -d bnx2.ko > bnx2.obj and please send me the output file.  Thanks.
Comment 18 Pascal de Bruijn 2009-05-05 06:47:09 UTC
On Tue, May 5, 2009 at 12:23 AM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
>
> --- Comment #17 from Michael Chan <mchan@broadcom.com>  2009-05-04 22:23:19
> ---
> We're using it starting in 2.6.30.  But firmware will not be easily
> changeable
> because the firmware interface is tightly coupled and not all combinations
> will
> work.
>
> Anyway, I think we have some breakthrough with this problem.  Another user
> who
> encountered the same problem was able to help us debug it.  The problem is in
> the compiled code and we'll need to add some barrier() to force it to
> generate
> the proper code.
>
> If you can send me the disassembled driver code on the kernel that shows the
> problem, I can confirm it.  Use objdump -d bnx2.ko > bnx2.obj and please send
> me the output file.  Thanks.

I've attached the compressed obj file. I hope this helps.

If the fix turns out to be relatively trivial, I'd love for it to
getting submitted for 2.6.27.22.

Regards,
Pascal de Bruijn
Comment 19 Michael Chan 2009-05-11 17:59:39 UTC
Problem should be fixed by this upstream patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=581daf7e00c5e766f26aff80a61a860a17b0d75a
Comment 20 Pascal de Bruijn 2009-05-15 12:55:49 UTC
I've tried the patch with 2.6.27.23, and now our system has been stable for two days in a row. While it previously would crash every night while running 2.6.27.x.

So I think I can confirm the patch solved our problem.

Thank you!
Comment 21 Pascal de Bruijn 2009-05-21 09:53:13 UTC
On Mon, May 11, 2009 at 7:59 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
>
> --- Comment #19 from Michael Chan <mchan@broadcom.com>  2009-05-11 17:59:39
> ---
> Problem should be fixed by this upstream patch:
>
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=581daf7e00c5e766f26aff80a61a860a17b0d75a

We've had a kernel with this fix in our production setup now for a
couple of days, and we've run into a new issue.

The kernel does not crash anymore, however every other day (or so),
the interface just stops working. We get no
significant kernel messages (no explicit link down). ifdown/ifupping
the interface fixes things again.

So even with this fix the bnx2 driver in 2.6.27.23 still can't really
be considered reliable under load.

We're had this issue twice now, and the problem occurs consistently
about 2:30u into our backup scheme.

Please note, we have other machines using 2.6.27.x with the bnx2
driver without issues. But again, our backup machine is the only one
that truely gets loaded.

Regards,
Pascal de Bruijn
Comment 22 Michael Chan 2009-05-21 19:04:16 UTC
We now know that the original problem was highly dependent on the compiler and that's why we couldn't reproduce it in out lab.  This new problem may also be similar.  We will try to set up the exact distro user environemnt.  Please provide gcc -v so we can double check the gcc version.  Please also provide the .config.  Lastly, please attach the lastest disassembled bnx2.ko from 2.6.27.23 with the upstream patch applied.  Thanks.
Comment 23 Pascal de Bruijn 2009-05-22 11:56:47 UTC
On Thu, 2009-05-21 at 19:04 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
> 
> --- Comment #22 from Michael Chan <mchan@broadcom.com>  2009-05-21 19:04:16
> ---
> We now know that the original problem was highly dependent on the compiler
> and
> that's why we couldn't reproduce it in out lab.  This new problem may also be
> similar.  We will try to set up the exact distro user environemnt.  Please
> provide gcc -v so we can double check the gcc version.  Please also provide
> the
> .config.  Lastly, please attach the lastest disassembled bnx2.ko from
> 2.6.27.23
> with the upstream patch applied.  Thanks.

I've attached the object dump, however it's from 2.6.27.22 plus your
barrier fix applied (I was mistaken about the .23)...

Our system is a Debian Etch install:

# gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c
++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared
--with-system-zlib --libexecdir=/usr/lib --without-included-gettext
--enable-threads=posix --enable-nls --program-suffix=-4.1
--enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug
--enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
# dpkg -l | grep gcc
ii  gcc                      4.1.1-15                             The
GNU C compiler
ii  gcc-4.1                  4.1.1-21                             The
GNU C compiler
ii  gcc-4.1-base             4.1.1-21                             The
GNU Compiler Collection (base package)
ii  libgcc1                  4.1.1-21                             GCC
support library
filer02:~# 

I hope this helps, please let me know if additional information is
required.

Regards,
Pascal de Bruijn
Comment 24 Pascal de Bruijn 2009-05-27 09:32:04 UTC
On Thu, May 21, 2009 at 9:04 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
>
> --- Comment #22 from Michael Chan <mchan@broadcom.com>  2009-05-21 19:04:16
> ---
> We now know that the original problem was highly dependent on the compiler
> and
> that's why we couldn't reproduce it in out lab.  This new problem may also be
> similar.  We will try to set up the exact distro user environemnt.  Please
> provide gcc -v so we can double check the gcc version.  Please also provide
> the
> .config.  Lastly, please attach the lastest disassembled bnx2.ko from
> 2.6.27.23
> with the upstream patch applied.  Thanks.

Since Debian Etch support will end in less than a year, we considered
upgrading the
system in question to Debian Lenny. So we did...

Again we compiled our own vanilla 2.6.27.24 kernel with the barrier
fix applied...

We used the Debian system compiler, as specified below:

# gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian
4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-targets=all --enable-cld
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
--target=i486-linux-gnu
Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)

# dpkg -l | grep gcc
ii  gcc                                  4:4.3.2-2                The
GNU C compiler
ii  gcc-4.3                              4.3.2-1.1                The
GNU C compiler
ii  gcc-4.3-base                         4.3.2-1.1                The
GNU Compiler Collection (base package)
ii  libgcc1                              1:4.3.2-1.1              GCC
support library

This time our kernel didn't crash, the link wasn't lost... However we
did get a trace in a dmesg...

[122774.972277]  [<c0159855>] __alloc_pages_internal+0x375/0x470
[122774.972310]  [<c01746e9>] cache_alloc_refill+0x2e9/0x510
[122774.972339]  [<c01749d9>] __kmalloc+0xc9/0xd0
[122774.972366]  [<c02f6d81>] __alloc_skb+0x51/0x110
[122774.972394]  [<c02f7802>] __netdev_alloc_skb+0x22/0x50
[122774.972421]  [<c0271b1d>] bnx2_poll_work+0x81d/0x1030
[122774.972453]  [<c011bb79>] place_entity+0xa9/0x130
[122774.972481]  [<c01fa1fb>] rb_insert_color+0x7b/0xf0
[122774.972510]  [<c011975b>] memtype_get_idx+0xb/0xe0
[122774.972537]  [<c0272447>] bnx2_poll+0x47/0x210
[122774.972563]  [<c02fa847>] net_rx_action+0x97/0x160
[122774.972591]  [<c012a642>] __do_softirq+0x82/0xf0
[122774.972619]  [<c012a6ed>] do_softirq+0x3d/0x50
[122774.972646]  [<c01060a0>] do_IRQ+0x40/0x70
[122774.972672]  [<c0113745>] smp_apic_timer_interrupt+0x55/0x80
[122774.972702]  [<c0103b3b>] common_interrupt+0x23/0x28
[122774.972735]  [<c0109eef>] mwait_idle+0x2f/0x40
[122774.972765]  [<c01020f3>] cpu_idle+0x73/0xf0

I've attached the object dump of our bnx2.o file, our kernel .config,
and the complete dmesg.

Regards,
Pascal de Bruijn
Comment 25 Michael Chan 2009-06-01 22:09:54 UTC
I don't see the entire kernel message.  Did it say:

"... page allocation error. order:..."

If it did, then it's out of memory which is known to happen.

I'll compare the obj files with the last one you sent me to try to determine the 2nd problem you encountered using the older gcc.
Comment 26 Michael Chan 2009-06-03 21:45:05 UTC
1.  I checked the complete dmesg log that you sent to me privately.  The system is out of memory when the bnx2 driver is trying to allocate a single page.  Not sure why your system is so low in kernel memory.  But this should be unrelated to the bnx2 driver.

[122774.972217] swapper: page allocation failure. order:0, mode:0x20
[122774.972248] Pid: 0, comm: swapper Not tainted 2.6.27.24-bnx2fix #1
[122774.972277]  [<c0159855>] __alloc_pages_internal+0x375/0x470
[122774.972310]  [<c01746e9>] cache_alloc_refill+0x2e9/0x510
[122774.972339]  [<c01749d9>] __kmalloc+0xc9/0xd0
[122774.972366]  [<c02f6d81>] __alloc_skb+0x51/0x110
[122774.972394]  [<c02f7802>] __netdev_alloc_skb+0x22/0x50
[122774.972421]  [<c0271b1d>] bnx2_poll_work+0x81d/0x1030

2.  I've looked at all the obj files that you sent.  The one before the patch, the one after the patch, and the one after the patch using gcc 4.3.2.  Only the last one using gcc 4.3.2 has the correct code.  The 2nd one has obj code that is a little different than the first one near the place where barrier was added, but the code is still not correct.  The system should still crash using the 2nd obj code.

Anyway, the latest compiled code using the new gcc looks correct.  Please let me know if you encounter any more issues.  Thanks.
Comment 27 Pascal de Bruijn 2009-06-04 11:36:20 UTC
On Wed, Jun 3, 2009 at 11:45 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12698
>
> --- Comment #26 from Michael Chan <mchan@broadcom.com>  2009-06-03 21:45:05
> ---
> 1.  I checked the complete dmesg log that you sent to me privately.  The
> system
> is out of memory when the bnx2 driver is trying to allocate a single page. 
> Not
> sure why your system is so low in kernel memory.  But this should be
> unrelated
> to the bnx2 driver.
>
> [122774.972217] swapper: page allocation failure. order:0, mode:0x20
> [122774.972248] Pid: 0, comm: swapper Not tainted 2.6.27.24-bnx2fix #1
> [122774.972277]  [<c0159855>] __alloc_pages_internal+0x375/0x470
> [122774.972310]  [<c01746e9>] cache_alloc_refill+0x2e9/0x510
> [122774.972339]  [<c01749d9>] __kmalloc+0xc9/0xd0
> [122774.972366]  [<c02f6d81>] __alloc_skb+0x51/0x110
> [122774.972394]  [<c02f7802>] __netdev_alloc_skb+0x22/0x50
> [122774.972421]  [<c0271b1d>] bnx2_poll_work+0x81d/0x1030
>
> 2.  I've looked at all the obj files that you sent.  The one before the
> patch,
> the one after the patch, and the one after the patch using gcc 4.3.2.  Only
> the
> last one using gcc 4.3.2 has the correct code.  The 2nd one has obj code that
> is a little different than the first one near the place where barrier was
> added, but the code is still not correct.  The system should still crash
> using
> the 2nd obj code.
>
> Anyway, the latest compiled code using the new gcc looks correct.  Please let
> me know if you encounter any more issues.  Thanks.

Thanks for your efforts!

Regards,
Pascal de Bruijn
Comment 28 Alan 2010-01-25 14:48:47 UTC
The page allocation bug seems urelated so closing this one.

(Lots of vm tuning has occurred since 2.6.27 as well)