Bug 12698
Summary: | bnx2_poll_work kernel panic | ||
---|---|---|---|
Product: | Drivers | Reporter: | Pascal de Bruijn (pmjdebruijn) |
Component: | Network | Assignee: | Michael Chan (mchan) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | alan, jtorrice, pterjan |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.27.15 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | Digital screen capture |
Description
Pascal de Bruijn
2009-02-13 03:10:15 UTC
Created attachment 20223 [details]
Digital screen capture
The original screen capture taken with a digital camera.
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Fri, 13 Feb 2009 03:10:16 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > Summary: bnx2_poll_work kernel panic > Product: Drivers > Version: 2.5 > KernelVersion: 2.6.27.15 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: Network > AssignedTo: jgarzik@pobox.com > ReportedBy: pmjdebruijn@pcode.nl > > > Latest working kernel version: > 2.6.22.19 > > Earliest failing kernel version: > 2.6.27.15 > We did not test kernel versions in between 2.6.27.15 and 2.6.22.19 Is a regression. > Distribution: > Debian GNU/Linux 4.0r6 (fully updated), with self built vanilla kernel. > > Hardware Environment: > HP DL360 G5, with two onboard Broadcom Corporation NetXtreme II BCM5708 > Gigabit Ethernet Controllers > > Software Environment: > Debian GNU/Linux 4.0r6 using rsync 3.0.3 > > Problem Description: > After initializing daily (nightly) rsync backups our backup server panics > with the following call trace (we've had this issue twice, so we downgraded > our > kernel). > > The following call trace was captured using a digital camera and reproduced > as > the following text: > > [24756.631365] Call Trace: > [24756.631365] [<c0108ba8>] nommu_map_single+0x38/0x80 > [24756.631365] [<c0261e77>] bnx2_poll+0x67/01d0 > [24756.631365] [<c02e9e47>] net_rx_Action+0x87/0x130 > [24756.631365] [<c0128b82>] __do_softirq+0x82/0x100 > [24756.631365] [<c0128da0>] ksoftirqd+0x0/0xf0 > [24756.631365] [<c0128c37>] do_softirq+0x37/0x40 > [24756.631365] [<c0128e12>] ksoftirqd+0x72/0xf0 > [24756.631365] [<c01365d2>] kthread+0x42/0x70 > [24756.631365] [<c0136590>] kthread+0x0/0x70 > [24756.631365] [<c0103c2f>] kernel_thread_helper+0x7/0x18 > [24756,631697] ======================= > [24756.631697] Code: 00 00 00 8b 9c 24 a8 00 00 00 48 83 ea 04 3b 84 24 8c 00 > 00 > 00 0f 45 94 24 88 00 00 00 89 24 88 00 00 00 8b 94 24 8c 00 00 00 <8b> 0e 8d > 04 52 8b 93 8c 00 00 00 8d 04 85 10 00 00 00 01 d0 8d > [24756.631925] EIP: [<c02613d6>] bnx2_poll_work+0xa0d/0x1010 SS:ESP > 0068:f787fea8 > [24756.632218] Kernel panic - not syncing: Fatal exception in interrupt > > Steps to reproduce: > 1. Boot server using 2.6.27.15 kernel > 2. Start several (about 10) simultaneous rsync processes to the server > 3. Wait for crash > There's a partial screenshot in bugzilla. It crashed in bnx2_poll_work(). On Tue, 2009-02-17 at 14:16 -0800, bugme-daemon@bugzilla.kernel.org wrote: > There's a partial screenshot in bugzilla. It crashed in > bnx2_poll_work(). > Pascal, what's the MTU setting? Is it set to 1500 or bigger jumbo frames? Well, we never touched the MTU settings... We're not using Jumbo frames at all. So we're at the normal MTU 1500... The following was taking using the 2.6.22.19 kernel, but 2.6.27.15 should have been the same: eth0 Link encap:Ethernet HWaddr 00:1E:0B:E9:A1:32 inet addr:XX.XX.XX.XX Bcast:XX.XX.XX.XX Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:185791948 errors:0 dropped:41 overruns:0 frame:0 TX packets:90091372 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2325280305 (2.1 GiB) TX bytes:2478280955 (2.3 GiB) Interrupt:18 Memory:f8000000-f8012100 On Tue, Feb 17, 2009 at 11:26 PM, <bugme-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > ------- Comment #3 from mchan@broadcom.com 2009-02-17 14:26 ------- > > On Tue, 2009-02-17 at 14:16 -0800, bugme-daemon@bugzilla.kernel.org > wrote: >> There's a partial screenshot in bugzilla. It crashed in >> bnx2_poll_work(). >> > > Pascal, what's the MTU setting? Is it set to 1500 or bigger jumbo > frames? We haven't fiddled with the MTU, no jumbo frames, just plain old 1500 byte ones... > Code: 00 00 00 8b 9c 24 a8 00 00 00 48 83 ea 04 3b 84 24 8c 00 00
This i386 32-bit code looks a bit strange and I suspect that it's been corrupted.
Can you send me the output of objdump -d bnx2.o taken from the kernel that
crashed?
On Fri, 2009-02-20 at 14:32 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > ------- Comment #6 from mchan@broadcom.com 2009-02-20 14:32 ------- > > Code: 00 00 00 8b 9c 24 a8 00 00 00 48 83 ea 04 3b 84 24 8c 00 00 > > This i386 32-bit code looks a bit strange and I suspect that it's been > corrupted. > Can you send me the output of objdump -d bnx2.o taken from the kernel that > crashed? How could it be corrupted? We've built this kernel with GCC 4.1.1 (Debian Etch), and the module was compiled into the kernel, so I don't have a seperate bnx2.o file. I could provide the vmlinuz? Please also note, we had EDAC (i5000) enabled, and no ECC errors occured on this specific machine. bugme-daemon@bugzilla.kernel.org wrote: > How could it be corrupted? We've built this kernel with GCC 4.1.1 > (Debian Etch), and the module was compiled into the kernel, so I don't > have a seperate bnx2.o file. You should still have the bnx2.o in the drivers/net directory. It could be corrupted because buggy software wrote to the driver's code pages or corrupted by DMA. That's why I wanted to compare with the disassembled instructions from bnx2.o. Just run objdump -d bnx2.o from the kernel source tree that crashed. > > I could provide the vmlinuz? Or send me drivers/net/bnx2.o > > Please also note, we had EDAC (i5000) enabled, and no ECC > errors occured > on this specific machine. Memory overwritten by buggy code or DMA will not produce ECC errors. Thanks. On Sat, 2009-02-21 at 22:40 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > ------- Comment #8 from mchan@broadcom.com 2009-02-21 22:40 ------- > bugme-daemon@bugzilla.kernel.org wrote: > > > How could it be corrupted? We've built this kernel with GCC 4.1.1 > > (Debian Etch), and the module was compiled into the kernel, so I don't > > have a seperate bnx2.o file. > > You should still have the bnx2.o in the drivers/net directory. It > could be corrupted because buggy software wrote to the driver's > code pages or corrupted by DMA. That's why I wanted to compare > with the disassembled instructions from bnx2.o. > > Just run objdump -d bnx2.o from the kernel source tree that crashed. > > > > > I could provide the vmlinuz? > > Or send me drivers/net/bnx2.o I've attached the bnx2.o file. > > > > Please also note, we had EDAC (i5000) enabled, and no ECC > > errors occured > > on this specific machine. > > Memory overwritten by buggy code or DMA will not produce ECC errors. Right. I've looked at the disassembled code from the bnx2.o. The code was not corrupted and it indeed crashed in bnx2_poll_work + 0xa06 which was somewhere in bnx2_rx_skb(). If I decoded the instructions right, it was trying to process rx pages and it shouldn't be doing that if the MTU was set to 1500. Since it is related to jumbo frames logic (even though you're not using jumbo frames), the latest 2.6.29-rc has the latest firmware with some related bug fixes. If you can test that kernel, it will be great. I can also send you some debug patches for 2.6.27.15 to confirm what I suspect and narrow it down further. We can also try to reproduce this in our lab. Thanks. On Mon, 2009-02-23 at 16:33 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > ------- Comment #10 from mchan@broadcom.com 2009-02-23 16:33 ------- > I've looked at the disassembled code from the bnx2.o. The code was not > corrupted and it indeed crashed in bnx2_poll_work + 0xa06 which was > somewhere in bnx2_rx_skb(). If I decoded the instructions right, it was > trying to process rx pages and it shouldn't be doing that if the MTU was > set to 1500. > > Since it is related to jumbo frames logic (even though you're not using > jumbo frames), the latest 2.6.29-rc has the latest firmware with some > related bug fixes. If you can test that kernel, it will be great. I > can also send you some debug patches for 2.6.27.15 to confirm what I > suspect and narrow it down further. We can also try to reproduce this > in our lab. We are mostly interested in 2.6.27.x (-stable), so I hope all related fixes will also appear in 2.6.27.20+. Because of the size of the firmware patch, it won't be eligible for -stable. But we need to confirm if it actually fixes the problem or not. Will you be able to test 2.6.29-rc for us? On Thu, Feb 26, 2009 at 12:41 AM, <bugme-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > ------- Comment #12 from mchan@broadcom.com 2009-02-25 15:41 ------- > Because of the size of the firmware patch, it won't be eligible for -stable. > > But we need to confirm if it actually fixes the problem or not. Will you be > able to test 2.6.29-rc for us? I'll have to discuss this... I'll let you know... I understand the rules for -stable... though I can't image leaving bnx2 broken in -stable, especially since bnx2/e1000 are the two most common NICs in server parks. On Thu, Feb 26, 2009 at 9:24 AM, Pascal de Bruijn <pmjdebruijn@pcode.nl> wrote: > On Thu, Feb 26, 2009 at 12:41 AM, <bugme-daemon@bugzilla.kernel.org> wrote: >> http://bugzilla.kernel.org/show_bug.cgi?id=12698 >> >> ------- Comment #12 from mchan@broadcom.com 2009-02-25 15:41 ------- >> Because of the size of the firmware patch, it won't be eligible for -stable. >> >> But we need to confirm if it actually fixes the problem or not. Will you be >> able to test 2.6.29-rc for us? > > I'll have to discuss this... I'll let you know... > > I understand the rules for -stable... though I can't image leaving > bnx2 broken in -stable, especially since bnx2/e1000 are the two > most common NICs in server parks. We're trying to reproduce the bug in a testing environment, but we're finding it difficult to reliably reproduce the problem. 1. We run into this bug every night in our production environment, that's why we rolled back to our older kernel. 2. In the testing environment we can reproduce the issue, but not reliably. Just transfering lots of data simultaneously from multiple sources seems to help in triggering the issue, but I can't say for sure. In our production environment we have about 200 rsyncs, of which 20 run simultaneously in sequences of 10, during the night. Without a reliable way to reproduce this in our testing environment, we can't reliably test if this also occurs in 2.6.29-rc. Regards, Pascal OK, we'll try to reproduce this in our lab. Thanks. Out of curiousity, is there a reason why bnx2 isn't using FW_LOADER yet? So firmware can be relatively easily interchanged... We're using it starting in 2.6.30. But firmware will not be easily changeable because the firmware interface is tightly coupled and not all combinations will work. Anyway, I think we have some breakthrough with this problem. Another user who encountered the same problem was able to help us debug it. The problem is in the compiled code and we'll need to add some barrier() to force it to generate the proper code. If you can send me the disassembled driver code on the kernel that shows the problem, I can confirm it. Use objdump -d bnx2.ko > bnx2.obj and please send me the output file. Thanks. On Tue, May 5, 2009 at 12:23 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > --- Comment #17 from Michael Chan <mchan@broadcom.com> 2009-05-04 22:23:19 > --- > We're using it starting in 2.6.30. But firmware will not be easily > changeable > because the firmware interface is tightly coupled and not all combinations > will > work. > > Anyway, I think we have some breakthrough with this problem. Another user > who > encountered the same problem was able to help us debug it. The problem is in > the compiled code and we'll need to add some barrier() to force it to > generate > the proper code. > > If you can send me the disassembled driver code on the kernel that shows the > problem, I can confirm it. Use objdump -d bnx2.ko > bnx2.obj and please send > me the output file. Thanks. I've attached the compressed obj file. I hope this helps. If the fix turns out to be relatively trivial, I'd love for it to getting submitted for 2.6.27.22. Regards, Pascal de Bruijn Problem should be fixed by this upstream patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=581daf7e00c5e766f26aff80a61a860a17b0d75a I've tried the patch with 2.6.27.23, and now our system has been stable for two days in a row. While it previously would crash every night while running 2.6.27.x. So I think I can confirm the patch solved our problem. Thank you! On Mon, May 11, 2009 at 7:59 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > --- Comment #19 from Michael Chan <mchan@broadcom.com> 2009-05-11 17:59:39 > --- > Problem should be fixed by this upstream patch: > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=581daf7e00c5e766f26aff80a61a860a17b0d75a We've had a kernel with this fix in our production setup now for a couple of days, and we've run into a new issue. The kernel does not crash anymore, however every other day (or so), the interface just stops working. We get no significant kernel messages (no explicit link down). ifdown/ifupping the interface fixes things again. So even with this fix the bnx2 driver in 2.6.27.23 still can't really be considered reliable under load. We're had this issue twice now, and the problem occurs consistently about 2:30u into our backup scheme. Please note, we have other machines using 2.6.27.x with the bnx2 driver without issues. But again, our backup machine is the only one that truely gets loaded. Regards, Pascal de Bruijn We now know that the original problem was highly dependent on the compiler and that's why we couldn't reproduce it in out lab. This new problem may also be similar. We will try to set up the exact distro user environemnt. Please provide gcc -v so we can double check the gcc version. Please also provide the .config. Lastly, please attach the lastest disassembled bnx2.ko from 2.6.27.23 with the upstream patch applied. Thanks. On Thu, 2009-05-21 at 19:04 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > --- Comment #22 from Michael Chan <mchan@broadcom.com> 2009-05-21 19:04:16 > --- > We now know that the original problem was highly dependent on the compiler > and > that's why we couldn't reproduce it in out lab. This new problem may also be > similar. We will try to set up the exact distro user environemnt. Please > provide gcc -v so we can double check the gcc version. Please also provide > the > .config. Lastly, please attach the lastest disassembled bnx2.ko from > 2.6.27.23 > with the upstream patch applied. Thanks. I've attached the object dump, however it's from 2.6.27.22 plus your barrier fix applied (I was mistaken about the .23)... Our system is a Debian Etch install: # gcc -v Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c ++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu Thread model: posix gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) # dpkg -l | grep gcc ii gcc 4.1.1-15 The GNU C compiler ii gcc-4.1 4.1.1-21 The GNU C compiler ii gcc-4.1-base 4.1.1-21 The GNU Compiler Collection (base package) ii libgcc1 4.1.1-21 GCC support library filer02:~# I hope this helps, please let me know if additional information is required. Regards, Pascal de Bruijn On Thu, May 21, 2009 at 9:04 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > --- Comment #22 from Michael Chan <mchan@broadcom.com> 2009-05-21 19:04:16 > --- > We now know that the original problem was highly dependent on the compiler > and > that's why we couldn't reproduce it in out lab. This new problem may also be > similar. We will try to set up the exact distro user environemnt. Please > provide gcc -v so we can double check the gcc version. Please also provide > the > .config. Lastly, please attach the lastest disassembled bnx2.ko from > 2.6.27.23 > with the upstream patch applied. Thanks. Since Debian Etch support will end in less than a year, we considered upgrading the system in question to Debian Lenny. So we did... Again we compiled our own vanilla 2.6.27.24 kernel with the barrier fix applied... We used the Debian system compiler, as specified below: # gcc -v Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-targets=all --enable-cld --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu Thread model: posix gcc version 4.3.2 (Debian 4.3.2-1.1) # dpkg -l | grep gcc ii gcc 4:4.3.2-2 The GNU C compiler ii gcc-4.3 4.3.2-1.1 The GNU C compiler ii gcc-4.3-base 4.3.2-1.1 The GNU Compiler Collection (base package) ii libgcc1 1:4.3.2-1.1 GCC support library This time our kernel didn't crash, the link wasn't lost... However we did get a trace in a dmesg... [122774.972277] [<c0159855>] __alloc_pages_internal+0x375/0x470 [122774.972310] [<c01746e9>] cache_alloc_refill+0x2e9/0x510 [122774.972339] [<c01749d9>] __kmalloc+0xc9/0xd0 [122774.972366] [<c02f6d81>] __alloc_skb+0x51/0x110 [122774.972394] [<c02f7802>] __netdev_alloc_skb+0x22/0x50 [122774.972421] [<c0271b1d>] bnx2_poll_work+0x81d/0x1030 [122774.972453] [<c011bb79>] place_entity+0xa9/0x130 [122774.972481] [<c01fa1fb>] rb_insert_color+0x7b/0xf0 [122774.972510] [<c011975b>] memtype_get_idx+0xb/0xe0 [122774.972537] [<c0272447>] bnx2_poll+0x47/0x210 [122774.972563] [<c02fa847>] net_rx_action+0x97/0x160 [122774.972591] [<c012a642>] __do_softirq+0x82/0xf0 [122774.972619] [<c012a6ed>] do_softirq+0x3d/0x50 [122774.972646] [<c01060a0>] do_IRQ+0x40/0x70 [122774.972672] [<c0113745>] smp_apic_timer_interrupt+0x55/0x80 [122774.972702] [<c0103b3b>] common_interrupt+0x23/0x28 [122774.972735] [<c0109eef>] mwait_idle+0x2f/0x40 [122774.972765] [<c01020f3>] cpu_idle+0x73/0xf0 I've attached the object dump of our bnx2.o file, our kernel .config, and the complete dmesg. Regards, Pascal de Bruijn I don't see the entire kernel message. Did it say: "... page allocation error. order:..." If it did, then it's out of memory which is known to happen. I'll compare the obj files with the last one you sent me to try to determine the 2nd problem you encountered using the older gcc. 1. I checked the complete dmesg log that you sent to me privately. The system is out of memory when the bnx2 driver is trying to allocate a single page. Not sure why your system is so low in kernel memory. But this should be unrelated to the bnx2 driver. [122774.972217] swapper: page allocation failure. order:0, mode:0x20 [122774.972248] Pid: 0, comm: swapper Not tainted 2.6.27.24-bnx2fix #1 [122774.972277] [<c0159855>] __alloc_pages_internal+0x375/0x470 [122774.972310] [<c01746e9>] cache_alloc_refill+0x2e9/0x510 [122774.972339] [<c01749d9>] __kmalloc+0xc9/0xd0 [122774.972366] [<c02f6d81>] __alloc_skb+0x51/0x110 [122774.972394] [<c02f7802>] __netdev_alloc_skb+0x22/0x50 [122774.972421] [<c0271b1d>] bnx2_poll_work+0x81d/0x1030 2. I've looked at all the obj files that you sent. The one before the patch, the one after the patch, and the one after the patch using gcc 4.3.2. Only the last one using gcc 4.3.2 has the correct code. The 2nd one has obj code that is a little different than the first one near the place where barrier was added, but the code is still not correct. The system should still crash using the 2nd obj code. Anyway, the latest compiled code using the new gcc looks correct. Please let me know if you encounter any more issues. Thanks. On Wed, Jun 3, 2009 at 11:45 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12698 > > --- Comment #26 from Michael Chan <mchan@broadcom.com> 2009-06-03 21:45:05 > --- > 1. I checked the complete dmesg log that you sent to me privately. The > system > is out of memory when the bnx2 driver is trying to allocate a single page. > Not > sure why your system is so low in kernel memory. But this should be > unrelated > to the bnx2 driver. > > [122774.972217] swapper: page allocation failure. order:0, mode:0x20 > [122774.972248] Pid: 0, comm: swapper Not tainted 2.6.27.24-bnx2fix #1 > [122774.972277] [<c0159855>] __alloc_pages_internal+0x375/0x470 > [122774.972310] [<c01746e9>] cache_alloc_refill+0x2e9/0x510 > [122774.972339] [<c01749d9>] __kmalloc+0xc9/0xd0 > [122774.972366] [<c02f6d81>] __alloc_skb+0x51/0x110 > [122774.972394] [<c02f7802>] __netdev_alloc_skb+0x22/0x50 > [122774.972421] [<c0271b1d>] bnx2_poll_work+0x81d/0x1030 > > 2. I've looked at all the obj files that you sent. The one before the > patch, > the one after the patch, and the one after the patch using gcc 4.3.2. Only > the > last one using gcc 4.3.2 has the correct code. The 2nd one has obj code that > is a little different than the first one near the place where barrier was > added, but the code is still not correct. The system should still crash > using > the 2nd obj code. > > Anyway, the latest compiled code using the new gcc looks correct. Please let > me know if you encounter any more issues. Thanks. Thanks for your efforts! Regards, Pascal de Bruijn The page allocation bug seems urelated so closing this one. (Lots of vm tuning has occurred since 2.6.27 as well) |