Bug 6009
Summary: | tcpdump causes kernel panic | ||
---|---|---|---|
Product: | SCSI Drivers | Reporter: | Donny Jekels (djekels) |
Component: | Other | Assignee: | James Bottomley (jejb) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | akpm, jejb, protasnb |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.15 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
messages files without crond and sshd messages
top xterm display when OS hung vmstat -s at lock up i o errors after 3ware controller reset workerqueue strace and io rejecting I/O kernel panic pics taken from console dmesg after patch applied panic output after patch applied 2 pics two more pcis of panic last two pcis of panic 3 jpegs tarred and gzipped more jpegs as requested - 3ware patch causes panic at boot time |
Description
Donny Jekels
2006-02-04 09:33:55 UTC
Would help if you could include the kernel panic ;-) If it doesn't capture in /var/log/messages, and you don't have a serial console, even a digital camera helps ... I am running 2.6.16-rc2 now and that kernel panic message does not appear anymore, but the system exibit the same results after the 3ware array controller resets itself during this high multicast traffic period. Let me know if there is anything else you'd like me to do. Even boot on 2.6.15 and get the error up againg if need be. -----Original Message----- From: bugme-daemon@bugzilla.kernel.org [mailto:bugme-daemon@bugzilla.kernel.org] Sent: Saturday, February 04, 2006 11:48 AM To: Donny Jekels Subject: [Bug 6009] tcpdump causes kernel panic http://bugzilla.kernel.org/show_bug.cgi?id=6009 ------- Additional Comments From mbligh@mbligh.org 2006-02-04 09:47 ------- Would help if you could include the kernel panic ;-) If it doesn't capture in /var/log/messages, and you don't have a serial console, even a digital camera helps ... ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are on the CC list for the bug, or are watching someone who is. Created attachment 7235 [details]
messages files without crond and sshd messages
bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=6009 This 2.6.15 kernel has some 3ware problems: messages.4:Jan 6 12:15:41 chilsp010 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity test is overdue:. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi0 : 3ware 9000 Storage Controller messages.4:Jan 6 12:15:41 chilsp010 kernel: 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfeaffc00, IRQ: 217. messages.4:Jan 6 12:15:41 chilsp010 kernel: 3w-9xxx: scsi0: Firmware FE9X 2.06.00.009, BIOS BE9X 2.03.01.051, Ports: 8. messages.4:Jan 6 12:15:41 chilsp010 kernel: Vendor: AMCC Model: 9500S-8 DISK Rev: 2.06 messages.4:Jan 6 12:15:41 chilsp010 kernel: Type: Direct-Access ANSI SCSI revision: 03 messages.4:Jan 6 12:15:41 chilsp010 kernel: SCSI device sda: 390602752 512-byte hdwr sectors (199989 MB) messages.4:Jan 6 12:15:41 chilsp010 kernel: SCSI device sda: drive cache: write back, no read (daft) messages.4:Jan 6 12:15:41 chilsp010 kernel: sda: sda1 sda2 messages.4:Jan 6 12:15:41 chilsp010 kernel: Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: On host 0 channel 0 id 0 only 511 (max_scsi_report_luns) of 493425154 luns reported, try increasing max_scsi_report_luns. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0xb0b800008ed88ec0 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0xfbbe007cbf0006b9 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x0002f3a4ea210600 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x00bebe073804750b has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x83c61081fefe0775 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0xf3eb16b402b001bb has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x007cb2808a740302 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x8000008041c80100 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x0008fa80ca80ea53 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x7c000031c08ed88e has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0xd0bc0020fba0407c has a LUN larger than currently supported. Donny, I think we have two separate bugs here, but we won't be able tell until you've been able to get us a copy of the oops output. applied 2.6.16-rc2 patch, rebuilt and running. not experiencing the same results. anymore. keep new - i will run a few hours to test. Created attachment 7236 [details]
top xterm display when OS hung
and just when I thought its gonna be alright. The rsync job that I ran for
almost 3 hours go stuck and hung the OS.
Created attachment 7237 [details]
vmstat -s at lock up
Created attachment 7238 [details]
i o errors after 3ware controller reset
the attached output of any command I try to execute - after the host hang
no Panic this time just stuck and then I got my prompt back after a few minutes
and all file systems are now mounted read only mode. - no console messages
either
3ware 9000 Storage Controller device driver for Linux v2.26.02.004. input: AT Translated Set 2 keyboard as /class/input/input0 scsi0 : 3ware 9000 Storage Controller 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfeaff000, IRQ: 9. 3w-9xxx: scsi0: Firmware FE9X 3.02.00.016, BIOS BE9X 3.01.00.027, Ports: 8. Vendor: AMCC Model: 9550SX-8LP DISK Rev: 3.02 Type: Direct-Access ANSI SCSI revision: 03 SCSI device sda: 390602752 512-byte hdwr sectors (199989 MB) SCSI device sda: drive cache: write back, no read (daft) SCSI device sda: 390602752 512-byte hdwr sectors (199989 MB) SCSI device sda: drive cache: write back, no read (daft) sda: sda1 sda2 sd 0:0:0:0: Attached scsi disk sda Vendor: AMCC Model: 9550SX-8LP DISK Rev: 3.02 Type: Direct-Access ANSI SCSI revision: 03 SCSI device sdb: 781205504 512-byte hdwr sectors (399977 MB) SCSI device sdb: drive cache: write back, no read (daft) SCSI device sdb: 781205504 512-byte hdwr sectors (399977 MB) SCSI device sdb: drive cache: write back, no read (daft) sdb: unknown partition table sd 0:0:1:0: Attached scsi disk sdb Debug: sleeping function called from invalid context at kernel/workqueue.c:266 in_atomic():1, irqs_disabled():0 Call Trace: <IRQ> <ffffffff801299ae>{__might_sleep+190} <ffffffff80139482>{try_to_del_timer_sync+ 75} <ffffffff80140c74>{flush_workqueue+24} <ffffffff801ea956>{as_exit_queue+22} <ffffffff801e1e03>{elevator_exit+18} <ffffffff801e5f4b>{blk_cleanup_queue+42} <ffffffff8800abba>{:scsi_mod:scsi_device_dev_release+230} <ffffffff801efcd7>{kobject_cleanup+84} <ffffffff801ea046>{as_queue_empty+0} <ffffffff801efd04>{kobject_release+0} <ffffffff801f07e5>{kref_put+83} <ffffffff880077d2>{:scsi_mod:scsi_end_request+186} <ffffffff88007c8d>{:scsi_mod:scsi_io_co mpletion+1063} <ffffffff88002d4c>{:scsi_mod:scsi_softirq+360} <ffffffff80135de8>{__do_softirq+80} <ffffffff8010ee1b>{call_softirq+31} <ffffffff8011030a>{do_softirq+47} <ffffffff801102d4>{do_IRQ+50} <ffffffff8010de1a>{ret_from_intr+0} <EOI> <ffffffff8010bc36>{default_idle+53} <ffffffff8010be37>{cpu_idle+93} <ffffffff80501347>{start_secondary+1138} device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel@redhat.com cdrom: open failed. bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=6009 > ah-hah, a trace. > > ------- Additional Comments From djekels@breakwater.com 2006-02-04 15:15 ------- > 3ware 9000 Storage Controller device driver for Linux v2.26.02.004. > input: AT Translated Set 2 keyboard as /class/input/input0 > scsi0 : 3ware 9000 Storage Controller > 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfeaff000, IRQ: 9. > 3w-9xxx: scsi0: Firmware FE9X 3.02.00.016, BIOS BE9X 3.01.00.027, Ports: 8. > Vendor: AMCC Model: 9550SX-8LP DISK Rev: 3.02 > Type: Direct-Access ANSI SCSI revision: 03 > SCSI device sda: 390602752 512-byte hdwr sectors (199989 MB) > SCSI device sda: drive cache: write back, no read (daft) > SCSI device sda: 390602752 512-byte hdwr sectors (199989 MB) > SCSI device sda: drive cache: write back, no read (daft) > sda: sda1 sda2 > sd 0:0:0:0: Attached scsi disk sda > Vendor: AMCC Model: 9550SX-8LP DISK Rev: 3.02 > Type: Direct-Access ANSI SCSI revision: 03 > SCSI device sdb: 781205504 512-byte hdwr sectors (399977 MB) > SCSI device sdb: drive cache: write back, no read (daft) > SCSI device sdb: 781205504 512-byte hdwr sectors (399977 MB) > SCSI device sdb: drive cache: write back, no read (daft) > sdb: unknown partition table > sd 0:0:1:0: Attached scsi disk sdb > Debug: sleeping function called from invalid context at kernel/workqueue.c:266 > in_atomic():1, irqs_disabled():0 > > Call Trace: <IRQ> <ffffffff801299ae>{__might_sleep+190} > <ffffffff80139482>{try_to_del_timer_sync+ > 75} > <ffffffff80140c74>{flush_workqueue+24} <ffffffff801ea956>{as_exit_queue+22} > <ffffffff801e1e03>{elevator_exit+18} <ffffffff801e5f4b>{blk_cleanup_queue+42} > <ffffffff8800abba>{:scsi_mod:scsi_device_dev_release+230} > <ffffffff801efcd7>{kobject_cleanup+84} <ffffffff801ea046>{as_queue_empty+0} > <ffffffff801efd04>{kobject_release+0} <ffffffff801f07e5>{kref_put+83} > <ffffffff880077d2>{:scsi_mod:scsi_end_request+186} > <ffffffff88007c8d>{:scsi_mod:scsi_io_co > mpletion+1063} > <ffffffff88002d4c>{:scsi_mod:scsi_softirq+360} > <ffffffff80135de8>{__do_softirq+80} > <ffffffff8010ee1b>{call_softirq+31} <ffffffff8011030a>{do_softirq+47} > <ffffffff801102d4>{do_IRQ+50} <ffffffff8010de1a>{ret_from_intr+0} > <EOI> <ffffffff8010bc36>{default_idle+53} <ffffffff8010be37>{cpu_idle+93} > <ffffffff80501347>{start_secondary+1138} > device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel@redhat.com > cdrom: open failed. > OK, this is a) not related to tcpdump and b) not an oops. It's a warning that we're doing illegal things from softirq context. In this case, we're doing the final kref_put() on an object from within softirq context in the scsi code. James, I don't recall whether we've fixed this or not? It was non-trivial, wasn't it? Non trivial yes. But I am still battling getting a core dump or a trace when the OS hangs. I am recompiling 2.6.15 with patch .2 to see if I can recreate the crash. -----Original Message----- From: bugme-daemon@bugzilla.kernel.org [mailto:bugme-daemon@bugzilla.kernel.org] Sent: Saturday, February 04, 2006 5:32 PM To: Donny Jekels Subject: [Bug 6009] tcpdump causes kernel panic http://bugzilla.kernel.org/show_bug.cgi?id=6009 ------- Additional Comments From akpm@osdl.org 2006-02-04 15:31 ------- bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=6009 > ah-hah, a trace. > > ------- Additional Comments From djekels@breakwater.com 2006-02-04 > 15:15 ------- 3ware 9000 Storage Controller device driver for Linux v2.26.02.004. > input: AT Translated Set 2 keyboard as /class/input/input0 scsi0 : > 3ware 9000 Storage Controller > 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfeaff000, IRQ: 9. > 3w-9xxx: scsi0: Firmware FE9X 3.02.00.016, BIOS BE9X 3.01.00.027, Ports: 8. > Vendor: AMCC Model: 9550SX-8LP DISK Rev: 3.02 > Type: Direct-Access ANSI SCSI revision: 03 > SCSI device sda: 390602752 512-byte hdwr sectors (199989 MB) SCSI > device sda: drive cache: write back, no read (daft) SCSI device sda: > 390602752 512-byte hdwr sectors (199989 MB) SCSI device sda: drive > cache: write back, no read (daft) > sda: sda1 sda2 > sd 0:0:0:0: Attached scsi disk sda > Vendor: AMCC Model: 9550SX-8LP DISK Rev: 3.02 > Type: Direct-Access ANSI SCSI revision: 03 > SCSI device sdb: 781205504 512-byte hdwr sectors (399977 MB) SCSI > device sdb: drive cache: write back, no read (daft) SCSI device sdb: > 781205504 512-byte hdwr sectors (399977 MB) SCSI device sdb: drive > cache: write back, no read (daft) > sdb: unknown partition table > sd 0:0:1:0: Attached scsi disk sdb > Debug: sleeping function called from invalid context at > kernel/workqueue.c:266 in_atomic():1, irqs_disabled():0 > > Call Trace: <IRQ> <ffffffff801299ae>{__might_sleep+190} > <ffffffff80139482>{try_to_del_timer_sync+ > 75} > <ffffffff80140c74>{flush_workqueue+24} <ffffffff801ea956>{as_exit_queue+22} > <ffffffff801e1e03>{elevator_exit+18} <ffffffff801e5f4b>{blk_cleanup_queue+42} > <ffffffff8800abba>{:scsi_mod:scsi_device_dev_release+230} > <ffffffff801efcd7>{kobject_cleanup+84} <ffffffff801ea046>{as_queue_empty+0} > <ffffffff801efd04>{kobject_release+0} <ffffffff801f07e5>{kref_put+83} > <ffffffff880077d2>{:scsi_mod:scsi_end_request+186} > <ffffffff88007c8d>{:scsi_mod:scsi_io_co > mpletion+1063} > <ffffffff88002d4c>{:scsi_mod:scsi_softirq+360} > <ffffffff80135de8>{__do_softirq+80} > <ffffffff8010ee1b>{call_softirq+31} <ffffffff8011030a>{do_softirq+47} > <ffffffff801102d4>{do_IRQ+50} <ffffffff8010de1a>{ret_from_intr+0} > <EOI> <ffffffff8010bc36>{default_idle+53} <ffffffff8010be37>{cpu_idle+93} > <ffffffff80501347>{start_secondary+1138} > device-mapper: 4.4.0-ioctl (2005-01-12) initialised: > dm-devel@redhat.com > cdrom: open failed. > OK, this is a) not related to tcpdump and b) not an oops. It's a warning that we're doing illegal things from softirq context. In this case, we're doing the final kref_put() on an object from within softirq context in the scsi code. James, I don't recall whether we've fixed this or not? It was non-trivial, wasn't it? ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. Created attachment 7242 [details]
workerqueue strace and io rejecting I/O
cannot replicate the crash but the OS hangs and when I get my prompt back most
filesystems are mounted read only.
Reply-To: James.Bottomley@SteelEye.com On Sat, 2006-02-04 at 15:31 -0800, Andrew Morton wrote: > James, I don't recall whether we've fixed this or not? It was non-trivial, > wasn't it? It's not fixed, and pretty non-trivial. Basically we'd have to redo most of our generic device (or kobject) handling through workqueues. What I'd like for this is a way to tell context. We know the locking context and can cope with that, but it would be nice to tell if we have user context or not and then only go through the workqueue for the softirq or hardirq contexts. If we can get the check, it would probably make sense to do the actual manipulation in put_device(). James which kernel should I use to get me back at a stable release? redhat 2.6.9 also gives me the same result. that is why I am now using or wnat ot use 2.6.15 James Bottomley <James.Bottomley@SteelEye.com> wrote: > > On Sat, 2006-02-04 at 15:31 -0800, Andrew Morton wrote: > > James, I don't recall whether we've fixed this or not? It was non-trivial, > > wasn't it? > > It's not fixed, and pretty non-trivial. Basically we'd have to redo > most of our generic device (or kobject) handling through workqueues. > > What I'd like for this is a way to tell context. We know the locking > context and can cope with that, but it would be nice to tell if we have > user context or not and then only go through the workqueue for the > softirq or hardirq contexts. > > If we can get the check, it would probably make sense to do the actual > manipulation in put_device(). > in_interrupt() will return true in hard- or soft-irq context on all architectures and all .configs - you can certainly use that. What we cannot use is in_atomic() to detect whether we're inside spinlock - that only works if CONFIG_PREEMPT. Regarding this bug, Donny: are you saying that the 3ware failure is caused by putting the NIC into promiscuous mode (or something related)? That it is caused by running tcpdump? Created attachment 7245 [details]
kernel panic pics taken from console
I finally reproduced the panic.
James, Initially, the kernel panic when we run our multicast C++ application for about 20 minutes before the panic occured. What I accidentally descovered was that if I run tcpdump the panic occures much faster, about 2 minutes from start to the panic. I upgraded the firmware on the 3ware SATA array controller and the device driver, 3w-xxx.ko, per instruction of 3ware developers. Even with this new firmware I get same results. bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=6009 > > > > > > ------- Additional Comments From djekels@breakwater.com 2006-02-04 18:42 ------- > James, > > Initially, the kernel panic when we run our multicast C++ application for > about 20 minutes before the panic occured. What I accidentally descovered was > that if I run tcpdump the panic occures much faster, about 2 minutes from > start to the panic. > I upgraded the firmware on the 3ware SATA array controller and the device > driver, 3w-xxx.ko, per instruction of 3ware developers. > Even with this new firmware I get same results. > No, there's no kernel panic here. What we have is two things: a) A kernel _warning_, telling us that we're doing illegal things from softirq context in the scsi stack. This is a known bug. It's possible that the _probability_ of this happening is increased when there's a lot of network traffic happening, because that causes more softirq activity. b) The 3ware driver is shitting itself: messages.4:Jan 6 12:15:41 chilsp010 kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0053): Battery capacity test is overdue:. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi0 : 3ware 9000 Storage Controller messages.4:Jan 6 12:15:41 chilsp010 kernel: 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfeaffc00, IRQ: 217. messages.4:Jan 6 12:15:41 chilsp010 kernel: 3w-9xxx: scsi0: Firmware FE9X 2.06.00.009, BIOS BE9X 2.03.01.051, Ports: 8. messages.4:Jan 6 12:15:41 chilsp010 kernel: Vendor: AMCC Model: 9500S-8 DISK Rev: 2.06 messages.4:Jan 6 12:15:41 chilsp010 kernel: Type: Direct-Access ANSI SCSI revision: 03 messages.4:Jan 6 12:15:41 chilsp010 kernel: SCSI device sda: 390602752 512-byte hdwr sectors (199989 MB) messages.4:Jan 6 12:15:41 chilsp010 kernel: SCSI device sda: drive cache: write back, no read (daft) messages.4:Jan 6 12:15:41 chilsp010 kernel: sda: sda1 sda2 messages.4:Jan 6 12:15:41 chilsp010 kernel: Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: On host 0 channel 0 id 0 only 511 (max_scsi_report_luns) of 493425154 luns reported, try increasing max_scsi_report_luns. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0xb0b800008ed88ec0 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0xfbbe007cbf0006b9 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x0002f3a4ea210600 has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x00bebe073804750b has a LUN larger than currently supported. messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0x83c61081fefe0775 has a LUN larger than currently supported. I don't know why b) is happening. Can you please confirm that the occurrence of b) is increased if there's a tcpdump happening? I don't believe that's the case, because b) happened at boot. In other words, we have two coompletely unrelated bugs. Do you agree? James, no the max lun issue only occurs at boot time. I believe there is a method to lower it /etc/modprobe.conf "options max_lun=XXX" - I tried this before but it doesnt work. anyway see attachment http://bugzilla.kernel.org/attachment.cgi? id=7242&action=view - this is what happens when we have an increase in network traffic. I also get this when I rsync large filesystems between themselves. It seems large number of UDP traffic is a catalist in this case... Do you know of any tools I can use to capture the panic, since I cannot access the filesystems after this condition occurs, and a reboot looses all info. Reply-To: James.Bottomley@SteelEye.com On Sat, 2006-02-04 at 16:51 -0800, Andrew Morton wrote: > in_interrupt() will return true in hard- or soft-irq context on all > architectures and all .configs - you can certainly use that. > > What we cannot use is in_atomic() to detect whether we're inside spinlock - > that only works if CONFIG_PREEMPT. OK, what do you think about this? It introduces an extra api to ensure user context. I suppose one think that could be done is to get the wrappers off a slab instead of using kmalloc, but that could be fixed up later. James --- [PATCH] fix scsi function called from wrong context errors There are several functions in our call tree that need to be called with user context, but for which we cannot guarantee this. The two examples here are scsi_reap_target() and scsi_device_dev_release(). The problem is that SCSI commands can be retried from softirq context, so it's very difficult to guarantee user context for anything in SCSI. The solution is to introduce a new workqueue API: execute_from_user_context() that checks the context and executes the function directly if it has user context, otherwise schedules it for execution via a workqueue. This is the optimal behaviour for SCSI because there are only rare occasions where we don't have context. Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com> Index: BUILD-2.6/drivers/scsi/scsi_scan.c =================================================================== --- BUILD-2.6.orig/drivers/scsi/scsi_scan.c 2006-02-05 10:30:00.000000000 -0600 +++ BUILD-2.6/drivers/scsi/scsi_scan.c 2006-02-05 10:53:11.000000000 -0600 @@ -385,19 +385,12 @@ return found_target; } -struct work_queue_wrapper { - struct work_struct work; - struct scsi_target *starget; -}; - -static void scsi_target_reap_work(void *data) { - struct work_queue_wrapper *wqw = (struct work_queue_wrapper *)data; - struct scsi_target *starget = wqw->starget; +static void scsi_target_reap_usercontext(void *data) +{ + struct scsi_target *starget = data; struct Scsi_Host *shost = dev_to_shost(starget->dev.parent); unsigned long flags; - kfree(wqw); - spin_lock_irqsave(shost->host_lock, flags); if (--starget->reap_ref == 0 && list_empty(&starget->devices)) { @@ -426,18 +419,8 @@ */ void scsi_target_reap(struct scsi_target *starget) { - struct work_queue_wrapper *wqw = - kzalloc(sizeof(struct work_queue_wrapper), GFP_ATOMIC); - - if (!wqw) { - starget_printk(KERN_ERR, starget, - "Failed to allocate memory in scsi_reap_target()\n"); - return; - } - - INIT_WORK(&wqw->work, scsi_target_reap_work, wqw); - wqw->starget = starget; - schedule_work(&wqw->work); + execute_in_user_context(scsi_target_reap_usercontext, + starget); } /** Index: BUILD-2.6/kernel/workqueue.c =================================================================== --- BUILD-2.6.orig/kernel/workqueue.c 2006-02-05 10:41:28.000000000 -0600 +++ BUILD-2.6/kernel/workqueue.c 2006-02-05 11:21:14.000000000 -0600 @@ -27,6 +27,7 @@ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/kthread.h> +#include <linux/hardirq.h> /* * The per-CPU workqueue (if single thread, we always use the first @@ -476,6 +477,63 @@ } EXPORT_SYMBOL(cancel_rearming_delayed_work); +struct work_queue_wrapper { + struct work_struct work; + void (*fn)(void *); + void *data; +}; + +static void execute_in_user_context_work(void *data) +{ + void (*fn)(void *data); + struct work_queue_wrapper *wqw = data; + + fn = wqw->fn; + data = wqw->data; + + kfree(wqw); + + fn(data); +} + +/** + * execute_in_user_context - reliably execute the routine with user context + * @fn: the function to execute + * @data: data to pass to the function + * + * Executes the function immediately if user context is available, otherwise + * schedules the function for delayed execution. + * + * Returns: 0 - function was executed + * 1 - function was scheduled for execution + * <0 - error + */ +int execute_in_user_context(void (*fn)(void *data), void *data) +{ + struct work_queue_wrapper *wqw; + + if(!in_interrupt()) { + fn(data); + return 0; + } + + wqw = kmalloc(sizeof(struct work_queue_wrapper), GFP_ATOMIC); + + if (!wqw) { + printk(KERN_ERR "Failed to allocate memory\n"); + WARN_ON(1); + return -ENOMEM; + } + + INIT_WORK(&wqw->work, execute_in_user_context_work, wqw); + wqw->fn = fn; + wqw->data = data; + schedule_work(&wqw->work); + + return 1; +} +EXPORT_SYMBOL(execute_in_user_context); + int keventd_up(void) { return keventd_wq != NULL; Index: BUILD-2.6/drivers/scsi/scsi_sysfs.c =================================================================== --- BUILD-2.6.orig/drivers/scsi/scsi_sysfs.c 2006-02-05 10:43:39.000000000 -0600 +++ BUILD-2.6/drivers/scsi/scsi_sysfs.c 2006-02-05 10:59:54.000000000 -0600 @@ -217,8 +217,9 @@ put_device(&sdev->sdev_gendev); } -static void scsi_device_dev_release(struct device *dev) +static void scsi_device_dev_release_usercontext(void *data) { + struct device *dev = data; struct scsi_device *sdev; struct device *parent; struct scsi_target *starget; @@ -237,6 +238,7 @@ if (sdev->request_queue) { sdev->request_queue->queuedata = NULL; + /* user context needed to free queue */ scsi_free_queue(sdev->request_queue); /* temporary expedient, try to catch use of queue lock * after free of sdev */ @@ -252,6 +254,12 @@ put_device(parent); } +static void scsi_device_dev_release(struct device *dev) +{ + execute_in_user_context(scsi_device_dev_release_usercontext, + dev); +} + static struct class sdev_class = { .name = "scsi_device", .release = scsi_device_cls_release, Index: BUILD-2.6/include/linux/workqueue.h =================================================================== --- BUILD-2.6.orig/include/linux/workqueue.h 2006-01-31 17:01:53.000000000 -0600 +++ BUILD-2.6/include/linux/workqueue.h 2006-02-05 10:51:30.000000000 -0600 @@ -74,6 +74,7 @@ void cancel_rearming_delayed_work(struct work_struct *work); void cancel_rearming_delayed_workqueue(struct workqueue_struct *, struct work_struct *); +int execute_in_user_context(void (*fn)(void *), void *); /* * Kill off a pending schedule_delayed_work(). Note that the work callback Thanks, I will apply the patch and test. keep you posted. [root@childd010 linux-2.6.15]# patch -p1 <patch.crash patching file drivers/scsi/scsi_scan.c Hunk #1 succeeded at 400 (offset 15 lines). patching file kernel/workqueue.c Hunk #1 succeeded at 27 with fuzz 1. Hunk #2 succeeded at 450 (offset -27 lines). patching file drivers/scsi/scsi_sysfs.c Hunk #1 succeeded at 214 (offset -3 lines). Hunk #3 succeeded at 251 (offset -3 lines). patching file include/linux/workqueue.h Hunk #1 succeeded at 73 (offset -1 lines). [root@childd010 linux-2.6.15]# make clean Created attachment 7250 [details]
dmesg after patch applied
applied patch - dmesg after patch applied. no test started. hold thumbs. here
go.
Created attachment 7251 [details]
panic output after patch applied 2 pics
James,
here is the panic after I applied the patch you submitted.
I did the following on the host to produce large amount of network traffic.
1] rsync 1G file
2] opened 6 xterm windows.
3] tcpdump -i eth0 | grep -v ssh
4] tcpdump -i eth1 | grep -i udp
5] while true; do vmstat -s ; sleep 2; clear; done
6] top
7] kept it open to check if the prompt is active.
Created attachment 7252 [details]
two more pcis of panic
more pics
Created attachment 7253 [details]
last two pcis of panic
these are the last two of the pcis I took of the panic.
In what format are attachments 7252 and 7253? I get bix:/home/akpm> file attachment.cgi attachment.cgi: RAR archive data and I don't know what that is. <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7638.1"> <TITLE>[Bug 6009] tcpdump causes kernel panic</TITLE> </HEAD> <BODY> <DIV id=idOWAReplyText93806 dir=ltr> <DIV dir=ltr><FONT face=Arial color=#000000 size=2>I used win rar to compress the pcitures.</FONT></DIV></DIV> <DIV dir=ltr><BR> <HR tabIndex=-1> <FONT face=Tahoma size=2><B>From:</B> bugme-daemon@bugzilla.kernel.org [mailto:bugme-daemon@bugzilla.kernel.org]<BR><B>Sent:</B> Sun 2/5/2006 2:05 PM<BR><B>To:</B> Donny Jekels<BR><B>Subject:</B> [Bug 6009] tcpdump causes kernel panic<BR></FONT><BR></DIV> <DIV> <P><FONT size=2><A href="http://bugzilla.kernel.org/show_bug.cgi?id=6009">http://bugzilla.kernel.org/show_bug.cgi?id=6009</A><BR><BR><BR><BR><BR><BR>------- Additional Comments From akpm@osdl.org 2006-02-05 12:05 -------<BR>In what format are attachments 7252 and 7253? I get<BR><BR>bix:/home/akpm> file attachment.cgi<BR>attachment.cgi: RAR archive data<BR><BR>and I don't know what that is.<BR><BR>------- You are receiving this mail because: -------<BR>You reported the bug, or are watching the reporter.<BR><BR></FONT></P></DIV> </BODY> </HTML> bugme-daemon@bugzilla.kernel.org wrote: > > I used win rar to compress > the pcitures Please don't do that. (And please odn't send html mail into bugzilla!) jpeg or something like that would be fine. Created attachment 7256 [details]
3 jpegs tarred and gzipped
sorry abotu the rar and html email. attached is 2 jpeg files tarred + gzip
I will attach 1 more with 3 pics inside
Created attachment 7257 [details]
more jpegs
last jpegs of the panic
Most of the oops scrolled off the screen. Please use a higher screen resolution. Add `vga=extended' to grub.conf Use SYSFONT="iso08.08" in /etc/sysconfig/i18n Make sure that the whole screen is in the photo. Just attach the single jpeg to the bug report. Please remind us which kernel you're using and which patches - I'm losing track. Thanks. 2.6.15 - rebooting now. Then run another test to get it to crash -----Original Message----- From: bugme-daemon@bugzilla.kernel.org [mailto:bugme-daemon@bugzilla.kernel.org] Sent: Sunday, February 05, 2006 5:51 PM To: Donny Jekels Subject: [Bug 6009] tcpdump causes kernel panic http://bugzilla.kernel.org/show_bug.cgi?id=6009 akpm@osdl.org changed: What |Removed |Added ------------------------------------------------------------------------ ---- CC| |jejb@steeleye.com ------- Additional Comments From akpm@osdl.org 2006-02-05 15:50 ------- Most of the oops scrolled off the screen. Please use a higher screen resolution. Add `vga=extended' to grub.conf Use SYSFONT="iso08.08" in /etc/sysconfig/i18n Make sure that the whole screen is in the photo. Just attach the single jpeg to the bug report. Please remind us which kernel you're using and which patches - I'm losing track. Thanks. ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. Don't you have James's patch applied? There's not much point in testing without it. Created attachment 7258 [details]
as requested -
this is a full screen
yes I only have James' patch applied Oh. In that case, James's patch didn't work. The jpeg looks fine, thanks. Are you able to test Adam's patch? i dont know where to get adam's patch from. I do not see it in the thread. I forwarded it to you today. I re-forwarded it just now. Created attachment 7260 [details]
3ware patch causes panic at boot time
the patch that was supplied does not allow the kernel to boot.
driver version available from 3ware = c0 Driver Version = 2.26.04.007
removing the patch and testing again on clean 2.6.16-rc2, with the driver from
3ware. 2.26.04.007
I think something must have gone wrong with your kernel build or install - the 3ware driver failed to load because of a missing module symbol, but Adam's patch doesn't reference any new symbols. Hi Andrew, messages.4:Jan 6 12:15:41 chilsp010 kernel: scsi: host 0 channel 0 id 0 lun 0xb0b800008ed88ec0 has a LUN larger than currently supported. These messages are a known issue with kernel's before 2.6.14 with 3ware systems that have over 4G of RAM. I think it was something in scsi_scan.c but I haven't hunted down the exact patchset. If they are occuring again, that's a regression... Hi Donny, If I'm reading this right there are actually 3 seperate bugs here. 1) Nonsense LUNs reported on boot. I looked through the var log messages and these only happenned with 2.6.9. Let's mark them as fixed. 2) A sleeping function called from invalid context on boot 3) The 3ware resets that happens 20 minutes after boot. Do you have any kernel panic messages that show #3 type bugs? Donny, My patch is only for the in-kernel 3w-9xxx driver. It does in fact fix the report luns error messages on machines with 4GB+ memory. Please reproduce the report luns failure messages with the in-kernel 3w-9xxx driver (with the patch applied) which is v2.26.02.005. The driver you are running from the 3ware website is only supplied as source to be built against certain distributions we claim to support and their stock kernels. If you have a newer kernel, such as 2.6.14, 2.6.15, etc, run the in-kernel driver. It has the latest updates for that kernel. -Adam On Sun, Feb 05, 2006 at 12:02:35PM -0600, James Bottomley wrote:
> On Sat, 2006-02-04 at 16:51 -0800, Andrew Morton wrote:
> > in_interrupt() will return true in hard- or soft-irq context on all
> > architectures and all .configs - you can certainly use that.
> >
> > What we cannot use is in_atomic() to detect whether we're inside spinlock -
> > that only works if CONFIG_PREEMPT.
>
> OK, what do you think about this? It introduces an extra api to ensure
> user context. I suppose one think that could be done is to get the
> wrappers off a slab instead of using kmalloc, but that could be fixed up
> later.
This looks good to me, and odds are this function will come in handy for
other parts of the kernel.
thanks,
greg k-h
Donny, Were you able to sort out patches, apply them and test? Have you tried newer kernels recently, is the problem still there? Thanks. It looks like all three problems listed in #43 has been taken care of. Closing the bug, thanks. |