Latest working kernel version:2.6.27 Earliest failing kernel version:2.6.28-rc2 Distribution: RHEL 4u2 Hardware Environment: IA64 Software Environment: Problem Description: When we use dd tool to clear a SCSI disk. it may getinto a kernel panic like below: Unable to handle kernel NULL pointer dereference (address 0000000000000044) ksoftirqd/1[6]: Oops 8813272891392 [1] Modules linked in: Pid: 6, CPU 1, comm: ksoftirqd/1 psr : 00001010081a6010 ifs : 8000000000000289 ip : [<a000000100391cf0>] Not tainted (2.6.28-rc6) ip is at scsi_softirq_done+0x50/0x2e0 unat: 0000000000000000 pfs : 0000000000000184 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000005a81 ldrs: 0000000000000000 ccv : 0000000000000002 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a00000010023f910 b6 : a000000100391ca0 b7 : a00000010000b220 f6 : 1003e0000000000001d4c f7 : 1003e0000000000200000 f8 : 1003e0000000080000000 f9 : 1003e0000054a28eb9332 f10 : 1003eb59a50a4dbb025f6 f11 : 1003e00000000000000c7 r1 : a000000100a1b720 r2 : e000000001105e48 r3 : e00000000cff9518 r8 : e00000000cff9510 r9 : ffffffffffff5e48 r10 : 0000000000000018 r11 : 0000000000000044 r12 : e000000008a97e10 r13 : e000000008a90000 r14 : e000000008a97e30 r15 : e000000008a97e38 r16 : 0000000000000020 r17 : 0000000000000030 r18 : e00000000cff9650 r19 : e00000000cff95e0 r20 : a00000010082eea8 r21 : a000000100391ca0 r22 : e000000001e9ab70 r23 : e000000001e9aac8 r24 : e00000000cff9540 r25 : e000000008a97e30 r26 : e000000008a97e30 r27 : e00000000cff9518 r28 : e000000008a97e38 r29 : 0000000000000000 r30 : 0000000000000000 r31 : 0000000000000000 Call Trace: [<a000000100012870>] show_stack+0x50/0xa0 sp=e000000008a979e0 bsp=e000000008a90eb0 [<a000000100013150>] show_regs+0x830/0x860 sp=e000000008a97bb0 bsp=e000000008a90e68 [<a000000100022a60>] die+0x1c0/0x2c0 sp=e000000008a97bb0 bsp=e000000008a90e20 [<a000000100042350>] ia64_do_page_fault+0x830/0x960 sp=e000000008a97bb0 bsp=e000000008a90db8 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270 sp=e000000008a97c40 bsp=e000000008a90db8 [<a000000100391cf0>] scsi_softirq_done+0x50/0x2e0 sp=e000000008a97e10 bsp=e000000008a90d70 [<a00000010023f910>] blk_done_softirq+0x170/0x1c0 sp=e000000008a97e20 bsp=e000000008a90d58 [<a00000010006a8b0>] __do_softirq+0x110/0x2e0 sp=e000000008a97e30 bsp=e000000008a90cc8 [<a00000010006aaf0>] do_softirq+0x70/0xc0 sp=e000000008a97e30 bsp=e000000008a90c68 [<a00000010006c2d0>] ksoftirqd+0xf0/0x240 sp=e000000008a97e30 bsp=e000000008a90c40 [<a00000010008df80>] kthread+0xa0/0x120 sp=e000000008a97e30 bsp=e000000008a90c10 [<a000000100014430>] kernel_thread_helper+0x30/0x60 sp=e000000008a97e30 bsp=e000000008a90be0 [<a00000010000a0c0>] start_kernel_thread+0x20/0x40 sp=e000000008a97e30 bsp=e000000008a90be0 Kernel panic - not syncing: Aiee, killing interrupt handler! Steps to reproduce: while true; do echo "3">/proc/sys/vm/drop_caches; dd if=/dev/sdb2 of=/dev/zero bs=1024k count=8900; done;
the device is : 06:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) Subsystem: Intel Corporation: Unknown device 3402 Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 49 I/O ports at 8800 [size=256] Memory at 00000000f8fd0000 (64-bit, non-prefetchable) [size=64K] Memory at 00000000f8fc0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at 00000000f8d00000 [disabled] [size=1M] Capabilities: <available only to root> 06:02.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07) Subsystem: Intel Corporation: Unknown device 3402 Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 50 I/O ports at 8c00 [size=256] Memory at 00000000f8ff0000 (64-bit, non-prefetchable) [size=64K] Memory at 00000000f8fe0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at 00000000f8e00000 [disabled] [size=1M] Capabilities: <available only to root>
Created attachment 19244 [details] kernel config file
I instrumented kernel. below is some finding. 1) The disk isn't stable. If we run dd in a loop, sometimes the disk doesn't work, so block request will expire to start a scsi error handling; 2) fussion driver has its own periodic (1 sec) checking in mpt_fault_reset_work. If something is wrong, it will do a hardware reset and flush all pending requests; 3) Step 1) and 2) have a race that a request might be flushed in step 2), but step 1) releases the request firstly. Later on, when SOFTIRQ tries to release the scsi_cmnd, kernel panic.
Good finding!
Reply-To: James.Bottomley@HansenPartnership.com On Thu, 2008-12-11 at 01:05 -0800, bugme-daemon@bugzilla.kernel.org wrote: > I instrumented kernel. below is some finding. > 1) The disk isn't stable. If we run dd in a loop, sometimes the disk doesn't > work, so block request will expire to start a scsi error handling; > 2) fussion driver has its own periodic (1 sec) checking in > mpt_fault_reset_work. If something is wrong, it will do a hardware reset and > flush all pending requests; > 3) Step 1) and 2) have a race that a request might be flushed in step 2), but > step 1) releases the request firstly. Later on, when SOFTIRQ tries to release > the scsi_cmnd, kernel panic. OK, so to rule out the block timer problems which we think are fixed, could you reproduce on 2.6.28-rc8. Also, the routine > [<a000000100391cf0>] scsi_softirq_done+0x50/0x2e0 > sp=e000000008a97e10 bsp=e000000008a90d70 Is hard to diagnose. Could you ask gdb or addr2line which actual source line this is? Thanks, James
2.6.28-rc8 also panic addr2line -e vmlinux a0000001003971b0 linux-2.6.28-rc8/drivers/scsi/scsi_lib.c:1538 1535 static void scsi_softirq_done(struct request *rq) 1536 { 1537 struct scsi_cmnd *cmd = rq->special; 1538 unsigned long wait_for = (cmd->allowed + 1) * rq->timeout; 1539 int disposition;
Reply-To: andmike@linux.vnet.ibm.com bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12195 > > > > > > ------- Comment #6 from ming.m.lin@intel.com 2008-12-11 18:27 ------- > 2.6.28-rc8 also panic The blk_mark_rq_complete check should prevent completions from occurring on already timed out requests unless the interaction previous mentioned between mpt_fault_reset_work and the scsi eh thread requeue alows the REQ_ATOM_COMPLETE bit to get cleared prior to the scsi_done being called from mptscsih_flush_running_cmds. This did not look obvious to hit. mpt_fault_reset_work mpt_HardResetHandler mpt_signal_reset mptsas_ioc_reset mptscsih_flush_running_cmds mpt_do_ioc_recovery When scsi_times_out is called there should not be a transportt->eh_timed_out, or hostt->eh_timed_out set for mptsas which should lead to waking up the eh thread. We will then call mptscsih_abort from the eh thread and it will return success if the scsi command is not found leading to a possible requeue. If you have time for another re-create it would be good to set some scsi logging. sysctl -w dev.scsi.logging_level=4100 # mlcomplete 1 and error 4 echo "1" > /proc/sys/kernel/sysrq # If needed echo 9 > /proc/sysrq-trigger # Raise console log level Also if you have more dmesg output prior to the error from the previous failure runs that would be good to post also. -andmike -- Michael Anderson andmike@linux.vnet.ibm.com
Reply-To: James.Bottomley@HansenPartnership.com On Fri, 2008-12-12 at 02:22 -0800, Mike Anderson wrote: > bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=12195 > > > > > > > > > > > > ------- Comment #6 from ming.m.lin@intel.com 2008-12-11 18:27 ------- > > 2.6.28-rc8 also panic > > The blk_mark_rq_complete check should prevent completions from occurring on > already timed out requests unless the interaction previous mentioned between > mpt_fault_reset_work and the scsi eh thread requeue alows the > REQ_ATOM_COMPLETE > bit to get cleared prior to the scsi_done being called from > mptscsih_flush_running_cmds. This did not look obvious to hit. > > mpt_fault_reset_work > mpt_HardResetHandler > mpt_signal_reset > mptsas_ioc_reset > mptscsih_flush_running_cmds > mpt_do_ioc_recovery Actually, this isn't quite true. Particularly in the eh case. It looks like the block timeout isn't stopped until blk_complete_request() which is pretty late. If the timeout fires after scsi_done is called but before we complete the request, any timeout goes through the BLK_EH_HANDLED path to __blk_complete_request(). This routine unconditionally adds to the done routine without checking the mark, so there is a window where we can get double dones. James
Reply-To: James.Bottomley@HansenPartnership.com On Fri, 2008-12-12 at 09:09 -0600, James Bottomley wrote: > On Fri, 2008-12-12 at 02:22 -0800, Mike Anderson wrote: > > bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=12195 > > > > > > > > > > > > > > > > > > ------- Comment #6 from ming.m.lin@intel.com 2008-12-11 18:27 ------- > > > 2.6.28-rc8 also panic > > > > The blk_mark_rq_complete check should prevent completions from occurring on > > already timed out requests unless the interaction previous mentioned > between > > mpt_fault_reset_work and the scsi eh thread requeue alows the > REQ_ATOM_COMPLETE > > bit to get cleared prior to the scsi_done being called from > > mptscsih_flush_running_cmds. This did not look obvious to hit. > > > > mpt_fault_reset_work > > mpt_HardResetHandler > > mpt_signal_reset > > mptsas_ioc_reset > > mptscsih_flush_running_cmds > > mpt_do_ioc_recovery > > Actually, this isn't quite true. Particularly in the eh case. It looks > like the block timeout isn't stopped until blk_complete_request() which > is pretty late. If the timeout fires after scsi_done is called but > before we complete the request, any timeout goes through the > BLK_EH_HANDLED path to __blk_complete_request(). This routine > unconditionally adds to the done routine without checking the mark, so > there is a window where we can get double dones. Actually, I take that back ... the patch to plug the unprep race was send over the list but never applied because the timer changes seemed to fix the problem. So, we still have a small window where unprep can NULL out rq->special while an asynchronous mpt reset is flushing the commands via scsi_done. Could you see if it goes away (or at least lessens in frequency) with this patch? Thanks, James --- diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 111f9e9..f2f51e0 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -653,8 +653,8 @@ static void scsi_requeue_command(struct request_queue *q, struct scsi_cmnd *cmd) struct request *req = cmd->request; unsigned long flags; - scsi_unprep_request(req); spin_lock_irqsave(q->queue_lock, flags); + scsi_unprep_request(req); blk_requeue_request(q, req); spin_unlock_irqrestore(q->queue_lock, flags);
Handled-By : James Bottomley <James.Bottomley@HansenPartnership.com> Notify-Also : Michael Anderson <andmike@linux.vnet.ibm.com> Notify-Also : Lin Ming <ming.m.lin@intel.com> Notify-Also : Yanmin Zhang <yanmin_zhang@linux.intel.com>
kernel panic again with patch at #9 applied to 2.6.28-rc8 ---------------------------------------------------------- mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh end_request: I/O error, dev sdb, sector 22968800 Buffer I/O error on device sdb, logical block 5742200 Buffer I/O error on device sdb, logical block 5742201 Buffer I/O error on device sdb, logical block 5742202 Buffer I/O error on device sdb, logical block 5742203 Buffer I/O error on device sdb, logical block 5742204 Buffer I/O error on device sdb, logical block 5742205 Buffer I/O error on device sdb, logical block 5742206 Buffer I/O error on device sdb, logical block 5742207 Buffer I/O error on device sdb, logical block 5742208 Buffer I/O error on device sdb, logical block 5742209 sd 0:0:1:0: rejecting I/O to offline device end_request: I/O error, dev sdb, sector 22969056 mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! Unable to handle kernel NULL pointer dereference (address 0000000000000044) ksoftirqd/2[10]: Oops 8813272891392 [1] Modules linked in: Pid: 10, CPU 2, comm: ksoftirqd/2 psr : 00001010081a6010 ifs : 8000000000000289 ip : [<a000000100397190>] Not tainted (2.6.28-rc8-mlin) ip is at scsi_softirq_done+0x50/0x2e0 unat: 0000000000000000 pfs : 0000000000000184 rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000005a81 ldrs: 0000000000000000 ccv : 0000000000000002 fpsr: 0009804c8a70433f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100244e90 b6 : a000000100397140 b7 : a00000010000b220 f6 : 1003e0000000000000000 f7 : 1003e0000000000200000 f8 : 1003e0000000080000000 f9 : 1003e000001de86495475 f10 : 1003e9277c68cf83c1120 f11 : 1003e0000000000000046 r1 : a000000100a2bf40 r2 : e000000001115ec8 r3 : e0000000402d6658 r8 : e0000000402d6650 r9 : ffffffffffff5ec8 r10 : 0000000000000018 r11 : 0000000000000044 r12 : e000000008b07e10 r13 : e000000008b00000 r14 : e000000008b07e30 r15 : e000000008b07e38 r16 : 0000000000000020 r17 : 0000000000000030 r18 : e0000000402d6790 r19 : e0000000402d6720 r20 : a00000010083fb20 r21 : a000000100397140 r22 : e00000004062b388 r23 : e00000004062b2e0 r24 : e0000000402d6680 r25 : e000000008b07e30 r26 : e000000040119700 r27 : e0000000402d6658 r28 : e000000040119708 r29 : 0000000000000000 r30 : 0000000000000000 r31 : 0000000000000000 Call Trace: [<a000000100012870>] show_stack+0x50/0xa0 sp=e000000008b079e0 bsp=e000000008b00ec8 [<a000000100013150>] show_regs+0x830/0x860 sp=e000000008b07bb0 bsp=e000000008b00e80 [<a000000100022a60>] die+0x1c0/0x2c0 sp=e000000008b07bb0 bsp=e000000008b00e38 [<a000000100042350>] ia64_do_page_fault+0x830/0x960 sp=e000000008b07bb0 bsp=e000000008b00dd0 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270 sp=e000000008b07c40 bsp=e000000008b00dd0 [<a000000100397190>] scsi_softirq_done+0x50/0x2e0 sp=e000000008b07e10 bsp=e000000008b00d88 [<a000000100244e90>] blk_done_softirq+0x170/0x1c0 sp=e000000008b07e20 bsp=e000000008b00d70 [<a00000010006f050>] __do_softirq+0x110/0x2e0 sp=e000000008b07e30 bsp=e000000008b00ce0 [<a00000010006f290>] do_softirq+0x70/0xc0 sp=e000000008b07e30 bsp=e000000008b00c80 [<a000000100070a70>] ksoftirqd+0xf0/0x240 sp=e000000008b07e30 bsp=e000000008b00c58 [<a000000100092720>] kthread+0xa0/0x120 sp=e000000008b07e30 bsp=e000000008b00c28 [<a000000100014430>] kernel_thread_helper+0x30/0x60 sp=e000000008b07e30 bsp=e000000008b00c00 [<a00000010000a0c0>] start_kernel_thread+0x20/0x40 sp=e000000008b07e30 bsp=e000000008b00c00 Kernel panic - not syncing: Aiee, killing interrupt handler!
(In reply to comment #7) > If you have time for another re-create it would be good to set some scsi > logging. > sysctl -w dev.scsi.logging_level=4100 # mlcomplete 1 and error 4 > echo "1" > /proc/sys/kernel/sysrq # If needed > echo 9 > /proc/sysrq-trigger # Raise console log level > mptscsih: ioc0: attempting task abort! (sc=e000000037a36980) sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00 mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: Initiating recovery mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh mptscsih: ioc0: Issue of TaskMgmt failed! mptscsih: ioc0: task abort: FAILED (sc=e000000037a36980) mptscsih: ioc0: attempting task abort! (sc=e000000037a35f80) sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4f a0 00 01 00 00 mptscsih: ioc0: task abort: FAILED (sc=e000000037a35f80) mptscsih: ioc0: attempting target reset! (sc=e000000037a36980) sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00 mptscsih: ioc0: target reset: FAILED (sc=e000000037a36980) mptscsih: ioc0: attempting bus reset! (sc=e000000037a36980) sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00 mptscsih: ioc0: bus reset: FAILED (sc=e000000037a36980) mptscsih: ioc0: attempting host reset! (sc=e000000037a36980) mptscsih: ioc0: host reset: SUCCESS (sc=e000000037a36980) sd 0:0:1:0: Device offlined - not ready after error recovery sd 0:0:1:0: Device offlined - not ready after error recovery sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06 end_request: I/O error, dev sdb, sector 20663968 Buffer I/O error on device sdb, logical block 5165992 Buffer I/O error on device sdb, logical block 5165993 Buffer I/O error on device sdb, logical block 5165994 Buffer I/O error on device sdb, logical block 5165995 Buffer I/O error on device sdb, logical block 5165996 Buffer I/O error on device sdb, logical block 5165997 Buffer I/O error on device sdb, logical block 5165998 Buffer I/O error on device sdb, logical block 5165999 Buffer I/O error on device sdb, logical block 5166000 Buffer I/O error on device sdb, logical block 5166001 sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06 end_request: I/O error, dev sdb, sector 20664224 mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=e000000037a36980, mf = e0000000406847e0, idx=55 sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=e000000037a35f80, mf = e0000000406883e0, idx=f5 Unable to handle kernel NULL pointer dereference (address 0000000000000044) mpt_poll_0[378]: Oops 8813272891392 [1] Modules linked in: Pid: 378, CPU 1, comm: mpt_poll_0 psr : 0000101008026018 ifs : 8000000000000289 ip : [<a0000001003988b0>] Not tainted (2.6.28-rc8-mlin) ip is at scsi_softirq_done+0x50/0x2e0 unat: 0000000000000000 pfs : 0000000000000184 rsc : 0000000000000003 rnat: 5555555555555555 bsps: 5555555555555555 pr : 000002a1491599a5 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100245310 b6 : a000000100398860 b7 : a000000100011ec0 f6 : 1003e0000000000000000 f7 : 1003e0000000000200802 f8 : 1003e0000000080000000 f9 : 1003e00000000000001d3 f10 : 1003e392baaf0b5995995 f11 : 1003e0000000000000030 r1 : a000000100a2c6c0 r2 : e000000001105ec8 r3 : e000000035e2d688 r8 : e000000035e2d680 r9 : ffffffffffff5ec8 r10 : 0000000000000018 r11 : 0000000000000044 r12 : e0000000405c7830 r13 : e0000000405c0000 r14 : e0000000405c7850 r15 : e0000000405c7858 r16 : 0000000000000020 r17 : 0000000000000030 r18 : e000000035e2d7c0 r19 : e000000035e2d750 r20 : 0000000000000001 r21 : a000000100398860 r22 : e000000040599ac0 r23 : e000000040599a18 r24 : e000000035e2d6b0 r25 : e0000000405c7850 r26 : e0000000405c7850 r27 : e000000035e2d688 r28 : e0000000405c7858 r29 : a000000100011ec0 r30 : 0000000000000000 r31 : e000000008a78000 Call Trace: [<a000000100012870>] show_stack+0x50/0xa0 sp=e0000000405c7400 bsp=e0000000405c1368 [<a000000100013150>] show_regs+0x830/0x860 sp=e0000000405c75d0 bsp=e0000000405c1320 [<a000000100022a60>] die+0x1c0/0x2c0 sp=e0000000405c75d0 bsp=e0000000405c12d8 [<a000000100042350>] ia64_do_page_fault+0x830/0x960 sp=e0000000405c75d0 bsp=e0000000405c1278 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270 sp=e0000000405c7660 bsp=e0000000405c1278 [<a0000001003988b0>] scsi_softirq_done+0x50/0x2e0 sp=e0000000405c7830 bsp=e0000000405c1230 [<a000000100245310>] blk_done_softirq+0x170/0x1c0 sp=e0000000405c7840 bsp=e0000000405c1218 [<a00000010006f3d0>] __do_softirq+0x110/0x2e0 sp=e0000000405c7850 bsp=e0000000405c1180 [<a00000010006f610>] do_softirq+0x70/0xc0 sp=e0000000405c7850 bsp=e0000000405c1120 [<a00000010006fa50>] irq_exit+0x70/0xa0 sp=e0000000405c7850 bsp=e0000000405c1108 [<a000000100011c70>] ia64_handle_irq+0x350/0x380 sp=e0000000405c7850 bsp=e0000000405c10d0 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270 sp=e0000000405c7850 bsp=e0000000405c10d0 [<a0000001000662e0>] vprintk+0x660/0x6a0 sp=e0000000405c7a20 bsp=e0000000405c1000 [<a0000001000663a0>] printk+0x80/0xc0 sp=e0000000405c7a90 bsp=e0000000405c0f98 [<a0000001003d4c50>] mptscsih_ioc_reset+0x250/0x420 sp=e0000000405c7ad0 bsp=e0000000405c0f30 [<a0000001003d9c50>] mptspi_ioc_reset+0x30/0xa0 sp=e0000000405c7ae0 bsp=e0000000405c0f08 [<a0000001003c1d30>] mpt_signal_reset+0x150/0x180 sp=e0000000405c7ae0 bsp=e0000000405c0ed8 [<a0000001003c53c0>] KickStart+0x680/0x1000 sp=e0000000405c7ae0 bsp=e0000000405c0e70 [<a0000001003c5f50>] MakeIocReady+0x210/0x500 sp=e0000000405c7ae0 bsp=e0000000405c0e18 [<a0000001003c86f0>] mpt_do_ioc_recovery+0x150/0x2840 sp=e0000000405c7ae0 bsp=e0000000405c0d48 [<a0000001003caf80>] mpt_HardResetHandler+0x1a0/0x2a0 sp=e0000000405c7dd0 bsp=e0000000405c0d00 [<a0000001003cb190>] mpt_fault_reset_work+0x110/0x260 sp=e0000000405c7dd0 bsp=e0000000405c0cc0 [<a00000010008a310>] run_workqueue+0x1f0/0x340 sp=e0000000405c7dd0 bsp=e0000000405c0c80 [<a00000010008a5a0>] worker_thread+0x140/0x180 sp=e0000000405c7dd0 bsp=e0000000405c0c58 [<a000000100092aa0>] kthread+0xa0/0x120 sp=e0000000405c7e30 bsp=e0000000405c0c28 [<a000000100014430>] kernel_thread_helper+0x30/0x60 sp=e0000000405c7e30 bsp=e0000000405c0c00 [<a00000010000a0c0>] start_kernel_thread+0x20/0x40 sp=e0000000405c7e30 bsp=e0000000405c0c00 Kernel panic - not syncing: Aiee, killing interrupt handler!
Reply-To: James.Bottomley@HansenPartnership.com On Mon, 2008-12-15 at 02:39 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12195 > > > > > > ------- Comment #12 from ming.m.lin@intel.com 2008-12-15 02:39 ------- > (In reply to comment #7) > > If you have time for another re-create it would be good to set some scsi > > logging. > > sysctl -w dev.scsi.logging_level=4100 # mlcomplete 1 and error 4 > > echo "1" > /proc/sys/kernel/sysrq # If needed > > echo 9 > /proc/sysrq-trigger # Raise console log level > > > > mptscsih: ioc0: attempting task abort! (sc=e000000037a36980) > sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00 > mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! > mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! > mptbase: ioc0: Initiating recovery > mptbase: ioc0: WARNING - IOC is in FAULT state!!! > mptbase: ioc0: WARNING - FAULT code = 000eh > mptscsih: ioc0: Issue of TaskMgmt failed! > mptscsih: ioc0: task abort: FAILED (sc=e000000037a36980) > mptscsih: ioc0: attempting task abort! (sc=e000000037a35f80) > sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4f a0 00 01 00 00 > mptscsih: ioc0: task abort: FAILED (sc=e000000037a35f80) > mptscsih: ioc0: attempting target reset! (sc=e000000037a36980) > sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00 > mptscsih: ioc0: target reset: FAILED (sc=e000000037a36980) > mptscsih: ioc0: attempting bus reset! (sc=e000000037a36980) > sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00 > mptscsih: ioc0: bus reset: FAILED (sc=e000000037a36980) > mptscsih: ioc0: attempting host reset! (sc=e000000037a36980) > mptscsih: ioc0: host reset: SUCCESS (sc=e000000037a36980) > sd 0:0:1:0: Device offlined - not ready after error recovery > sd 0:0:1:0: Device offlined - not ready after error recovery > sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06 > end_request: I/O error, dev sdb, sector 20663968 > Buffer I/O error on device sdb, logical block 5165992 > Buffer I/O error on device sdb, logical block 5165993 > Buffer I/O error on device sdb, logical block 5165994 > Buffer I/O error on device sdb, logical block 5165995 > Buffer I/O error on device sdb, logical block 5165996 > Buffer I/O error on device sdb, logical block 5165997 > Buffer I/O error on device sdb, logical block 5165998 > Buffer I/O error on device sdb, logical block 5165999 > Buffer I/O error on device sdb, logical block 5166000 > Buffer I/O error on device sdb, logical block 5166001 > sd 0:0:1:0: rejecting I/O to offline device > sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06 > end_request: I/O error, dev sdb, sector 20664224 > mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! > sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, > sc=e000000037a36980, mf = e0000000406847e0, idx=55 > sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, > sc=e000000037a35f80, mf = e0000000406883e0, idx=f5 > Unable to handle kernel NULL pointer dereference (address 0000000000000044) > mpt_poll_0[378]: Oops 8813272891392 [1] Oh ... this is actually a fusion problem, then. It looks like the fusion is relying on the old done behaviour. Does this work? It flushes the fusion internal queue if we go into host reset. This should prevent the commands turning up later after the device has been offlined. James --- diff --git a/drivers/message/fusion/mptscsih.c b/drivers/message/fusion/mptscsih.c index d62fd4f..ee09041 100644 --- a/drivers/message/fusion/mptscsih.c +++ b/drivers/message/fusion/mptscsih.c @@ -2008,6 +2008,9 @@ mptscsih_host_reset(struct scsi_cmnd *SCpnt) return FAILED; } + /* make sure we have no outstanding commands at this stage */ + mptscsih_flush_running_cmds(hd); + ioc = hd->ioc; printk(MYIOC_s_INFO_FMT "attempting host reset! (sc=%p)\n", ioc->name, SCpnt);
(In reply to comment #13) > > diff --git a/drivers/message/fusion/mptscsih.c > b/drivers/message/fusion/mptscsih.c > index d62fd4f..ee09041 100644 > --- a/drivers/message/fusion/mptscsih.c > +++ b/drivers/message/fusion/mptscsih.c > @@ -2008,6 +2008,9 @@ mptscsih_host_reset(struct scsi_cmnd *SCpnt) > return FAILED; > } > > + /* make sure we have no outstanding commands at this stage */ > + mptscsih_flush_running_cmds(hd); > + > ioc = hd->ioc; > printk(MYIOC_s_INFO_FMT "attempting host reset! (sc=%p)\n", > ioc->name, SCpnt); > It seems that this patch fixes the bug. I have been running "dd" for 4 hours and no panic. dmesg attached. Thanks ------------------------------------------------------ end_request: I/O error, dev sdb, sector 34572800 Buffer I/O error on device sdb, logical block 8643200 Buffer I/O error on device sdb, logical block 8643201 Buffer I/O error on device sdb, logical block 8643202 Buffer I/O error on device sdb, logical block 8643203 Buffer I/O error on device sdb, logical block 8643204 Buffer I/O error on device sdb, logical block 8643205 Buffer I/O error on device sdb, logical block 8643206 Buffer I/O error on device sdb, logical block 8643207 Buffer I/O error on device sdb, logical block 8643208 Buffer I/O error on device sdb, logical block 8643209 Info fld=0x4044 end_request: I/O error, dev sdb, sector 34572544 __ratelimit: 54 callbacks suppressed Buffer I/O error on device sdb, logical block 8643136 Buffer I/O error on device sdb, logical block 8643137 Buffer I/O error on device sdb, logical block 8643138 Buffer I/O error on device sdb, logical block 8643139 Buffer I/O error on device sdb, logical block 8643140 Buffer I/O error on device sdb, logical block 8643141 Buffer I/O error on device sdb, logical block 8643142 Buffer I/O error on device sdb, logical block 8643143 Buffer I/O error on device sdb, logical block 8643144 Buffer I/O error on device sdb, logical block 8643145 program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh sd 0:0:1:0: rejecting I/O to offline device __ratelimit: 54 callbacks suppressed Buffer I/O error on device sdb, logical block 5081200 Buffer I/O error on device sdb, logical block 5081201 Buffer I/O error on device sdb, logical block 5081202 Buffer I/O error on device sdb, logical block 5081203 Buffer I/O error on device sdb, logical block 5081204 Buffer I/O error on device sdb, logical block 5081205 Buffer I/O error on device sdb, logical block 5081206 Buffer I/O error on device sdb, logical block 5081207 Buffer I/O error on device sdb, logical block 5081208 Buffer I/O error on device sdb, logical block 5081209 sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh sd 0:0:1:0: rejecting I/O to offline device __ratelimit: 126 callbacks suppressed Buffer I/O error on device sdb, logical block 5642800 Buffer I/O error on device sdb, logical block 5642801 Buffer I/O error on device sdb, logical block 5642802 Buffer I/O error on device sdb, logical block 5642803 Buffer I/O error on device sdb, logical block 5642804 Buffer I/O error on device sdb, logical block 5642805 Buffer I/O error on device sdb, logical block 5642806 Buffer I/O error on device sdb, logical block 5642807 Buffer I/O error on device sdb, logical block 5642808 Buffer I/O error on device sdb, logical block 5642809 sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh sd 0:0:1:0: rejecting I/O to offline device __ratelimit: 126 callbacks suppressed Buffer I/O error on device sdb, logical block 6333104 Buffer I/O error on device sdb, logical block 6333105 Buffer I/O error on device sdb, logical block 6333106 Buffer I/O error on device sdb, logical block 6333107 Buffer I/O error on device sdb, logical block 6333108 Buffer I/O error on device sdb, logical block 6333109 Buffer I/O error on device sdb, logical block 6333110 Buffer I/O error on device sdb, logical block 6333111 Buffer I/O error on device sdb, logical block 6333112 Buffer I/O error on device sdb, logical block 6333113 sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh sd 0:0:1:0: rejecting I/O to offline device __ratelimit: 126 callbacks suppressed Buffer I/O error on device sdb, logical block 7920944 Buffer I/O error on device sdb, logical block 7920945 Buffer I/O error on device sdb, logical block 7920946 Buffer I/O error on device sdb, logical block 7920947 Buffer I/O error on device sdb, logical block 7920948 Buffer I/O error on device sdb, logical block 7920949 Buffer I/O error on device sdb, logical block 7920950 Buffer I/O error on device sdb, logical block 7920951 Buffer I/O error on device sdb, logical block 7920952 Buffer I/O error on device sdb, logical block 7920953 sd 0:0:1:0: rejecting I/O to offline device mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh sd 0:0:1:0: rejecting I/O to offline device __ratelimit: 126 callbacks suppressed Buffer I/O error on device sdb, logical block 6794608 Buffer I/O error on device sdb, logical block 6794609 Buffer I/O error on device sdb, logical block 6794610 Buffer I/O error on device sdb, logical block 6794611 Buffer I/O error on device sdb, logical block 6794612 Buffer I/O error on device sdb, logical block 6794613 Buffer I/O error on device sdb, logical block 6794614 Buffer I/O error on device sdb, logical block 6794615 Buffer I/O error on device sdb, logical block 6794616 Buffer I/O error on device sdb, logical block 6794617 sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!! mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! mptbase: ioc0: WARNING - IOC is in FAULT state!!! mptbase: ioc0: WARNING - FAULT code = 000eh sd 0:0:1:0: rejecting I/O to offline device __ratelimit: 126 callbacks suppressed Buffer I/O error on device sdb, logical block 8816496 Buffer I/O error on device sdb, logical block 8816497 Buffer I/O error on device sdb, logical block 8816498 Buffer I/O error on device sdb, logical block 8816499 Buffer I/O error on device sdb, logical block 8816500 Buffer I/O error on device sdb, logical block 8816501 Buffer I/O error on device sdb, logical block 8816502 Buffer I/O error on device sdb, logical block 8816503 Buffer I/O error on device sdb, logical block 8816504 Buffer I/O error on device sdb, logical block 8816505 sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device sd 0:0:1:0: rejecting I/O to offline device mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000! mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
Fixed by commit a6da74cb077e88a604e5b2a6663b44c221f0ae75 .