Bug 12195

Summary: "dd" make kernel panic
Product: SCSI Drivers Reporter: Alex Shi (alex.shi)
Component: OtherAssignee: scsi_drivers-other
Status: CLOSED CODE_FIX    
Severity: high CC: ming.m.lin, rjw, yanmin_zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11808    
Attachments: kernel config file

Description Alex Shi 2008-12-10 18:07:17 UTC
Latest working kernel version:2.6.27
Earliest failing kernel version:2.6.28-rc2
Distribution: RHEL 4u2
Hardware Environment: IA64
Software Environment: 
Problem Description: 
When we use dd tool to clear a SCSI disk. it may getinto a kernel panic like below:
Unable to handle kernel NULL pointer dereference (address 0000000000000044)
ksoftirqd/1[6]: Oops 8813272891392 [1]
Modules linked in:

Pid: 6, CPU 1, comm:          ksoftirqd/1
psr : 00001010081a6010 ifs : 8000000000000289 ip  : [<a000000100391cf0>]    Not tainted (2.6.28-rc6)
ip is at scsi_softirq_done+0x50/0x2e0
unat: 0000000000000000 pfs : 0000000000000184 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000005a81
ldrs: 0000000000000000 ccv : 0000000000000002 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a00000010023f910 b6  : a000000100391ca0 b7  : a00000010000b220
f6  : 1003e0000000000001d4c f7  : 1003e0000000000200000
f8  : 1003e0000000080000000 f9  : 1003e0000054a28eb9332
f10 : 1003eb59a50a4dbb025f6 f11 : 1003e00000000000000c7
r1  : a000000100a1b720 r2  : e000000001105e48 r3  : e00000000cff9518
r8  : e00000000cff9510 r9  : ffffffffffff5e48 r10 : 0000000000000018
r11 : 0000000000000044 r12 : e000000008a97e10 r13 : e000000008a90000
r14 : e000000008a97e30 r15 : e000000008a97e38 r16 : 0000000000000020
r17 : 0000000000000030 r18 : e00000000cff9650 r19 : e00000000cff95e0
r20 : a00000010082eea8 r21 : a000000100391ca0 r22 : e000000001e9ab70
r23 : e000000001e9aac8 r24 : e00000000cff9540 r25 : e000000008a97e30
r26 : e000000008a97e30 r27 : e00000000cff9518 r28 : e000000008a97e38
r29 : 0000000000000000 r30 : 0000000000000000 r31 : 0000000000000000

Call Trace:
 [<a000000100012870>] show_stack+0x50/0xa0
                                sp=e000000008a979e0 bsp=e000000008a90eb0
 [<a000000100013150>] show_regs+0x830/0x860
                                sp=e000000008a97bb0 bsp=e000000008a90e68
 [<a000000100022a60>] die+0x1c0/0x2c0
                                sp=e000000008a97bb0 bsp=e000000008a90e20
 [<a000000100042350>] ia64_do_page_fault+0x830/0x960
                                sp=e000000008a97bb0 bsp=e000000008a90db8
 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270
                                sp=e000000008a97c40 bsp=e000000008a90db8
 [<a000000100391cf0>] scsi_softirq_done+0x50/0x2e0
                                sp=e000000008a97e10 bsp=e000000008a90d70
 [<a00000010023f910>] blk_done_softirq+0x170/0x1c0
                                sp=e000000008a97e20 bsp=e000000008a90d58
 [<a00000010006a8b0>] __do_softirq+0x110/0x2e0
                                sp=e000000008a97e30 bsp=e000000008a90cc8
 [<a00000010006aaf0>] do_softirq+0x70/0xc0
                                sp=e000000008a97e30 bsp=e000000008a90c68
 [<a00000010006c2d0>] ksoftirqd+0xf0/0x240
                                sp=e000000008a97e30 bsp=e000000008a90c40
 [<a00000010008df80>] kthread+0xa0/0x120
                                sp=e000000008a97e30 bsp=e000000008a90c10
 [<a000000100014430>] kernel_thread_helper+0x30/0x60
                                sp=e000000008a97e30 bsp=e000000008a90be0
 [<a00000010000a0c0>] start_kernel_thread+0x20/0x40
                                sp=e000000008a97e30 bsp=e000000008a90be0
Kernel panic - not syncing: Aiee, killing interrupt handler!

Steps to reproduce:
while true; do echo "3">/proc/sys/vm/drop_caches; dd if=/dev/sdb2 of=/dev/zero bs=1024k count=8900; done;
Comment 1 Alex Shi 2008-12-10 18:09:01 UTC
the device is :
06:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)
        Subsystem: Intel Corporation: Unknown device 3402
        Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 49
        I/O ports at 8800 [size=256]
        Memory at 00000000f8fd0000 (64-bit, non-prefetchable) [size=64K]
        Memory at 00000000f8fc0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at 00000000f8d00000 [disabled] [size=1M]
        Capabilities: <available only to root>

06:02.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)
        Subsystem: Intel Corporation: Unknown device 3402
        Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 50
        I/O ports at 8c00 [size=256]
        Memory at 00000000f8ff0000 (64-bit, non-prefetchable) [size=64K]
        Memory at 00000000f8fe0000 (64-bit, non-prefetchable) [size=64K]
        Expansion ROM at 00000000f8e00000 [disabled] [size=1M]
        Capabilities: <available only to root>
Comment 2 Alex Shi 2008-12-10 18:15:00 UTC
Created attachment 19244 [details]
kernel config file
Comment 3 Yanmin Zhang 2008-12-11 01:05:58 UTC
I instrumented kernel. below is some finding.
1) The disk isn't stable. If we run dd in a loop, sometimes the disk doesn't
work, so block request will expire to start a scsi error handling;
2) fussion driver has its own periodic (1 sec) checking in
mpt_fault_reset_work. If something is wrong, it will do a hardware reset and
flush all pending requests;
3) Step 1) and 2) have a race that a request might be flushed in step 2), but
step 1) releases the request firstly. Later on, when SOFTIRQ tries to release
the scsi_cmnd, kernel panic.
Comment 4 Alex Shi 2008-12-11 01:26:26 UTC
Good finding! 
Comment 5 Anonymous Emailer 2008-12-11 07:17:25 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Thu, 2008-12-11 at 01:05 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> I instrumented kernel. below is some finding.
> 1) The disk isn't stable. If we run dd in a loop, sometimes the disk doesn't
> work, so block request will expire to start a scsi error handling;
> 2) fussion driver has its own periodic (1 sec) checking in
> mpt_fault_reset_work. If something is wrong, it will do a hardware reset and
> flush all pending requests;
> 3) Step 1) and 2) have a race that a request might be flushed in step 2), but
> step 1) releases the request firstly. Later on, when SOFTIRQ tries to release
> the scsi_cmnd, kernel panic.

OK, so to rule out the block timer problems which we think are fixed,
could you reproduce on 2.6.28-rc8.

Also, the routine 

> [<a000000100391cf0>] scsi_softirq_done+0x50/0x2e0
>                                 sp=e000000008a97e10 bsp=e000000008a90d70

Is hard to diagnose. Could you ask gdb or addr2line which actual source
line this is?

Thanks,

James
Comment 6 Lin Ming 2008-12-11 18:27:03 UTC
2.6.28-rc8 also panic

addr2line -e vmlinux a0000001003971b0
linux-2.6.28-rc8/drivers/scsi/scsi_lib.c:1538

   1535 static void scsi_softirq_done(struct request *rq)
   1536 {
   1537         struct scsi_cmnd *cmd = rq->special;
   1538         unsigned long wait_for = (cmd->allowed + 1) * rq->timeout;
   1539         int disposition;
   
Comment 7 Anonymous Emailer 2008-12-12 02:22:17 UTC
Reply-To: andmike@linux.vnet.ibm.com

bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12195
> 
> 
> 
> 
> 
> ------- Comment #6 from ming.m.lin@intel.com  2008-12-11 18:27 -------
> 2.6.28-rc8 also panic

The blk_mark_rq_complete check should prevent completions from occurring on
already timed out requests unless the interaction previous mentioned between
mpt_fault_reset_work and the scsi eh thread requeue alows the REQ_ATOM_COMPLETE
bit to get cleared prior to the scsi_done being called from
mptscsih_flush_running_cmds. This did not look obvious to hit.

mpt_fault_reset_work
	mpt_HardResetHandler
		mpt_signal_reset
			mptsas_ioc_reset 
				mptscsih_flush_running_cmds
		mpt_do_ioc_recovery

When scsi_times_out is called there should not be a transportt->eh_timed_out,
or hostt->eh_timed_out set for mptsas which should lead to waking up the eh
thread. We will then call mptscsih_abort from the eh thread and it will return
success if the scsi command is not found leading to a possible requeue.

If you have time for another re-create it would be good to set some scsi
logging.
sysctl -w dev.scsi.logging_level=4100 # mlcomplete 1 and error 4
echo "1" > /proc/sys/kernel/sysrq # If needed
echo 9 > /proc/sysrq-trigger # Raise console log level

Also if you have more dmesg output prior to the error from the previous
failure runs that would be good to post also.

-andmike
--
Michael Anderson
andmike@linux.vnet.ibm.com
Comment 8 Anonymous Emailer 2008-12-12 07:09:38 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Fri, 2008-12-12 at 02:22 -0800, Mike Anderson wrote:
> bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=12195
> > 
> > 
> > 
> > 
> > 
> > ------- Comment #6 from ming.m.lin@intel.com  2008-12-11 18:27 -------
> > 2.6.28-rc8 also panic
> 
> The blk_mark_rq_complete check should prevent completions from occurring on
> already timed out requests unless the interaction previous mentioned between
> mpt_fault_reset_work and the scsi eh thread requeue alows the
> REQ_ATOM_COMPLETE
> bit to get cleared prior to the scsi_done being called from
> mptscsih_flush_running_cmds. This did not look obvious to hit.
> 
> mpt_fault_reset_work
>       mpt_HardResetHandler
>               mpt_signal_reset
>                       mptsas_ioc_reset 
>                               mptscsih_flush_running_cmds
>               mpt_do_ioc_recovery

Actually, this isn't quite true.  Particularly in the eh case.  It looks
like the block timeout isn't stopped until blk_complete_request() which
is pretty late.  If the timeout fires after scsi_done is called but
before we complete the request, any timeout goes through the
BLK_EH_HANDLED path to __blk_complete_request().  This routine
unconditionally adds to the done routine without checking the mark, so
there is a window where we can get double dones.

James
Comment 9 Anonymous Emailer 2008-12-12 11:28:51 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Fri, 2008-12-12 at 09:09 -0600, James Bottomley wrote:
> On Fri, 2008-12-12 at 02:22 -0800, Mike Anderson wrote:
> > bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote:
> > > http://bugzilla.kernel.org/show_bug.cgi?id=12195
> > > 
> > > 
> > > 
> > > 
> > > 
> > > ------- Comment #6 from ming.m.lin@intel.com  2008-12-11 18:27 -------
> > > 2.6.28-rc8 also panic
> > 
> > The blk_mark_rq_complete check should prevent completions from occurring on
> > already timed out requests unless the interaction previous mentioned
> between
> > mpt_fault_reset_work and the scsi eh thread requeue alows the
> REQ_ATOM_COMPLETE
> > bit to get cleared prior to the scsi_done being called from
> > mptscsih_flush_running_cmds. This did not look obvious to hit.
> > 
> > mpt_fault_reset_work
> >     mpt_HardResetHandler
> >             mpt_signal_reset
> >                     mptsas_ioc_reset 
> >                             mptscsih_flush_running_cmds
> >             mpt_do_ioc_recovery
> 
> Actually, this isn't quite true.  Particularly in the eh case.  It looks
> like the block timeout isn't stopped until blk_complete_request() which
> is pretty late.  If the timeout fires after scsi_done is called but
> before we complete the request, any timeout goes through the
> BLK_EH_HANDLED path to __blk_complete_request().  This routine
> unconditionally adds to the done routine without checking the mark, so
> there is a window where we can get double dones.

Actually, I take that back ... the patch to plug the unprep race was
send over the list but never applied because the timer changes seemed to
fix the problem.

So, we still have a small window where unprep can NULL out rq->special
while an asynchronous mpt reset is flushing the commands via scsi_done.

Could you see if it goes away (or at least lessens in frequency) with
this patch?

Thanks,

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 111f9e9..f2f51e0 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -653,8 +653,8 @@ static void scsi_requeue_command(struct request_queue *q, struct scsi_cmnd *cmd)
 	struct request *req = cmd->request;
 	unsigned long flags;
 
-	scsi_unprep_request(req);
 	spin_lock_irqsave(q->queue_lock, flags);
+	scsi_unprep_request(req);
 	blk_requeue_request(q, req);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
Comment 10 Rafael J. Wysocki 2008-12-13 09:58:18 UTC
Handled-By : James Bottomley <James.Bottomley@HansenPartnership.com>
Notify-Also : Michael Anderson <andmike@linux.vnet.ibm.com>
Notify-Also : Lin Ming <ming.m.lin@intel.com>
Notify-Also : Yanmin Zhang <yanmin_zhang@linux.intel.com>
Comment 11 Lin Ming 2008-12-15 01:52:34 UTC
kernel panic again with patch at #9 applied to 2.6.28-rc8
----------------------------------------------------------

mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!                           
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
end_request: I/O error, dev sdb, sector 22968800
Buffer I/O error on device sdb, logical block 5742200
Buffer I/O error on device sdb, logical block 5742201
Buffer I/O error on device sdb, logical block 5742202
Buffer I/O error on device sdb, logical block 5742203
Buffer I/O error on device sdb, logical block 5742204
Buffer I/O error on device sdb, logical block 5742205
Buffer I/O error on device sdb, logical block 5742206
Buffer I/O error on device sdb, logical block 5742207
Buffer I/O error on device sdb, logical block 5742208
Buffer I/O error on device sdb, logical block 5742209
sd 0:0:1:0: rejecting I/O to offline device
end_request: I/O error, dev sdb, sector 22969056
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
Unable to handle kernel NULL pointer dereference (address 0000000000000044)
ksoftirqd/2[10]: Oops 8813272891392 [1]
Modules linked in:

Pid: 10, CPU 2, comm:          ksoftirqd/2
psr : 00001010081a6010 ifs : 8000000000000289 ip  : [<a000000100397190>]    Not tainted (2.6.28-rc8-mlin)
ip is at scsi_softirq_done+0x50/0x2e0
unat: 0000000000000000 pfs : 0000000000000184 rsc : 0000000000000003
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 0000000000005a81
ldrs: 0000000000000000 ccv : 0000000000000002 fpsr: 0009804c8a70433f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000100244e90 b6  : a000000100397140 b7  : a00000010000b220
f6  : 1003e0000000000000000 f7  : 1003e0000000000200000
f8  : 1003e0000000080000000 f9  : 1003e000001de86495475
f10 : 1003e9277c68cf83c1120 f11 : 1003e0000000000000046
r1  : a000000100a2bf40 r2  : e000000001115ec8 r3  : e0000000402d6658
r8  : e0000000402d6650 r9  : ffffffffffff5ec8 r10 : 0000000000000018
r11 : 0000000000000044 r12 : e000000008b07e10 r13 : e000000008b00000
r14 : e000000008b07e30 r15 : e000000008b07e38 r16 : 0000000000000020
r17 : 0000000000000030 r18 : e0000000402d6790 r19 : e0000000402d6720
r20 : a00000010083fb20 r21 : a000000100397140 r22 : e00000004062b388
r23 : e00000004062b2e0 r24 : e0000000402d6680 r25 : e000000008b07e30
r26 : e000000040119700 r27 : e0000000402d6658 r28 : e000000040119708
r29 : 0000000000000000 r30 : 0000000000000000 r31 : 0000000000000000

Call Trace:
 [<a000000100012870>] show_stack+0x50/0xa0
                                sp=e000000008b079e0 bsp=e000000008b00ec8
 [<a000000100013150>] show_regs+0x830/0x860
                                sp=e000000008b07bb0 bsp=e000000008b00e80
 [<a000000100022a60>] die+0x1c0/0x2c0
                                sp=e000000008b07bb0 bsp=e000000008b00e38
 [<a000000100042350>] ia64_do_page_fault+0x830/0x960
                                sp=e000000008b07bb0 bsp=e000000008b00dd0
 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270
                                sp=e000000008b07c40 bsp=e000000008b00dd0
 [<a000000100397190>] scsi_softirq_done+0x50/0x2e0
                                sp=e000000008b07e10 bsp=e000000008b00d88
 [<a000000100244e90>] blk_done_softirq+0x170/0x1c0
                                sp=e000000008b07e20 bsp=e000000008b00d70
 [<a00000010006f050>] __do_softirq+0x110/0x2e0
                                sp=e000000008b07e30 bsp=e000000008b00ce0
 [<a00000010006f290>] do_softirq+0x70/0xc0
                                sp=e000000008b07e30 bsp=e000000008b00c80
 [<a000000100070a70>] ksoftirqd+0xf0/0x240
                                sp=e000000008b07e30 bsp=e000000008b00c58
 [<a000000100092720>] kthread+0xa0/0x120
                                sp=e000000008b07e30 bsp=e000000008b00c28
 [<a000000100014430>] kernel_thread_helper+0x30/0x60
                                sp=e000000008b07e30 bsp=e000000008b00c00
 [<a00000010000a0c0>] start_kernel_thread+0x20/0x40
                                sp=e000000008b07e30 bsp=e000000008b00c00
Kernel panic - not syncing: Aiee, killing interrupt handler!
Comment 12 Lin Ming 2008-12-15 02:39:57 UTC
(In reply to comment #7)
> If you have time for another re-create it would be good to set some scsi
> logging.
> sysctl -w dev.scsi.logging_level=4100 # mlcomplete 1 and error 4
> echo "1" > /proc/sys/kernel/sysrq # If needed
> echo 9 > /proc/sysrq-trigger # Raise console log level
> 

mptscsih: ioc0: attempting task abort! (sc=e000000037a36980)
sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00
mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
mptscsih: ioc0: Issue of TaskMgmt failed!
mptscsih: ioc0: task abort: FAILED (sc=e000000037a36980)
mptscsih: ioc0: attempting task abort! (sc=e000000037a35f80)
sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4f a0 00 01 00 00
mptscsih: ioc0: task abort: FAILED (sc=e000000037a35f80)
mptscsih: ioc0: attempting target reset! (sc=e000000037a36980)
sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00
mptscsih: ioc0: target reset: FAILED (sc=e000000037a36980)
mptscsih: ioc0: attempting bus reset! (sc=e000000037a36980)
sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00
mptscsih: ioc0: bus reset: FAILED (sc=e000000037a36980)
mptscsih: ioc0: attempting host reset! (sc=e000000037a36980)
mptscsih: ioc0: host reset: SUCCESS (sc=e000000037a36980)
sd 0:0:1:0: Device offlined - not ready after error recovery
sd 0:0:1:0: Device offlined - not ready after error recovery
sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06
end_request: I/O error, dev sdb, sector 20663968
Buffer I/O error on device sdb, logical block 5165992
Buffer I/O error on device sdb, logical block 5165993
Buffer I/O error on device sdb, logical block 5165994
Buffer I/O error on device sdb, logical block 5165995
Buffer I/O error on device sdb, logical block 5165996
Buffer I/O error on device sdb, logical block 5165997
Buffer I/O error on device sdb, logical block 5165998
Buffer I/O error on device sdb, logical block 5165999
Buffer I/O error on device sdb, logical block 5166000
Buffer I/O error on device sdb, logical block 5166001
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06
end_request: I/O error, dev sdb, sector 20664224
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=e000000037a36980, mf = e0000000406847e0, idx=55
sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1, sc=e000000037a35f80, mf = e0000000406883e0, idx=f5
Unable to handle kernel NULL pointer dereference (address 0000000000000044)
mpt_poll_0[378]: Oops 8813272891392 [1]
Modules linked in:

Pid: 378, CPU 1, comm:           mpt_poll_0
psr : 0000101008026018 ifs : 8000000000000289 ip  : [<a0000001003988b0>]    Not tainted (2.6.28-rc8-mlin)
ip is at scsi_softirq_done+0x50/0x2e0
unat: 0000000000000000 pfs : 0000000000000184 rsc : 0000000000000003
rnat: 5555555555555555 bsps: 5555555555555555 pr  : 000002a1491599a5
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000100245310 b6  : a000000100398860 b7  : a000000100011ec0
f6  : 1003e0000000000000000 f7  : 1003e0000000000200802
f8  : 1003e0000000080000000 f9  : 1003e00000000000001d3
f10 : 1003e392baaf0b5995995 f11 : 1003e0000000000000030
r1  : a000000100a2c6c0 r2  : e000000001105ec8 r3  : e000000035e2d688
r8  : e000000035e2d680 r9  : ffffffffffff5ec8 r10 : 0000000000000018
r11 : 0000000000000044 r12 : e0000000405c7830 r13 : e0000000405c0000
r14 : e0000000405c7850 r15 : e0000000405c7858 r16 : 0000000000000020
r17 : 0000000000000030 r18 : e000000035e2d7c0 r19 : e000000035e2d750
r20 : 0000000000000001 r21 : a000000100398860 r22 : e000000040599ac0
r23 : e000000040599a18 r24 : e000000035e2d6b0 r25 : e0000000405c7850
r26 : e0000000405c7850 r27 : e000000035e2d688 r28 : e0000000405c7858
r29 : a000000100011ec0 r30 : 0000000000000000 r31 : e000000008a78000

Call Trace:
 [<a000000100012870>] show_stack+0x50/0xa0
                                sp=e0000000405c7400 bsp=e0000000405c1368
 [<a000000100013150>] show_regs+0x830/0x860
                                sp=e0000000405c75d0 bsp=e0000000405c1320
 [<a000000100022a60>] die+0x1c0/0x2c0
                                sp=e0000000405c75d0 bsp=e0000000405c12d8
 [<a000000100042350>] ia64_do_page_fault+0x830/0x960
                                sp=e0000000405c75d0 bsp=e0000000405c1278
 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270
                                sp=e0000000405c7660 bsp=e0000000405c1278
 [<a0000001003988b0>] scsi_softirq_done+0x50/0x2e0
                                sp=e0000000405c7830 bsp=e0000000405c1230
 [<a000000100245310>] blk_done_softirq+0x170/0x1c0
                                sp=e0000000405c7840 bsp=e0000000405c1218
 [<a00000010006f3d0>] __do_softirq+0x110/0x2e0
                                sp=e0000000405c7850 bsp=e0000000405c1180
 [<a00000010006f610>] do_softirq+0x70/0xc0
                                sp=e0000000405c7850 bsp=e0000000405c1120
 [<a00000010006fa50>] irq_exit+0x70/0xa0
                                sp=e0000000405c7850 bsp=e0000000405c1108
 [<a000000100011c70>] ia64_handle_irq+0x350/0x380
                                sp=e0000000405c7850 bsp=e0000000405c10d0
 [<a00000010000ba20>] ia64_native_leave_kernel+0x0/0x270
                                sp=e0000000405c7850 bsp=e0000000405c10d0
 [<a0000001000662e0>] vprintk+0x660/0x6a0
                                sp=e0000000405c7a20 bsp=e0000000405c1000
 [<a0000001000663a0>] printk+0x80/0xc0
                                sp=e0000000405c7a90 bsp=e0000000405c0f98
 [<a0000001003d4c50>] mptscsih_ioc_reset+0x250/0x420
                                sp=e0000000405c7ad0 bsp=e0000000405c0f30
 [<a0000001003d9c50>] mptspi_ioc_reset+0x30/0xa0
                                sp=e0000000405c7ae0 bsp=e0000000405c0f08
 [<a0000001003c1d30>] mpt_signal_reset+0x150/0x180
                                sp=e0000000405c7ae0 bsp=e0000000405c0ed8
 [<a0000001003c53c0>] KickStart+0x680/0x1000
                                sp=e0000000405c7ae0 bsp=e0000000405c0e70
 [<a0000001003c5f50>] MakeIocReady+0x210/0x500
                                sp=e0000000405c7ae0 bsp=e0000000405c0e18
 [<a0000001003c86f0>] mpt_do_ioc_recovery+0x150/0x2840
                                sp=e0000000405c7ae0 bsp=e0000000405c0d48
 [<a0000001003caf80>] mpt_HardResetHandler+0x1a0/0x2a0
                                sp=e0000000405c7dd0 bsp=e0000000405c0d00
 [<a0000001003cb190>] mpt_fault_reset_work+0x110/0x260
                                sp=e0000000405c7dd0 bsp=e0000000405c0cc0
 [<a00000010008a310>] run_workqueue+0x1f0/0x340
                                sp=e0000000405c7dd0 bsp=e0000000405c0c80
 [<a00000010008a5a0>] worker_thread+0x140/0x180
                                sp=e0000000405c7dd0 bsp=e0000000405c0c58
 [<a000000100092aa0>] kthread+0xa0/0x120
                                sp=e0000000405c7e30 bsp=e0000000405c0c28
 [<a000000100014430>] kernel_thread_helper+0x30/0x60
                                sp=e0000000405c7e30 bsp=e0000000405c0c00
 [<a00000010000a0c0>] start_kernel_thread+0x20/0x40
                                sp=e0000000405c7e30 bsp=e0000000405c0c00
Kernel panic - not syncing: Aiee, killing interrupt handler!
Comment 13 Anonymous Emailer 2008-12-15 12:13:53 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Mon, 2008-12-15 at 02:39 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12195
> 
> 
> 
> 
> 
> ------- Comment #12 from ming.m.lin@intel.com  2008-12-15 02:39 -------
> (In reply to comment #7)
> > If you have time for another re-create it would be good to set some scsi
> > logging.
> > sysctl -w dev.scsi.logging_level=4100 # mlcomplete 1 and error 4
> > echo "1" > /proc/sys/kernel/sysrq # If needed
> > echo 9 > /proc/sysrq-trigger # Raise console log level
> > 
> 
> mptscsih: ioc0: attempting task abort! (sc=e000000037a36980)
> sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00
> mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
> mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
> mptbase: ioc0: Initiating recovery
> mptbase: ioc0: WARNING - IOC is in FAULT state!!!
> mptbase: ioc0: WARNING -            FAULT code = 000eh
> mptscsih: ioc0: Issue of TaskMgmt failed!
> mptscsih: ioc0: task abort: FAILED (sc=e000000037a36980)
> mptscsih: ioc0: attempting task abort! (sc=e000000037a35f80)
> sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4f a0 00 01 00 00
> mptscsih: ioc0: task abort: FAILED (sc=e000000037a35f80)
> mptscsih: ioc0: attempting target reset! (sc=e000000037a36980)
> sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00
> mptscsih: ioc0: target reset: FAILED (sc=e000000037a36980)
> mptscsih: ioc0: attempting bus reset! (sc=e000000037a36980)
> sd 0:0:1:0: [sdb] CDB: cdb[0]=0x28: 28 00 01 3b 4e a0 00 01 00 00
> mptscsih: ioc0: bus reset: FAILED (sc=e000000037a36980)
> mptscsih: ioc0: attempting host reset! (sc=e000000037a36980)
> mptscsih: ioc0: host reset: SUCCESS (sc=e000000037a36980)
> sd 0:0:1:0: Device offlined - not ready after error recovery
> sd 0:0:1:0: Device offlined - not ready after error recovery
> sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06
> end_request: I/O error, dev sdb, sector 20663968
> Buffer I/O error on device sdb, logical block 5165992
> Buffer I/O error on device sdb, logical block 5165993
> Buffer I/O error on device sdb, logical block 5165994
> Buffer I/O error on device sdb, logical block 5165995
> Buffer I/O error on device sdb, logical block 5165996
> Buffer I/O error on device sdb, logical block 5165997
> Buffer I/O error on device sdb, logical block 5165998
> Buffer I/O error on device sdb, logical block 5165999
> Buffer I/O error on device sdb, logical block 5166000
> Buffer I/O error on device sdb, logical block 5166001
> sd 0:0:1:0: rejecting I/O to offline device
> sd 0:0:1:0: [sdb] Result: hostbyte=0x00 driverbyte=0x06
> end_request: I/O error, dev sdb, sector 20664224
> mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
> sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1,
> sc=e000000037a36980, mf = e0000000406847e0, idx=55
> sd 0:0:1:0: mptscsih: ioc0: completing cmds: fw_channel 0, fw_id 1,
> sc=e000000037a35f80, mf = e0000000406883e0, idx=f5
> Unable to handle kernel NULL pointer dereference (address 0000000000000044)
> mpt_poll_0[378]: Oops 8813272891392 [1]

Oh ... this is actually a fusion problem, then.  It looks like the
fusion is relying on the old done behaviour.  Does this work?  It
flushes the fusion internal queue if we go into host reset.  This should
prevent the commands turning up later after the device has been
offlined.

James

---

diff --git a/drivers/message/fusion/mptscsih.c b/drivers/message/fusion/mptscsih.c
index d62fd4f..ee09041 100644
--- a/drivers/message/fusion/mptscsih.c
+++ b/drivers/message/fusion/mptscsih.c
@@ -2008,6 +2008,9 @@ mptscsih_host_reset(struct scsi_cmnd *SCpnt)
 		return FAILED;
 	}
 
+	/* make sure we have no outstanding commands at this stage */
+	mptscsih_flush_running_cmds(hd);
+
 	ioc = hd->ioc;
 	printk(MYIOC_s_INFO_FMT "attempting host reset! (sc=%p)\n",
 	    ioc->name, SCpnt);
Comment 14 Lin Ming 2008-12-15 21:12:05 UTC
(In reply to comment #13)
> 
> diff --git a/drivers/message/fusion/mptscsih.c
> b/drivers/message/fusion/mptscsih.c
> index d62fd4f..ee09041 100644
> --- a/drivers/message/fusion/mptscsih.c
> +++ b/drivers/message/fusion/mptscsih.c
> @@ -2008,6 +2008,9 @@ mptscsih_host_reset(struct scsi_cmnd *SCpnt)
>                 return FAILED;
>         }
> 
> +       /* make sure we have no outstanding commands at this stage */
> +       mptscsih_flush_running_cmds(hd);
> +
>         ioc = hd->ioc;
>         printk(MYIOC_s_INFO_FMT "attempting host reset! (sc=%p)\n",
>             ioc->name, SCpnt);
> 

It seems that this patch fixes the bug.
I have been running "dd" for 4 hours and no panic.
dmesg attached. 
Thanks
------------------------------------------------------

end_request: I/O error, dev sdb, sector 34572800                                    
Buffer I/O error on device sdb, logical block 8643200
Buffer I/O error on device sdb, logical block 8643201
Buffer I/O error on device sdb, logical block 8643202
Buffer I/O error on device sdb, logical block 8643203
Buffer I/O error on device sdb, logical block 8643204
Buffer I/O error on device sdb, logical block 8643205
Buffer I/O error on device sdb, logical block 8643206
Buffer I/O error on device sdb, logical block 8643207
Buffer I/O error on device sdb, logical block 8643208
Buffer I/O error on device sdb, logical block 8643209
Info fld=0x4044
end_request: I/O error, dev sdb, sector 34572544
__ratelimit: 54 callbacks suppressed
Buffer I/O error on device sdb, logical block 8643136
Buffer I/O error on device sdb, logical block 8643137
Buffer I/O error on device sdb, logical block 8643138
Buffer I/O error on device sdb, logical block 8643139
Buffer I/O error on device sdb, logical block 8643140
Buffer I/O error on device sdb, logical block 8643141
Buffer I/O error on device sdb, logical block 8643142
Buffer I/O error on device sdb, logical block 8643143
Buffer I/O error on device sdb, logical block 8643144
Buffer I/O error on device sdb, logical block 8643145
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
sd 0:0:1:0: rejecting I/O to offline device
__ratelimit: 54 callbacks suppressed
Buffer I/O error on device sdb, logical block 5081200
Buffer I/O error on device sdb, logical block 5081201
Buffer I/O error on device sdb, logical block 5081202
Buffer I/O error on device sdb, logical block 5081203
Buffer I/O error on device sdb, logical block 5081204
Buffer I/O error on device sdb, logical block 5081205
Buffer I/O error on device sdb, logical block 5081206
Buffer I/O error on device sdb, logical block 5081207
Buffer I/O error on device sdb, logical block 5081208
Buffer I/O error on device sdb, logical block 5081209
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
sd 0:0:1:0: rejecting I/O to offline device
__ratelimit: 126 callbacks suppressed
Buffer I/O error on device sdb, logical block 5642800
Buffer I/O error on device sdb, logical block 5642801
Buffer I/O error on device sdb, logical block 5642802
Buffer I/O error on device sdb, logical block 5642803
Buffer I/O error on device sdb, logical block 5642804
Buffer I/O error on device sdb, logical block 5642805
Buffer I/O error on device sdb, logical block 5642806
Buffer I/O error on device sdb, logical block 5642807
Buffer I/O error on device sdb, logical block 5642808
Buffer I/O error on device sdb, logical block 5642809
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
sd 0:0:1:0: rejecting I/O to offline device
__ratelimit: 126 callbacks suppressed
Buffer I/O error on device sdb, logical block 6333104
Buffer I/O error on device sdb, logical block 6333105
Buffer I/O error on device sdb, logical block 6333106
Buffer I/O error on device sdb, logical block 6333107
Buffer I/O error on device sdb, logical block 6333108
Buffer I/O error on device sdb, logical block 6333109
Buffer I/O error on device sdb, logical block 6333110
Buffer I/O error on device sdb, logical block 6333111
Buffer I/O error on device sdb, logical block 6333112
Buffer I/O error on device sdb, logical block 6333113
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
sd 0:0:1:0: rejecting I/O to offline device
__ratelimit: 126 callbacks suppressed
Buffer I/O error on device sdb, logical block 7920944
Buffer I/O error on device sdb, logical block 7920945
Buffer I/O error on device sdb, logical block 7920946
Buffer I/O error on device sdb, logical block 7920947
Buffer I/O error on device sdb, logical block 7920948
Buffer I/O error on device sdb, logical block 7920949
Buffer I/O error on device sdb, logical block 7920950
Buffer I/O error on device sdb, logical block 7920951
Buffer I/O error on device sdb, logical block 7920952
Buffer I/O error on device sdb, logical block 7920953
sd 0:0:1:0: rejecting I/O to offline device
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
sd 0:0:1:0: rejecting I/O to offline device
__ratelimit: 126 callbacks suppressed
Buffer I/O error on device sdb, logical block 6794608
Buffer I/O error on device sdb, logical block 6794609
Buffer I/O error on device sdb, logical block 6794610
Buffer I/O error on device sdb, logical block 6794611
Buffer I/O error on device sdb, logical block 6794612
Buffer I/O error on device sdb, logical block 6794613
Buffer I/O error on device sdb, logical block 6794614
Buffer I/O error on device sdb, logical block 6794615
Buffer I/O error on device sdb, logical block 6794616
Buffer I/O error on device sdb, logical block 6794617
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
program smartd is using a deprecated SCSI ioctl, please convert it to SG_IO
mptbase: ioc0: WARNING - IOC is in FAULT state (000eh)!!!
mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
mptbase: ioc0: WARNING - IOC is in FAULT state!!!
mptbase: ioc0: WARNING -            FAULT code = 000eh
sd 0:0:1:0: rejecting I/O to offline device
__ratelimit: 126 callbacks suppressed
Buffer I/O error on device sdb, logical block 8816496
Buffer I/O error on device sdb, logical block 8816497
Buffer I/O error on device sdb, logical block 8816498
Buffer I/O error on device sdb, logical block 8816499
Buffer I/O error on device sdb, logical block 8816500
Buffer I/O error on device sdb, logical block 8816501
Buffer I/O error on device sdb, logical block 8816502
Buffer I/O error on device sdb, logical block 8816503
Buffer I/O error on device sdb, logical block 8816504
Buffer I/O error on device sdb, logical block 8816505
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
sd 0:0:1:0: rejecting I/O to offline device
mptbase: ioc0: ERROR - Doorbell ACK timeout (count=4999), IntStatus=80000000!
mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
Comment 15 Rafael J. Wysocki 2008-12-21 11:52:18 UTC
Fixed by commit a6da74cb077e88a604e5b2a6663b44c221f0ae75 .