Latest working kernel version: 2.6.27 Earliest failing kernel version: 2.6.28-rc4 Hardware Environment: Infortrend G2430 connected to LSI22320R Problem Description: Hello, first in 2.6.28-rc{1,2,3} the error handler was entirely broken - it deadlocked. In rc4 this is fixed, but now I already two times got a Null pointer dereference while doing some error handler tests. All of that looks like due to the scsi timeout commits. Steps to reproduce: E.g. reset devices connected to LSI 53C1030 devices using lsiutil. Can be reproduced on about 20% eh activations. (gdb) l *(scsi_times_out+0x15) 0xffffffff80460f1e is in scsi_times_out (drivers/scsi/scsi_error.c:176). 171 enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *); 172 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED; 173 174 scsi_log_completion(scmd, TIMEOUT_ERROR); 175 176 if (scmd->device->host->transportt->eh_timed_out) 177 eh_timed_out = scmd->device->host->transportt->eh_timed_out; 178 else if (scmd->device->host->hostt->eh_timed_out) 179 eh_timed_out = scmd->device->host->hostt->eh_timed_out; 180 else [ 143.804672] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 143.808507] IP: [<ffffffff80460f1e>] scsi_times_out+0x15/0x71 [ 143.816020] PGD f9381067 PUD f9360067 PMD 0 [ 143.824018] Oops: 0000 [#1] SMP [ 143.824018] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map [ 143.832016] Dumping ftrace buffer: [ 143.832016] (ftrace buffer empty) [ 143.832016] CPU 1 [ 143.832016] Modules linked in: mptctl ib_ipoib inet_lro ib_umad rdma_ucm rdma_cm ib_cm iw_cm ib_sa ib_addr ib_uvee [ 143.832016] Pid: 246, comm: pdflush Not tainted 2.6.28-rc4-bs1 #10 [ 143.832016] RIP: 0010:[<ffffffff80460f1e>] [<ffffffff80460f1e>] scsi_times_out+0x15/0x71 [ 143.832016] RSP: 0018:ffff88007f6a3df0 EFLAGS: 00010086 [ 143.832016] RAX: ffff88007ebf5330 RBX: 0000000000000000 RCX: ffff8800f93804b8 [ 143.832016] RDX: ffff88007ebf5948 RSI: 0000000000000246 RDI: ffff8800f9380378 [ 143.832016] RBP: ffff88007f6a3e00 R08: 0000000000000000 R09: 0000000000000000 [ 143.832016] R10: ffff8800f9144680 R11: ffff88007eeac240 R12: ffff88007ebf5330 [ 143.832016] R13: ffff88007ebf5808 R14: ffffffff80380461 R15: 0000000000000000 [ 143.832016] FS: 0000000000733860(0000) GS:ffff8800fb29ab40(0000) knlGS:0000000000000000 [ 143.832016] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 143.832016] CR2: 0000000000000000 CR3: 00000000e80ec000 CR4: 00000000000006e0 [ 143.832016] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 143.832016] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 143.832016] Process pdflush (pid: 246, threadinfo ffff88007ed12000, task ffff88007ed11890) [ 143.832016] Stack: [ 143.832016] ffff88007f6a3e00 ffff8800f9380378 ffff88007f6a3e20 ffffffff80380426 [ 143.832016] ffff88007ebf5330 ffff8800f9380378 ffff88007f6a3e70 ffffffff803804f9 [ 143.832016] ffff88007eea0000 ffff88007ebf5668 0000000000000246 ffff88007ebf5330 [ 143.832016] Call Trace: [ 143.832016] <IRQ> <0> [<ffffffff80380426>] blk_rq_timed_out+0x1b/0x56 [ 143.832016] [<ffffffff803804f9>] blk_rq_timed_out_timer+0x98/0x118 [ 143.832016] [<ffffffff80380461>] ? blk_rq_timed_out_timer+0x0/0x118 [ 143.832016] [<ffffffff802464e2>] run_timer_softirq+0x14c/0x1cc [ 143.832016] [<ffffffff80242392>] __do_softirq+0x83/0x128 [ 143.832016] [<ffffffff8020d03c>] call_softirq+0x1c/0x28 [ 143.832016] [<ffffffff8020ea39>] do_softirq+0x49/0x90 [ 143.832016] [<ffffffff802422aa>] irq_exit+0x44/0x46 [ 143.832016] [<ffffffff8020e88b>] do_IRQ+0xba/0xcf
Reply-To: James.Bottomley@HansenPartnership.com On Thu, 2008-11-13 at 10:30 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12020 > > Summary: scsi_times_out NULL pointer dereference > Product: SCSI Drivers > Version: 2.5 > KernelVersion: 2.6.28-git20081113 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: scsi_drivers-other@kernel-bugs.osdl.org > ReportedBy: bs@q-leap.de > > > Latest working kernel version: 2.6.27 > Earliest failing kernel version: 2.6.28-rc4 > Hardware Environment: Infortrend G2430 connected to LSI22320R > Problem Description: > > Hello, > > first in 2.6.28-rc{1,2,3} the error handler was entirely broken - it > deadlocked. In rc4 this is fixed, but now I already two times got a Null > pointer dereference while doing some error handler tests. All of that looks > like due to the scsi timeout commits. > > Steps to reproduce: E.g. reset devices connected to LSI 53C1030 devices using > lsiutil. Can be reproduced on about 20% eh activations. > > (gdb) l *(scsi_times_out+0x15) > 0xffffffff80460f1e is in scsi_times_out (drivers/scsi/scsi_error.c:176). > 171 enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *); > 172 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED; > 173 > 174 scsi_log_completion(scmd, TIMEOUT_ERROR); > 175 > 176 if (scmd->device->host->transportt->eh_timed_out) > 177 eh_timed_out = > scmd->device->host->transportt->eh_timed_out; > 178 else if (scmd->device->host->hostt->eh_timed_out) > 179 eh_timed_out = > scmd->device->host->hostt->eh_timed_out; > 180 else Actually, I think the trace is slightly off. I suspect this is the problem: struct scsi_cmnd *scmd = req->special; I bet req->special is NULL because the command timed out even before it was prepared by the subsystem. Does this fix it? The fix is more of a bandaid than anything ... we can't really have commands timing out in the mid-layer because we expect we have full control of them. With this patch, if we run out of resets, block will complete a command we're still processing. James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 94ed262..5612c42 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -127,6 +127,13 @@ enum blk_eh_timer_return scsi_times_out(struct request *req) enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *); enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED; + if (!scmd) + /* + * nasty: command timed out before the mid layer + * even prepared it + */ + return BLK_EH_RESET_TIMER; + scsi_log_completion(scmd, TIMEOUT_ERROR); if (scmd->device->host->transportt->eh_timed_out)
Thanks going to test it now. While we are at this function, could you please check if (eh_timed_out) rtn = eh_timed_out(scmd); switch (rtn) { case BLK_EH_NOT_HANDLED: break; default: return rtn; } Is the indentation wrong or are there missing if-braces? Thanks, Bernd
Reply-To: James.Bottomley@HansenPartnership.com On Thu, 2008-11-13 at 12:12 -0800, bugme-daemon@bugzilla.kernel.org wrote: > While we are at this function, could you please check > > if (eh_timed_out) > rtn = eh_timed_out(scmd); > switch (rtn) { > case BLK_EH_NOT_HANDLED: > break; > default: > return rtn; > } > > > Is the indentation wrong or are there missing if-braces? It's not as intended, but harmless: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=6ec39f02cf48df89c3cbab4aeef521569fec00e4 James
Reply-To: James.Bottomley@HansenPartnership.com On Thu, 2008-11-13 at 13:03 -0600, James Bottomley wrote: > Actually, I think the trace is slightly off. I suspect this is the > problem: > > struct scsi_cmnd *scmd = req->special; > > I bet req->special is NULL because the command timed out even before it > was prepared by the subsystem. > > Does this fix it? > > The fix is more of a bandaid than anything ... we can't really have > commands timing out in the mid-layer because we expect we have full > control of them. With this patch, if we run out of resets, block will > complete a command we're still processing. > > James > > --- > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c > index 94ed262..5612c42 100644 > --- a/drivers/scsi/scsi_error.c > +++ b/drivers/scsi/scsi_error.c > @@ -127,6 +127,13 @@ enum blk_eh_timer_return scsi_times_out(struct request > *req) > enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *); > enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED; > > + if (!scmd) > + /* > + * nasty: command timed out before the mid layer > + * even prepared it > + */ > + return BLK_EH_RESET_TIMER; > + > scsi_log_completion(scmd, TIMEOUT_ERROR); > > if (scmd->device->host->transportt->eh_timed_out) Mike Anderson pointed out that we have a potential window where the timer can fire after we've unprepped the request in SCSI (so making req->special NULL) but before we call blk_requeue_request() which stops the timer. We can rejig the locking to prevent this from happening, so could you (separately) try this patch? James --- diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f5d3b96..3475b74 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -649,8 +643,8 @@ static void scsi_requeue_command(struct request_queue *q, struct scsi_cmnd *cmd) struct request *req = cmd->request; unsigned long flags; - scsi_unprep_request(req); spin_lock_irqsave(q->queue_lock, flags); + scsi_unprep_request(req); blk_requeue_request(q, req); spin_unlock_irqrestore(q->queue_lock, flags);
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=12020#c4 Handled-By : James Bottomley <James.Bottomley@HansenPartnership.com>
looks very similar [ 316.336654] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 316.339972] IP: [<ffffffff803f84d3>] scsi_times_out+0x10/0x72 [ 316.339972] PGD 3e627067 PUD 3de0b067 PMD 0 [ 316.339972] Oops: 0000 [#1] PREEMPT SMP [ 316.339972] last sysfs file: /sys/devices/virtual/block/md0/md/metadata_version [ 316.339972] Dumping ftrace buffer: [ 316.339972] (ftrace buffer empty) [ 316.339972] CPU 1 [ 316.339972] Modules linked in: floppy sg [ 316.339972] Pid: 0, comm: swapper Not tainted 2.6.28-rc5 #1 [ 316.339972] RIP: 0010:[<ffffffff803f84d3>] [<ffffffff803f84d3>] scsi_times_out+0x10/0x72 [ 316.339972] RSP: 0018:ffff88003fb53e20 EFLAGS: 00010082 [ 316.339972] RAX: ffff88003ef60000 RBX: 0000000000000000 RCX: ffff88003ef60308 [ 316.339972] RDX: ffff88003ef60308 RSI: 0000000000006cb2 RDI: ffff880033dae5c0 [ 316.339972] RBP: ffff88003fb53e30 R08: ffff880001019180 R09: 0000000000000010 [ 316.339972] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003ef601c8 [ 316.339972] R13: ffff88003ef60308 R14: 0000000000000102 R15: 0000000000000000 [ 316.339972] FS: 0000000000000000(0000) GS:ffff88003fb23b00(0000) knlGS:0000000000000000 [ 316.339972] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 316.339972] CR2: 0000000000000000 CR3: 000000003e7e8000 CR4: 00000000000006e0 [ 316.339972] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 316.339972] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 316.339972] Process swapper (pid: 0, threadinfo ffff88003fb4e000, task ffff88003f863500) [ 316.339972] Stack: [ 316.339972] ffff88000101f900 ffff880033dae5c0 ffff88003fb53e50 ffffffff803544ca [ 316.339972] ffff88003ef60000 ffff880033dae5c0 ffff88003fb53ea0 ffffffff803545d8 [ 316.339972] ffff88003fb53ea0 ffff88003ef60000 0000000000000286 ffff88003ef60000 [ 316.339972] Call Trace: [ 316.339972] <IRQ> <0> [<ffffffff803544ca>] blk_rq_timed_out+0x16/0x5c [ 316.339972] [<ffffffff803545d8>] blk_rq_timed_out_timer+0xc8/0x138 [ 316.339972] [<ffffffff80354510>] ? blk_rq_timed_out_timer+0x0/0x138 [ 316.339972] [<ffffffff8023f8b7>] run_timer_softirq+0x183/0x1ec [ 316.339972] [<ffffffff80254c0c>] ? tick_dev_program_event+0x6c/0xa4 [ 316.339972] [<ffffffff8023b326>] __do_softirq+0x72/0x128 [ 316.339972] [<ffffffff8020c8cc>] call_softirq+0x1c/0x30 [ 316.339972] [<ffffffff8020df2d>] do_softirq+0x3d/0x78 [ 316.339972] [<ffffffff8023b249>] irq_exit+0x8f/0x98 [ 316.339972] [<ffffffff8021d8e4>] smp_apic_timer_interrupt+0x8a/0xd6 [ 316.339972] [<ffffffff8020c31b>] apic_timer_interrupt+0x6b/0x70 [ 316.339972] <EOI> <0> [<ffffffff80212f22>] ? mwait_idle+0x45/0x4a [ 316.339972] [<ffffffff80209deb>] ? enter_idle+0x22/0x24 [ 316.339972] [<ffffffff8020a386>] ? cpu_idle+0x41/0x80 [ 316.339972] Code: cb ff ff 85 c0 74 a0 45 31 e4 eb d2 45 31 e4 44 89 e0 5b 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 53 48 83 ec 08 48 8b 9f e0 00 00 00 <48> 8b 03 48 8b 10 48 8b 82 b8 00 00 00 48 8b 80 60 01 00 00 48 [ 316.339972] RIP [<ffffffff803f84d3>] scsi_times_out+0x10/0x72 [ 316.339972] RSP <ffff88003fb53e20> [ 316.339972] CR2: 0000000000000000 [ 316.339972] Kernel panic - not syncing: Fatal exception in interrupt in my case easily be triggered by disk activity (i.g. rsync) on rebuilding raid On Thu, 2008-11-13 at 13:03 -0600, James Bottomley wrote: > Actually, I think the trace is slightly off. I suspect this is the > problem: > > struct scsi_cmnd *scmd = req->special; > > I bet req->special is NULL because the command timed out even before it > was prepared by the subsystem. > > Does this fix it? In my case it doesn't 'fix', but proof of concept. With your patch [i just printk-ing the comment] system remains locked printk-ing from time to time: "nasty: command timed out before the mid layer even prepared it" > The fix is more of a bandaid than anything ... we can't really have > commands timing out in the mid-layer because we expect we have full > control of them. With this patch, if we run out of resets, block will > complete a command we're still processing. here is a dmesg: http://sysadminday.org.ru/2.6.28-rc5-git3/scsi_times_out-NULL_pointer_dereference_dmesg
Reply-To: andmike@linux.vnet.ibm.com I have two systems that are hitting similar signatures in scsi_times_out. Note: that my testing is using a distro kernel, but in this area the code is very similar. I will work to get a reproduction on mainline. ..but.. I added some debug to scsi_times_out and noticed that the request with no scmd set in req->special also did not have REQ_STARTED set. I added a WARN_ON check to blk_add_timer for any requests that we where starting a timer for that did not have REQ_STARTED. This is shown below. This does not look good as the elv_dequeue_request is being called off elv_next_request for some cases. Call Trace: [c00000007b747580] [c00000000027808c] .blk_add_timer+0x74/0x134 (unreliable) [c00000007b747610] [c00000000026f9b8] .elv_dequeue_request+0x78/0x8c [c00000007b747680] [c000000000275830] .blk_do_ordered+0x8c/0x31c [c00000007b747720] [c00000000026fc18] .elv_next_request+0x24c/0x2d4 [c00000007b7477c0] [d000000000368004] .scsi_request_fn+0xc8/0x628 [scsi_mod] [c00000007b7478a0] [c00000000026fdf4] .elv_insert+0x154/0x38c [c00000007b747940] [c000000000273ad0] .__make_request+0x4e4/0x568 [c00000007b7479f0] [c000000000271a68] .generic_make_request+0x3f4/0x468 [c00000007b747af0] [c000000000271bd8] .submit_bio+0xfc/0x124 [c00000007b747bb0] [c000000000160a00] .submit_bh+0x14c/0x198 [c00000007b747c40] [c0000000001630a0] .sync_dirty_buffer+0xbc/0x15c [c00000007b747cd0] [c0000000001fcac0] .journal_commit_transaction+0x1014/0x158c [c00000007b747e10] [c00000000020111c] .kjournald+0x104/0x2f4 [c00000007b747f00] [c0000000000a909c] .kthread+0x78/0xc4 [c00000007b747f90] [c00000000002ae2c] .kernel_thread+0x4c/0x68 I changed the previous mentioned WARN_ON to just do a return if the request does not have REQ_STARTED. This corrected the issue of seeing an oops in scsi_times_out. But this is just a hack. Hope this analysis is not flawed because of kernel deltas. It also may not address this specific issue being seen in this bug, but does appear to indicate a possible path to get a request on the timeout list with out a req->special set. I think we may need to look at some of the paths that are calling blkdev_dequeue_request and understand how to prevent blk_add_timer from being called if we are not really starting a SCSI cmd. -andmike -- Michael Anderson andmike@linux.vnet.ibm.com
> Mike Anderson pointed out that we have a potential window where the > timer can fire after we've unprepped the request in SCSI (so making > req->special NULL) but before we call blk_requeue_request() which stops > the timer. We can rejig the locking to prevent this from happening, so > could you (separately) try this patch? > > James > Hello James, sorry for the huge delay. Unfortunately it turned out I was just 'lucky' to run into this bug the first few times. When I later on tried to reproduce this specific issue, I tried 20 times and couldn't, even without any patch :( So testing especially the 2nd patch turnes out to be a bit difficult (can't be verified by printk). I now also mostly only have remotely access to the test hardware and can't reset this system remotely, so testing has become a bit difficult :( Tomorrow will be in the lab again and try to test again. Thanks for your help and patience, Bernd