Bug 11990

Summary: Kernel hang in spin_unlock_irq from scsi_request_fn from do_IRQ
Product: IO/Storage Reporter: Petr Vandrovec (vandrove)
Component: SCSIAssignee: linux-scsi (linux-scsi)
Status: CLOSED DUPLICATE    
Severity: normal CC: rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.28-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11808    

Description Petr Vandrovec 2008-11-08 19:50:18 UTC
Latest working kernel version: commit c8d7aa after 2.6.28-rc2
Earliest failing kernel version: commit 920da6 after 2.6.28-rc2
Distribution: Debian
Hardware Environment: sata_sil24, amd64, 2cpu
Software Environment: 64bit kernel, 32bit userspace, preemptible kernel
Problem Description:

When I/O is under stress, from time to time CPU1 hangs, most probably due to endless stream of interrupts.  Backtrace printed either by kernel's softlockup detection or alt-sysrq-p is below (written down; I/O is dead when this happens).

_spin_unlock_irq + 0x30  (after sti)
scsi_request_fn + 0x1b9  (after spin_unlock_irq(shost->host_lock) at not_ready:)
blk_invoke_request_fn
__blk_runqueue
scsi_run_queue
scsi_next_command
scsi_end_request
scsi_io_completion
scsi_finish_command
scsi_softirq_done
blk_done_softirq
__do_softirq
call_softirq
do_softirq
irqexit
do_IRQ
ret_from_intr
<EOI>
native_safe_halt
trace_hardirqs_on
default_idle
c1e_idle
cpu_idle
start_secondary

Steps to reproduce:

It seems to occur under heavy I/O (updatedb, dumping core from ~3GB app), but I was not able to trigger it reliably - most reliable is hard resetting box, then it occurs in ~80% cases when replaying journals on disks connected to sata_sil24 (through PMP, but problem does not seem to occur on 2.6.28-rc2 with Jens's PMP patches).
Comment 1 Petr Vandrovec 2008-11-09 01:57:51 UTC
Apparently my lower bound test kernel did not had Jens's patch to use tagged queuing from SCSI layer applied.  I've discovered reliable test case (write 4GB of data concurrently to every attached disk), and found that reverting all 4 SATA tagged queueing related checkins gets rid of the problem:

43a49cbdf31e812c0d8f553d433b09b421f5d52c
3070f69b66b7ab2f02d8a2500edae07039c38508
e013e13bf605b9e6b702adffbe2853cfc60e7806
2fca5ccf97d2c28bcfce44f5b07d85e74e3cd18e
Comment 2 Anonymous Emailer 2008-11-09 07:22:59 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Sat, 2008-11-08 at 19:50 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11990
> 
>            Summary: Kernel hang in spin_unlock_irq from scsi_request_fn from
>                     do_IRQ
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: 2.6.28-rc3
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: SCSI
>         AssignedTo: linux-scsi@vger.kernel.org
>         ReportedBy: vandrove@vc.cvut.cz
> 
> 
> Latest working kernel version: commit c8d7aa after 2.6.28-rc2
> Earliest failing kernel version: commit 920da6 after 2.6.28-rc2
> Distribution: Debian
> Hardware Environment: sata_sil24, amd64, 2cpu
> Software Environment: 64bit kernel, 32bit userspace, preemptible kernel
> Problem Description:
> 
> When I/O is under stress, from time to time CPU1 hangs, most probably due to
> endless stream of interrupts.  Backtrace printed either by kernel's
> softlockup
> detection or alt-sysrq-p is below (written down; I/O is dead when this
> happens).
> 
> _spin_unlock_irq + 0x30  (after sti)
> scsi_request_fn + 0x1b9  (after spin_unlock_irq(shost->host_lock) at
> not_ready:)
> blk_invoke_request_fn
> __blk_runqueue
> scsi_run_queue
> scsi_next_command
> scsi_end_request
> scsi_io_completion
> scsi_finish_command
> scsi_softirq_done
> blk_done_softirq
> __do_softirq
> call_softirq
> do_softirq
> irqexit
> do_IRQ
> ret_from_intr
> <EOI>
> native_safe_halt
> trace_hardirqs_on
> default_idle
> c1e_idle
> cpu_idle
> start_secondary
> 
> Steps to reproduce:
> 
> It seems to occur under heavy I/O (updatedb, dumping core from ~3GB app), but
> I
> was not able to trigger it reliably - most reliable is hard resetting box,
> then
> it occurs in ~80% cases when replaying journals on disks connected to
> sata_sil24 (through PMP, but problem does not seem to occur on 2.6.28-rc2
> with
> Jens's PMP patches).

This looks identical to

http://bugzilla.kernel.org/show_bug.cgi?id=11898

Could you see if this refinement of the discussed patches fixes it for
you?

Thanks,

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f5d3b96..e09a661 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -606,6 +606,7 @@ static void scsi_run_queue(struct request_queue *q)
 		}
 
 		list_del_init(&sdev->starved_entry);
+		starved_head = NULL;
 		spin_unlock(shost->host_lock);
 
 		spin_lock(sdev->request_queue->queue_lock);
@@ -620,6 +621,12 @@ static void scsi_run_queue(struct request_queue *q)
 		spin_unlock(sdev->request_queue->queue_lock);
 
 		spin_lock(shost->host_lock);
+		if (unlikely(!list_empty(&sdev->starved_entry)))
+			/* 
+			 * sdev got put back on the starved list
+			 * so finish starved handling
+			 */
+			break;
 	}
 	spin_unlock_irqrestore(shost->host_lock, flags);
 
Comment 3 Rafael J. Wysocki 2008-11-09 11:01:06 UTC

*** This bug has been marked as a duplicate of bug 11898 ***