Latest working kernel version: commit c8d7aa after 2.6.28-rc2 Earliest failing kernel version: commit 920da6 after 2.6.28-rc2 Distribution: Debian Hardware Environment: sata_sil24, amd64, 2cpu Software Environment: 64bit kernel, 32bit userspace, preemptible kernel Problem Description: When I/O is under stress, from time to time CPU1 hangs, most probably due to endless stream of interrupts. Backtrace printed either by kernel's softlockup detection or alt-sysrq-p is below (written down; I/O is dead when this happens). _spin_unlock_irq + 0x30 (after sti) scsi_request_fn + 0x1b9 (after spin_unlock_irq(shost->host_lock) at not_ready:) blk_invoke_request_fn __blk_runqueue scsi_run_queue scsi_next_command scsi_end_request scsi_io_completion scsi_finish_command scsi_softirq_done blk_done_softirq __do_softirq call_softirq do_softirq irqexit do_IRQ ret_from_intr <EOI> native_safe_halt trace_hardirqs_on default_idle c1e_idle cpu_idle start_secondary Steps to reproduce: It seems to occur under heavy I/O (updatedb, dumping core from ~3GB app), but I was not able to trigger it reliably - most reliable is hard resetting box, then it occurs in ~80% cases when replaying journals on disks connected to sata_sil24 (through PMP, but problem does not seem to occur on 2.6.28-rc2 with Jens's PMP patches).
Apparently my lower bound test kernel did not had Jens's patch to use tagged queuing from SCSI layer applied. I've discovered reliable test case (write 4GB of data concurrently to every attached disk), and found that reverting all 4 SATA tagged queueing related checkins gets rid of the problem: 43a49cbdf31e812c0d8f553d433b09b421f5d52c 3070f69b66b7ab2f02d8a2500edae07039c38508 e013e13bf605b9e6b702adffbe2853cfc60e7806 2fca5ccf97d2c28bcfce44f5b07d85e74e3cd18e
Reply-To: James.Bottomley@HansenPartnership.com On Sat, 2008-11-08 at 19:50 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11990 > > Summary: Kernel hang in spin_unlock_irq from scsi_request_fn from > do_IRQ > Product: IO/Storage > Version: 2.5 > KernelVersion: 2.6.28-rc3 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: SCSI > AssignedTo: linux-scsi@vger.kernel.org > ReportedBy: vandrove@vc.cvut.cz > > > Latest working kernel version: commit c8d7aa after 2.6.28-rc2 > Earliest failing kernel version: commit 920da6 after 2.6.28-rc2 > Distribution: Debian > Hardware Environment: sata_sil24, amd64, 2cpu > Software Environment: 64bit kernel, 32bit userspace, preemptible kernel > Problem Description: > > When I/O is under stress, from time to time CPU1 hangs, most probably due to > endless stream of interrupts. Backtrace printed either by kernel's > softlockup > detection or alt-sysrq-p is below (written down; I/O is dead when this > happens). > > _spin_unlock_irq + 0x30 (after sti) > scsi_request_fn + 0x1b9 (after spin_unlock_irq(shost->host_lock) at > not_ready:) > blk_invoke_request_fn > __blk_runqueue > scsi_run_queue > scsi_next_command > scsi_end_request > scsi_io_completion > scsi_finish_command > scsi_softirq_done > blk_done_softirq > __do_softirq > call_softirq > do_softirq > irqexit > do_IRQ > ret_from_intr > <EOI> > native_safe_halt > trace_hardirqs_on > default_idle > c1e_idle > cpu_idle > start_secondary > > Steps to reproduce: > > It seems to occur under heavy I/O (updatedb, dumping core from ~3GB app), but > I > was not able to trigger it reliably - most reliable is hard resetting box, > then > it occurs in ~80% cases when replaying journals on disks connected to > sata_sil24 (through PMP, but problem does not seem to occur on 2.6.28-rc2 > with > Jens's PMP patches). This looks identical to http://bugzilla.kernel.org/show_bug.cgi?id=11898 Could you see if this refinement of the discussed patches fixes it for you? Thanks, James --- diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f5d3b96..e09a661 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -606,6 +606,7 @@ static void scsi_run_queue(struct request_queue *q) } list_del_init(&sdev->starved_entry); + starved_head = NULL; spin_unlock(shost->host_lock); spin_lock(sdev->request_queue->queue_lock); @@ -620,6 +621,12 @@ static void scsi_run_queue(struct request_queue *q) spin_unlock(sdev->request_queue->queue_lock); spin_lock(shost->host_lock); + if (unlikely(!list_empty(&sdev->starved_entry))) + /* + * sdev got put back on the starved list + * so finish starved handling + */ + break; } spin_unlock_irqrestore(shost->host_lock, flags);
*** This bug has been marked as a duplicate of bug 11898 ***