Bug 47701
Summary: | When too many disks fall out at the same time, RCU hangs | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Steinar H. Gunderson (steinar+kernel) |
Component: | SCSI | Assignee: | linux-scsi (linux-scsi) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alan, joe.lawrence, lists2009 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.5.4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | dmesg of boot, and removal of one drive at 9462 seconds. |
Description
Steinar H. Gunderson
2012-09-18 23:13:08 UTC
Created attachment 82531 [details]
dmesg of boot, and removal of one drive at 9462 seconds.
I can reproduce this on 3.5.5 & 3.6.
I have 2 of these cards : 01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 02)
My test setup places only 2 drives on the SAS cards. I create a RAID10 from them. Simply pulling a drive will cause the following RCU hang, and prevent the machine from syncing, rebooting or being able to use the array. Alt-sysrq gets me rebooted and back up and running. 100% reproducible.
Stratus noticed a similar crash (hang actually) earlier this week when removing a single SAS disk as part of a RAID 1 MD mirror. In our instance, the all CPUs were idle, except one that was running scsi_target_reap and another waiting on RCU synchronize_sched. Since the former function was stuck in some loop, RCU stalled and the machine wedged. Another Stratus engineer noticed patch [1], and once applied to our kernel, MD/mpt2sas disk removal no longer hung the machine. [1] [SCSI] scsi_remove_target: fix softlockup regression on hot remove https://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=bc3f02a795d3b4faa99d37390174be2a75d091bd Apparently fixed as of 3.6.0-07201-ged5062d (current git as of 8 hours ago). |