Bug 17551
Summary: | mpt2sas -- spurious hotplug event causes drive to drop out of JBOD array | ||
---|---|---|---|
Product: | SCSI Drivers | Reporter: | starlight |
Component: | Other | Assignee: | scsi_drivers-other |
Status: | RESOLVED DOCUMENTED | ||
Severity: | normal | CC: | alan, kashyap.desai |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.26 - 2.6.32rc4-scsi-misc | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
kernel messages from failure with logging_level=0x1F8
boot-time messages with logging_level=0x1F8 firmware events from boot and failure boot-time information from 'lsiutil' miscellaneous 'lsiutil' information collected after failure kernel log from yet another failure |
Description
starlight
2010-08-31 08:05:57 UTC
Created attachment 28611 [details]
kernel messages from failure with logging_level=0x1F8
Created attachment 28621 [details]
boot-time messages with logging_level=0x1F8
Created attachment 28631 [details]
firmware events from boot and failure
Created attachment 28641 [details]
boot-time information from 'lsiutil'
Created attachment 28651 [details]
miscellaneous 'lsiutil' information collected after failure
Hardware details: Supermicro 1026T-URF two Intel / Xeon X5560 2.8GHz / 1333MHz / 8MB L3 / D0 Hynix / ECC UDIMM HMT351U7AFR8C-H9 / 1333MHz / 4GB x 6 = 24GB Supermicro AOC-USAS2-L8i SAS controller Left out one bit of hardware, and remembering it lead to an idea. A SuperMicro SAS-113TQ SAS/SATA backplane ( http://www.supermicro.com/manuals/other/BPN-SAS-113TQ.pdf ) is also in the mix and could be a possible cause of random hotplug events. Distinctly recall puzzling over two tiny ribbon cables that run between the controller card and the backplane. Turned out that the extra connections allow the controller and backplane to communicate via the obscure "SGPIO" protocol ( http://en.wikipedia.org/wiki/SGPIO ). Seems to be for flashing LEDs but who knows? Maybe the backplane can trigger hotplug events. Another detail is that it's always the last drive in each of the two SAS IPASS cable groups that drops: either physical slot 3 or physical slot 7 (where the ranges are 0-3 and 4-7). A suspicious coincidence. Created attachment 28962 [details]
kernel log from yet another failure
yet another controller failure
different profile: infinite hot-plug event loop this time
Possibly have figured this out. Since the problem often has occurred when the logging application becomes idle, it seems possible that power management in the drives is a cause. The Seagate Momentus ST9500420AS drives are known for parking their heads aggressively (and driving laptop users nuts). For some reason 'hdparm' does not work with LSI attached drives under CentOS 5.5, but it does work under Fedora 12. Have a F12 OS image available on the server and used it to run 'hdparm -B 255 /dev/sdX' on all of the drives, then rebooted back to CentOS after verifying that the value sticks. Time will tell if disabling APM on the drives works around the issue. If this is the cause, it implies that possibly the LSI firmware is mistaking APM event notifications from the drives as hot-plug events. Seems to me that would be a bug. However it's strange that this only happens after an extended period of time, so it may be a more complex variation of that basic theory. Perhaps the drives have a quirk where they drop into the spin-down power state only after a certain amount of uptime. Arrgh! It appears the the this drive has a bad firmware that hangs and freezes along with excessively parking the heads. Even better, Seagate has not released a fix. Second server with bad Seagate firmware--defintaely sticking with Western Digital going forward. http://forums.seagate.com/t5/Momentus-XT-Momentus-and/Momentus-ST9500420AS-Firmware-Update/td-p/33862 Hopefully the disabling of APM will avoid the firmware bug. |