At random interval of between 10 and 40 days a Seagate Momentus drive drops out of an eight-drive JBOD array attached to a LSI 2008 SAS controller.
eight Seagate Momentus ST9500420AS SATA drives, JBOD
LVM2 8x striped/RAID0 LV
CentOS 5.5 kernel 2.6.18-194.8.1.el5
MPT2BIOS 7.05.01.00 (2010.09.09)
LSI mpt2sas 05.00.00.00
CentOS 5.4 kernel 2.6.18-164.10.1.el5
MPT2BIOS 7.03.00.00 (2009-10-12)
distro mpt2sas version 01.101.00.00
Striped LV is for logging and recevies moderate write activity for 6.5 hours each day. Additionally a 'pbzip2' job runs nightly to compress each day's log. Uncompressed logs run from between 250 and 500 GBs each. Ext4 filesystem.
Originally reported under bug 14831 before exact nature of problem was identified. See bottom of that report for initial analysis by kdesai.
Created attachment 28611 [details]
kernel messages from failure with logging_level=0x1F8
Created attachment 28621 [details]
boot-time messages with logging_level=0x1F8
Created attachment 28631 [details]
firmware events from boot and failure
Created attachment 28641 [details]
boot-time information from 'lsiutil'
Created attachment 28651 [details]
miscellaneous 'lsiutil' information collected after failure
two Intel / Xeon X5560 2.8GHz / 1333MHz / 8MB L3 / D0
Hynix / ECC UDIMM HMT351U7AFR8C-H9 / 1333MHz / 4GB x 6 = 24GB
Supermicro AOC-USAS2-L8i SAS controller
Left out one bit of hardware, and remembering it lead to an idea. A SuperMicro SAS-113TQ SAS/SATA backplane ( http://www.supermicro.com/manuals/other/BPN-SAS-113TQ.pdf ) is also in the mix and could be a possible cause of random hotplug events. Distinctly recall puzzling over two tiny ribbon cables that run between the controller card and the backplane. Turned out that the extra connections allow the controller and backplane to communicate via the obscure "SGPIO" protocol ( http://en.wikipedia.org/wiki/SGPIO ). Seems to be for flashing LEDs but who knows? Maybe the backplane can trigger hotplug events.
Another detail is that it's always the last drive in each of the two SAS IPASS cable groups that drops: either physical slot 3 or physical slot 7 (where the ranges are 0-3 and 4-7). A suspicious coincidence.
Created attachment 28962 [details]
kernel log from yet another failure
yet another controller failure
different profile: infinite hot-plug event loop this time
Possibly have figured this out. Since the problem often has occurred when the logging application becomes idle, it seems possible that power management in the drives is a cause. The Seagate Momentus ST9500420AS drives are known for parking their heads aggressively (and driving laptop users nuts). For some reason 'hdparm' does not work with LSI attached drives under CentOS 5.5, but it does work under Fedora 12. Have a F12 OS image available on the server and used it to run 'hdparm -B 255 /dev/sdX' on all of the drives, then rebooted back to CentOS after verifying that the value sticks. Time will tell if disabling APM on the drives works around the issue.
If this is the cause, it implies that possibly the LSI firmware is mistaking APM event notifications from the drives as hot-plug events. Seems to me that would be a bug. However it's strange that this only happens after an extended period of time, so it may be a more complex variation of that basic theory. Perhaps the drives have a quirk where they drop into the spin-down power state only after a certain amount of uptime.
Arrgh! It appears the the this drive has a bad firmware that hangs and freezes along with excessively parking the heads. Even better, Seagate has not released a fix. Second server with bad Seagate firmware--defintaely sticking with Western Digital going forward.
Hopefully the disabling of APM will avoid the firmware bug.