Bug 17551 - mpt2sas -- spurious hotplug event causes drive to drop out of JBOD array
Summary: mpt2sas -- spurious hotplug event causes drive to drop out of JBOD array
Status: RESOLVED DOCUMENTED
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-31 08:05 UTC by starlight
Modified: 2012-08-13 16:07 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.26 - 2.6.32rc4-scsi-misc
Tree: Fedora
Regression: No


Attachments
kernel messages from failure with logging_level=0x1F8 (38.33 KB, text/plain)
2010-08-31 08:08 UTC, starlight
Details
boot-time messages with logging_level=0x1F8 (63.46 KB, text/plain)
2010-08-31 08:08 UTC, starlight
Details
firmware events from boot and failure (1.96 KB, text/plain)
2010-08-31 08:09 UTC, starlight
Details
boot-time information from 'lsiutil' (4.64 KB, text/plain)
2010-08-31 08:10 UTC, starlight
Details
miscellaneous 'lsiutil' information collected after failure (59.63 KB, text/plain)
2010-08-31 08:12 UTC, starlight
Details
kernel log from yet another failure (618.05 KB, application/octet-stream)
2010-09-04 20:14 UTC, starlight
Details

Description starlight 2010-08-31 08:05:57 UTC
At random interval of between 10 and 40 days a Seagate Momentus drive drops out of an eight-drive JBOD array attached to a LSI 2008 SAS controller.

LSI 2008
eight Seagate Momentus ST9500420AS SATA drives, JBOD
LVM2 8x striped/RAID0 LV

CentOS 5.5 kernel 2.6.18-194.8.1.el5
MPT2BIOS 7.05.01.00 (2010.09.09)
SAS2008-IT 5.00.00.00
LSI mpt2sas 05.00.00.00

also

CentOS 5.4 kernel 2.6.18-164.10.1.el5
MPT2BIOS 7.03.00.00 (2009-10-12)
SAS2008-IR 4.00.00.00
distro mpt2sas version 01.101.00.00

-----

Striped LV is for logging and recevies moderate write activity for 6.5 hours each day.  Additionally a 'pbzip2' job runs nightly to compress each day's log.  Uncompressed logs run from between 250 and 500 GBs each.  Ext4 filesystem.

-----

Originally reported under bug 14831 before exact nature of problem was identified.  See bottom of that report for initial analysis by kdesai.
Comment 1 starlight 2010-08-31 08:08:04 UTC
Created attachment 28611 [details]
kernel messages from failure with logging_level=0x1F8
Comment 2 starlight 2010-08-31 08:08:39 UTC
Created attachment 28621 [details]
boot-time messages with logging_level=0x1F8
Comment 3 starlight 2010-08-31 08:09:16 UTC
Created attachment 28631 [details]
firmware events from boot and failure
Comment 4 starlight 2010-08-31 08:10:31 UTC
Created attachment 28641 [details]
boot-time information from 'lsiutil'
Comment 5 starlight 2010-08-31 08:12:16 UTC
Created attachment 28651 [details]
miscellaneous 'lsiutil' information collected after failure
Comment 6 starlight 2010-08-31 08:23:09 UTC
Hardware details:

Supermicro 1026T-URF
two Intel / Xeon X5560 2.8GHz / 1333MHz / 8MB L3 / D0
Hynix / ECC UDIMM HMT351U7AFR8C-H9 / 1333MHz / 4GB x 6 = 24GB
Supermicro AOC-USAS2-L8i SAS controller
Comment 7 starlight 2010-08-31 09:32:44 UTC
Left out one bit of hardware, and remembering it lead to an idea.  A SuperMicro SAS-113TQ SAS/SATA backplane ( http://www.supermicro.com/manuals/other/BPN-SAS-113TQ.pdf ) is also in the mix and could be a possible cause of random hotplug events.  Distinctly recall puzzling over two tiny ribbon cables that run between the controller card and the backplane.  Turned out that the extra connections allow the controller and backplane to communicate via the obscure "SGPIO" protocol ( http://en.wikipedia.org/wiki/SGPIO ).  Seems to be for flashing LEDs but who knows?  Maybe the backplane can trigger hotplug events.

Another detail is that it's always the last drive in each of the two SAS IPASS cable groups that drops:  either physical slot 3 or physical slot 7 (where the ranges are 0-3 and 4-7).  A suspicious coincidence.
Comment 8 starlight 2010-09-04 20:14:36 UTC
Created attachment 28962 [details]
kernel log from yet another failure

yet another controller failure

different profile:  infinite hot-plug event loop this time
Comment 9 starlight 2010-09-05 06:40:44 UTC
Possibly have figured this out.  Since the problem often has occurred when the logging application becomes idle, it seems possible that power management in the drives is a cause.  The Seagate Momentus ST9500420AS drives are known for parking their heads aggressively (and driving laptop users nuts).  For some reason 'hdparm' does not work with LSI attached drives under CentOS 5.5, but it does work under Fedora 12.  Have a F12 OS image available on the server and used it to run 'hdparm -B 255 /dev/sdX' on all of the drives, then rebooted back to CentOS after verifying that the value sticks.  Time will tell if disabling APM on the drives works around the issue.

If this is the cause, it implies that possibly the LSI firmware is mistaking APM event notifications from the drives as hot-plug events.  Seems to me that would be a bug.  However it's strange that this only happens after an extended period of time, so it may be a more complex variation of that basic theory.  Perhaps the drives have a quirk where they drop into the spin-down power state only after a certain amount of uptime.
Comment 10 starlight 2010-09-05 16:55:19 UTC
Arrgh!  It appears the the this drive has a bad firmware that hangs and freezes along with excessively parking the heads.  Even better, Seagate has not released a fix.  Second server with bad Seagate firmware--defintaely sticking with Western Digital going forward.

http://forums.seagate.com/t5/Momentus-XT-Momentus-and/Momentus-ST9500420AS-Firmware-Update/td-p/33862

Hopefully the disabling of APM will avoid the firmware bug.

Note You need to log in before you can comment on or make changes to this bug.