Bug 111381
Summary: | mvsas 0.8.16 on Marvell 88SE9485 reports timeouts on load with SMART commands | ||
---|---|---|---|
Product: | SCSI Drivers | Reporter: | Gabriel A. Devenyi (gdevenyi) |
Component: | Other | Assignee: | scsi_drivers-other |
Status: | RESOLVED WILL_NOT_FIX | ||
Severity: | high | ||
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.1.15 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | attachment-706-0.html |
Description
Gabriel A. Devenyi
2016-01-27 21:56:17 UTC
Created attachment 202131 [details]
attachment-706-0.html
Okay, I'll move around the cables and see if the problems move with them.
After a check and reseat of all cables, now getting this in dmesg: [ 4386.920825] sd 3:0:4:0: [sdn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4386.920834] sd 3:0:4:0: [sdn] tag#0 Sense Key : Aborted Command [current] [ 4386.920841] sd 3:0:4:0: [sdn] tag#0 ASC=0x4b ASCQ=0x20 [ 4386.920847] sd 3:0:4:0: [sdn] tag#0 CDB: Read(10) 28 00 33 03 4f bf 00 00 25 00 [ 4386.920852] blk_update_request: I/O error, dev sdn, sector 855855039 Recently problems very similar to this showed up in a discussion on reddit. See https://www.reddit.com/r/zfs/comments/43z0sn/scrub_knocked_a_drive_offline_flaky_hardware_other/ It kinda looks like Marvell controllers (or the mvsas driver?) are sensitive to cable length. I'm looking at halving the length of my cabling, will report back if the issues resolve. After much more research and reading, I believe I've found the root issue causing the communication timeouts to the disks. I have not changed any cables (other than reseating everything). I came across several reports of marvell (mvsas) controllers randomly causing issues with load, which mentioned SMART. http://www.spinics.net/lists/linux-ide/msg50075.html https://bugzilla.kernel.org/show_bug.cgi?id=42679 So I decided to stop smartd and try my load tests. I have now gone through three iterations of high-load zfs scrubs along with iotest.sh hammering the array simultaneously with no timeouts or communications issues. I believe this indicates some issue with the driver/controller/disks handling smart commands during load. Do you have any recommendations as how I can test this further? Is there any debug info I can provide? |