Bug 14831 - mptsas - Use of ATA command pass-through results in unreliable operation - drive / controller resets
mptsas - Use of ATA command pass-through results in unreliable operation - dr...
Status: CLOSED OBSOLETE
Product: SCSI Drivers
Classification: Unclassified
Component: Other
All Linux
: P1 normal
Assigned To: scsi_drivers-other
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-18 11:25 UTC by Tim Small
Modified: 2012-06-18 13:20 UTC (History)
14 users (show)

See Also:
Kernel Version: 2.6.26 - 2.6.32rc4-scsi-misc
Tree: Mainline
Regression: No


Attachments
latest upstream fusion driver 3.4.14 (251.37 KB, application/octet-stream)
2009-12-21 04:52 UTC, kashyap
Details
Script to stress-test ATA command passthrough whilst write-loading a SATA device. (1.83 KB, application/x-sh)
2009-12-21 12:11 UTC, Tim Small
Details
kernel messages from failure (59.13 KB, text/plain)
2010-08-16 16:04 UTC, starlight
Details
kernel messages from corresponding boot (57.45 KB, text/plain)
2010-08-16 16:07 UTC, starlight
Details
boot-time information from 'lsiutil' (4.64 KB, text/plain)
2010-08-28 15:45 UTC, starlight
Details
firmware events from boot and failure (1.96 KB, text/plain)
2010-08-28 15:45 UTC, starlight
Details
boot-time messages with logging_level=0x1F8 (38.33 KB, text/plain)
2010-08-28 15:47 UTC, starlight
Details
kernel messages from failure with logging_level=0x1F8 (63.46 KB, text/plain)
2010-08-28 15:51 UTC, starlight
Details

Description Tim Small 2009-12-18 11:25:40 UTC
On Debian 2.6.26-2-amd64, and mptsas 3.04.13 from scsi-misc-2.6.git, use ATA command pass-through on LSI SAS1068 and SAS1068E may result in:

. Device resets
. Device offline
. Controller offline (only observed on 2.6.26)

The problem seems to occur far more frequently with the SAS1068 (PCI version).

I haven't verified whether any data loss is occuring, but this does at least seem to be a possibility.


For 2.6.26:

/lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptsas.ko
version:        3.04.06
license:        GPL
description:    Fusion MPT SAS Host driver
author:         LSI Corporation

.. and a couple of WesternDigitial SATA drives,  I ran the following 
command:

while true ; do smartctl -a /dev/sg0 > /dev/null ; done

After approx 45 minutes this happened:

kernel: [5060492.926757] 
mptctldrivers/message/fusion/mptctl.c::mptctl_ioctl() @602 - Controller 
disabled.


For 2.6.32-rc4 with mptsas 3.04.13:

[   22.414415] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
9, phy 0, sas_addr 0x1221000000000000
[   22.466953] mptsas: ioc0: attaching sata device: fw_channel 0, fw_id
1, phy 1, sas_addr 0x1221000001000000
[   22.519305] mptsas: ioc0: attaching raid volume, channel 1, id 0
[   33.727405] Fusion MPT misc device (ioctl) driver 3.04.13
[   33.738270] mptctl: Registered with Fusion MPT base driver
[   33.749277] mptctl: /dev/mptctl @ (major,minor=10,220)
[ 5300.611795] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.629028] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.646254] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.663478] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.680700] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5300.697924] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[ 5312.111795] mptbase: ioc0: LogInfo(0x31130000): Originator={PL},
Code={IO Not Yet Executed}, SubCode(0x0000)
[ 5312.131469] mptscsih: ioc0: attempting task abort! (sc=ffff88012c5fc8c0)
[ 5312.156831] mptscsih: ioc0: task abort: FAILED (sc=ffff88012c5fc8c0)
[ 5312.169534] mptscsih: ioc0: attempting target reset!
(sc=ffff88012c5fc8c0)
[ 5312.195222] mptscsih: ioc0: target reset: FAILED (sc=ffff88012c5fc8c0)
[ 5312.208276] mptscsih: ioc0: attempting bus reset! (sc=ffff88012c5fc8c0)
[ 5316.612245] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88012c5fc8c0)
[ 5328.112389] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[ 5328.128508] mptscsih: ioc0: attempting host reset! (sc=ffff88012c5fc8c0)
[12537.867482] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[12537.885769] mptscsih: ioc0: attempting host reset! (sc=ffff88012d55c8c0)
[12537.899173] mptbase: ioc0: Initiating recovery
[12559.704264] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88012d55c8c0)
[44184.424640] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[44184.441866] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[44195.924782] mptbase: ioc0: LogInfo(0x31130000): Originator={PL},
Code={IO Not Yet Executed}, SubCode(0x0000)
[44195.944449] mptscsih: ioc0: attempting task abort! (sc=ffff88012c403ac0)
[44195.969799] mptscsih: ioc0: task abort: FAILED (sc=ffff88012c403ac0)
[44195.982500] mptscsih: ioc0: attempting target reset!
(sc=ffff88012c403ac0)
[44196.008182] mptscsih: ioc0: target reset: FAILED (sc=ffff88012c403ac0)
[44196.021230] mptscsih: ioc0: attempting bus reset! (sc=ffff88012c403ac0)
[44200.425026] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88012c403ac0)
[44211.925127] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[44211.943416] mptscsih: ioc0: attempting host reset! (sc=ffff88012c403ac0)
[44211.956814] mptbase: ioc0: Initiating recovery
[44233.760010] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88012c403ac0)
[49878.447977] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL},
Code={Reset}, SubCode(0x0d00)
[49889.948381] mptbase: ioc0: LogInfo(0x31130000): Originator={PL},
Code={IO Not Yet Executed}, SubCode(0x0000)
[49889.968080] mptscsih: ioc0: attempting task abort! (sc=ffff88003799acc0)
[49889.993425] mptscsih: ioc0: task abort: FAILED (sc=ffff88003799acc0)
[49890.006129] mptscsih: ioc0: attempting target reset!
(sc=ffff88003799acc0)
[49890.031817] mptscsih: ioc0: target reset: FAILED (sc=ffff88003799acc0)
[49890.044869] mptscsih: ioc0: attempting bus reset! (sc=ffff88003799acc0)
[49894.448617] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88003799acc0)
[49905.948189] mptbase: ioc0: LogInfo(0x31140000): Originator={PL},
Code={IO Executed}, SubCode(0x0000)
[49905.966491] mptscsih: ioc0: attempting host reset! (sc=ffff88003799acc0)
[49905.979888] mptbase: ioc0: Initiating recovery
Comment 1 kashyap 2009-12-18 12:44:28 UTC
Can you please provide firmware version information ?
Comment 2 Tim Small 2009-12-18 15:18:37 UTC
For the 1068 (Dell PE860):

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev
 1.  /proc/mpt/ioc0    LSI Logic SAS1068 B0      105      000a3300

Current active firmware version is 0.10.51
Firmware image's version is MPTFW-00.10.51.00-IE
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.12.05.00 (2007.09.29)


For the 1068E: (Dell PE1950):
     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev
 1.  /proc/mpt/ioc0    LSI Logic SAS1068E 08     105      00192f00

Current active firmware version is 0.25.47
Firmware image's version is MPTFW-00.25.47.00-IE
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.22.03.00 (2008.08.06)


I've just started a test on a SAS1064 (Intel S5000PSL).  Will leave it on test and report back here...

Thanks,

Tim.
Comment 3 Tim Small 2009-12-18 15:31:55 UTC
Hmm, failed as soon as I submitted the last comment (Intel S5000PSL on-board controller)...

filename:       /lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptsas.ko
version:        3.04.06

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev
 1.  /proc/mpt/ioc0    LSI Logic SAS1064E 04     105      01190100


Current active firmware version is 1.25.01
Firmware image's version is MPTFW-01.25.01.00-IT
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.22.00.00 (2008.04.10)


[    2.369101] ioc0: LSISAS1064E B2: Capabilities={Initiator}
[    2.371062] mptbase: ioc0: PCI-MSI enabled
[    2.371612] PCI: Setting latency timer of device 0000:04:00.0 to 64
[   18.377426] scsi0 : ioc0: LSISAS1064E B2, FwRev=01190100h, Ports=1, MaxQ=478, IRQ=1269
[   19.249357] scsi 0:0:0:0: Direct-Access     ATA      WDC WD3200BJKT-0 1A11 PQ: 0 ANSI: 5
[  881.982165] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108)
[  882.086359] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108)
[ 1514.521445] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108)
[ 1514.525947] mptbase: ioc0: LogInfo(0x30030108): Originator={IOP}, Code={Invalid Page}, SubCode(0x0108)
[ 2051.568333] mptscsih: ioc0: attempting task abort! (sc=ffff8101190f6940)
[ 2051.568446] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
[ 2056.593202] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[ 2056.594064] mptsas: ioc0: removing sata device, channel 0, id 0, phy 0
[ 2056.594182]  port-0:0: mptsas: ioc0: delete port (0)
[ 2056.617086] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 2056.797030] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8101190f6940)
[ 2056.797166] mptscsih: ioc0: attempting task abort! (sc=ffff810124da15c0)
[ 2056.798195] sd 0:0:0:0: [sda] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[ 2056.799567] mptscsih: ioc0: task abort: SUCCESS (sc=ffff810124da15c0)
[ 2056.799697] mptscsih: ioc0: attempting target reset! (sc=ffff8101190f6940)
[ 2056.799821] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
[ 2057.137289] mptscsih: ioc0: target reset: SUCCESS (sc=ffff8101190f6940)
[ 2057.140469] mptscsih: ioc0: attempting bus reset! (sc=ffff8101190f6940)
[ 2057.140585] sd 0:0:0:0: [sda] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
[ 2057.398218] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8101190f6940)
[ 2068.601852] mptscsih: ioc0: attempting host reset! (sc=ffff8101190f6940)
[ 2068.606217] mptbase: ioc0: Initiating recovery
[ 2082.235819] mptscsih: ioc0: host reset: SUCCESS (sc=ffff8101190f6940)
[ 2082.235932] sd 0:0:0:0: Device offlined - not ready after error recovery
[ 2082.236054] sd 0:0:0:0: Device offlined - not ready after error recovery
[ 2082.236197] end_request: I/O error, dev sda, sector 19534911
Comment 4 kashyap 2009-12-21 04:51:47 UTC
I tried the same test with below setup detail and It works fine for me. Can you give try upgrading your FW version for 1068 B0 card as mentioned in comment #2.

I used 1068 B0 card and HDD is WesternDigitial SATA drives REVV: 1E01
FW version is 1.29.00.00-IE 
Card Name is SAS3442X.

MPT driver version is 3.4.14 (latest upstream driver). see attachment fusion_03_04_14.tgz for quick access to fusion driver. You may need to some change code to make it compilable with your kernel.


Please give a try and let me know your result.

1.29.00 Fw is available at
http://lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/combo/sas3442x-r/index.html

Thanks,
Kashyap
Comment 5 kashyap 2009-12-21 04:52:56 UTC
Created attachment 24240 [details]
latest upstream fusion driver 3.4.14
Comment 6 Tim Small 2009-12-21 12:08:50 UTC
Hi Kashyap,

Thanks for your input.  Unfortunately, I can't test on the 1068, as the machine is now in production (with SMART disabled!).

I do still have access to the 1068E and the 1064, and I will see if I can borrow another 1068.

Could you try the attached script on your test system?  It carries out I/O to the device which is under test, and seems to trigger failures much more quickly as a result.

Thanks,

Tim.
Comment 7 Tim Small 2009-12-21 12:11:31 UTC
Created attachment 24243 [details]
Script to stress-test ATA command passthrough whilst write-loading a SATA device.

This script uses dd to repeatedly write and remove a 1G zero-filled file to/from a file-system whilst executing smartctl against the associated device file.
Comment 8 amf 2010-01-11 11:59:39 UTC
We see this on numerous Dell hosts running the SAS6iR based on the LSISAS1068E chip. Running the stock RHEL5 driver.

It's simple to reproduce with SMART commands, but we actually see huge issues with drives dropping off cards even during heavy I/O and no SMART commands involved at all. It seems to be all the worse when the disk reallocates a bad sector. The disks are Dell-supplied and therefore 'enterprise' models capable of TLER type behaviour.

I believe the SMART command method makes it easier to reproduce what may be a problem not specifically related to SMART, but that's just my own feeling.

HTH.
Comment 9 Aaron Williams 2010-01-12 23:15:37 UTC
I am seeing similar events with my current computer and with my last computer. My setup consists of two WD Black Edition 1TB drives:

Model=WDC WD1001FALS-00J7B0, FwRev=05.00K05, SerialNo=WD-WMATV0705568
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7

00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller

I also have smartmond running to periodically query the drive. In my case I have two drives running in a mirrored configuration and this will arbitrarily kick one of the drives out of my RAID array (using md). I just spent the last few days recovering from a RAID event that killed the entire array (due to this problem I believe).
Comment 10 Tim Small 2010-01-13 12:30:47 UTC
Hi Aaron,

It's possible that this is an unrelated issue.  On one of the systems with the MPT SAS controllers, I have moved a drive from an MPT SAS channel onto an Intel 631xESB/632xESB SATA channel, and the unreliable behaviour appeared to stop.

Do you have any other drives which you can test in place of the WD drives?  Personally I have found Hitachi SATA drives to be well engineered in recent years from a SMART PoV.

If you'd like to open another bug, the script included in this bug might help you reproduce the problems.  You could also try disabling NCQ and/or using a different SATA controller (Silicon Image SiI 3132 based PCIe cards are available very inexpensively) to see if this helps.

Thanks,

Tim.
Comment 11 Brian Sullivan 2010-03-21 11:34:37 UTC
I too am running into this bug.

Here is firmware rev of the onboard LSI controller I am using:
Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
/proc/mpt/ioc0    LSI Logic SAS1068E B2     105      01160000     0

I tried 2.6.32 kernel from Ubuntu 10.04 and then tried updating to 2.6.33 from mainline.  I also then tried updating the mptsas driver to the latest off LSI's site, v4.18.00.00.  Nothing seemed to improve issue.

Problem is, for me has been, reading smart info fast enough, or long enough, eventually the command will fail.  It tries aborting task, bus reset, and then host reset.  This takes some amount of time.

The pause is what I believe causes drives to sometimes drop off the controller.  I am not sure what is to blame, but at least a work around is to go in the LSI controller's BIOS and set all the timeout values to 0.  The default timeout value seems to vary depending on which 1068E card you have and which firmware is installed.  After setting all timeout values to 0, I still have problem with ATA pass-through, but the drives no longer drop off the controller when I hit the pass-through bug.

Also I have both WDC and Hitatchi drives.  Both behave the same.

BTW here is errors I get when running hddtemp, basically same as OP:
[156291.890023] mptscsih: ioc0: attempting task abort! (sc=ffff880369e51000)
[156291.890028] sd 7:0:12:0: [sdo] CDB: ATA command pass through(16): 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[156293.532938] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[156293.533080] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880369e51000)
[156303.531268] mptscsih: ioc0: attempting task abort! (sc=ffff880369e51000)
[156303.531274] sd 7:0:12:0: [sdo] CDB: Test Unit Ready: 00 00 00 00 00 00
[156303.531283] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880369e51000)
[156303.531299] mptscsih: ioc0: attempting target reset! (sc=ffff880369e51000)
[156303.531302] sd 7:0:12:0: [sdo] CDB: ATA command pass through(16): 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[156305.050176] mptscsih: ioc0: target reset: FAILED (sc=ffff880369e51000)
[156305.050185] mptscsih: ioc0: attempting bus reset! (sc=ffff880369e51000)
[156305.050189] sd 7:0:12:0: [sdo] CDB: ATA command pass through(16): 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[156309.553552] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff880369e51000)
[156329.560014] mptscsih: ioc0: attempting task abort! (sc=ffff880369e51000)
[156329.560020] sd 7:0:12:0: [sdo] CDB: Test Unit Ready: 00 00 00 00 00 00
[156331.297762] mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
[156331.297903] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880369e51000)
[156331.297907] mptscsih: ioc0: attempting host reset! (sc=ffff880369e51000)
[156342.470033] mptscsih: ioc0: host reset: SUCCESS (sc=ffff880369e51000)
Comment 12 bpkroth 2010-03-30 14:52:52 UTC
I'm also seeing the problem, also on a Dell server.  I'm running Debian Squeeze.  I didn't see this issue with 2.6.29.

# cat /proc/mpt/summary 
ioc0: LSISAS1068E B3, FwRev=00192f00h, Ports=1, MaxQ=266, IRQ=32

# uname -a
Linux oberon 2.6.32-3-amd64 #1 SMP Wed Feb 24 18:07:42 UTC 2010 x86_64 GNU/Linux

# dmesg | egrep -i 'scsi 0.*(wdc|samsung)'
[   18.427778] scsi 0:0:0:0: Direct-Access     ATA      WDC WD1602ABKS-1 3B04 PQ: 0 ANSI: 5
[   18.432650] scsi 0:0:1:0: Direct-Access     ATA      WDC WD1602ABKS-1 3B04 PQ: 0 ANSI: 5
[   18.448696] scsi 0:0:2:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118 PQ: 0 ANSI: 5
[   18.464756] scsi 0:0:3:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118 PQ: 0 ANSI: 5
[   18.480824] scsi 0:0:4:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118 PQ: 0 ANSI: 5
[   18.496910] scsi 0:0:5:0: Direct-Access     ATA      SAMSUNG HD103UJ  1118 PQ: 0 ANSI: 5

Please let me know if you need any additional information.  I would like to be able to know when my drives are starting to have issues so I can replace them before it causes major problems.  Right now I can't issue SMART queries without it throwing the disks from the md.

Thanks,
Brian
Comment 13 Brian Sullivan 2010-04-29 07:25:25 UTC
This issue seems to get no attention.  I would be happy to go buy some other HBA controller but 1068E is used everywhere, I can't figure out what to buy as an alternative.

Is this a firmware bug?  There is a billion firmwares for the 1068E, I'm stuck with an onboard controller and not sure if its possible to update the firmware.  Would it be worth buying some addin card to be able to try different firmwares?

Does LSI care about this?  If not, fine.  If so, is the problem you cannot reproduce it?

If you google a bit you can find many people running into this bug.

Argh, I'm so frustrated.
Comment 14 amf 2010-04-29 09:27:19 UTC
This isn't going to solve your MPTSAS problem, but may help you choose to move on.

I've found LSI to be fairly unresponsive on this matter, and the same goes for my vendor (Dell, who the moment you mention Linux just aren't interested).

LSI have now brought out their new range of 6Gb/sec based cards which use the MPT2 system and I therefore doubt they will be progressing the development of MPT any further.

The good news is that if you get hold of an MPT2 card (LSISAS2008 chipset) you should (at least by my testing) be able to migrate your RAID to this card using the BIOS. The MPT2 drivers seem a whole lot better (I understand they're a complete re-write, which speaks volumes about how good the MPT driver was, IMHO). I've not been able to break them so far.

Performance also seems to be quite improved.

HTH.
Comment 15 Ryan Kuester 2010-05-03 22:22:06 UTC
Take a look at my diagnosis here:
http://lkml.org/lkml/2010/4/26/335

It includes a rough-draft patch.  I'd be very interested in hearing reports of whether this fixes this smartctl issue in others' environments as it has in mine.

The reason I haven't proposed it as a real patch is that there's probably a better location for that code.  Where I have it, it'll apply to every SCSI host using the MPT Fusion framework, and if it's a hardware bug, perhaps we want it to apply only to this specific LSI 1068 controller.

That said, I expect most requests hitting the device are already well-aligned, so this wouldn't affect many requests even if it did apply to a broader-than-necessary collection of hardware.
Comment 16 kashyap 2010-05-04 05:16:28 UTC
(In reply to comment #15)
> Take a look at my diagnosis here:
> http://lkml.org/lkml/2010/4/26/335
> 
> It includes a rough-draft patch.  I'd be very interested in hearing reports of
> whether this fixes this smartctl issue in others' environments as it has in
> mine.
I am doing my analysis and meanwhile also in touch with our Firmware folks to understand this issue.   Using Ryan's diagnosis tool I am able to see LSI controller is not able to DMA for particular alignment. I will update this ASAP.
Thanks, Kashyap
> 
> The reason I haven't proposed it as a real patch is that there's probably a
> better location for that code.  Where I have it, it'll apply to every SCSI host
> using the MPT Fusion framework, and if it's a hardware bug, perhaps we want it
> to apply only to this specific LSI 1068 controller.
> 
> That said, I expect most requests hitting the device are already well-aligned,
> so this wouldn't affect many requests even if it did apply to a
> broader-than-necessary collection of hardware.
Comment 17 Brian Sullivan 2010-05-04 09:15:45 UTC
No way, issue is fixed!  After waiting for months I lose hope and order a
mpt2sas controller.  The next day issue is fixed.  Argh! lol :)

Without patch, running hddtemp in loop on 15 drives would last maybe 5-10
seconds before controller would crap out.

With patch its been going for at least 20 minutes now without issue.  I put
a load on the controller too (~600MB/sec reads total) & still stable.  I'll
let this go overnight but really, this issue is fixed for me.



On Mon, May 3, 2010 at 10:16 PM, <bugzilla-daemon@bugzilla.kernel.org>wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>
>
>
>
> --- Comment #16 from kdesai <kashyap.desai@lsi.com>  2010-05-04 05:16:28
> ---
> (In reply to comment #15)
> > Take a look at my diagnosis here:
> > http://lkml.org/lkml/2010/4/26/335
> >
> > It includes a rough-draft patch.  I'd be very interested in hearing
> reports of
> > whether this fixes this smartctl issue in others' environments as it has
> in
> > mine.
> I am doing my analysis and meanwhile also in touch with our Firmware folks
> to
> understand this issue.   Using Ryan's diagnosis tool I am able to see LSI
> controller is not able to DMA for particular alignment. I will update this
> ASAP.
> Thanks, Kashyap
> >
> > The reason I haven't proposed it as a real patch is that there's probably
> a
> > better location for that code.  Where I have it, it'll apply to every
> SCSI host
> > using the MPT Fusion framework, and if it's a hardware bug, perhaps we
> want it
> > to apply only to this specific LSI 1068 controller.
> >
> > That said, I expect most requests hitting the device are already
> well-aligned,
> > so this wouldn't affect many requests even if it did apply to a
> > broader-than-necessary collection of hardware.
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>
Comment 18 Andrew Dunn 2010-05-05 10:35:17 UTC
I anxiously await confirmation of this patch. This issue has been plaguing me for quite a while. Just for verification the mpt2sas controllers don't have problems with this? I was thinking of trying to get an AOC-USAS2-L8i (http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I)
Comment 19 Brian Sullivan 2010-05-07 04:55:40 UTC
So... patch seems to fix ATA command pass-through problems.  I let it go a
day spamming hddtemp in a loop on all the drives, while at same time reading
600MB/sec or so.  No problem.  Again, without patch, it would never manage
more than 10 seconds spamming all the drives at once.

IMO it seems like the ATA-Passthrough bug is fixed by this patch.  I cannot
cause a failure using ATA-Passthrough.

All is not good news however....

With this bug fixed I was going to start expanding a md array one disk at a
time.  Unfortunately sooner or later the controller seems to crap out.  I
don't know what is at fault, but the mptsas drive's method of just blowing
up and blocking processes forever sucks.

I've tried this 4 times now and each time I see some read errors, then task
resets fail and eventually it gets to point it just keeps spamming 'sometask
has been blocked for 120s'.  I WISH this was a bad drive, but even if it was
a bad drive it shouldn't take down the system like this, but just to be sure
I've been swapping a few drives and it doesn't really make a difference.
Each time a different drive starts the fail sequence.  I'm guessing its
unlikely I have a pile of bad drives.

I do have 16 drives all attached via a HP SAS Expander, perhaps the expander
is at fault.  I also have a backup Chenbro Expander I could try...  but I'm
too lazy to at the moment.  I could also try ditching the Expanders to see
if that is the cause of these problems, but again too lazy at the moment.
Monday a mpt2sas expander is being delivered, I think my best bet is to
ditch this mptsas driver all together.  If that doesn't fix problems I'll
then go back and try swapping Expanders and whatnot.

Anyways, TL;DR:  ATA-PassThrough bug is fixed, mptsas still blows.

Here log from current failures, fairly sure this is unrelated to the entire
ATA-Passthrough problem:
May  6 17:52:09 nine kernel: [18838.207805] md: recovery of RAID array md127
May  6 17:52:09 nine kernel: [18838.207815] md: minimum _guaranteed_  speed:
1000 KB/sec/disk.
May  6 17:52:09 nine kernel: [18838.207818] md: using maximum available idle
IO bandwidth (but not more than 200000 KB/sec) for recovery.
May  6 17:52:09 nine kernel: [18838.207831] md: using 128k window, over a
total of 1953510784 blocks.
May  6 17:52:09 nine kernel: [18838.207833] md: resuming recovery of md127
from checkpoint.
May  6 20:51:21 nine kernel: [29589.980035] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f4900)
May  6 20:51:21 nine kernel: [29589.980041] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:51:28 nine kernel: [29596.503483] mptbase: ioc0:
LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
May  6 20:51:28 nine kernel: [29596.503747] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f4900)
May  6 20:51:28 nine kernel: [29597.253319] mptbase: ioc0:
LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry},
SubCode(0x0000)
May  6 20:51:28 nine kernel: [29597.253329] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f4e00)
May  6 20:51:28 nine kernel: [29597.253332] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e fc 00 00 01 00 00
May  6 20:51:28 nine kernel: [29597.253341] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f4e00)
May  6 20:51:29 nine kernel: [29597.753599] mptbase: ioc0:
LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed},
SubCode(0x0000)
May  6 20:51:29 nine kernel: [29597.753608] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f4c00)
May  6 20:51:29 nine kernel: [29597.753610] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 02 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753619] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f4c00)
May  6 20:51:29 nine kernel: [29597.753622] mptscsih: ioc0: attempting task
abort! (sc=ffff8803318f5b00)
May  6 20:51:29 nine kernel: [29597.753624] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 0e 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753633] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff8803318f5b00)
May  6 20:51:29 nine kernel: [29597.753636] mptscsih: ioc0: attempting task
abort! (sc=ffff880331e3d900)
May  6 20:51:29 nine kernel: [29597.753638] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 00 00 00 08 00
May  6 20:51:29 nine kernel: [29597.753646] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff880331e3d900)
May  6 20:51:29 nine kernel: [29597.753649] mptscsih: ioc0: attempting task
abort! (sc=ffff880331e3d400)
May  6 20:51:29 nine kernel: [29597.753651] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 08 00 00 68 00
May  6 20:51:29 nine kernel: [29597.753659] mptscsih: ioc0: task abort:
SUCCESS (sc=ffff880331e3d400)
May  6 20:51:29 nine kernel: [29597.753671] mptscsih: ioc0: attempting
target reset! (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753673] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753685] mptscsih: ioc0: target reset:
FAILED (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753693] mptscsih: ioc0: attempting bus
reset! (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753695] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:51:29 nine kernel: [29597.753712] mptscsih: ioc0: bus reset:
FAILED (sc=ffff8803318f4900)
May  6 20:51:29 nine kernel: [29597.753715] mptscsih: ioc0: attempting host
reset! (sc=ffff8803318f4900)
May  6 20:52:04 nine kernel: [29632.830020] mptscsih: ioc0: host reset:
SUCCESS (sc=ffff8803318f4900)
May  6 20:52:14 nine kernel: [29642.840021] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840024] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840026] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840028] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840030] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840032] sd 6:0:5:0: Device offlined -
not ready after error recovery
May  6 20:52:14 nine kernel: [29642.840076] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.840082] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.840087] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e f6 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.840112] raid5:md127: read error not
correctable (sector 1284435456 on sdh2).
May  6 20:52:14 nine kernel: [29642.840129] raid5:md127: read error not
correctable (sector 1284435464 on sdh2).
May  6 20:52:14 nine kernel: [29642.840133] raid5:md127: read error not
correctable (sector 1284435472 on sdh2).
May  6 20:52:14 nine kernel: [29642.840136] raid5:md127: read error not
correctable (sector 1284435480 on sdh2).
May  6 20:52:14 nine kernel: [29642.840139] raid5:md127: read error not
correctable (sector 1284435488 on sdh2).
May  6 20:52:14 nine kernel: [29642.840143] raid5:md127: read error not
correctable (sector 1284435496 on sdh2).
May  6 20:52:14 nine kernel: [29642.840149] raid5:md127: read error not
correctable (sector 1284435504 on sdh2).
May  6 20:52:14 nine kernel: [29642.840196] raid5:md127: read error not
correctable (sector 1284435512 on sdh2).
May  6 20:52:14 nine kernel: [29642.840199] raid5:md127: read error not
correctable (sector 1284435520 on sdh2).
May  6 20:52:14 nine kernel: [29642.840202] raid5:md127: read error not
correctable (sector 1284435528 on sdh2).
May  6 20:52:14 nine kernel: [29642.847676] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847678] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847681] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8e fc 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.847745] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847746] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847749] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 02 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.847812] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847813] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847816] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 0e 00 00 01 00 00
May  6 20:52:14 nine kernel: [29642.847871] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847873] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847875] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 00 00 00 08 00
May  6 20:52:14 nine kernel: [29642.847907] sd 6:0:5:0: [sdh] Unhandled
error code
May  6 20:52:14 nine kernel: [29642.847908] sd 6:0:5:0: [sdh] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May  6 20:52:14 nine kernel: [29642.847911] sd 6:0:5:0: [sdh] CDB: Read(10):
28 00 4c 8f 14 08 00 00 68 00
May  6 20:52:19 nine kernel: [29647.840019] mptbase: ioc0: WARNING - Issuing
Reset from mpt_config!!
May  6 20:52:50 nine kernel: [29678.961260] ------------[ cut here
]------------
May  6 20:52:50 nine kernel: [29678.961268] WARNING: at
/home/kernel-ppa/mainline/build/kernel/workqueue.c:485
flush_cpu_workqueue+0x8c/0x90()
May  6 20:52:50 nine kernel: [29678.961271] Hardware name: empty
May  6 20:52:50 nine kernel: [29678.961273] Modules linked in: btrfs
zlib_deflate crc32c libcrc32c xfs exportfs mptctl binfmt_misc ppdev
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge
stp kvm_intel kvm snd_hda_codec_realtek snd_hda_intel snd_hda_codec
snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device
psmouse serio_raw ioatdma snd i5100_edac nvidia(P) dca soundcore
snd_page_alloc edac_core lp parport raid10 raid456 async_raid6_recov
async_pq raid6_pq async_xor ses enclosure xor async_memcpy async_tx raid1
raid0 multipath linear ahci e1000e mptsas mptscsih mptbase
scsi_transport_sas
May  6 20:52:50 nine kernel: [29678.961333] Pid: 321, comm: mpt/0 Tainted:
P           2.6.34-020634rc6-generic #020634rc6
May  6 20:52:50 nine kernel: [29678.961336] Call Trace:
May  6 20:52:50 nine kernel: [29678.961341]  [<ffffffff8107a9ac>] ?
flush_cpu_workqueue+0x8c/0x90
May  6 20:52:50 nine kernel: [29678.961346]  [<ffffffff8105f1ec>]
warn_slowpath_common+0x8c/0xc0
May  6 20:52:50 nine kernel: [29678.961350]  [<ffffffff8105f234>]
warn_slowpath_null+0x14/0x20
May  6 20:52:50 nine kernel: [29678.961353]  [<ffffffff8107a9ac>]
flush_cpu_workqueue+0x8c/0x90
May  6 20:52:50 nine kernel: [29678.961357]  [<ffffffff8106f981>] ?
try_to_del_timer_sync+0x51/0xe0
May  6 20:52:50 nine kernel: [29678.961360]  [<ffffffff8107aa74>]
flush_workqueue+0x44/0x70
May  6 20:52:50 nine kernel: [29678.961373]  [<ffffffffa004531c>]
mptsas_cleanup_fw_event_q+0x12c/0x160 [mptsas]
May  6 20:52:50 nine kernel: [29678.961378]  [<ffffffffa0048434>]
mptsas_ioc_reset+0x94/0x130 [mptsas]
May  6 20:52:50 nine kernel: [29678.961383]  [<ffffffff81033d39>] ?
default_spin_lock_flags+0x9/0x10
May  6 20:52:50 nine kernel: [29678.961389]  [<ffffffffa001222d>]
mpt_signal_reset+0x4d/0x60 [mptbase]
May  6 20:52:50 nine kernel: [29678.961394]  [<ffffffffa0018eb6>]
mpt_SoftResetHandler+0x1b6/0x3c0 [mptbase]
May  6 20:52:50 nine kernel: [29678.961399]  [<ffffffffa001bee7>]
mpt_config+0x307/0x640 [mptbase]
May  6 20:52:50 nine kernel: [29678.961404]  [<ffffffffa004c6f0>] ?
mptsas_firmware_event_work+0x0/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961409]  [<ffffffffa001d0b1>]
mpt_findImVolumes+0xb1/0x600 [mptbase]
May  6 20:52:50 nine kernel: [29678.961415]  [<ffffffffa004c6f0>] ?
mptsas_firmware_event_work+0x0/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961419]  [<ffffffffa004cd88>]
mptsas_firmware_event_work+0x698/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961424]  [<ffffffff8100985b>] ?
__switch_to+0xbb/0x2e0
May  6 20:52:50 nine kernel: [29678.961428]  [<ffffffff8105118e>] ?
put_prev_entity+0x2e/0x80
May  6 20:52:50 nine kernel: [29678.961430]  [<ffffffff81051af6>] ?
finish_task_switch+0x66/0xd0
May  6 20:52:50 nine kernel: [29678.961435]  [<ffffffffa004c6f0>] ?
mptsas_firmware_event_work+0x0/0xe80 [mptsas]
May  6 20:52:50 nine kernel: [29678.961438]  [<ffffffff8107a10c>]
run_workqueue+0xbc/0x190
May  6 20:52:50 nine kernel: [29678.961441]  [<ffffffff8107a65b>]
worker_thread+0x9b/0x100
May  6 20:52:50 nine kernel: [29678.961444]  [<ffffffff8107edc0>] ?
autoremove_wake_function+0x0/0x40
May  6 20:52:50 nine kernel: [29678.961447]  [<ffffffff8107a5c0>] ?
worker_thread+0x0/0x100
May  6 20:52:50 nine kernel: [29678.961450]  [<ffffffff8107e9e6>]
kthread+0x96/0xa0
May  6 20:52:50 nine kernel: [29678.961453]  [<ffffffff8100be64>]
kernel_thread_helper+0x4/0x10
May  6 20:52:50 nine kernel: [29678.961456]  [<ffffffff8107e950>] ?
kthread+0x0/0xa0
May  6 20:52:50 nine kernel: [29678.961458]  [<ffffffff8100be60>] ?
kernel_thread_helper+0x0/0x10
May  6 20:52:50 nine kernel: [29678.961460] ---[ end trace 5b0b1793526edc2a
]---
May  6 20:53:20 nine kernel: [29709.040090] mptscsih: ioc0: attempting task
abort! (sc=ffff880331812400)
May  6 20:53:20 nine kernel: [29709.040093] sd 6:0:15:0: [sdr] CDB:
Write(10): 2a 00 00 00 00 47 00 00 02 00
May  6 20:53:50 nine kernel: [29739.040011] mptscsih: ioc0: WARNING -
Issuing Reset from mptscsih_IssueTaskMgmt!!
May  6 20:54:13 nine kernel: [29761.700122] md127_resync  D
ffff880001f55740     0  6733      2 0x00000000
May  6 20:54:13 nine kernel: [29761.700130]  ffff8803318f3b90
0000000000000046 ffff8803318f3b50 ffff8803318f3fd8
May  6 20:54:13 nine kernel: [29761.700134]  ffff8803318eae20
0000000000015740 0000000000015740 ffff8803318f3fd8
May  6 20:54:13 nine kernel: [29761.700137]  0000000000015740
ffff8803318f3fd8 0000000000015740 ffff8803318eae20
May  6 20:54:13 nine kernel: [29761.700141] Call Trace:
May  6 20:54:13 nine kernel: [29761.700160]  [<ffffffffa00f20e2>]
get_active_stripe+0x232/0x340 [raid456]
May  6 20:54:13 nine kernel: [29761.700167]  [<ffffffff810507e0>] ?
default_wake_function+0x0/0x20
May  6 20:54:13 nine kernel: [29761.700172]  [<ffffffffa00f49ad>]
sync_request+0x26d/0x2d0 [raid456]
May  6 20:54:13 nine kernel: [29761.700176]  [<ffffffffa00f1e8e>] ?
raid5_unplug_device+0x7e/0xa0 [raid456]


On Wed, May 5, 2010 at 3:35 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>
> Andrew Dunn <andrew.g.dunn.dod@gmail.com> changed:
>
>           What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                 CC|                            |
> andrew.g.dunn.dod@gmail.com
>
>
>
>
> --- Comment #18 from Andrew Dunn <andrew.g.dunn.dod@gmail.com>  2010-05-05
> 10:35:17 ---
> I anxiously await confirmation of this patch. This issue has been plaguing
> me
> for quite a while. Just for verification the mpt2sas controllers don't have
> problems with this? I was thinking of trying to get an AOC-USAS2-L8i
> (
> http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I
> )
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>
Comment 20 kashyap 2010-05-07 08:01:25 UTC
(In reply to comment #19)
> So... patch seems to fix ATA command pass-through problems.  I let it go a
> day spamming hddtemp in a loop on all the drives, while at same time reading
> 600MB/sec or so.  No problem.  Again, without patch, it would never manage
> more than 10 seconds spamming all the drives at once.
> 
> IMO it seems like the ATA-Passthrough bug is fixed by this patch.  I cannot
> cause a failure using ATA-Passthrough.
> 
> All is not good news however....
> 
> With this bug fixed I was going to start expanding a md array one disk at a
> time.  Unfortunately sooner or later the controller seems to crap out.  I
> don't know what is at fault, but the mptsas drive's method of just blowing
> up and blocking processes forever sucks.
> 
> I've tried this 4 times now and each time I see some read errors, then task
> resets fail and eventually it gets to point it just keeps spamming 'sometask
> has been blocked for 120s'.  I WISH this was a bad drive, but even if it was
> a bad drive it shouldn't take down the system like this, but just to be sure
> I've been swapping a few drives and it doesn't really make a difference.
> Each time a different drive starts the fail sequence.  I'm guessing its
> unlikely I have a pile of bad drives.
> 
> I do have 16 drives all attached via a HP SAS Expander, perhaps the expander
> is at fault.  I also have a backup Chenbro Expander I could try...  but I'm
> too lazy to at the moment.  I could also try ditching the Expanders to see
> if that is the cause of these problems, but again too lazy at the moment.
> Monday a mpt2sas expander is being delivered, I think my best bet is to
> ditch this mptsas driver all together.  If that doesn't fix problems I'll
> then go back and try swapping Expanders and whatnot.
> 
> Anyways, TL;DR:  ATA-PassThrough bug is fixed, mptsas still blows.

Patch for setting dma boundary is mere avoiding condition which is causing this issue. LSI Gen-1 controller does not have 512byte dma boundary limitation. I have started internal chat with our Firmware engineer. I will update you findings as and when some imp stuffs are found. 
> 
> Here log from current failures, fairly sure this is unrelated to the entire
> ATA-Passthrough problem:
> May  6 17:52:09 nine kernel: [18838.207805] md: recovery of RAID array md127
> May  6 17:52:09 nine kernel: [18838.207815] md: minimum _guaranteed_  speed:
> 1000 KB/sec/disk.
> May  6 17:52:09 nine kernel: [18838.207818] md: using maximum available idle
> IO bandwidth (but not more than 200000 KB/sec) for recovery.
> May  6 17:52:09 nine kernel: [18838.207831] md: using 128k window, over a
> total of 1953510784 blocks.
> May  6 17:52:09 nine kernel: [18838.207833] md: resuming recovery of md127
> from checkpoint.
> May  6 20:51:21 nine kernel: [29589.980035] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f4900)
> May  6 20:51:21 nine kernel: [29589.980041] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:51:28 nine kernel: [29596.503483] mptbase: ioc0:
> LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> May  6 20:51:28 nine kernel: [29596.503747] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f4900)
> May  6 20:51:28 nine kernel: [29597.253319] mptbase: ioc0:
> LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay Retry},
> SubCode(0x0000)
> May  6 20:51:28 nine kernel: [29597.253329] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f4e00)
> May  6 20:51:28 nine kernel: [29597.253332] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e fc 00 00 01 00 00
> May  6 20:51:28 nine kernel: [29597.253341] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f4e00)
> May  6 20:51:29 nine kernel: [29597.753599] mptbase: ioc0:
> LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed},
> SubCode(0x0000)
> May  6 20:51:29 nine kernel: [29597.753608] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f4c00)
> May  6 20:51:29 nine kernel: [29597.753610] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 02 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753619] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f4c00)
> May  6 20:51:29 nine kernel: [29597.753622] mptscsih: ioc0: attempting task
> abort! (sc=ffff8803318f5b00)
> May  6 20:51:29 nine kernel: [29597.753624] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 0e 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753633] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff8803318f5b00)
> May  6 20:51:29 nine kernel: [29597.753636] mptscsih: ioc0: attempting task
> abort! (sc=ffff880331e3d900)
> May  6 20:51:29 nine kernel: [29597.753638] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 00 00 00 08 00
> May  6 20:51:29 nine kernel: [29597.753646] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff880331e3d900)
> May  6 20:51:29 nine kernel: [29597.753649] mptscsih: ioc0: attempting task
> abort! (sc=ffff880331e3d400)
> May  6 20:51:29 nine kernel: [29597.753651] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 08 00 00 68 00
> May  6 20:51:29 nine kernel: [29597.753659] mptscsih: ioc0: task abort:
> SUCCESS (sc=ffff880331e3d400)
> May  6 20:51:29 nine kernel: [29597.753671] mptscsih: ioc0: attempting
> target reset! (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753673] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753685] mptscsih: ioc0: target reset:
> FAILED (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753693] mptscsih: ioc0: attempting bus
> reset! (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753695] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:51:29 nine kernel: [29597.753712] mptscsih: ioc0: bus reset:
> FAILED (sc=ffff8803318f4900)
> May  6 20:51:29 nine kernel: [29597.753715] mptscsih: ioc0: attempting host
> reset! (sc=ffff8803318f4900)
> May  6 20:52:04 nine kernel: [29632.830020] mptscsih: ioc0: host reset:
> SUCCESS (sc=ffff8803318f4900)
> May  6 20:52:14 nine kernel: [29642.840021] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840024] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840026] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840028] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840030] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840032] sd 6:0:5:0: Device offlined -
> not ready after error recovery
> May  6 20:52:14 nine kernel: [29642.840076] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.840082] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.840087] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e f6 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.840112] raid5:md127: read error not
> correctable (sector 1284435456 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840129] raid5:md127: read error not
> correctable (sector 1284435464 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840133] raid5:md127: read error not
> correctable (sector 1284435472 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840136] raid5:md127: read error not
> correctable (sector 1284435480 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840139] raid5:md127: read error not
> correctable (sector 1284435488 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840143] raid5:md127: read error not
> correctable (sector 1284435496 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840149] raid5:md127: read error not
> correctable (sector 1284435504 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840196] raid5:md127: read error not
> correctable (sector 1284435512 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840199] raid5:md127: read error not
> correctable (sector 1284435520 on sdh2).
> May  6 20:52:14 nine kernel: [29642.840202] raid5:md127: read error not
> correctable (sector 1284435528 on sdh2).
> May  6 20:52:14 nine kernel: [29642.847676] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847678] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847681] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8e fc 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.847745] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847746] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847749] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 02 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.847812] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847813] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847816] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 0e 00 00 01 00 00
> May  6 20:52:14 nine kernel: [29642.847871] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847873] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847875] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 00 00 00 08 00
> May  6 20:52:14 nine kernel: [29642.847907] sd 6:0:5:0: [sdh] Unhandled
> error code
> May  6 20:52:14 nine kernel: [29642.847908] sd 6:0:5:0: [sdh] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> May  6 20:52:14 nine kernel: [29642.847911] sd 6:0:5:0: [sdh] CDB: Read(10):
> 28 00 4c 8f 14 08 00 00 68 00
> May  6 20:52:19 nine kernel: [29647.840019] mptbase: ioc0: WARNING - Issuing
> Reset from mpt_config!!
> May  6 20:52:50 nine kernel: [29678.961260] ------------[ cut here
> ]------------
> May  6 20:52:50 nine kernel: [29678.961268] WARNING: at
> /home/kernel-ppa/mainline/build/kernel/workqueue.c:485
> flush_cpu_workqueue+0x8c/0x90()
> May  6 20:52:50 nine kernel: [29678.961271] Hardware name: empty
> May  6 20:52:50 nine kernel: [29678.961273] Modules linked in: btrfs
> zlib_deflate crc32c libcrc32c xfs exportfs mptctl binfmt_misc ppdev
> ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state
> nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge
> stp kvm_intel kvm snd_hda_codec_realtek snd_hda_intel snd_hda_codec
> snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
> snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device
> psmouse serio_raw ioatdma snd i5100_edac nvidia(P) dca soundcore
> snd_page_alloc edac_core lp parport raid10 raid456 async_raid6_recov
> async_pq raid6_pq async_xor ses enclosure xor async_memcpy async_tx raid1
> raid0 multipath linear ahci e1000e mptsas mptscsih mptbase
> scsi_transport_sas
> May  6 20:52:50 nine kernel: [29678.961333] Pid: 321, comm: mpt/0 Tainted:
> P           2.6.34-020634rc6-generic #020634rc6
> May  6 20:52:50 nine kernel: [29678.961336] Call Trace:
> May  6 20:52:50 nine kernel: [29678.961341]  [<ffffffff8107a9ac>] ?
> flush_cpu_workqueue+0x8c/0x90
> May  6 20:52:50 nine kernel: [29678.961346]  [<ffffffff8105f1ec>]
> warn_slowpath_common+0x8c/0xc0
> May  6 20:52:50 nine kernel: [29678.961350]  [<ffffffff8105f234>]
> warn_slowpath_null+0x14/0x20
> May  6 20:52:50 nine kernel: [29678.961353]  [<ffffffff8107a9ac>]
> flush_cpu_workqueue+0x8c/0x90
> May  6 20:52:50 nine kernel: [29678.961357]  [<ffffffff8106f981>] ?
> try_to_del_timer_sync+0x51/0xe0
> May  6 20:52:50 nine kernel: [29678.961360]  [<ffffffff8107aa74>]
> flush_workqueue+0x44/0x70
> May  6 20:52:50 nine kernel: [29678.961373]  [<ffffffffa004531c>]
> mptsas_cleanup_fw_event_q+0x12c/0x160 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961378]  [<ffffffffa0048434>]
> mptsas_ioc_reset+0x94/0x130 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961383]  [<ffffffff81033d39>] ?
> default_spin_lock_flags+0x9/0x10
> May  6 20:52:50 nine kernel: [29678.961389]  [<ffffffffa001222d>]
> mpt_signal_reset+0x4d/0x60 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961394]  [<ffffffffa0018eb6>]
> mpt_SoftResetHandler+0x1b6/0x3c0 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961399]  [<ffffffffa001bee7>]
> mpt_config+0x307/0x640 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961404]  [<ffffffffa004c6f0>] ?
> mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961409]  [<ffffffffa001d0b1>]
> mpt_findImVolumes+0xb1/0x600 [mptbase]
> May  6 20:52:50 nine kernel: [29678.961415]  [<ffffffffa004c6f0>] ?
> mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961419]  [<ffffffffa004cd88>]
> mptsas_firmware_event_work+0x698/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961424]  [<ffffffff8100985b>] ?
> __switch_to+0xbb/0x2e0
> May  6 20:52:50 nine kernel: [29678.961428]  [<ffffffff8105118e>] ?
> put_prev_entity+0x2e/0x80
> May  6 20:52:50 nine kernel: [29678.961430]  [<ffffffff81051af6>] ?
> finish_task_switch+0x66/0xd0
> May  6 20:52:50 nine kernel: [29678.961435]  [<ffffffffa004c6f0>] ?
> mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> May  6 20:52:50 nine kernel: [29678.961438]  [<ffffffff8107a10c>]
> run_workqueue+0xbc/0x190
> May  6 20:52:50 nine kernel: [29678.961441]  [<ffffffff8107a65b>]
> worker_thread+0x9b/0x100
> May  6 20:52:50 nine kernel: [29678.961444]  [<ffffffff8107edc0>] ?
> autoremove_wake_function+0x0/0x40
> May  6 20:52:50 nine kernel: [29678.961447]  [<ffffffff8107a5c0>] ?
> worker_thread+0x0/0x100
> May  6 20:52:50 nine kernel: [29678.961450]  [<ffffffff8107e9e6>]
> kthread+0x96/0xa0
> May  6 20:52:50 nine kernel: [29678.961453]  [<ffffffff8100be64>]
> kernel_thread_helper+0x4/0x10
> May  6 20:52:50 nine kernel: [29678.961456]  [<ffffffff8107e950>] ?
> kthread+0x0/0xa0
> May  6 20:52:50 nine kernel: [29678.961458]  [<ffffffff8100be60>] ?
> kernel_thread_helper+0x0/0x10
> May  6 20:52:50 nine kernel: [29678.961460] ---[ end trace 5b0b1793526edc2a
> ]---
> May  6 20:53:20 nine kernel: [29709.040090] mptscsih: ioc0: attempting task
> abort! (sc=ffff880331812400)
> May  6 20:53:20 nine kernel: [29709.040093] sd 6:0:15:0: [sdr] CDB:
> Write(10): 2a 00 00 00 00 47 00 00 02 00
> May  6 20:53:50 nine kernel: [29739.040011] mptscsih: ioc0: WARNING -
> Issuing Reset from mptscsih_IssueTaskMgmt!!
> May  6 20:54:13 nine kernel: [29761.700122] md127_resync  D
> ffff880001f55740     0  6733      2 0x00000000
> May  6 20:54:13 nine kernel: [29761.700130]  ffff8803318f3b90
> 0000000000000046 ffff8803318f3b50 ffff8803318f3fd8
> May  6 20:54:13 nine kernel: [29761.700134]  ffff8803318eae20
> 0000000000015740 0000000000015740 ffff8803318f3fd8
> May  6 20:54:13 nine kernel: [29761.700137]  0000000000015740
> ffff8803318f3fd8 0000000000015740 ffff8803318eae20
> May  6 20:54:13 nine kernel: [29761.700141] Call Trace:
> May  6 20:54:13 nine kernel: [29761.700160]  [<ffffffffa00f20e2>]
> get_active_stripe+0x232/0x340 [raid456]
> May  6 20:54:13 nine kernel: [29761.700167]  [<ffffffff810507e0>] ?
> default_wake_function+0x0/0x20
> May  6 20:54:13 nine kernel: [29761.700172]  [<ffffffffa00f49ad>]
> sync_request+0x26d/0x2d0 [raid456]
> May  6 20:54:13 nine kernel: [29761.700176]  [<ffffffffa00f1e8e>] ?
> raid5_unplug_device+0x7e/0xa0 [raid456]
> 
> 

As of now you can continue with patched for dma boundary alignment issue.
For this new issue you can provide me complete var log messages with debug turned on.

use 0x8188 > /sys/modules/mptbase/parameters/mpt_debug_level

Thanks,
Kashyap
> On Wed, May 5, 2010 at 3:35 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=14831
> >
> >
> > Andrew Dunn <andrew.g.dunn.dod@gmail.com> changed:
> >
> >           What    |Removed                     |Added
> >
> > ----------------------------------------------------------------------------
> >                 CC|                            |
> > andrew.g.dunn.dod@gmail.com
> >
> >
> >
> >
> > --- Comment #18 from Andrew Dunn <andrew.g.dunn.dod@gmail.com>  2010-05-05
> > 10:35:17 ---
> > I anxiously await confirmation of this patch. This issue has been plaguing
> > me
> > for quite a while. Just for verification the mpt2sas controllers don't have
> > problems with this? I was thinking of trying to get an AOC-USAS2-L8i
> > (
> > http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I
> > )
> >
> > --
> > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> > ------- You are receiving this mail because: -------
> > You are on the CC list for the bug.
> >
Comment 21 Brian Sullivan 2010-05-12 08:50:32 UTC
So apparently this bug affects mpt2sas too???

Parts of dmesg:
[    4.460541] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[    4.460543] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[    4.460615] mpt2sas0: sending port enable !!
[   33.760036] mpt2sas0: sending diag reset !!
[   34.661882] eth0: no IPv6 routers present
[   34.710015] mpt2sas0: diag reset: SUCCESS
[   34.714397] mpt2sas0: attempting task abort! scmd(ffff88036f74bf00)
[   34.714404] sd 0:0:3:0: [sdg] CDB: Inquiry: 12 01 80 00 fe 00
[   34.714441] mpt2sas0: task abort: SUCCESS scmd(ffff88036f74bf00)
[   35.290527] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[   35.290532] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[   35.290618] mpt2sas0: sending port enable !!
[   44.711262] mpt2sas0: attempting task abort! scmd(ffff88036f74bf00)
[   44.711264] sd 0:0:3:0: [sdg] CDB: Test Unit Ready: 00 00 00 00 00 00
[   44.711272] mpt2sas0: task abort: SUCCESS scmd(ffff88036f74bf00)
[   44.711274] mpt2sas0: attempting task abort! scmd(ffff88036e456a00)
[   44.711276] sd 0:0:9:0: [sdm] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   44.711285] mpt2sas0: task abort: SUCCESS scmd(ffff88036e456a00)
[   46.090185] mpt2sas0: port enable: SUCCESS
[   46.090299] mpt2sas0: _scsih_search_responding_sas_devices
[   46.091172] scsi target0:0:0: handle(0x000a),
sas_addr(0x50014380048874cc), enclosure logical id(0x50014380048874e5),
slot(47)
[   46.091259] scsi target0:0:1: handle(0x000b),
sas_addr(0x50014380048874cd), enclosure logical id(0x50014380048874e5),
slot(46)
[   46.091350] scsi target0:0:2: handle(0x000c),
sas_addr(0x50014380048874ce), enclosure logical id(0x50014380048874e5),
slot(45)
[   46.091437] scsi target0:0:3: handle(0x000d),
sas_addr(0x50014380048874cf), enclosure logical id(0x50014380048874e5),
slot(44)
[   46.091521] scsi target0:0:4: handle(0x000e),
sas_addr(0x50014380048874d0), enclosure logical id(0x50014380048874e5),
slot(51)
[   46.091612] scsi target0:0:5: handle(0x000f),
sas_addr(0x50014380048874d1), enclosure logical id(0x50014380048874e5),
slot(50)
[   46.091702] scsi target0:0:6: handle(0x0010),
sas_addr(0x50014380048874d2), enclosure logical id(0x50014380048874e5),
slot(49)
[   46.091789] scsi target0:0:7: handle(0x0011),
sas_addr(0x50014380048874d3), enclosure logical id(0x50014380048874e5),
slot(48)
[   46.091872] scsi target0:0:8: handle(0x0012),
sas_addr(0x50014380048874d4), enclosure logical id(0x50014380048874e5),
slot(55)
[   46.091964] scsi target0:0:9: handle(0x0013),
sas_addr(0x50014380048874d5), enclosure logical id(0x50014380048874e5),
slot(54)
[   46.092048] scsi target0:0:10: handle(0x0014),
sas_addr(0x50014380048874d6), enclosure logical id(0x50014380048874e5),
slot(53)
[   46.092134] scsi target0:0:11: handle(0x0015),
sas_addr(0x50014380048874d7), enclosure logical id(0x50014380048874e5),
slot(52)
[   46.092218] scsi target0:0:12: handle(0x0016),
sas_addr(0x50014380048874e0), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092306] scsi target0:0:13: handle(0x0017),
sas_addr(0x50014380048874e1), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092401] scsi target0:0:14: handle(0x0018),
sas_addr(0x50014380048874e2), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092488] scsi target0:0:15: handle(0x0019),
sas_addr(0x50014380048874e3), enclosure logical id(0x0000000000000000),
slot(0)
[   46.092572] scsi target0:0:16: handle(0x001a),
sas_addr(0x50014380048874e5), enclosure logical id(0x50014380048874e5),
slot(0)
[   46.092658] mpt2sas0: _scsih_search_responding_raid_devices
[   46.092660] mpt2sas0: _scsih_search_responding_expanders
[   46.092753]  expander present: handle(0x0009),
sas_addr(0x50014380048874e6)
[   54.711261] mpt2sas0: attempting task abort! scmd(ffff88036e456a00)
[   54.711265] sd 0:0:9:0: [sdm] CDB: Test Unit Ready: 00 00 00 00 00 00
[   54.711275] mpt2sas0: task abort: SUCCESS scmd(ffff88036e456a00)
[   54.711277] mpt2sas0: attempting task abort! scmd(ffff88036f02fc00)
[   54.711279] sd 0:0:14:0: [sdr] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00
[   54.711290] mpt2sas0: task abort: SUCCESS scmd(ffff88036f02fc00)
[   54.711383] mpt2sas0: attempting task abort! scmd(ffff88036f72ed00)
[   54.711387] sd 0:0:15:0: [sds] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00
[   54.711401] mpt2sas0: task abort: SUCCESS scmd(ffff88036f72ed00)
[   54.711479] mpt2sas0: attempting task abort! scmd(ffff88036f72fe00)
[   54.711487] sd 0:0:2:0: [sdf] CDB: Inquiry: 12 00 00 00 fe 00
[   54.711495] mpt2sas0: task abort: SUCCESS scmd(ffff88036f72fe00)
[   54.711566] mpt2sas0: attempting task abort! scmd(ffff88036cd99300)
[   54.711570] sd 0:0:5:0: [sdi] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.711585] mpt2sas0: task abort: SUCCESS scmd(ffff88036cd99300)
[   54.711651] mpt2sas0: attempting task abort! scmd(ffff88036cd99900)
[   54.711654] sd 0:0:7:0: [sdk] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.711664] mpt2sas0: task abort: SUCCESS scmd(ffff88036cd99900)
[   54.711781] mpt2sas0: attempting task abort! scmd(ffff8803721f9000)
[   54.711784] sd 0:0:12:0: [sdp] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00
[   54.711794] mpt2sas0: task abort: SUCCESS scmd(ffff8803721f9000)
[   54.711867] mpt2sas0: attempting task abort! scmd(ffff88036f72fc00)
[   54.711871] sd 0:0:0:0: [sdd] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.711891] mpt2sas0: task abort: SUCCESS scmd(ffff88036f72fc00)
[   54.711981] mpt2sas0: attempting task abort! scmd(ffff88036e456b00)
[   54.711986] sd 0:0:4:0: [sdh] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.712030] mpt2sas0: task abort: SUCCESS scmd(ffff88036e456b00)
[   54.712097] mpt2sas0: attempting task abort! scmd(ffff88036cdd0d00)
[   54.712100] sd 0:0:6:0: [sdj] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.712110] mpt2sas0: task abort: SUCCESS scmd(ffff88036cdd0d00)
[   54.712176] mpt2sas0: attempting task abort! scmd(ffff88036f02ef00)
[   54.712181] sd 0:0:8:0: [sdl] CDB: ATA command pass through(12)/Blank: a1
08 2e 00 01 00 00 00 00 ec 00 00
[   54.712230] mpt2sas0: task abort: SUCCESS scmd(ffff88036f02ef00)
[   54.712310] mpt2sas0: attempting task abort! scmd(ffff88036cd99a00)
[   54.712313] sd 0:0:10:0: [sdn] CDB: ATA command pass through(12)/Blank:
a1 08 2e 00 01 00 00 00 00 ec 00 00

Spam hddtemp on drives and bam:
[ 1161.151577] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.151580] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.151582] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.151948] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.151951] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.151953] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.152313] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.152316] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.152318] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.152684] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.152688] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.152690] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.153054] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.153058] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.153060] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.153418] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.153420] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.153422] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.153787] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1161.153790] mpt2sas0: attempting target reset! scmd(ffff880342884200)
[ 1161.153792] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1161.154151] mpt2sas0: target reset: SUCCESS scmd(ffff880342884200)
[ 1171.151888] mpt2sas0: attempting task abort! scmd(ffff880342884200)
[ 1171.151892] sd 0:0:17:0: [sdm] CDB: Test Unit Ready: 00 00 00 00 00 00
[ 1171.151902] mpt2sas0: task abort: SUCCESS scmd(ffff880342884200)
[ 1171.151906] mpt2sas0: attempting host reset! scmd(ffff880342884200)
[ 1171.151908] sd 0:0:17:0: [sdm] CDB: ATA command pass through(16): 85 08
2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
[ 1171.151923] mpt2sas0: sending diag reset !!
[ 1172.110009] mpt2sas0: diag reset: SUCCESS
[ 1172.690466] mpt2sas0: LSISAS2008: FWVersion(02.00.50.00),
ChipRevision(0x02), BiosVersion(07.01.00.00)
[ 1172.690469] mpt2sas0: Protocol=(Initiator,Target),
Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)
[ 1172.690536] mpt2sas0: sending port enable !!
[ 1181.730641] mpt2sas0: port enable: SUCCESS
[ 1181.730759] mpt2sas0: _scsih_search_responding_sas_devices
[ 1181.731611] scsi target0:0:0: handle(0x000a),
sas_addr(0x50014380048874cc), enclosure logical id(0x50014380048874e5),
slot(47)
[ 1181.731698] scsi target0:0:1: handle(0x000b),
sas_addr(0x50014380048874cd), enclosure logical id(0x50014380048874e5),
slot(46)
[ 1181.731782] scsi target0:0:2: handle(0x000c),
sas_addr(0x50014380048874ce), enclosure logical id(0x50014380048874e5),
slot(45)
[ 1181.731865] scsi target0:0:3: handle(0x000d),
sas_addr(0x50014380048874cf), enclosure logical id(0x50014380048874e5),
slot(44)
[ 1181.731945] scsi target0:0:4: handle(0x000e),
sas_addr(0x50014380048874d0), enclosure logical id(0x50014380048874e5),
slot(51)
[ 1181.732034] scsi target0:0:5: handle(0x000f),
sas_addr(0x50014380048874d1), enclosure logical id(0x50014380048874e5),
slot(50)
[ 1181.732118] scsi target0:0:6: handle(0x0010),
sas_addr(0x50014380048874d2), enclosure logical id(0x50014380048874e5),
slot(49)
[ 1181.732203] scsi target0:0:7: handle(0x0011),
sas_addr(0x50014380048874d3), enclosure logical id(0x50014380048874e5),
slot(48)
[ 1181.732286] scsi target0:0:8: handle(0x0012),
sas_addr(0x50014380048874d4), enclosure logical id(0x50014380048874e5),
slot(55)
[ 1181.732371] scsi target0:0:17: handle(0x0013),
sas_addr(0x50014380048874d5), enclosure logical id(0x50014380048874e5),
slot(54)
[ 1181.732454] scsi target0:0:10: handle(0x0014),
sas_addr(0x50014380048874d6), enclosure logical id(0x50014380048874e5),
slot(53)
[ 1181.732538] scsi target0:0:11: handle(0x0015),
sas_addr(0x50014380048874d7), enclosure logical id(0x50014380048874e5),
slot(52)
[ 1181.732621] scsi target0:0:12: handle(0x0016),
sas_addr(0x50014380048874e0), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732704] scsi target0:0:13: handle(0x0017),
sas_addr(0x50014380048874e1), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732788] scsi target0:0:14: handle(0x0018),
sas_addr(0x50014380048874e2), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732870] scsi target0:0:15: handle(0x0019),
sas_addr(0x50014380048874e3), enclosure logical id(0x0000000000000000),
slot(0)
[ 1181.732954] scsi target0:0:16: handle(0x001a),
sas_addr(0x50014380048874e5), enclosure logical id(0x50014380048874e5),
slot(0)
[ 1181.733043] mpt2sas0: _scsih_search_responding_raid_devices
[ 1181.733046] mpt2sas0: _scsih_search_responding_expanders
[ 1181.733138]  expander present: handle(0x0009),
sas_addr(0x50014380048874e6)
[ 1181.733220] mpt2sas0: host reset: SUCCESS scmd(ffff880342884200)

Drives did not fall off though but I didn't really keep it up.




On Fri, May 7, 2010 at 1:01 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>
>
>
>
> --- Comment #20 from kdesai <kashyap.desai@lsi.com>  2010-05-07 08:01:25
> ---
> (In reply to comment #19)
> > So... patch seems to fix ATA command pass-through problems.  I let it go
> a
> > day spamming hddtemp in a loop on all the drives, while at same time
> reading
> > 600MB/sec or so.  No problem.  Again, without patch, it would never
> manage
> > more than 10 seconds spamming all the drives at once.
> >
> > IMO it seems like the ATA-Passthrough bug is fixed by this patch.  I
> cannot
> > cause a failure using ATA-Passthrough.
> >
> > All is not good news however....
> >
> > With this bug fixed I was going to start expanding a md array one disk at
> a
> > time.  Unfortunately sooner or later the controller seems to crap out.  I
> > don't know what is at fault, but the mptsas drive's method of just
> blowing
> > up and blocking processes forever sucks.
> >
> > I've tried this 4 times now and each time I see some read errors, then
> task
> > resets fail and eventually it gets to point it just keeps spamming
> 'sometask
> > has been blocked for 120s'.  I WISH this was a bad drive, but even if it
> was
> > a bad drive it shouldn't take down the system like this, but just to be
> sure
> > I've been swapping a few drives and it doesn't really make a difference.
> > Each time a different drive starts the fail sequence.  I'm guessing its
> > unlikely I have a pile of bad drives.
> >
> > I do have 16 drives all attached via a HP SAS Expander, perhaps the
> expander
> > is at fault.  I also have a backup Chenbro Expander I could try...  but
> I'm
> > too lazy to at the moment.  I could also try ditching the Expanders to
> see
> > if that is the cause of these problems, but again too lazy at the moment.
> > Monday a mpt2sas expander is being delivered, I think my best bet is to
> > ditch this mptsas driver all together.  If that doesn't fix problems I'll
> > then go back and try swapping Expanders and whatnot.
> >
> > Anyways, TL;DR:  ATA-PassThrough bug is fixed, mptsas still blows.
>
> Patch for setting dma boundary is mere avoiding condition which is causing
> this
> issue. LSI Gen-1 controller does not have 512byte dma boundary limitation.
> I
> have started internal chat with our Firmware engineer. I will update you
> findings as and when some imp stuffs are found.
> >
> > Here log from current failures, fairly sure this is unrelated to the
> entire
> > ATA-Passthrough problem:
> > May  6 17:52:09 nine kernel: [18838.207805] md: recovery of RAID array
> md127
> > May  6 17:52:09 nine kernel: [18838.207815] md: minimum _guaranteed_
>  speed:
> > 1000 KB/sec/disk.
> > May  6 17:52:09 nine kernel: [18838.207818] md: using maximum available
> idle
> > IO bandwidth (but not more than 200000 KB/sec) for recovery.
> > May  6 17:52:09 nine kernel: [18838.207831] md: using 128k window, over a
> > total of 1953510784 blocks.
> > May  6 17:52:09 nine kernel: [18838.207833] md: resuming recovery of
> md127
> > from checkpoint.
> > May  6 20:51:21 nine kernel: [29589.980035] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f4900)
> > May  6 20:51:21 nine kernel: [29589.980041] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:51:28 nine kernel: [29596.503483] mptbase: ioc0:
> > LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
> > May  6 20:51:28 nine kernel: [29596.503747] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f4900)
> > May  6 20:51:28 nine kernel: [29597.253319] mptbase: ioc0:
> > LogInfo(0x31170000): Originator={PL}, Code={IO Device Missing Delay
> Retry},
> > SubCode(0x0000)
> > May  6 20:51:28 nine kernel: [29597.253329] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f4e00)
> > May  6 20:51:28 nine kernel: [29597.253332] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e fc 00 00 01 00 00
> > May  6 20:51:28 nine kernel: [29597.253341] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f4e00)
> > May  6 20:51:29 nine kernel: [29597.753599] mptbase: ioc0:
> > LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed},
> > SubCode(0x0000)
> > May  6 20:51:29 nine kernel: [29597.753608] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f4c00)
> > May  6 20:51:29 nine kernel: [29597.753610] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 02 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753619] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f4c00)
> > May  6 20:51:29 nine kernel: [29597.753622] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff8803318f5b00)
> > May  6 20:51:29 nine kernel: [29597.753624] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 0e 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753633] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff8803318f5b00)
> > May  6 20:51:29 nine kernel: [29597.753636] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff880331e3d900)
> > May  6 20:51:29 nine kernel: [29597.753638] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 00 00 00 08 00
> > May  6 20:51:29 nine kernel: [29597.753646] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff880331e3d900)
> > May  6 20:51:29 nine kernel: [29597.753649] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff880331e3d400)
> > May  6 20:51:29 nine kernel: [29597.753651] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 08 00 00 68 00
> > May  6 20:51:29 nine kernel: [29597.753659] mptscsih: ioc0: task abort:
> > SUCCESS (sc=ffff880331e3d400)
> > May  6 20:51:29 nine kernel: [29597.753671] mptscsih: ioc0: attempting
> > target reset! (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753673] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753685] mptscsih: ioc0: target reset:
> > FAILED (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753693] mptscsih: ioc0: attempting
> bus
> > reset! (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753695] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:51:29 nine kernel: [29597.753712] mptscsih: ioc0: bus reset:
> > FAILED (sc=ffff8803318f4900)
> > May  6 20:51:29 nine kernel: [29597.753715] mptscsih: ioc0: attempting
> host
> > reset! (sc=ffff8803318f4900)
> > May  6 20:52:04 nine kernel: [29632.830020] mptscsih: ioc0: host reset:
> > SUCCESS (sc=ffff8803318f4900)
> > May  6 20:52:14 nine kernel: [29642.840021] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840024] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840026] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840028] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840030] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840032] sd 6:0:5:0: Device offlined -
> > not ready after error recovery
> > May  6 20:52:14 nine kernel: [29642.840076] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.840082] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.840087] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e f6 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.840112] raid5:md127: read error not
> > correctable (sector 1284435456 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840129] raid5:md127: read error not
> > correctable (sector 1284435464 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840133] raid5:md127: read error not
> > correctable (sector 1284435472 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840136] raid5:md127: read error not
> > correctable (sector 1284435480 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840139] raid5:md127: read error not
> > correctable (sector 1284435488 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840143] raid5:md127: read error not
> > correctable (sector 1284435496 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840149] raid5:md127: read error not
> > correctable (sector 1284435504 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840196] raid5:md127: read error not
> > correctable (sector 1284435512 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840199] raid5:md127: read error not
> > correctable (sector 1284435520 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.840202] raid5:md127: read error not
> > correctable (sector 1284435528 on sdh2).
> > May  6 20:52:14 nine kernel: [29642.847676] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847678] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847681] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8e fc 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.847745] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847746] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847749] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 02 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.847812] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847813] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847816] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 0e 00 00 01 00 00
> > May  6 20:52:14 nine kernel: [29642.847871] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847873] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847875] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 00 00 00 08 00
> > May  6 20:52:14 nine kernel: [29642.847907] sd 6:0:5:0: [sdh] Unhandled
> > error code
> > May  6 20:52:14 nine kernel: [29642.847908] sd 6:0:5:0: [sdh] Result:
> > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> > May  6 20:52:14 nine kernel: [29642.847911] sd 6:0:5:0: [sdh] CDB:
> Read(10):
> > 28 00 4c 8f 14 08 00 00 68 00
> > May  6 20:52:19 nine kernel: [29647.840019] mptbase: ioc0: WARNING -
> Issuing
> > Reset from mpt_config!!
> > May  6 20:52:50 nine kernel: [29678.961260] ------------[ cut here
> > ]------------
> > May  6 20:52:50 nine kernel: [29678.961268] WARNING: at
> > /home/kernel-ppa/mainline/build/kernel/workqueue.c:485
> > flush_cpu_workqueue+0x8c/0x90()
> > May  6 20:52:50 nine kernel: [29678.961271] Hardware name: empty
> > May  6 20:52:50 nine kernel: [29678.961273] Modules linked in: btrfs
> > zlib_deflate crc32c libcrc32c xfs exportfs mptctl binfmt_misc ppdev
> > ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_state
> > nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables
> bridge
> > stp kvm_intel kvm snd_hda_codec_realtek snd_hda_intel snd_hda_codec
> > snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss
> > snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer
> snd_seq_device
> > psmouse serio_raw ioatdma snd i5100_edac nvidia(P) dca soundcore
> > snd_page_alloc edac_core lp parport raid10 raid456 async_raid6_recov
> > async_pq raid6_pq async_xor ses enclosure xor async_memcpy async_tx raid1
> > raid0 multipath linear ahci e1000e mptsas mptscsih mptbase
> > scsi_transport_sas
> > May  6 20:52:50 nine kernel: [29678.961333] Pid: 321, comm: mpt/0
> Tainted:
> > P           2.6.34-020634rc6-generic #020634rc6
> > May  6 20:52:50 nine kernel: [29678.961336] Call Trace:
> > May  6 20:52:50 nine kernel: [29678.961341]  [<ffffffff8107a9ac>] ?
> > flush_cpu_workqueue+0x8c/0x90
> > May  6 20:52:50 nine kernel: [29678.961346]  [<ffffffff8105f1ec>]
> > warn_slowpath_common+0x8c/0xc0
> > May  6 20:52:50 nine kernel: [29678.961350]  [<ffffffff8105f234>]
> > warn_slowpath_null+0x14/0x20
> > May  6 20:52:50 nine kernel: [29678.961353]  [<ffffffff8107a9ac>]
> > flush_cpu_workqueue+0x8c/0x90
> > May  6 20:52:50 nine kernel: [29678.961357]  [<ffffffff8106f981>] ?
> > try_to_del_timer_sync+0x51/0xe0
> > May  6 20:52:50 nine kernel: [29678.961360]  [<ffffffff8107aa74>]
> > flush_workqueue+0x44/0x70
> > May  6 20:52:50 nine kernel: [29678.961373]  [<ffffffffa004531c>]
> > mptsas_cleanup_fw_event_q+0x12c/0x160 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961378]  [<ffffffffa0048434>]
> > mptsas_ioc_reset+0x94/0x130 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961383]  [<ffffffff81033d39>] ?
> > default_spin_lock_flags+0x9/0x10
> > May  6 20:52:50 nine kernel: [29678.961389]  [<ffffffffa001222d>]
> > mpt_signal_reset+0x4d/0x60 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961394]  [<ffffffffa0018eb6>]
> > mpt_SoftResetHandler+0x1b6/0x3c0 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961399]  [<ffffffffa001bee7>]
> > mpt_config+0x307/0x640 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961404]  [<ffffffffa004c6f0>] ?
> > mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961409]  [<ffffffffa001d0b1>]
> > mpt_findImVolumes+0xb1/0x600 [mptbase]
> > May  6 20:52:50 nine kernel: [29678.961415]  [<ffffffffa004c6f0>] ?
> > mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961419]  [<ffffffffa004cd88>]
> > mptsas_firmware_event_work+0x698/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961424]  [<ffffffff8100985b>] ?
> > __switch_to+0xbb/0x2e0
> > May  6 20:52:50 nine kernel: [29678.961428]  [<ffffffff8105118e>] ?
> > put_prev_entity+0x2e/0x80
> > May  6 20:52:50 nine kernel: [29678.961430]  [<ffffffff81051af6>] ?
> > finish_task_switch+0x66/0xd0
> > May  6 20:52:50 nine kernel: [29678.961435]  [<ffffffffa004c6f0>] ?
> > mptsas_firmware_event_work+0x0/0xe80 [mptsas]
> > May  6 20:52:50 nine kernel: [29678.961438]  [<ffffffff8107a10c>]
> > run_workqueue+0xbc/0x190
> > May  6 20:52:50 nine kernel: [29678.961441]  [<ffffffff8107a65b>]
> > worker_thread+0x9b/0x100
> > May  6 20:52:50 nine kernel: [29678.961444]  [<ffffffff8107edc0>] ?
> > autoremove_wake_function+0x0/0x40
> > May  6 20:52:50 nine kernel: [29678.961447]  [<ffffffff8107a5c0>] ?
> > worker_thread+0x0/0x100
> > May  6 20:52:50 nine kernel: [29678.961450]  [<ffffffff8107e9e6>]
> > kthread+0x96/0xa0
> > May  6 20:52:50 nine kernel: [29678.961453]  [<ffffffff8100be64>]
> > kernel_thread_helper+0x4/0x10
> > May  6 20:52:50 nine kernel: [29678.961456]  [<ffffffff8107e950>] ?
> > kthread+0x0/0xa0
> > May  6 20:52:50 nine kernel: [29678.961458]  [<ffffffff8100be60>] ?
> > kernel_thread_helper+0x0/0x10
> > May  6 20:52:50 nine kernel: [29678.961460] ---[ end trace
> 5b0b1793526edc2a
> > ]---
> > May  6 20:53:20 nine kernel: [29709.040090] mptscsih: ioc0: attempting
> task
> > abort! (sc=ffff880331812400)
> > May  6 20:53:20 nine kernel: [29709.040093] sd 6:0:15:0: [sdr] CDB:
> > Write(10): 2a 00 00 00 00 47 00 00 02 00
> > May  6 20:53:50 nine kernel: [29739.040011] mptscsih: ioc0: WARNING -
> > Issuing Reset from mptscsih_IssueTaskMgmt!!
> > May  6 20:54:13 nine kernel: [29761.700122] md127_resync  D
> > ffff880001f55740     0  6733      2 0x00000000
> > May  6 20:54:13 nine kernel: [29761.700130]  ffff8803318f3b90
> > 0000000000000046 ffff8803318f3b50 ffff8803318f3fd8
> > May  6 20:54:13 nine kernel: [29761.700134]  ffff8803318eae20
> > 0000000000015740 0000000000015740 ffff8803318f3fd8
> > May  6 20:54:13 nine kernel: [29761.700137]  0000000000015740
> > ffff8803318f3fd8 0000000000015740 ffff8803318eae20
> > May  6 20:54:13 nine kernel: [29761.700141] Call Trace:
> > May  6 20:54:13 nine kernel: [29761.700160]  [<ffffffffa00f20e2>]
> > get_active_stripe+0x232/0x340 [raid456]
> > May  6 20:54:13 nine kernel: [29761.700167]  [<ffffffff810507e0>] ?
> > default_wake_function+0x0/0x20
> > May  6 20:54:13 nine kernel: [29761.700172]  [<ffffffffa00f49ad>]
> > sync_request+0x26d/0x2d0 [raid456]
> > May  6 20:54:13 nine kernel: [29761.700176]  [<ffffffffa00f1e8e>] ?
> > raid5_unplug_device+0x7e/0xa0 [raid456]
> >
> >
>
> As of now you can continue with patched for dma boundary alignment issue.
> For this new issue you can provide me complete var log messages with debug
> turned on.
>
> use 0x8188 > /sys/modules/mptbase/parameters/mpt_debug_level
>
> Thanks,
> Kashyap
> > On Wed, May 5, 2010 at 3:35 AM, <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=14831
> > >
> > >
> > > Andrew Dunn <andrew.g.dunn.dod@gmail.com> changed:
> > >
> > >           What    |Removed                     |Added
> > >
> > >
> ----------------------------------------------------------------------------
> > >                 CC|                            |
> > > andrew.g.dunn.dod@gmail.com
> > >
> > >
> > >
> > >
> > > --- Comment #18 from Andrew Dunn <andrew.g.dunn.dod@gmail.com>
>  2010-05-05
> > > 10:35:17 ---
> > > I anxiously await confirmation of this patch. This issue has been
> plaguing
> > > me
> > > for quite a while. Just for verification the mpt2sas controllers don't
> have
> > > problems with this? I was thinking of trying to get an AOC-USAS2-L8i
> > > (
> > >
> http://www.supermicro.com/products/accessories/addon/AOC-USAS2-L8i.cfm?TYP=I
> > > )
> > >
> > > --
> > > Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> > > ------- You are receiving this mail because: -------
> > > You are on the CC list for the bug.
> > >
>
> --
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
>
Comment 22 Tim Small 2010-05-12 09:03:57 UTC
(In reply to comment #21)

> So apparently this bug affects mpt2sas too???

Hi Brian,

Can you give the drive makes and models too?  Some recent WD drives can lock-up on SMART with all controllers.

Thanks,

Tim.

p.s. please try and keep the quotes under control in comments - only quote the relevant parts!
Comment 23 Brian Sullivan 2010-05-12 09:33:13 UTC
>
> Can you give the drive makes and models too?  Some recent WD drives can
> lock-up
> on SMART with all controllers.


1,1  /dev/sdq unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46815978
2,1  /dev/sdp unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46783271
3,1  /dev/sdo unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46803203
4,1  /dev/sdn unmounted                           WDC WD1000FYPS-0:1B01
WD-WCASJ0318455
1,2  /dev/sdi unmounted                           WDC WD10EACS-00D:1A01
WD-WCAU42580399
2,2  /dev/sdh unmounted                           WDC WD10EACS-00D:1A01
WD-WCAU42557087
3,2  /dev/sdg unmounted                           WDC WD10EADS-00L:1A01
WD-WCAU46812587
4,2  /dev/sdf unmounted                           WDC WD15EADS-00H:0K05
WD-WCAUP0019266
1,3  /dev/sde unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY2526737
2,3  /dev/sdd unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY2252304
3,3  (empty)
4,3  (empty)
1,4  /dev/sdm unmounted                           Hitachi HDS72202:A20N
JK1130YAGVZSRT
2,4  /dev/sdl unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY1361632
3,4  /dev/sdk unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY1338263
4,4  /dev/sdj unmounted                           WDC WD20EADS-00S:0A01
WD-WCAVY1279474
1,5  /dev/sdr unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1985273
2,5  /dev/sds unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1861891
3,5  /dev/sdt unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1985283
4,5  /dev/sdu unmounted                           WDC WD20EADS-00R:0A01
WD-WCAVY1831784

I have some more Hitatchi drives, but I am pretty sure at some point I
compared WD to Hitatchi and had problems with both.

Maybe next weekend if I have some time I'll try pulling all the WD drives
and adding some more Hitatchi and seeing what happens.
Comment 24 Tim Small 2010-05-12 10:22:08 UTC
(In reply to comment #23)

> 3,1  /dev/sdo unmounted                           WDC WD10EADS-00L:1A01
> WD-WCAU46803203
> 4,1  /dev/sdn unmounted                           WDC WD1000FYPS-0:1B01
> WD-WCASJ0318455
> 1,2  /dev/sdi unmounted                           WDC WD10EACS-00D:1A01
> WD-WCAU42580399
> 2,2  /dev/sdh unmounted                           WDC WD10EACS-00D:1A01
> WD-WCAU42557087
> 3,2  /dev/sdg unmounted                           WDC WD10EADS-00L:1A01
> WD-WCAU46812587
> 4,2  /dev/sdf unmounted                           WDC WD15EADS-00H:0K05
> WD-WCAUP0019266

> Maybe next weekend if I have some time I'll try pulling all the WD drives
> and adding some more Hitatchi and seeing what happens.

Perhaps you could stress-test the drives on a different controller e.g. AHCI, PIIX, Silicon Image, or whatever?

Tim.
Comment 25 dujun 2010-05-24 09:05:10 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=16021 
we have a similar problem which I described in above bug report. Ryan told me about this patch work and I have it tested in our system. It seems that it works, and the md raid rebuild speed is around 60MB/s for 16 hdd. 

However, the dd speed for when raid 5 is rebuiding is much less than original result. Around 150MB/s for write compared with almost 400MB/s without the patch. 600MB/s for read compared with 800MB/s. 

Is this caused by the bounce buffer for the alignment?γ€€Is there any way to solve this problem in lsi formware so that we don't need a forced alignment? 

We will test the software raid grow problem reported by Brian later.
Comment 26 dujun 2010-05-26 08:08:01 UTC
It seems to us that the software raid growing has no problem at all with the forced alignment patch.
Comment 27 richard 2010-06-07 20:00:22 UTC
Hi Dujun,

Can you add any more information to this performance drop, here and on linux-scsi?

See discussion thread here: 

http://marc.info/?l=linux-scsi&m=127567915722288&w=2

Thanks,

Richard
Comment 28 dujun 2010-06-08 00:29:49 UTC
Hi, Richard, 
Sorry I didn't follow the linux scsi mailing list. Pls note following.

There are 3 testing environments, all works under linux 2.6.32 and the LSI official driver 4.22.00.00:

1. LSI 1068e HBA chip on mainboard connected to two LSI x12A expander chips, then 16 1T WD SATA-II disks.
2. LSI 1068e HBA chip on mainboard connected to one LSI x36 expander chip, then 16 1T WD SATA-II disks. 
3. LSI 1068e HBA chip on mainboard and LSI 1068e HBA on pci-e slot, then both connected to 8 1T WD SATA-II disks, totally 16 disks. 

Without the 512 byte alignment patch, only testing 1 has reset problem. 2&3 works perfectly without any problem. 

With the patch, all the testing passed stability testing. However, the dd testing showed that the testing 1 performance degraded.
mdadm -C /dev/md10 -l 5 -n 16 /dev/sd[b-q] to setup the md raid5.
then 
dd if=/dev/zero of=/dev/md10 bs=1M count=10960 to test the write speed of the md. 

2&3 has no performance penalty. 

After several days' investigating further, changed hardware part by part, we found that some of the x12A expanders may caused the problem. Most of them works ok just like in testing 2&3, only two or three caused the performance issue. 

So we may draw a conclusion that the patch should be included in the next release. The performance may be caused by the x12A expander. We are going to ask our chip solution provider to investigate further why some of the chips work just with lower performance. 

(In reply to comment #27)
> Hi Dujun,
> 
> Can you add any more information to this performance drop, here and on
> linux-scsi?
> 
> See discussion thread here: 
> 
> http://marc.info/?l=linux-scsi&m=127567915722288&w=2
> 
> Thanks,
> 
> Richard
Comment 29 Tim Small 2010-06-08 06:44:45 UTC
(In reply to comment #28)

> linux 2.6.32 and the LSI official driver 4.22.00.00:

The official Linux driver seems to be 3.04.15 -

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/message/fusion/mptbase.h;hb=HEAD#l69

so my understanding is that you should probably be testing with this driver, rather than the 4.x LSI driver which hasn't made it into the Linux tree (yet).  The 4.x driver might be useful for providing additional data points.

Some official word from LSI on this would be useful.

Thanks,

Tim.
Comment 30 kashyap 2010-06-08 08:42:48 UTC
(In reply to comment #29)
> (In reply to comment #28)
> 
> > linux 2.6.32 and the LSI official driver 4.22.00.00:
> 
> The official Linux driver seems to be 3.04.15 -
> 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=drivers/message/fusion/mptbase.h;hb=HEAD#l69
> 
> so my understanding is that you should probably be testing with this driver,
Yes. I would recommend use of 3.4.15 driver since it is a bug raised from kernel.org. 

> rather than the 4.x LSI driver which hasn't made it into the Linux tree (yet). 
> The 4.x driver might be useful for providing additional data points.
> 
> Some official word from LSI on this would be useful.
> 
> Thanks,
> 
> Tim.
Comment 31 starlight 2010-06-29 20:26:33 UTC
FYI

Experiencing same problem on CentOS kernel with latest LSI
driver and and firmware:

LSI 2008
eight Seagate Momentus ST9500420AS SATA drives
LVM2 8x striped LV

CentOS 5.5 kernel 2.6.18-194.3.1.el5
MPT2BIOS 7.05.01.00 (2010.09.09)
SAS2008-IT 5.00.00.00
LSI mpt2sas 05.00.00.00

also

CentOS 5.4 kernel 2.6.18-164.10.1.el5
MPT2BIOS 7.03.00.00 (2009-10-12)
SAS2008-IR 4.00.00.00
distro mpt2sas version 01.101.00.00

-----

With 'smartd' running controller resets and drops last drive 
after about one or two days.  Fails during very light write 
activity rather than heavy write activity.  LV is used for a 
writing a very large log file to an 'ext4' file system.

With 'smartd' disabled ran for longer under mpt2sas v01.101,
but a somewhat different error corrupted the 'ext4' filesystem
after about three weeks.

EXT4-fs error (device dm-19): ext4_mb_generate_buddy: EXT4-fs: group 1168: 32768 blocks in bitmap, 32720 in gd
EXT4-fs error (device dm-19): ext4_mb_mark_diskspace_used: Allocating block 38273024 in system zone of 1168 group
.
.
.
mpt2sas0: attempting task abort! scmd(ffff810130449540)
sd 0:0:2:0: command: Read(10): 28 00 11 35 5b 07 00 00 08 00
mpt2sas0: task abort: SUCCESS scmd(ffff810130449540)
.
.
.

Too soon to tell if mpt2sas-05.00.00.00 is better.
Comment 32 starlight 2010-08-16 16:00:33 UTC
Happened again with 'smartd' disabled and with *latest*
kernel, LSI device driver and LSI IT (initiator target)
firmware.  Took 37 days of uptime for it to happen.
Failure was during moderate write activity rather than
light activity as with the 'smartd' pass-through
transactions.  Kernel messages attached.

kernel 5.5 2.6.18-194.8.1.el5
MPT2BIOS-7.05.01.00 (2010.02.09)
SAS2008-IT 5.00.00.00
LSI driver mpt2sas-05.00.00.00
Comment 33 starlight 2010-08-16 16:04:36 UTC
Created attachment 27469 [details]
kernel messages from failure
Comment 34 starlight 2010-08-16 16:07:39 UTC
Created attachment 27470 [details]
kernel messages from corresponding boot
Comment 35 starlight 2010-08-28 15:43:16 UTC
Another controller failure, this time with logging_level=0x1F8.
Comment 36 starlight 2010-08-28 15:45:04 UTC
Created attachment 28171 [details]
boot-time information from 'lsiutil'
Comment 37 starlight 2010-08-28 15:45:49 UTC
Created attachment 28181 [details]
firmware events from boot and failure
Comment 38 starlight 2010-08-28 15:47:50 UTC
Created attachment 28191 [details]
boot-time messages with logging_level=0x1F8
Comment 39 starlight 2010-08-28 15:51:01 UTC
Created attachment 28201 [details]
kernel messages from failure with logging_level=0x1F8
Comment 40 starlight 2010-08-28 15:52:05 UTC
descriptions for attachments in #38 and #39 are reversed
Comment 41 kashyap 2010-08-30 15:18:21 UTC
(In reply to comment #40)
> descriptions for attachments in #38 and #39 are reversed

I have taken a deep look of all the available logs for below configuration.

kernel 5.5 2.6.18-194.8.1.el5
MPT2BIOS-7.05.01.00 (2010.02.09)
SAS2008-IT 5.00.00.00
LSI driver mpt2sas-05.00.00.00


Things are different in this case. It is not the same issue which is related to "smartd" mentioned in this bugzilla.

I have seen some kind of hotplug action in this case. (or may be some connection issue which has created Hotplug kind of situation)

1. See below snippet of (https://bugzilla.kernel.org/attachment.cgi?id=28191)
--
Aug 27 14:23:34 X kernel: mpt2sas0: Device Status Change
Aug 27 14:23:34 X kernel: 	handle(0x000f), sas address(0x4433221107000000)<6>mpt2sas0: SAS Topology Change List
Aug 27 14:23:34 X kernel: sd 0:0:7:0: device_blocked, handle(0x000f)
Aug 27 14:24:02 X kernel: mpt2sas0: attempting task abort! scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: sd 0:0:7:0: 
Aug 27 14:24:02 X kernel:         comma

---

Driver has received Hotplug action "device delay removal" (this is relavent to LSI controllers Device missing delay parameters) 
Check "/sys/class/scsi_host/host6/device_delay"

2. Very soon I have seen Some of the Task abort followed by Device delete event
See below snippet.

--ug 27 14:24:02 X kernel: mpt2sas0: attempting task abort! scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: sd 0:0:7:0: 
Aug 27 14:24:02 X kernel:         command: Write(10): 2a 00 11 51 68 0f 00 04 00 00
Aug 27 14:24:02 X kernel: mpt2sas0: Device Status Change
Aug 27 14:24:02 X kernel: mpt2sas0: task abort: SUCCESS scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: 
Aug 27 14:24:02 X kernel: mpt2sas0: updating handles for sas_host(0x5003048573212988)
Aug 27 14:24:02 X kernel: 	handle(0x000f), sas address(0x4433221107000000)<6>
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (stop)
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (start)
Aug 27 14:24:02 X kernel: mpt2sas0: SAS Topology Change List
Aug 27 14:24:02 X kernel: mpt2sas0: tr_send:handle(0x000f), (open), smid(439), cb(7)
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (stop)
Aug 27 14:24:02 X kernel: mpt2sas0: updating handles for sas_host(0x5003048573212988)
Aug 27 14:24:02 X kernel: mpt2sas0: tr_complete:handle(0x000f), (open) smid(439), ioc_status(0x0000), loginfo(0x00000000), completed(0)
Aug 27 14:24:02 X kernel: mpt2sas0: sc_send:handle(0x000f), (open), smid(540), cb(5)
Aug 27 14:24:02 X kernel: mpt2sas0: sc_complete:handle(0x000f), (open) smid(540), ioc_status(0x0000), loginfo(0x00000000)
Aug 27 14:24:02 X kernel: mpt2sas0: _scsih_remove_device: enter: handle(0x000f), sas_addr(0x4433221107000000)
Aug 27 14:24:02 X kernel: sd 0:0:7:0: device_unblocked, handle(0x000f)
Aug 27 14:24:02 X kernel: mpt2sas0: removing handle(0x000f), sas_addr(0x4433221107000000)
Aug 27 14:24:02 X kernel: mpt2sas0: _scsih_remove_device: exit: handle(0x000f), sas_addr(0x4433221107000000)


---

3. Now Driver immediately receive Device ADD. (see below snippet)
--
Aug 27 14:24:02 X kernel: mpt2sas0: Discovery: (stop)
Aug 27 14:24:02 X kernel: mpt2sas0: REPORT_LUNS: handle(0x000f), retries(0)
Aug 27 14:24:02 X kernel: mpt2sas0: 	ioc_status(0x0045), loginfo(0x00000000), rc(ready)
Aug 27 14:24:02 X kernel: mpt2sas0: TEST_UNIT_READY: handle(0x000f), lun(0)
Aug 27 14:24:02 X kernel: mpt2sas0: 	ioc_status(0x0000), loginfo(0x00000000), rc(retry_ua)
Aug 27 14:24:02 X kernel: mpt2sas0: 	[sense_key,asc,ascq]: [0x06,0x29,0x00]
Aug 27 14:24:02 X kernel: mpt2sas0: TEST_UNIT_READY: handle(0x000f), lun(0)
Aug 27 14:24:02 X kernel: mpt2sas0: attempting task abort! scmd(ffff81005a235cc0)
Aug 27 14:24:02 X kernel: scsi 0:0:7:0: 
Aug 27 14:24:02 X kernel:         command: Test Unit Ready: 00 00 00 00 00 00
Aug 27 14:24:02 X kernel: mpt2sas0: device been deleted! scmd(ffff81005a235cc0)
--

4. At the end HBA reset is executed which is removing device "scsi 0:0:7:0".
It means device is not actually available in firmware table. (this can be confirm if we have lsiutil option 8 and 16 )

In summary, this can be a completely different issue. Can we move this issue to new bugzilla, so that I can have a fresh look on it ?

Thanks, Kashyap
Comment 42 starlight 2010-08-30 16:42:08 UTC
Kashyap,

Thank you for looking at this problem in depth.

Since it is different, I certainly can create a new bugzilla for 
it.  I'll do that in the next day or so.

Do you have any ideas about what might be the cause?

One thing that crosses my mind is that the drives here are not 
enterprise Seagate Constellation drives, but are Seagate 
Momentus drives that have more aggressive power saving features 
intended for laptops.  We chose them because at the time they 
were much less expensive and the sequential read/write 
performance was the same.  Now the price differential is much 
smaller and we would have gone with Constellations.

Is there any chance that the Momentus drives require the 10 
second command time-out in the LSI BIOS config to be extended? 
This is just a random idea.  The drives were all active at the 
time of the event and so would not have been in power saving 
mode or been responding slowly to commands.

Another theory I have is that there might be a memory leak in 
the firmware and that when all free memory is exhausted, the 
controller "goes insane".  Is such a memory leak something that 
would be apparent in the tracing?

Finally should mention that I believe the configuration is 
unusual.  A large, identical partition on the eight drives is 
configured as a software RAID0 volume.  I doubt that many people 
configure systems this way.  It might be stressing the
firmware/software in a unique fashion.

Thanks,

David





At 03:18 PM 8/30/2010 +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
>https://bugzilla.kernel.org/show_bug.cgi?id=14831
>
>I have taken a deep look of all the available logs for below 
>configuration.
>
>In summary, this can be a completely different issue. Can we move this issue to
>new bugzilla, so that I can have a fresh look on it ?
>
>Thanks, Kashyap
Comment 43 starlight 2010-08-31 08:17:07 UTC
Above recent activity reported by me assigned as new bug 17551 as it's apparently not related to the 'smartd' failure.
Comment 44 Benjamin ESTRABAUD 2010-09-28 19:07:00 UTC
Is the fix suggested in previous comment #15 (http://lkml.org/lkml/2010/4/26/335) not the same as a proposed commit from Yuri Tikonov in september 09?

reference: (https://kerneltrap.org/mailarchive/linux-scsi/2009/9/1/6371653)

Is the one suggested in mptscsih.c more 'global' than the one suggested in mptsas.c by Yuri? Or are they complementary/redundant/different?
Comment 45 Tim Small 2010-11-03 08:14:51 UTC
As per:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2a1b7e5

etc.

Do LSI now consider this to be closed, or is some other firmware fix etc. in the offing?

If LSI do consider this the final fix to this issue, could they actually close this bug, rather than leaving it in the "NEW" state, so that the various distros can have some more information to inform them with respect to backporting etc.

Thanks,

Tim.

Note You need to log in before you can comment on or make changes to this bug.