Bug 42765

Summary: mptscsih driver issues task aborts during high write utilization
Product: SCSI Drivers Reporter: Eric (eric.hidle)
Component: OtherAssignee: scsi_drivers-other
Status: RESOLVED OBSOLETE    
Severity: blocking CC: alan, eric.hidle, jamireste
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-8 Subsystem:
Regression: No Bisected commit-id:
Attachments: Graph of total write rate in guest during rsync data restore + mdadm rebuild (high write utilization)
TGZ file with patched mptsas driver for ESXi 5.0
Patched VIB. The tgz I added will not work

Description Eric 2012-02-13 13:36:50 UTC
Created attachment 72364 [details]
Graph of total write rate in guest during rsync data restore + mdadm rebuild (high write utilization)

During high write utilization, a sting of errors similar to the following appears in syslog:

Feb 13 07:42:22 Beluga kernel: [54224.040144] mptscsih: ioc1: attempting task abort! (sc=ffff8800018dea00)
Feb 13 07:42:22 Beluga kernel: [54224.040154] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 55 3f 00 02 00 00
Feb 13 07:42:23 Beluga kernel: [54224.779457] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff8800018dea00) (sn=53380191)
Feb 13 07:42:24 Beluga kernel: [54226.018680] mptscsih: ioc1: attempting task abort! (sc=ffff88001737a300)
Feb 13 07:42:24 Beluga kernel: [54226.018699] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 50 3f 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.018711] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88001737a300) (sn=53380197)
Feb 13 07:42:24 Beluga kernel: [54226.025368] mptscsih: ioc1: attempting task abort! (sc=ffff88002b2c7200)
Feb 13 07:42:24 Beluga kernel: [54226.025372] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 50 47 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.025382] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b2c7200) (sn=53380199)
Feb 13 07:42:24 Beluga kernel: [54226.025556] mptscsih: ioc1: attempting task abort! (sc=ffff88002b3b3300)
Feb 13 07:42:24 Beluga kernel: [54226.025559] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 50 4f 00 00 60 00
Feb 13 07:42:24 Beluga kernel: [54226.025569] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b3b3300) (sn=53380205)
Feb 13 07:42:24 Beluga kernel: [54226.025737] mptscsih: ioc1: attempting task abort! (sc=ffff88002b2c7900)
Feb 13 07:42:24 Beluga kernel: [54226.025740] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 57 3f 00 01 f0 00
Feb 13 07:42:24 Beluga kernel: [54226.025749] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b2c7900) (sn=53380211)
Feb 13 07:42:24 Beluga kernel: [54226.025916] mptscsih: ioc1: attempting task abort! (sc=ffff88002b385e00)
Feb 13 07:42:24 Beluga kernel: [54226.025919] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 59 2f 00 03 10 00
Feb 13 07:42:24 Beluga kernel: [54226.025928] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b385e00) (sn=53380217)
Feb 13 07:42:24 Beluga kernel: [54226.026094] mptscsih: ioc1: attempting task abort! (sc=ffff88002b385b00)
Feb 13 07:42:24 Beluga kernel: [54226.026098] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 2f 7f ad bf 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.026107] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b385b00) (sn=53380218)
Feb 13 07:42:24 Beluga kernel: [54226.026271] mptscsih: ioc1: attempting task abort! (sc=ffff88001737ad00)
Feb 13 07:42:24 Beluga kernel: [54226.026274] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 2f 7f ad c7 00 00 08 00
Feb 13 07:42:24 Beluga kernel: [54226.026283] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88001737ad00) (sn=53380224)

Simultaneous with this output in the Linux Guest Syslog, the following appear in the VMWare Kernel Log:

2012-02-13T12:42:21.677Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort! (sc=0x4124015017c0)
2012-02-13T12:42:21.677Z cpu6:65683)MPT SAS Host:8:0:4:0 ::
<6>        command: Write(10): 2a 00 47 0e 55 bf 00 00 80 00
2012-02-13T12:42:22.141Z cpu1:2049)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x412400728580) to dev "naa.50024e92063340f2" on path "vmhba3:C0:T4:L0" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.Act:EVAL
2012-02-13T12:42:22.141Z cpu1:2049)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.50024e92063340f2" state in doubt; requested fast path state update...
2012-02-13T12:42:22.141Z cpu1:2049)ScsiDeviceIO: 2305: Cmd(0x412400728580) 0x2a, CmdSN 0x800e0069 to dev "naa.50024e92063340f2" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-02-13T12:42:22.141Z cpu1:2049)<6>mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
2012-02-13T12:42:22.142Z cpu3:100554)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1055 Host Busy vmhba3:0:4:0 (driver name: MPT SAS Host) - Message repeated 1 time
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS (sc=0x4124015017c0)
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort! (sc=0x4124014d6380)
2012-02-13T12:42:22.142Z cpu6:65683)MPT SAS Host:8:0:4:0 ::
<6>        command: Write(10): 2a 00 47 0e 56 3f 00 00 80 00
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS (sc=0x4124014d6380)
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort! (sc=0x41240141ba80)
2012-02-13T12:42:22.142Z cpu6:65683)MPT SAS Host:8:0:4:0 ::
<6>        command: Write(10): 2a 00 47 0e 56 bf 00 00 80 00
2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS (sc=0x41240141ba80)
2012-02-13T12:42:23.397Z cpu3:2171)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x0 (0x412400720d40) to dev "naa.50024e92063340f2" on path "vmhba3:C0:T4:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.Act:NONE
2012-02-13T12:42:23.397Z cpu3:2171)ScsiDeviceIO: 2305: Cmd(0x412400720d40) 0x0, CmdSN 0x800e0061 to dev "naa.50024e92063340f2" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
2012-02-13T12:42:23.397Z cpu3:2171)ScsiCore: 1455: Power-on Reset occurred on naa.50024e92063340f2


This has occurred on several of the disks attached to the LSI 1068E controller in the system. All disks are Samsung HD204UI. O/S is Ubuntu 11.04 Server running in VMWare ESXi 5.0 with all 6 drives attached to the guest via Raw Device Mapping, and assembled into a RAID5 array using mdadm. 

When a hard disk undergoes POR, it can fall out of an mdadm array, causing permanent data loss. We have seen one occurrence of a "Rebuild20" event from mdadm in the Guest syslog. The POR causes all writes to the array to stop, long enough to show up in the ESXi disk performance graph (attached image).
Comment 1 Eric 2012-02-14 14:57:38 UTC
Created attachment 72375 [details]
TGZ file with patched mptsas driver for ESXi 5.0

This is a TGZ file designed to be used with the ESXi Customizer v2.5 or v2.6 to patch a new mptsas driver into the stock ESXi 5.0 installation CD. I recompiled the community source driver to incorporate the patch against this driver that was introduced in Linux 2.6.35 that is not present in the ESXi 5.0 driver. 

I did this because I suspect this problem is being caused by ESXi, and not the Linux Guest. I will follow up once I have been able to test this new driver on an ESXi installation.
Comment 2 Eric 2012-02-14 18:48:10 UTC
Created attachment 72380 [details]
Patched VIB. The tgz I added will not work

I am adding this VIB since a TGZ cannot be used to replace an existing driver in the ISO. This VIB can be added to the ISO using the Customizer, or be added to a running system using CLI tools.
Comment 3 Alan 2013-12-23 14:28:20 UTC
If this is still present in modern kernels please update the bug