Bug 42765
Summary: | mptscsih driver issues task aborts during high write utilization | ||
---|---|---|---|
Product: | SCSI Drivers | Reporter: | Eric (eric.hidle) |
Component: | Other | Assignee: | scsi_drivers-other |
Status: | RESOLVED OBSOLETE | ||
Severity: | blocking | CC: | alan, eric.hidle, jamireste |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.38-8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Graph of total write rate in guest during rsync data restore + mdadm rebuild (high write utilization)
TGZ file with patched mptsas driver for ESXi 5.0 Patched VIB. The tgz I added will not work |
Created attachment 72375 [details]
TGZ file with patched mptsas driver for ESXi 5.0
This is a TGZ file designed to be used with the ESXi Customizer v2.5 or v2.6 to patch a new mptsas driver into the stock ESXi 5.0 installation CD. I recompiled the community source driver to incorporate the patch against this driver that was introduced in Linux 2.6.35 that is not present in the ESXi 5.0 driver.
I did this because I suspect this problem is being caused by ESXi, and not the Linux Guest. I will follow up once I have been able to test this new driver on an ESXi installation.
Created attachment 72380 [details]
Patched VIB. The tgz I added will not work
I am adding this VIB since a TGZ cannot be used to replace an existing driver in the ISO. This VIB can be added to the ISO using the Customizer, or be added to a running system using CLI tools.
If this is still present in modern kernels please update the bug |
Created attachment 72364 [details] Graph of total write rate in guest during rsync data restore + mdadm rebuild (high write utilization) During high write utilization, a sting of errors similar to the following appears in syslog: Feb 13 07:42:22 Beluga kernel: [54224.040144] mptscsih: ioc1: attempting task abort! (sc=ffff8800018dea00) Feb 13 07:42:22 Beluga kernel: [54224.040154] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 55 3f 00 02 00 00 Feb 13 07:42:23 Beluga kernel: [54224.779457] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff8800018dea00) (sn=53380191) Feb 13 07:42:24 Beluga kernel: [54226.018680] mptscsih: ioc1: attempting task abort! (sc=ffff88001737a300) Feb 13 07:42:24 Beluga kernel: [54226.018699] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 50 3f 00 00 08 00 Feb 13 07:42:24 Beluga kernel: [54226.018711] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88001737a300) (sn=53380197) Feb 13 07:42:24 Beluga kernel: [54226.025368] mptscsih: ioc1: attempting task abort! (sc=ffff88002b2c7200) Feb 13 07:42:24 Beluga kernel: [54226.025372] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 50 47 00 00 08 00 Feb 13 07:42:24 Beluga kernel: [54226.025382] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b2c7200) (sn=53380199) Feb 13 07:42:24 Beluga kernel: [54226.025556] mptscsih: ioc1: attempting task abort! (sc=ffff88002b3b3300) Feb 13 07:42:24 Beluga kernel: [54226.025559] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 50 4f 00 00 60 00 Feb 13 07:42:24 Beluga kernel: [54226.025569] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b3b3300) (sn=53380205) Feb 13 07:42:24 Beluga kernel: [54226.025737] mptscsih: ioc1: attempting task abort! (sc=ffff88002b2c7900) Feb 13 07:42:24 Beluga kernel: [54226.025740] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 57 3f 00 01 f0 00 Feb 13 07:42:24 Beluga kernel: [54226.025749] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b2c7900) (sn=53380211) Feb 13 07:42:24 Beluga kernel: [54226.025916] mptscsih: ioc1: attempting task abort! (sc=ffff88002b385e00) Feb 13 07:42:24 Beluga kernel: [54226.025919] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 47 0e 59 2f 00 03 10 00 Feb 13 07:42:24 Beluga kernel: [54226.025928] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b385e00) (sn=53380217) Feb 13 07:42:24 Beluga kernel: [54226.026094] mptscsih: ioc1: attempting task abort! (sc=ffff88002b385b00) Feb 13 07:42:24 Beluga kernel: [54226.026098] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 2f 7f ad bf 00 00 08 00 Feb 13 07:42:24 Beluga kernel: [54226.026107] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88002b385b00) (sn=53380218) Feb 13 07:42:24 Beluga kernel: [54226.026271] mptscsih: ioc1: attempting task abort! (sc=ffff88001737ad00) Feb 13 07:42:24 Beluga kernel: [54226.026274] sd 3:0:5:0: [sdg] CDB: Write(10): 2a 00 2f 7f ad c7 00 00 08 00 Feb 13 07:42:24 Beluga kernel: [54226.026283] mptscsih: ioc1: task abort: SUCCESS (rv=2002) (sc=ffff88001737ad00) (sn=53380224) Simultaneous with this output in the Linux Guest Syslog, the following appear in the VMWare Kernel Log: 2012-02-13T12:42:21.677Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort! (sc=0x4124015017c0) 2012-02-13T12:42:21.677Z cpu6:65683)MPT SAS Host:8:0:4:0 :: <6> command: Write(10): 2a 00 47 0e 55 bf 00 00 80 00 2012-02-13T12:42:22.141Z cpu1:2049)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x412400728580) to dev "naa.50024e92063340f2" on path "vmhba3:C0:T4:L0" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.Act:EVAL 2012-02-13T12:42:22.141Z cpu1:2049)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.50024e92063340f2" state in doubt; requested fast path state update... 2012-02-13T12:42:22.141Z cpu1:2049)ScsiDeviceIO: 2305: Cmd(0x412400728580) 0x2a, CmdSN 0x800e0069 to dev "naa.50024e92063340f2" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. 2012-02-13T12:42:22.141Z cpu1:2049)<6>mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) 2012-02-13T12:42:22.142Z cpu3:100554)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1055 Host Busy vmhba3:0:4:0 (driver name: MPT SAS Host) - Message repeated 1 time 2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS (sc=0x4124015017c0) 2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort! (sc=0x4124014d6380) 2012-02-13T12:42:22.142Z cpu6:65683)MPT SAS Host:8:0:4:0 :: <6> command: Write(10): 2a 00 47 0e 56 3f 00 00 80 00 2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS (sc=0x4124014d6380) 2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: attempting task abort! (sc=0x41240141ba80) 2012-02-13T12:42:22.142Z cpu6:65683)MPT SAS Host:8:0:4:0 :: <6> command: Write(10): 2a 00 47 0e 56 bf 00 00 80 00 2012-02-13T12:42:22.142Z cpu6:65683)<6>mptscsih: ioc0: task abort: SUCCESS (sc=0x41240141ba80) 2012-02-13T12:42:23.397Z cpu3:2171)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x0 (0x412400720d40) to dev "naa.50024e92063340f2" on path "vmhba3:C0:T4:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.Act:NONE 2012-02-13T12:42:23.397Z cpu3:2171)ScsiDeviceIO: 2305: Cmd(0x412400720d40) 0x0, CmdSN 0x800e0061 to dev "naa.50024e92063340f2" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. 2012-02-13T12:42:23.397Z cpu3:2171)ScsiCore: 1455: Power-on Reset occurred on naa.50024e92063340f2 This has occurred on several of the disks attached to the LSI 1068E controller in the system. All disks are Samsung HD204UI. O/S is Ubuntu 11.04 Server running in VMWare ESXi 5.0 with all 6 drives attached to the guest via Raw Device Mapping, and assembled into a RAID5 array using mdadm. When a hard disk undergoes POR, it can fall out of an mdadm array, causing permanent data loss. We have seen one occurrence of a "Rebuild20" event from mdadm in the Guest syslog. The POR causes all writes to the array to stop, long enough to show up in the ESXi disk performance graph (attached image).