Bug 18652

Summary: mptscsih: ioc0: attempting task abort when heavy disk operations on MPT SAS
Product: SCSI Drivers Reporter: ksb
Component: OtherAssignee: scsi_drivers-other
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, bigplum, eric.hidle, jamireste, quintin, tobias.pal, wangdb
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35.4 - 2.6.36.1 Subsystem:
Regression: No Bisected commit-id:

Description ksb 2010-09-16 08:14:46 UTC
Happens when heavy disk write operations ongoing:
[ 4499.860030] mptscsih: ioc0: attempting task abort! (sc=ffff88007a588200)
[ 4499.860036] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dc f8 9f 00 04 00 00
[ 4499.894551] mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptbase_reply
[ 4501.256258] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_do
[ 4501.268298] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007a588200)
[ 4503.256426] mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptscsih_io_done
[ 4503.256439] mptscsih: ioc0: attempting task abort! (sc=ffff88007ab5cc00)
[ 4503.256443] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dc fc 9f 00 04 00 00
[ 4503.256455] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007ab5cc00)
[ 4503.506394] mptscsih: ioc0: attempting task abort! (sc=ffff88007a588000)
[ 4503.506399] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dd 00 9f 00 04 00 00
[ 4503.506412] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007a588000)
[ 4503.506698] mptscsih: ioc0: attempting task abort! (sc=ffff88007ab5c700)
[ 4503.506702] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dd 04 9f 00 02 50 00
[ 4503.506715] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007ab5c700)
[ 4503.506999] mptscsih: ioc0: attempting task abort! (sc=ffff88007a589600)
[ 4503.507002] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dd 06 f7 00 01 80 00
[ 4503.507014] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007a589600)

[.. a lot of them, and ends with..]

[ 4503.525540] mptscsih: ioc0: attempting task abort! (sc=ffff88007becbe00)
[ 4503.525543] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 10 4c 78 ff 00 01 50 00
[ 4503.525555] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007becbe00)
[ 4503.525858] mptscsih: ioc0: attempting target reset! (sc=ffff88007a588200)
[ 4503.525861] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dc f8 9f 00 04 00 00
[ 4505.006219] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88007a588200)
Happens when heavy disk write operations ongoing.
Identically on ubuntu's stock 2.6.32-24 and also on custom built 2.6.35.4 and
2.6.36-rc3 kernels.

following on 2.6.35.4 kernel:
# cat /proc/mpt/version
mptlinux-3.04.15
  Fusion MPT base driver
  Fusion MPT SAS host driver

# cat /proc/mpt/summary
ioc0: LSISAS1064E B2, FwRev=01140000h, Ports=1, MaxQ=511, IRQ=17 

# cat /proc/mpt/ioc0/info
ioc0:
  ProductID = 0x2204 (LSISAS1064E B2)
  FWVersion = 0x01140000
  MsgVersion = 0x0105
  FirstWhoInit = 0x00
  EventState = 0x00
  CurrentHostMfaHighAddr = 0x00000000
  CurrentSenseBufferHighAddr = 0x00000000
  MaxChainDepth = 0x60 frames
  MinBlockSize = 0x20 bytes
  RequestFrames @ 0xffff88007a402800 (Dma @ 0x000000007a402800)
    {CurReqSz=128} x {CurReqDepth=511} = 65408 bytes ^= 0x10000
    {MaxReqSz=128}   {MaxReqDepth=511}
  Frames   @ 0xffff88007a400000 (Dma @ 0x000000007a400000)
    {CurRepSz=80} x {CurRepDepth=128} = 10240 bytes ^= 0x2880
    {MaxRepSz=0}   {MaxRepDepth=511}
  MaxDevices = 173
  MaxBuses = 1
  PortNumber = 1 (of 1)

# cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: TSSTcorp Model: DVD-ROM SH-D162D Rev: SB00
  Type:   CD-ROM                           ANSI  SCSI revision: 05
Host: scsi4 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: WDC WD2500AAKS-0 Rev: 1B01
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi4 Channel: 00 Id: 01 Lun: 00
  Vendor: ATA      Model: WDC WD2500AAKS-0 Rev: 1B01
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi4 Channel: 00 Id: 02 Lun: 00
  Vendor: FUJITSU  Model: MAX3147RC        Rev: 0104
  Type:   Direct-Access                    ANSI  SCSI revision: 03
Host: scsi4 Channel: 00 Id: 03 Lun: 00
  Vendor: FUJITSU  Model: MAX3147RC        Rev: 0104
  Type:   Direct-Access                    ANSI  SCSI revision: 03
Host: scsi4 Channel: 00 Id: 04 Lun: 00
  Vendor: ESG-SHV. Model: SCA HSBP M1..... Rev: 2.15
  Type:   Enclosure                        ANSI  SCSI revision: 03

Happens at least on WDC's (SATA).
Comment 1 pipa.tk 2010-10-13 07:31:36 UTC
I have the same issue, and I am testing the patch from http://lkml.org/lkml/2010/4/26/335

It seems resolved, try it.
Comment 2 ksb 2010-10-14 14:09:18 UTC
Tested again with 2.6.35.7 where the patch already included. Unfortunately nothing changed - errors on heavy writes. Test cases like smartctl don't trigger any errors.
Further tests will be with other sata hdd's.
Comment 3 pipa.tk 2010-10-15 01:10:37 UTC
Yes, this patch not functioned completely. After I apply this patch with 2.6.18.8-xen and run for 2 days, IO still hang every 1~3 hours. It's more less than before. The message shows: 

mptscsih: ioc0: attempting task abort! (sc=ffff88028efdcb40)
sd 4:0:6:0:
        command: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff88028efdcb40)
mptscsih: ioc0: attempting task abort! (sc=ffff88028f12e080)
sd 4:0:6:0:
        command: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff88028f12e080)
Comment 4 pipa.tk 2010-10-29 03:34:31 UTC
Which harddisk ary you used, ST31500341AS?

Relate bug: https://bugzilla.kernel.org/show_bug.cgi?id=13594
Comment 5 ksb 2010-11-04 09:59:46 UTC
As I said before - happens only on WD'c, exactly WD2500AAKS. It don't happen on other older hdd's, i.e. WDC WD2500KS.
HDD's are connected to 6-Drive Hot-Swap Expander Board AXX6DRV3GEXP, but when I attach SATA hdd's directly to mainboard - everything works ok, without any errors.
Comment 6 quintin 2010-11-06 11:51:34 UTC
I too am seeing this behavior on 2.6.26 after applying the patch from: http://lkml.org/lkml/2010/4/26/335 - The drives we're using are: WD5000BEKT

Sometimes we have 1 drive drop from the array, other times multiple. The messages from the latest were:

[2149681.954263] EXT3 FS on dm-1, internal journal
[2149681.954396] EXT3-fs: mounted filesystem with ordered data mode.
[2150378.948488] mptscsih: ioc0: attempting task abort! (sc=ffff88000b13b580)
[2150378.948589] sd 1:0:6:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[2150383.337348] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[2150383.337806] mptsas: ioc0: removing sata device, channel 0, id 6, phy 6
[2150383.337902]  port-1:6: mptsas: ioc0: delete port (6)
[2150383.338043] sd 1:0:6:0: [sdh] Synchronizing SCSI cache
[2150383.554191] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88000b13b580)
[2150383.554292] mptscsih: ioc0: attempting target reset! (sc=ffff88000b13b580)
[2150383.554372] sd 1:0:6:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[2150383.810191] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88000b13b580)
[2150383.810288] mptscsih: ioc0: attempting bus reset! (sc=ffff88000b13b580)
[2150383.810367] sd 1:0:6:0: [sdh] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[2150384.578124] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff88000b13b580)
[2150394.582457] mptscsih: ioc0: attempting host reset! (sc=ffff88000b13b580)
[2150394.582555] mptbase: ioc0: Initiating recovery
[2150415.384673] mptscsih: ioc0: host reset: SUCCESS (sc=ffff88000b13b580)
[2150415.384773] sd 1:0:6:0: Device offlined - not ready after error recovery
[2150415.384865] end_request: I/O error, dev sdh, sector 976767931
[2150415.384944] md: super_written gets error=-5, uptodate=0
[2150415.385021] raid10: Disk failure on sdh6, disabling device.
[2150415.385022] raid10: Operation continuing on 7 devices.
[2150415.385223] raid10: sdh6: rescheduling sector 439149440
[2150415.385302] raid10: sdh6: rescheduling sector 439149456
[2150415.385378] raid10: sdh6: rescheduling sector 439149464
[2150415.385454] raid10: sdh6: rescheduling sector 439149472
[2150415.385530] raid10: sdh6: rescheduling sector 439149480
[2150415.385606] raid10: sdh6: rescheduling sector 439149488
[2150415.385682] raid10: sdh6: rescheduling sector 439149496
[2150415.385758] raid10: sdh6: rescheduling sector 439149504
[2150415.385834] raid10: sdh6: rescheduling sector 439149512
[2150415.385910] raid10: sdh6: rescheduling sector 439149520
[2150415.385990] end_request: I/O error, dev sdh, sector 20000703
[2150415.386067] md: super_written gets error=-5, uptodate=0
[2150415.386143] raid1: Disk failure on sdh1, disabling device.
[2150415.386144] raid1: Operation continuing on 7 devices.
[2150415.386422] sd 1:0:6:0: [sdh] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
[2150415.760573] raid10: Disk failure on sdh5, disabling device.
[2150415.760576] raid10: Operation continuing on 7 devices.
[2150415.786990] RAID1 conf printout:
[2150415.787080]  --- wd:7 rd:8
[2150415.787151]  disk 0, wo:0, o:1, dev:sdb1
[2150415.787225]  disk 1, wo:0, o:1, dev:sdc1
[2150415.787298]  disk 2, wo:0, o:1, dev:sdd1
[2150415.787381]  disk 3, wo:0, o:1, dev:sde1
[2150415.787455]  disk 4, wo:0, o:1, dev:sdf1
[2150415.787528]  disk 5, wo:0, o:1, dev:sdg1
[2150415.787611]  disk 6, wo:1, o:0, dev:sdh1
[2150415.787664]  disk 7, wo:0, o:1, dev:sdi1
[2150415.796490] RAID1 conf printout:
[2150415.796581]  --- wd:7 rd:8
[2150415.796653]  disk 0, wo:0, o:1, dev:sdb1
[2150415.796727]  disk 1, wo:0, o:1, dev:sdc1
[2150415.796800]  disk 2, wo:0, o:1, dev:sdd1
[2150415.796873]  disk 3, wo:0, o:1, dev:sde1
[2150415.796958]  disk 4, wo:0, o:1, dev:sdf1
[2150415.797032]  disk 5, wo:0, o:1, dev:sdg1
[2150415.797105]  disk 7, wo:0, o:1, dev:sdi1
[2150415.813487] RAID10 conf printout:
[2150415.813573]  --- wd:7 rd:8
[2150415.813645]  disk 0, wo:0, o:1, dev:sdb6
[2150415.813730]  disk 1, wo:0, o:1, dev:sdc6
[2150415.813803]  disk 2, wo:0, o:1, dev:sdd6
[2150415.813885]  disk 3, wo:0, o:1, dev:sde6
[2150415.813970]  disk 4, wo:0, o:1, dev:sdf6
[2150415.814044]  disk 5, wo:0, o:1, dev:sdg6
[2150415.814117]  disk 6, wo:1, o:0, dev:sdh6
[2150415.814200]  disk 7, wo:0, o:1, dev:sdi6
[2150415.815864] RAID10 conf printout:
[2150415.815937]  --- wd:7 rd:8
[2150415.816009]  disk 0, wo:0, o:1, dev:sdb6
[2150415.816082]  disk 1, wo:0, o:1, dev:sdc6
[2150415.816156]  disk 2, wo:0, o:1, dev:sdd6
[2150415.816229]  disk 3, wo:0, o:1, dev:sde6
[2150415.816302]  disk 4, wo:0, o:1, dev:sdf6
[2150415.821507]  disk 5, wo:0, o:1, dev:sdg6
[2150415.821580]  disk 7, wo:0, o:1, dev:sdi6
[2150415.843762] RAID10 conf printout:
[2150415.843839]  --- wd:7 rd:8
[2150415.843911]  disk 0, wo:0, o:1, dev:sdb5
[2150415.843984]  disk 1, wo:0, o:1, dev:sdc5
[2150415.844058]  disk 2, wo:0, o:1, dev:sdd5
[2150415.844131]  disk 3, wo:0, o:1, dev:sde5
[2150415.844204]  disk 4, wo:0, o:1, dev:sdf5
[2150415.844278]  disk 5, wo:0, o:1, dev:sdg5
[2150415.844351]  disk 6, wo:1, o:0, dev:sdh5
[2150415.844424]  disk 7, wo:0, o:1, dev:sdi5
[2150415.853059] RAID10 conf printout:
[2150415.853135]  --- wd:7 rd:8
[2150415.853206]  disk 0, wo:0, o:1, dev:sdb5
[2150415.853279]  disk 1, wo:0, o:1, dev:sdc5
[2150415.853352]  disk 2, wo:0, o:1, dev:sdd5
[2150415.853426]  disk 3, wo:0, o:1, dev:sde5
[2150415.853499]  disk 4, wo:0, o:1, dev:sdf5
[2150415.853572]  disk 5, wo:0, o:1, dev:sdg5
[2150415.853645]  disk 7, wo:0, o:1, dev:sdi5
[2150450.460260] mptsas: ioc0: attaching sata device, channel 0, id 6, phy 6
[2150450.466970] scsi 1:0:8:0: Direct-Access     ATA      WDC WD5000BEKT-6 1A01 PQ: 0 ANSI: 5
[2150450.468312] sd 1:0:8:0: [sdj] 976773168 512-byte hardware sectors (500108 MB)
[2150450.475724] sd 1:0:8:0: [sdj] Write Protect is off
[2150450.475802] sd 1:0:8:0: [sdj] Mode Sense: 73 00 00 08
[2150450.481065] sd 1:0:8:0: [sdj] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[2150450.481385] sd 1:0:8:0: [sdj] 976773168 512-byte hardware sectors (500108 MB)
[2150450.489103] sd 1:0:8:0: [sdj] Write Protect is off
[2150450.489179] sd 1:0:8:0: [sdj] Mode Sense: 73 00 00 08
[2150450.494628] sd 1:0:8:0: [sdj] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[2150450.494731]  sdj: sdj1 sdj2 < sdj5 sdj6 >
[2150450.831656] sd 1:0:8:0: [sdj] Attached SCSI disk

Upon further testing the drive turns out to be fine. I have hardware available & are happy to test patches / perform additional debugging if required.
Comment 7 ksb 2010-11-10 20:53:25 UTC
After testing on 2.6.36 situation seems to be slightly better. Probably it happens little bit rarer, but the problem still exists.
Comment 8 Daobang Wang 2011-08-03 06:31:49 UTC
I test it on 2.6.36.4 with HitachiHDE72101, it has the same issue.

sd 7:0:9:0: [sdj] CDB: cdb[0]=0x28: 28 00 00 18 ac 00 00 01 c8 00
mptbase: ioc0: LogInfo(0x31110700): Originator={PL}, Code={Reset}, SubCode(0x0700) cb_idx mptbase_reply
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801937cd1c0)
mptbase: ioc0: LogInfo(0x31110700): Originator={PL}, Code={Reset}, SubCode(0x0700) cb_idx mptscsih_io_done
mptscsih: ioc0: attempting task abort! (sc=ffff8801e562d4c0)
sd 7:0:9:0: [sdj] CDB: cdb[0]=0x2a: 2a 00 00 18 ab 00 00 01 00 00
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801e562d4c0)
mptscsih: ioc0: attempting task abort! (sc=ffff88021eb6c880)
...............

And i saw the /var/log/messages, there were some log:

Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1627904
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1628160
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1628920
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1629096
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1630120
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1631144
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1619456
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1632168
Aug  3 12:00:13 nss12345678 kernel: 
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1616384
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1620992
Aug  3 12:00:13 nss12345678 kernel:  00 00 18 ae 00 00 04 00 00
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1617408
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1618432
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1619808
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1619968
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1619984
Aug  3 12:00:13 nss12345678 kernel:  00 18 b8 18 00 00 e8 00
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1619992
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1620224
Aug  3 12:00:13 nss12345678 kernel:  00
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1620480
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1620736
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1621248
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1621504
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1621760
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1622784
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1623808
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1624832
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1625856
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1626880
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 1627648
Aug  3 12:00:13 nss12345678 kernel: end_request: I/O error, dev sdb, sector 16
Aug  3 12:00:13 nss12345678 kernel: md: super_written gets error=-5, uptodate=0
Aug  3 12:00:13 nss12345678 kernel: md/raid:md127: Disk failure on sdb, disabling device.
But it was ok when test it with smartctl
Comment 9 Eric 2012-02-14 18:52:44 UTC
I have this issue as well (Bug #42765) in Linux on VMWare, but the patch from lkml is not in ESXi 5.0. I created a VIB to patch the ESXi scsi-mptsas driver and will follow up when I have time to test it.

Mine occurs during high disk write utilization, seemingly independent of what disk is attached to the controller.