Bug 16547 - mptscsih: ioc0: attempting task abort, raid array LUNs not detected properly on some boots
Summary: mptscsih: ioc0: attempting task abort, raid array LUNs not detected properly ...
Status: RESOLVED OBSOLETE
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-09 09:22 UTC by Martin Steigerwald
Modified: 2013-12-10 21:54 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.32-bpo.5-amd64
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
lspci -nnvv of one of the servers (82.12 KB, text/plain)
2010-08-09 09:24 UTC, Martin Steigerwald
Details
config for the 2.6.32-5-amd64 debian backport kernel (103.05 KB, text/plain)
2010-08-09 09:35 UTC, Martin Steigerwald
Details

Description Martin Steigerwald 2010-08-09 09:22:06 UTC
This is with a FibreChannel driver, the MPT Fusion driver, but I did not find any more suitable category.

Latest kernel known to work: 2.6.26 from Debian Backports

This likely is related to:

LSI Fusion MPT driver problem - recurring messages: mptscsih ioc0 attempting task abort
https://bugzilla.redhat.com/show_bug.cgi?id=483424

On two FTS servers from a customer I see "attempting task abort" errors on some boots. Then FibreChannel LUNS are not detected properly. Sometimes I see no errors, but one external RAID arrays is missing completely. And often it just works. Errors usually disappear after rebooting, sometimes it needs quite a few reboots until it works again. On boots where LUNs are detected properly, there do not seem to be any further errors until next boot.

This did not happen using some SuperMicro servers with exactly the same FibreChannel hostbus adapter using Debian Etch with kernels from 2.6.18 to 2.6.26 (Debian Backport Kernel). On the FTS server I use 2.6.32 Lenny backport kernel, since 2.6.26 is not able to boot from the internal SATA controller.

Now to the details:

Our setup is as follows: Two backend servers are each connected to two external EasyRAID arrays. So both see each array all the time. Usually one server takes the first one of both arrays and the other one takes the second one. Each LUN is a SoftRAID 1 with LVM on top of it, so that data is stored synchronously on both RAID arrays. A heartbeat setup with STONITH makes sure that only one server ever writes to a LUN even on cluster takeover.

When everything works each server sees the following LUNs - each one twice due to being connected to both of the RAID arrays which carry the "same" LUNs; the SoftRAID is over sdb and sdd or sdc and sde:

backend01:~# fdisk -l 2>/dev/null | grep "sd[b-e]"
Disk /dev/sdb: 2097.1 GB, 2097146764800 bytes
/dev/sdb1               1      254963  2047990266   fd  Linux raid autodetect
Disk /dev/sdc: 1101.7 GB, 1101725337600 bytes
/dev/sdc1               1      133943  1075897116   fd  Linux raid autodetect
Disk /dev/sdd: 2097.1 GB, 2097146764800 bytes
/dev/sdd1               1      254963  2047990266   fd  Linux raid autodetect
Disk /dev/sde: 1101.7 GB, 1101725337600 bytes
/dev/sde1               1      133943  1075897116   fd  Linux raid autodetect


Now after upgrading to the new FTS servers and to Debian Lenny with 2.6.32 backport kernel we sometimes see FC errors on boot.

The driver is loaded as:

Aug  2 16:17:22 backend02 kernel: [   27.547240] Fusion MPT base driver 3.04.12
Aug  2 16:17:22 backend02 kernel: [   27.547241] Copyright (c) 1999-2008 LSI Corporation
Aug  2 16:17:22 backend02 kernel: [   27.548426] dca service started, version 1.12.1
Aug  2 16:17:22 backend02 kernel: [   27.556900] Fusion MPT FC Host driver 3.04.12
Aug  2 16:17:22 backend02 kernel: [   27.556939] mptfc 0000:07:00.0: PCI INT A -> GSI 33 (level, low) -> IRQ 33

Then the driver detects a LUN:

Aug  2 16:17:22 backend02 kernel: [   38.081418] ioc0: LSIFC949E A1: Capabilities={Initiator,Target,LAN}
Aug  2 16:17:22 backend02 kernel: [   38.081435] mptfc 0000:07:00.0: setting latency timer to 64
Aug  2 16:17:22 backend02 kernel: [   39.025071] scsi5 : ioc0: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023, IRQ=33
Aug  2 16:17:22 backend02 kernel: [   39.025285] mptfc: ioc0: FC Link Established, Speed = 4 Gbps
Aug  2 16:17:22 backend02 kernel: [   39.025750] mptfc 0000:07:00.1: PCI INT B -> GSI 31 (level, low) -> IRQ 31
Aug  2 16:17:22 backend02 kernel: [   39.026674] scsi 5:0:0:0: Direct-Access     easyRAID easyRAID_Q16P2   0001 PQ: 0 ANSI: 5
Aug  2 16:17:22 backend02 kernel: [   39.026810] sd 5:0:0:0: Attached scsi generic sg2 type 0
Aug  2 16:17:22 backend02 kernel: [   39.027010] scsi: host 5 channel 0 id 0 lun134217728 has a LUN larger than allowed by the host adapter
Aug  2 16:17:22 backend02 kernel: [   39.027017] sd 5:0:0:0: [sdb] 4095989775 512-byte logical blocks: (2.09 TB/1.90 TiB)
Aug  2 16:17:22 backend02 kernel: [   39.027295] sd 5:0:0:0: [sdb] Write Protect is off
Aug  2 16:17:22 backend02 kernel: [   39.027297] sd 5:0:0:0: [sdb] Mode Sense: b7 00 00 08
Aug  2 16:17:22 backend02 kernel: [   39.027415] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

The message "lun134217728 has a LUN larger than allowed by the host adapter" came to our attention. I don't know how it is related. When everything works both LUNs are detected properly. Each LUN is below 2 TiB. Maybe this is just a side effect of not detecting LUNs and their "geometry" properly.

Then some more of these:

Aug  2 16:17:22 backend02 kernel: [   41.768233] scsi6 : ioc1: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023, IRQ=31
Aug  2 16:17:22 backend02 kernel: [   41.768507] mptfc: ioc1: FC Link Established, Speed = 4 Gbps
Aug  2 16:17:22 backend02 kernel: [   41.768555]  sdb:
Aug  2 16:17:22 backend02 kernel: [   41.769231] scsi 6:0:0:0: Direct-Access     easyRAID easyRAID_Q16P2   0001 PQ: 0 ANSI: 5
Aug  2 16:17:22 backend02 kernel: [   41.769354] sd 6:0:0:0: Attached scsi generic sg3 type 0
Aug  2 16:17:22 backend02 kernel: [   41.769592] scsi: host 6 channel 0 id 0 lun 0x6561737952414944 has a LUN larger than currently supporte
Aug  2 16:17:22 backend02 kernel: [   41.769597] scsi: host 6 channel 0 id 0 lun 0x6561737952414944 has a LUN larger than currently supporte
Aug  2 16:17:22 backend02 kernel: [   41.769601] scsi: host 6 channel 0 id 0 lun 0x5f51313650322020 has a LUN larger than currently supporte
Aug  2 16:17:22 backend02 kernel: [   41.769605] scsi: host 6 channel 0 id 0 lun134479872 has a LUN larger than allowed by the host adapter
Aug  2 16:17:22 backend02 kernel: [   41.769608] scsi: host 6 channel 0 id 0 lun134217728 has a LUN larger than allowed by the host adapter
Aug  2 16:17:22 backend02 kernel: [   41.769614] sd 6:0:0:0: [sdc] 4095989775 512-byte logical blocks: (2.09 TB/1.90 TiB)
Aug  2 16:17:22 backend02 kernel: [   41.769617] scsi: host 6 channel 0 id 0 lun1934688609 has a LUN larger than allowed by the host adapter
Aug  2 16:17:22 backend02 kernel: [   41.769985] sd 6:0:0:0: [sdc] Write Protect is off
Aug  2 16:17:22 backend02 kernel: [   41.769988] sd 6:0:0:0: [sdc] Mode Sense: b7 00 00 08
Aug  2 16:17:22 backend02 kernel: [   41.770145] sd 6:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Aug  2 16:17:22 backend02 kernel: [   41.770819]  sdc: unknown partition table
Aug  2 16:17:22 backend02 kernel: [   43.137000] ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 23 (level, low) -> IRQ 23
Aug  2 16:17:22 backend02 kernel: [   43.137434]  unknown partition table
Aug  2 16:17:22 backend02 kernel: [   43.137715] sd 6:0:0:0: [sdc] Attached SCSI disk

Including "unknown partition table" which isn't true, cause each LUN contains one partition of type 0xFD Linux RAID autodetect.

After this there come the error messages which in my pinpoint the real problem:

Aug  2 16:17:22 backend02 kernel: [   73.342434] mptscsih: ioc0: attempting task abort! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [   73.378007] mptscsih: ioc1: attempting task abort! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [   73.378009] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [   73.378143] mptscsih: ioc1: task abort: FAILED (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [   73.378146] mptscsih: ioc1: attempting target reset! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [   73.378148] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [   73.378508] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [   73.905285] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [   73.989932] mptscsih: ioc0: task abort: FAILED (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [   74.066044] mptscsih: ioc0: attempting target reset! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [   74.148385] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [   74.233436] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  104.343825] mptscsih: ioc0: attempting task abort! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  104.355877] mptscsih: ioc1: attempting task abort! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  104.355878] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  104.356002] mptscsih: ioc1: task abort: FAILED (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  104.356004] mptscsih: ioc1: attempting target reset! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  104.356005] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  104.356365] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  104.906661] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  104.991288] mptscsih: ioc0: task abort: FAILED (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  105.067390] mptscsih: ioc0: attempting target reset! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  105.149731] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  105.234778] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  135.322725] mptscsih: ioc0: attempting task abort! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  135.334773] mptscsih: ioc1: attempting task abort! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  135.334775] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  135.334900] mptscsih: ioc1: task abort: FAILED (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  135.334903] mptscsih: ioc1: attempting target reset! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  135.334904] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  135.335262] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  135.885560] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  135.970189] mptscsih: ioc0: task abort: FAILED (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  136.046292] mptscsih: ioc0: attempting target reset! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  136.128630] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  136.213683] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  166.301627] mptscsih: ioc0: attempting task abort! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  166.313676] mptscsih: ioc1: attempting task abort! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  166.313677] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  166.313807] mptscsih: ioc1: task abort: FAILED (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  166.313810] mptscsih: ioc1: attempting target reset! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  166.313811] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  166.314172] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  166.864460] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  166.949102] mptscsih: ioc0: task abort: FAILED (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  167.025204] mptscsih: ioc0: attempting target reset! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  167.107544] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  167.192601] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  197.280524] mptscsih: ioc0: attempting task abort! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  197.292576] mptscsih: ioc1: attempting task abort! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  197.292577] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  197.292709] mptscsih: ioc1: task abort: FAILED (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  197.292711] mptscsih: ioc1: attempting target reset! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  197.292713] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  197.293073] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  197.843362] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  197.928004] mptscsih: ioc0: task abort: FAILED (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  198.004106] mptscsih: ioc0: attempting target reset! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  198.086446] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  198.171494] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023d7d0e00

After some more tries the driver seems to hand the error to the block layer:

Aug  2 16:17:22 backend02 kernel: [  228.260728] mptscsih: ioc0: attempting task abort! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  228.260731] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  228.260916] mptscsih: ioc0: task abort: FAILED (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  228.260921] mptscsih: ioc0: attempting target reset! (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  228.260922] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  228.261439] mptscsih: ioc0: target reset: SUCCESS (sc=ffff88023d7d0e00)
Aug  2 16:17:22 backend02 kernel: [  228.275519] mptscsih: ioc1: attempting task abort! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  228.275521] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  228.275701] mptscsih: ioc1: task abort: FAILED (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  228.275704] mptscsih: ioc1: attempting target reset! (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  228.275706] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  228.276073] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88023d78a400)
Aug  2 16:17:22 backend02 kernel: [  228.278697] sd 5:0:0:0: [sdb] Unhandled error code
Aug  2 16:17:22 backend02 kernel: [  228.278699] sd 5:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Aug  2 16:17:22 backend02 kernel: [  228.278701] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  228.278704] end_request: I/O error, dev sdb, sector 1
Aug  2 16:17:22 backend02 kernel: [  228.278707] Buffer I/O error on device sdb, logical block 1
Aug  2 16:17:22 backend02 kernel: [  228.278709] Buffer I/O error on device sdb, logical block 2
Aug  2 16:17:22 backend02 kernel: [  228.278711] Buffer I/O error on device sdb, logical block 3
Aug  2 16:17:22 backend02 kernel: [  228.278712] Buffer I/O error on device sdb, logical block 4
Aug  2 16:17:22 backend02 kernel: [  228.278713] Buffer I/O error on device sdb, logical block 5
Aug  2 16:17:22 backend02 kernel: [  228.278715] Buffer I/O error on device sdb, logical block 6
Aug  2 16:17:22 backend02 kernel: [  228.278716] Buffer I/O error on device sdb, logical block 7
Aug  2 16:17:22 backend02 kernel: [  228.278720] Buffer I/O error on device sdb, logical block 8
Aug  2 16:17:22 backend02 kernel: [  228.278721] Buffer I/O error on device sdb, logical block 9
Aug  2 16:17:22 backend02 kernel: [  228.278723] Buffer I/O error on device sdb, logical block 10
Aug  2 16:17:22 backend02 kernel: [  228.293595] sd 6:0:0:0: [sdc] Unhandled error code
Aug  2 16:17:22 backend02 kernel: [  228.293596] sd 6:0:0:0: [sdc] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Aug  2 16:17:22 backend02 kernel: [  228.293599] sd 6:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 00 01 00 00 1f 00
Aug  2 16:17:22 backend02 kernel: [  228.293604] end_request: I/O error, dev sdc, sector 1

Well when it can't read sector one it also can't read the partition, so maybe these two are related.


On some occasions we just see each LUN once without any apparent error messages. In those cases, one of the links do not show a remote port - let me see whether I find this again. There it is - usually I see both remote ports:

backend01:/sys/module/mptfc/drivers/pci:mptfc# ls -ld 0000\:07\:00.0/host?/rp*
drwxr-xr-x 5 root root 0 2010-08-09 11:02 0000:07:00.0/host3/rport-3:0-0
backend01:/sys/module/mptfc/drivers/pci:mptfc# ls -ld 0000\:07\:00.1/host?/rp*
drwxr-xr-x 5 root root 0 2010-08-09 11:02 0000:07:00.1/host6/rport-6:0-0

In that case where one RAID array is missing, I see a remote port on just one of the links.

Please tell me whether I should open a seperate bug report regarding this issue.


Speeds are currently as follows:

backend01:/sys/module/mptfc/drivers/pci:mptfc# grep "" 0000\:07\:00.?/host?/fc_host/host?/speed
0000:07:00.0/host3/fc_host/host3/speed:1 Gbit
0000:07:00.1/host6/fc_host/host6/speed:2 Gbit
backend01:/sys/module/mptfc/drivers/pci:mptfc# grep "" 0000\:07\:00.?/host?/fc_host/host?/supported_speeds
0000:07:00.0/host3/fc_host/host3/supported_speeds:1 Gbit, 2 Gbit, 4 Gbit
0000:07:00.1/host6/fc_host/host6/supported_speeds:1 Gbit, 2 Gbit, 4 Gbit

backend02:/sys/module/mptfc/drivers/pci:mptfc# grep "" 0000\:07\:00.?/host?/fc_host/host?/speed
0000:07:00.0/host1/fc_host/host1/speed:4 Gbit
0000:07:00.1/host2/fc_host/host2/speed:4 Gbit
backend02:/sys/module/mptfc/drivers/pci:mptfc# grep "" 0000\:07\:00.?/host?/fc_host/host?/supported_speeds
0000:07:00.0/host1/fc_host/host1/supported_speeds:1 Gbit, 2 Gbit, 4 Gbit
0000:07:00.1/host2/fc_host/host2/supported_speeds:1 Gbit, 2 Gbit, 4 Gbit

These are autonegioted, I don't think that we set any contraints. I do not know why server backend01 has lower speeds.


This is the Fibre Channel hostbus adapter in use:

07:00.0 Fibre Channel [0c04]: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter [1000:0646] (rev 01)
        Subsystem: LSI Logic / Symbios Logic Device [1000:1020]
        Flags: bus master, fast devsel, latency 0, IRQ 33
        I/O ports at 4000 [size=256]
        Memory at ce320000 (64-bit, non-prefetchable) [size=16K]
        Memory at ce300000 (64-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at c0200000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 2
        Capabilities: [68] Express Endpoint, MSI 00
        Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
        Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1
        Capabilities: [100] Advanced Error Reporting <?>
        Kernel driver in use: mptfc
        Kernel modules: mptfc

07:00.1 Fibre Channel [0c04]: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter [1000:0646] (rev 01)
        Subsystem: LSI Logic / Symbios Logic Device [1000:1020]
        Flags: bus master, fast devsel, latency 0, IRQ 31
        I/O ports at 4400 [size=256]
        Memory at ce324000 (64-bit, non-prefetchable) [size=16K]
        Memory at ce310000 (64-bit, non-prefetchable) [size=64K]
        [virtual] Expansion ROM at c0300000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 2
        Capabilities: [68] Express Endpoint, MSI 00
        Capabilities: [98] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
        Capabilities: [b0] MSI-X: Enable- Mask- TabSize=1
        Capabilities: [100] Advanced Error Reporting <?>
        Kernel driver in use: mptfc
        Kernel modules: mptfc


Regarding issues mentioned in the RedHat bug reports:
- There is no smartd or hddtemp running


I will attach a full lspci -nnvv. Please tell when you need any further details. Please note that these are production machines. I can't bisect between 2.6.26 and 2.6.32 there easily. We might be willing to build / backport a newer Debian / upstream kernel to these machine when there is a good chance that it fixes the issue. These are Dual-Quadcore Nehalems so they should be building a kernel package really fast. Due to the cluster nature of the setup we are able to do some limited testing.
Comment 1 Martin Steigerwald 2010-08-09 09:24:05 UTC
Created attachment 27386 [details]
lspci -nnvv of one of the servers
Comment 2 Martin Steigerwald 2010-08-09 09:35:12 UTC
Created attachment 27387 [details]
config for the 2.6.32-5-amd64 debian backport kernel
Comment 3 Martin Steigerwald 2010-08-09 09:45:32 UTC
Some additional information on the MPT driver version and controller:

backend01:~# grep -r "" /proc/mpt/*
/proc/mpt/ioc0/summary:ioc0: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023, LanAddr=00:06:[...], IRQ=33
/proc/mpt/ioc0/info:ioc0:
/proc/mpt/ioc0/info:  ProductID = 0x1005 (LSIFC949E A1)
/proc/mpt/ioc0/info:  FWVersion = 0x01030e00 (fw_size=190556)
/proc/mpt/ioc0/info:  MsgVersion = 0x0105
/proc/mpt/ioc0/info:  FirstWhoInit = 0x00
/proc/mpt/ioc0/info:  EventState = 0x00
/proc/mpt/ioc0/info:  CurrentHostMfaHighAddr = 0x00000004
/proc/mpt/ioc0/info:  CurrentSenseBufferHighAddr = 0x00000004
/proc/mpt/ioc0/info:  MaxChainDepth = 0x3e frames
/proc/mpt/ioc0/info:  MinBlockSize = 0x20 bytes
/proc/mpt/ioc0/info:  RequestFrames @ 0xffff88043c102800 (Dma @ 0x000000043c102800)
/proc/mpt/ioc0/info:    {CurReqSz=128} x {CurReqDepth=1023} = 130944 bytes ^= 0x20000
/proc/mpt/ioc0/info:    {MaxReqSz=128}   {MaxReqDepth=1023}
/proc/mpt/ioc0/info:  Frames   @ 0xffff88043c100000 (Dma @ 0x000000043c100000)
/proc/mpt/ioc0/info:    {CurRepSz=80} x {CurRepDepth=128} = 10240 bytes ^= 0x2880
/proc/mpt/ioc0/info:    {MaxRepSz=0}   {MaxRepDepth=1023}
/proc/mpt/ioc0/info:  MaxDevices = 255
/proc/mpt/ioc0/info:  MaxBuses = 2
/proc/mpt/ioc0/info:  PortNumber = 1 (of 1)
/proc/mpt/ioc0/info:    LanAddr = 00:06:[...]
/proc/mpt/ioc0/info:    WWN = 2000[...]
/proc/mpt/ioc1/summary:ioc1: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:11:3B:79, IRQ=31
/proc/mpt/ioc1/info:ioc1:
/proc/mpt/ioc1/info:  ProductID = 0x1005 (LSIFC949E A1)
/proc/mpt/ioc1/info:  FWVersion = 0x01030e00 (fw_size=190556)
/proc/mpt/ioc1/info:  MsgVersion = 0x0105
/proc/mpt/ioc1/info:  FirstWhoInit = 0x00
/proc/mpt/ioc1/info:  EventState = 0x00
/proc/mpt/ioc1/info:  CurrentHostMfaHighAddr = 0x00000004
/proc/mpt/ioc1/info:  CurrentSenseBufferHighAddr = 0x00000004
/proc/mpt/ioc1/info:  MaxChainDepth = 0x3e frames
/proc/mpt/ioc1/info:  MinBlockSize = 0x20 bytes
/proc/mpt/ioc1/info:  RequestFrames @ 0xffff88043c202800 (Dma @ 0x000000043c202800)
/proc/mpt/ioc1/info:    {CurReqSz=128} x {CurReqDepth=1023} = 130944 bytes ^= 0x20000
/proc/mpt/ioc1/info:    {MaxReqSz=128}   {MaxReqDepth=1023}
/proc/mpt/ioc1/info:  Frames   @ 0xffff88043c200000 (Dma @ 0x000000043c200000)
/proc/mpt/ioc1/info:    {CurRepSz=80} x {CurRepDepth=128} = 10240 bytes ^= 0x2880
/proc/mpt/ioc1/info:    {MaxRepSz=0}   {MaxRepDepth=1023}
/proc/mpt/ioc1/info:  MaxDevices = 255
/proc/mpt/ioc1/info:  MaxBuses = 2
/proc/mpt/ioc1/info:  PortNumber = 1 (of 1)
/proc/mpt/ioc1/info:    LanAddr = 00:06:[...]
/proc/mpt/ioc1/info:    WWN = 2000[...]
/proc/mpt/summary:ioc0: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:11:3B:78, IRQ=33
/proc/mpt/summary:ioc1: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:11:3B:79, IRQ=31
/proc/mpt/version:mptlinux-3.04.12
/proc/mpt/version:  Fusion MPT base driver
/proc/mpt/version:  Fusion MPT FC host driver
Comment 4 ksb 2010-09-12 11:08:46 UTC
I'm also have something like that:
[ 4499.860030] mptscsih: ioc0: attempting task abort! (sc=ffff88007a588200)
[ 4499.860036] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dc f8 9f 00 04 00 00
[ 4499.894551] mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptbase_reply
[ 4501.256258] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
[ 4501.268298] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007a588200)
[ 4503.256426] mptbase: ioc0: LogInfo(0x31120403): Originator={PL}, Code={Abort}, SubCode(0x0403) cb_idx mptscsih_io_done
[ 4503.256439] mptscsih: ioc0: attempting task abort! (sc=ffff88007ab5cc00)
[ 4503.256443] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dc fc 9f 00 04 00 00
[ 4503.256455] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007ab5cc00)
[ 4503.506394] mptscsih: ioc0: attempting task abort! (sc=ffff88007a588000)
[ 4503.506399] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dd 00 9f 00 04 00 00
[ 4503.506412] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007a588000)
... and so on.
Happens when heavy disk write operations ongoing.
Identically on ubuntu's stock 2.6.32-24 and also on custom built 2.6.35.4 and 2.6.36-rc3 kernels.

cat /proc/mpt/version
mptlinux-3.04.17
  Fusion MPT base driver
  Fusion MPT SAS host driver

cat /proc/mpt/summary
ioc0: LSISAS1064E B2, FwRev=01140000h, Ports=1, MaxQ=511, IRQ=17

cat /proc/mpt/ioc0/info
ioc0:
  ProductID = 0x2204 (LSISAS1064E B2)
  FWVersion = 0x01140000
  MsgVersion = 0x0105
  FirstWhoInit = 0x00
  EventState = 0x00
  CurrentHostMfaHighAddr = 0x00000000
  CurrentSenseBufferHighAddr = 0x00000000
  MaxChainDepth = 0x60 frames
  MinBlockSize = 0x20 bytes
  RequestFrames @ 0xffff88007a502800 (Dma @ 0x000000007a502800)
    {CurReqSz=128} x {CurReqDepth=511} = 65408 bytes ^= 0x10000
    {MaxReqSz=128}   {MaxReqDepth=511}
  Frames   @ 0xffff88007a500000 (Dma @ 0x000000007a500000)
    {CurRepSz=80} x {CurRepDepth=128} = 10240 bytes ^= 0x2880
    {MaxRepSz=0}   {MaxRepDepth=511}
  MaxDevices = 173
  MaxBuses = 1
  PortNumber = 1 (of 1)
Comment 5 kashyap 2010-09-16 06:03:33 UTC
(In reply to comment #3)
> Some additional information on the MPT driver version and controller:
> 
> backend01:~# grep -r "" /proc/mpt/*
> /proc/mpt/ioc0/summary:ioc0: LSIFC949E A1, FwRev=01030e00h, Ports=1,
> MaxQ=1023,
> LanAddr=00:06:[...], IRQ=33
> /proc/mpt/ioc0/info:ioc0:
> /proc/mpt/ioc0/info:  ProductID = 0x1005 (LSIFC949E A1)
> /proc/mpt/ioc0/info:  FWVersion = 0x01030e00 (fw_size=190556)
> /proc/mpt/ioc0/info:  MsgVersion = 0x0105
> /proc/mpt/ioc0/info:  FirstWhoInit = 0x00
> /proc/mpt/ioc0/info:  EventState = 0x00
> /proc/mpt/ioc0/info:  CurrentHostMfaHighAddr = 0x00000004
> /proc/mpt/ioc0/info:  CurrentSenseBufferHighAddr = 0x00000004
> /proc/mpt/ioc0/info:  MaxChainDepth = 0x3e frames
> /proc/mpt/ioc0/info:  MinBlockSize = 0x20 bytes
> /proc/mpt/ioc0/info:  RequestFrames @ 0xffff88043c102800 (Dma @
> 0x000000043c102800)
> /proc/mpt/ioc0/info:    {CurReqSz=128} x {CurReqDepth=1023} = 130944 bytes ^=
> 0x20000
> /proc/mpt/ioc0/info:    {MaxReqSz=128}   {MaxReqDepth=1023}
> /proc/mpt/ioc0/info:  Frames   @ 0xffff88043c100000 (Dma @
> 0x000000043c100000)
> /proc/mpt/ioc0/info:    {CurRepSz=80} x {CurRepDepth=128} = 10240 bytes ^=
> 0x2880
> /proc/mpt/ioc0/info:    {MaxRepSz=0}   {MaxRepDepth=1023}
> /proc/mpt/ioc0/info:  MaxDevices = 255
> /proc/mpt/ioc0/info:  MaxBuses = 2
> /proc/mpt/ioc0/info:  PortNumber = 1 (of 1)
> /proc/mpt/ioc0/info:    LanAddr = 00:06:[...]
> /proc/mpt/ioc0/info:    WWN = 2000[...]
> /proc/mpt/ioc1/summary:ioc1: LSIFC949E A1, FwRev=01030e00h, Ports=1,
> MaxQ=1023,
> LanAddr=00:06:2B:11:3B:79, IRQ=31
> /proc/mpt/ioc1/info:ioc1:
> /proc/mpt/ioc1/info:  ProductID = 0x1005 (LSIFC949E A1)
> /proc/mpt/ioc1/info:  FWVersion = 0x01030e00 (fw_size=190556)
> /proc/mpt/ioc1/info:  MsgVersion = 0x0105
> /proc/mpt/ioc1/info:  FirstWhoInit = 0x00
> /proc/mpt/ioc1/info:  EventState = 0x00
> /proc/mpt/ioc1/info:  CurrentHostMfaHighAddr = 0x00000004
> /proc/mpt/ioc1/info:  CurrentSenseBufferHighAddr = 0x00000004
> /proc/mpt/ioc1/info:  MaxChainDepth = 0x3e frames
> /proc/mpt/ioc1/info:  MinBlockSize = 0x20 bytes
> /proc/mpt/ioc1/info:  RequestFrames @ 0xffff88043c202800 (Dma @
> 0x000000043c202800)
> /proc/mpt/ioc1/info:    {CurReqSz=128} x {CurReqDepth=1023} = 130944 bytes ^=
> 0x20000
> /proc/mpt/ioc1/info:    {MaxReqSz=128}   {MaxReqDepth=1023}
> /proc/mpt/ioc1/info:  Frames   @ 0xffff88043c200000 (Dma @
> 0x000000043c200000)
> /proc/mpt/ioc1/info:    {CurRepSz=80} x {CurRepDepth=128} = 10240 bytes ^=
> 0x2880
> /proc/mpt/ioc1/info:    {MaxRepSz=0}   {MaxRepDepth=1023}
> /proc/mpt/ioc1/info:  MaxDevices = 255
> /proc/mpt/ioc1/info:  MaxBuses = 2
> /proc/mpt/ioc1/info:  PortNumber = 1 (of 1)
> /proc/mpt/ioc1/info:    LanAddr = 00:06:[...]
> /proc/mpt/ioc1/info:    WWN = 2000[...]
> /proc/mpt/summary:ioc0: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023,
> LanAddr=00:06:2B:11:3B:78, IRQ=33
> /proc/mpt/summary:ioc1: LSIFC949E A1, FwRev=01030e00h, Ports=1, MaxQ=1023,
> LanAddr=00:06:2B:11:3B:79, IRQ=31
> /proc/mpt/version:mptlinux-3.04.12
> /proc/mpt/version:  Fusion MPT base driver
> /proc/mpt/version:  Fusion MPT FC host driver




Your bug is completely different issue. Whatever you are point to redhat bugzilla is with respect to SAS controller.

In your case it is FC controller.

You have mentioned that 
"Latest kernel known to work: 2.6.26 from Debian Backports"

Can you provide me driver version where things are working fine. In case of some working kernel is there, I would like to simply upgrade MPTFUSION driver (do not upgrade a whole kernel). This way I would like to change only one component of the system at a time...

This will help to understand where things are broken.

FYI,
MPTFC drive is highly in mentionation mode. There are very very minimal changes happened to MPTFC driver since 2008.

Last change went to upstream for MPTFC is 

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=03cb3829e0e5650518ce37e2b4420a35e034dc9e


Thanks, Kashyap
Comment 6 kashyap 2010-09-16 06:05:26 UTC
(In reply to comment #4)
> I'm also have something like that:
> [ 4499.860030] mptscsih: ioc0: attempting task abort! (sc=ffff88007a588200)
> [ 4499.860036] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dc f8 9f 00 04 00
> 00
> [ 4499.894551] mptbase: ioc0: LogInfo(0x31120403): Originator={PL},
> Code={Abort}, SubCode(0x0403) cb_idx mptbase_reply
> [ 4501.256258] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO
> Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
> [ 4501.268298] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007a588200)
> [ 4503.256426] mptbase: ioc0: LogInfo(0x31120403): Originator={PL},
> Code={Abort}, SubCode(0x0403) cb_idx mptscsih_io_done
> [ 4503.256439] mptscsih: ioc0: attempting task abort! (sc=ffff88007ab5cc00)
> [ 4503.256443] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dc fc 9f 00 04 00
> 00
> [ 4503.256455] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007ab5cc00)
> [ 4503.506394] mptscsih: ioc0: attempting task abort! (sc=ffff88007a588000)
> [ 4503.506399] sd 4:0:0:0: [sda] CDB: Write(10): 2a 00 0f dd 00 9f 00 04 00
> 00
> [ 4503.506412] mptscsih: ioc0: task abort: SUCCESS (sc=ffff88007a588000)
> ... and so on.
> Happens when heavy disk write operations ongoing.
> Identically on ubuntu's stock 2.6.32-24 and also on custom built 2.6.35.4 and
> 2.6.36-rc3 kernels.
> 
> cat /proc/mpt/version
> mptlinux-3.04.17
>   Fusion MPT base driver
>   Fusion MPT SAS host driver
> 
> cat /proc/mpt/summary
> ioc0: LSISAS1064E B2, FwRev=01140000h, Ports=1, MaxQ=511, IRQ=17
> 
> cat /proc/mpt/ioc0/info
> ioc0:
>   ProductID = 0x2204 (LSISAS1064E B2)
>   FWVersion = 0x01140000
>   MsgVersion = 0x0105
>   FirstWhoInit = 0x00
>   EventState = 0x00
>   CurrentHostMfaHighAddr = 0x00000000
>   CurrentSenseBufferHighAddr = 0x00000000
>   MaxChainDepth = 0x60 frames
>   MinBlockSize = 0x20 bytes
>   RequestFrames @ 0xffff88007a502800 (Dma @ 0x000000007a502800)
>     {CurReqSz=128} x {CurReqDepth=511} = 65408 bytes ^= 0x10000
>     {MaxReqSz=128}   {MaxReqDepth=511}
>   Frames   @ 0xffff88007a500000 (Dma @ 0x000000007a500000)
>     {CurRepSz=80} x {CurRepDepth=128} = 10240 bytes ^= 0x2880
>     {MaxRepSz=0}   {MaxRepDepth=511}
>   MaxDevices = 173
>   MaxBuses = 1
>   PortNumber = 1 (of 1)

your bug is not similar to first reported bug. Please open new bugzilla.

since your product is LSI SAS controller and first bug has been reported for LSI FC controller.
thanks, Kashyap
Comment 7 Martin Steigerwald 2010-09-21 08:12:01 UTC
(In reply to comment #5)
> (In reply to comment #3)
> > Some additional information on the MPT driver version and controller:
> > 
> > backend01:~# grep -r "" /proc/mpt/*
> > /proc/mpt/ioc0/summary:ioc0: LSIFC949E A1, FwRev=01030e00h, Ports=1,
> MaxQ=1023,
> > LanAddr=00:06:[...], IRQ=33
> > /proc/mpt/ioc0/info:ioc0:
> > /proc/mpt/ioc0/info:  ProductID = 0x1005 (LSIFC949E A1)
> > /proc/mpt/ioc0/info:  FWVersion = 0x01030e00 (fw_size=190556)
[...]
> Your bug is completely different issue. Whatever you are point to redhat
> bugzilla is with respect to SAS controller.

I thought it might be related nevertheless. I don't know the inner structure of the MPT driver. It also sounded similar, cause in that bug report there is also the mention that it worked with 2.6.26, but I AFAIR not with 2.6.27. Maybe its a general change in the SCSI layer that triggers the issue.

> In your case it is FC controller.

Yes, I know.

> You have mentioned that 
> "Latest kernel known to work: 2.6.26 from Debian Backports"
> 
> Can you provide me driver version where things are working fine. 

Here is the version from a 2.6.26 lenny kernel, which should be the one that has been backported to Etch:

pasta:~# modinfo /lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptfc.ko 
filename:       /lib/modules/2.6.26-2-amd64/kernel/drivers/message/fusion/mptfc.ko
version:        3.04.06
license:        GPL
description:    Fusion MPT FC Host driver
author:         LSI Corporation
srcversion:     F3D99FE0544BDDD1455BAAA
alias:          pci:v00001657d00000646sv*sd*bc*sc*i*
alias:          pci:v00001000d00000646sv*sd*bc*sc*i*
alias:          pci:v00001000d00000640sv*sd*bc*sc*i*
alias:          pci:v00001000d00000642sv*sd*bc*sc*i*
alias:          pci:v00001000d00000626sv*sd*bc*sc*i*
alias:          pci:v00001000d00000628sv*sd*bc*sc*i*
alias:          pci:v00001000d00000622sv*sd*bc*sc*i*
alias:          pci:v00001000d00000624sv*sd*bc*sc*i*
alias:          pci:v00001000d00000621sv*sd*bc*sc*i*
depends:        mptscsih,scsi_transport_fc,scsi_mod,mptbase
vermagic:       2.6.26-2-amd64 SMP mod_unload modversions 
parm:           mptfc_dev_loss_tmo: Initial time the driver programs the  transport to wait for an rport to  return following a device loss event.  Default=60. (int)
parm:           max_lun: max lun, default=16895  (int)

The 2.6.32 kernel, where we see described issues has:

backend01:~# modinfo mptfc
filename:       /lib/modules/2.6.32-bpo.5-amd64/kernel/drivers/message/fusion/mptfc.ko
version:        3.04.12
license:        GPL
description:    Fusion MPT FC Host driver
author:         LSI Corporation
srcversion:     92E350C096B75A9714B8B0E
alias:          pci:v00001657d00000646sv*sd*bc*sc*i*
alias:          pci:v00001000d00000646sv*sd*bc*sc*i*
alias:          pci:v00001000d00000640sv*sd*bc*sc*i*
alias:          pci:v00001000d00000642sv*sd*bc*sc*i*
alias:          pci:v00001000d00000626sv*sd*bc*sc*i*
alias:          pci:v00001000d00000628sv*sd*bc*sc*i*
alias:          pci:v00001000d00000622sv*sd*bc*sc*i*
alias:          pci:v00001000d00000624sv*sd*bc*sc*i*
alias:          pci:v00001000d00000621sv*sd*bc*sc*i*
depends:        mptscsih,mptbase,scsi_transport_fc,scsi_mod
vermagic:       2.6.32-bpo.5-amd64 SMP mod_unload modversions 
parm:           mptfc_dev_loss_tmo: Initial time the driver programs the  transport to wait for an rport to  return following a device loss event.  Default=60. (int)
parm:           max_lun: max lun, default=16895  (int)
backend01:~#

> In case of
> some working kernel is there, I would like to simply upgrade MPTFUSION driver
> (do not upgrade a whole kernel). This way I would like to change only one
> component of the system at a time...

Well the old 2.6.26 kernel worked. But actually it does not boot on the new servers, cause the old version ata_piix does not talk to the newer onboard SATA controller. Thus it would be required to use a newer ata_piix and a newer MPT FUSION FC driver with 2.6.26 kernel. I don't know whether thats feasible.

Its a production machine and I need to be careful with testing. I can only test with agreement of the customer. But for a defined test case it might be workable. Would it be as easy as to replace the directories with the driver source with a newer version? From 2.6.26 to 2.6.32 is quite a step.

> This will help to understand where things are broken.

I understand.

> FYI,
> MPTFC drive is highly in mentionation mode. There are very very minimal
> changes
> happened to MPTFC driver since 2008.
> 
> Last change went to upstream for MPTFC is 
> 
>
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=03cb3829e0e5650518ce37e2b4420a35e034dc9e

I don't think that commit has landed in 2.6.32, since Linus released it on 3rd december 2009. It also does not seem to be in one of the stable patches:

ms@mango:~/Linux/Kernel/Mainline> ls ChangeLog-2.6.32*
ChangeLog-2.6.32     ChangeLog-2.6.32.16  ChangeLog-2.6.32.3
ChangeLog-2.6.32.1   ChangeLog-2.6.32.17  ChangeLog-2.6.32.4
ChangeLog-2.6.32.10  ChangeLog-2.6.32.18  ChangeLog-2.6.32.5
ChangeLog-2.6.32.11  ChangeLog-2.6.32.19  ChangeLog-2.6.32.6
ChangeLog-2.6.32.12  ChangeLog-2.6.32.2   ChangeLog-2.6.32.7
ChangeLog-2.6.32.13  ChangeLog-2.6.32.20  ChangeLog-2.6.32.8
ChangeLog-2.6.32.14  ChangeLog-2.6.32.21  ChangeLog-2.6.32.9
ChangeLog-2.6.32.15  ChangeLog-2.6.32.22
ms@mango:~/Linux/Kernel/Mainline> grep 03cb3829e0e5650518ce37e2b4420a35e034dc9e ChangeLog-2.6.32*
ms@mango:~/Linux/Kernel/Mainline#1>

Thanks,
Martin
Comment 8 kashyap 2010-09-21 13:23:04 UTC
Since issue is seen on production system and it is MPTFC controller, I would recommend customer to report this issue to LSI support channel. 

Thanks, Kashyap

Note You need to log in before you can comment on or make changes to this bug.