Bug 13594 - SMART responses for SATA disks on SAS get interpreted as errors
Summary: SMART responses for SATA disks on SAS get interpreted as errors
Status: CLOSED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-06-21 17:26 UTC by Steinar H. Gunderson
Modified: 2012-06-08 15:40 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.30-rc6
Tree: Mainline
Regression: No


Attachments

Description Steinar H. Gunderson 2009-06-21 17:26:27 UTC
Hi,

I just bought a LSI SAS3081E-R which I use against a Supermicro backplane to
drive ten Seagate SATA disks (7200.11, 750GB and 1.5GB). I'm using the
standard Linux Fusion MPT device driver (CONFIG_FUSION_SAS) under Linux
2.6.30-rc6. Everything seems to work pretty well, with one exception: When I
use SMART against the drives (say, smartctl -a /dev/sda) the kernel complains
with:

  [  811.091916] sd 0:0:0:0: [sda] Sense Key : Recovered Error [current] [descriptor]
  [  811.099807] Descriptor sense data with sense descriptors (in hex):
  [  811.106175]         72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
  [  811.113262]         00 4f 00 c2 00 50
  [  811.117379] sd 0:0:0:0: [sda] Add. Sense: ATA pass through information available

I've tried upgrading to the newest firmware (1.28.02.00, 05-MAY-2009), but
all that changed is that the hex dump was added to the error message.

Whenever this happens, it appears like all the disks “hiccup” and the kernel
loses contact with the controller for a small while. If too many of these
happen at once, eventually disks start falling off RAIDs, and the entire
machine goes down. It looks to me as if these messages should simply not be treated as errors by the kernel -- smartctl explicitly asks for a response even if the command doesn't fail (by setting CK_COND), so the response probably shouldn't be taken as an error.
Comment 1 Anonymous Emailer 2009-06-21 18:47:59 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Sun, 2009-06-21 at 17:26 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13594
> 
>            Summary: SMART responses for SATA disks on SAS get interpreted
>                     as errors
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 2.6.30-rc6
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: SCSI
>         AssignedTo: linux-scsi@vger.kernel.org
>         ReportedBy: sgunderson@bigfoot.com
>         Regression: No
> 
> 
> Hi,
> 
> I just bought a LSI SAS3081E-R which I use against a Supermicro backplane to
> drive ten Seagate SATA disks (7200.11, 750GB and 1.5GB). I'm using the
> standard Linux Fusion MPT device driver (CONFIG_FUSION_SAS) under Linux
> 2.6.30-rc6. Everything seems to work pretty well, with one exception: When I
> use SMART against the drives (say, smartctl -a /dev/sda) the kernel complains
> with:
> 
>   [  811.091916] sd 0:0:0:0: [sda] Sense Key : Recovered Error [current]
> [descriptor]
>   [  811.099807] Descriptor sense data with sense descriptors (in hex):
>   [  811.106175]         72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
>   [  811.113262]         00 4f 00 c2 00 50
>   [  811.117379] sd 0:0:0:0: [sda] Add. Sense: ATA pass through information
> available

This is a message the kernel prints out on all recovered error returns
(except those marked REQ_QUIET).  It's purely informational and doesn't
affect return processing of the command at all, so the kernel is
actually treating this as a successful completion not an error.

> I've tried upgrading to the newest firmware (1.28.02.00, 05-MAY-2009), but
> all that changed is that the hex dump was added to the error message.
> 
> Whenever this happens, it appears like all the disks “hiccup” and the kernel
> loses contact with the controller for a small while. If too many of these
> happen at once, eventually disks start falling off RAIDs, and the entire
> machine goes down. It looks to me as if these messages should simply not be
> treated as errors by the kernel -- smartctl explicitly asks for a response
> even
> if the command doesn't fail (by setting CK_COND), so the response probably
> shouldn't be taken as an error.

So this sounds like the bug ... however, for the LSI card, this bug will
be in the SAT layer in the fusion firmware.  I can shut the kernel up by
making the recovered error processing clause look for 01/00/1D as well
as REQ_QUIET, but it won't affect this problem.

James
Comment 2 Anonymous Emailer 2009-06-21 18:55:06 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Sun, 2009-06-21 at 13:47 -0500, James Bottomley wrote:
> >   [  811.091916] sd 0:0:0:0: [sda] Sense Key : Recovered Error [current]
> > [descriptor]
> >   [  811.099807] Descriptor sense data with sense descriptors (in hex):
> >   [  811.106175]         72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00
> >   [  811.113262]         00 4f 00 c2 00 50
> >   [  811.117379] sd 0:0:0:0: [sda] Add. Sense: ATA pass through information
> > available
> 
> This is a message the kernel prints out on all recovered error returns
> (except those marked REQ_QUIET).  It's purely informational and doesn't
> affect return processing of the command at all, so the kernel is
> actually treating this as a successful completion not an error.
> 
> > I've tried upgrading to the newest firmware (1.28.02.00, 05-MAY-2009), but
> > all that changed is that the hex dump was added to the error message.
> > 
> > Whenever this happens, it appears like all the disks “hiccup” and the
> kernel
> > loses contact with the controller for a small while. If too many of these
> > happen at once, eventually disks start falling off RAIDs, and the entire
> > machine goes down. It looks to me as if these messages should simply not be
> > treated as errors by the kernel -- smartctl explicitly asks for a response
> even
> > if the command doesn't fail (by setting CK_COND), so the response probably
> > shouldn't be taken as an error.
> 
> So this sounds like the bug ... however, for the LSI card, this bug will
> be in the SAT layer in the fusion firmware.  I can shut the kernel up by
> making the recovered error processing clause look for 01/00/1D as well
> as REQ_QUIET, but it won't affect this problem.

Actually quieting the message is trivially easy, try this.

James

---

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f3c4089..a0235c9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -774,7 +774,8 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	 * is what gets returned to the user
 	 */
 	if (sense_valid && sshdr.sense_key == RECOVERED_ERROR) {
-		if (!(req->cmd_flags & REQ_QUIET))
+		if (!(req->cmd_flags & REQ_QUIET) &&
+		    !(sshdr.asc == 0x00 && sshdr.ascq == 0x1d))
 			scsi_print_sense("", cmd);
 		result = 0;
 		/* BLOCK_PC may have set error */
Comment 3 Steinar H. Gunderson 2009-06-21 18:58:28 UTC
(In reply to comment #1)
> This is a message the kernel prints out on all recovered error returns
> (except those marked REQ_QUIET).  It's purely informational and doesn't
> affect return processing of the command at all, so the kernel is
> actually treating this as a successful completion not an error.

OK.

> So this sounds like the bug ... however, for the LSI card, this bug will
> be in the SAT layer in the fusion firmware.  I can shut the kernel up by
> making the recovered error processing clause look for 01/00/1D as well
> as REQ_QUIET, but it won't affect this problem.

I tried reporting this to the Linux fusionmpt driver people a while ago, but never received any response (thus this bug)... I guess I'm out of luck, then, if there's nothing that can be done for it in the kernel. It's a bit weird, though; one would believe people ran smartd on their systems and discovered this already.

/* Steinar */
Comment 4 Anonymous Emailer 2009-06-21 19:07:13 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Sun, 2009-06-21 at 18:58 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13594
> 
> 
> 
> 
> 
> --- Comment #3 from Steinar H. Gunderson <sgunderson@bigfoot.com>  2009-06-21
> 18:58:28 ---
> (In reply to comment #1)
> > This is a message the kernel prints out on all recovered error returns
> > (except those marked REQ_QUIET).  It's purely informational and doesn't
> > affect return processing of the command at all, so the kernel is
> > actually treating this as a successful completion not an error.
> 
> OK.
> 
> > So this sounds like the bug ... however, for the LSI card, this bug will
> > be in the SAT layer in the fusion firmware.  I can shut the kernel up by
> > making the recovered error processing clause look for 01/00/1D as well
> > as REQ_QUIET, but it won't affect this problem.
> 
> I tried reporting this to the Linux fusionmpt driver people a while ago, but
> never received any response (thus this bug)... I guess I'm out of luck,

OK, cc'd LSI people, let's see if I get better luck

>  then,
> if there's nothing that can be done for it in the kernel. It's a bit weird,
> though; one would believe people ran smartd on their systems and discovered
> this already.

I can guess that it's some type of firmware mode problem: either it runs
for SMART or it runs for normal commands, hence the hiatus.  If that's
true, you'd likely only see the problem in a large disk setup ... it
might also be possible to work around by simply quiescing the card
before sending down SMART commands (that would be grossly inefficient,
but at least devices wouldn't get errored).

James
Comment 5 Anonymous Emailer 2009-06-21 20:53:36 UTC
Reply-To: dgilbert@interlog.com

bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=13594
> 
> 
> 
> 
> 
> --- Comment #4 from Anonymous Emailer <anonymous@kernel-bugs.osdl.org> 
> 2009-06-21 19:07:13 ---
> Reply-To: James.Bottomley@HansenPartnership.com
> 
> On Sun, 2009-06-21 at 18:58 +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
>> http://bugzilla.kernel.org/show_bug.cgi?id=13594
>>
>>
>>
>>
>>
>> --- Comment #3 from Steinar H. Gunderson <sgunderson@bigfoot.com> 
>> 2009-06-21 18:58:28 ---
>> (In reply to comment #1)
>>> This is a message the kernel prints out on all recovered error returns
>>> (except those marked REQ_QUIET).  It's purely informational and doesn't
>>> affect return processing of the command at all, so the kernel is
>>> actually treating this as a successful completion not an error.
>> OK.
>>
>>> So this sounds like the bug ... however, for the LSI card, this bug will
>>> be in the SAT layer in the fusion firmware.  I can shut the kernel up by
>>> making the recovered error processing clause look for 01/00/1D as well
>>> as REQ_QUIET, but it won't affect this problem.
>> I tried reporting this to the Linux fusionmpt driver people a while ago, but
>> never received any response (thus this bug)... I guess I'm out of luck,
> 
> OK, cc'd LSI people, let's see if I get better luck
> 
>>  then,
>> if there's nothing that can be done for it in the kernel. It's a bit weird,
>> though; one would believe people ran smartd on their systems and discovered
>> this already.
> 
> I can guess that it's some type of firmware mode problem: either it runs
> for SMART or it runs for normal commands, hence the hiatus.  If that's
> true, you'd likely only see the problem in a large disk setup ... it
> might also be possible to work around by simply quiescing the card
> before sending down SMART commands (that would be grossly inefficient,
> but at least devices wouldn't get errored).

I have just replicated the "ATA pass through information
available" message report on a similar vintage LSI
controller and a SATA disk with a recent smartctl
version.

There is no need to report this in the kernel error log,
as the smartmontools ATA pass-through (SCSI) command asked
for the final state of the ATA registers and the sense
buffer is the conduit for that information. That ASC/ASCQ
pair basically means "you asked for them and here they
are". [reference: sat2r07b.pdf section 12.2.5 table 107
when CK_COND is 1]

As for the hiccup, I have noticed that with SAS (SCSI)
disks from Seagate there is a curious sound and a pause
before the response to LOG SENSE SCSI command (the
type the smartmontools uses on SCSI disks).

Another annoyance is that the disk must be ready (i.e.
spun up) before MODE SENSE and LOG SENSE work, haven't
Seagate heard of flash :-)
SCSI standards permit that (i.e. only
a small number of commands have to work when the disk
is not ready) but you would think accessing metadata
given the disk has spun up once since power up could
be accomplished from RAM or flash.

Doug Gilbert
Comment 6 Steinar H. Gunderson 2009-06-21 21:14:37 UTC
On Sun, Jun 21, 2009 at 08:53:37PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> I have just replicated the "ATA pass through information
> available" message report on a similar vintage LSI
> controller and a SATA disk with a recent smartctl
> version.
> 
> There is no need to report this in the kernel error log,
> as the smartmontools ATA pass-through (SCSI) command asked
> for the final state of the ATA registers and the sense
> buffer is the conduit for that information. That ASC/ASCQ
> pair basically means "you asked for them and here they
> are". [reference: sat2r07b.pdf section 12.2.5 table 107
> when CK_COND is 1]

OK, this is basically what we agreed on already. I'm not able to
test the given patch right now, though (the machine is a production
machine).

> As for the hiccup, I have noticed that with SAS (SCSI)
> disks from Seagate there is a curious sound and a pause
> before the response to LOG SENSE SCSI command (the
> type the smartmontools uses on SCSI disks).

FWIW, I've used the same disks on SATA controllers with smartctl
without any problems. I'm not entirely sure how to parse your
message, though -- do you imply that the problem is in smartctl?
The disk?

/* Steinar */
Comment 7 Matthew Wilcox 2009-06-22 12:04:31 UTC
On Sun, Jun 21, 2009 at 04:53:29PM -0400, Douglas Gilbert wrote:
> As for the hiccup, I have noticed that with SAS (SCSI)
> disks from Seagate there is a curious sound and a pause
> before the response to LOG SENSE SCSI command (the
> type the smartmontools uses on SCSI disks).
>
> Another annoyance is that the disk must be ready (i.e.
> spun up) before MODE SENSE and LOG SENSE work, haven't
> Seagate heard of flash :-)
> SCSI standards permit that (i.e. only
> a small number of commands have to work when the disk
> is not ready) but you would think accessing metadata
> given the disk has spun up once since power up could
> be accomplished from RAM or flash.

We've experienced similar problems at Intel with an LSI card and Intel
SSDs (SATA, not SAS).  This issue got pushed into the 'investigate later'
category, as we were able to just disable smartd.  I'll try and get some
more information on this later.
Comment 8 Al Tobey 2009-11-21 00:20:30 UTC
I get the same issue on LSI SAS2008 using the mpt2sas driver in 2.6.32-rc5.   It wouldn't be a big deal, but it actually increments /sys/block/$dev/device/ioerr_cnt, which I'd like to use for quick & dirty checks for drives going south (I realize it's not perfect).

This occurs with both smartmontools 5.38-2+lenny1 as shipped with Debian 5 and with a local backport of 5.38+svn2956 from experimental.

Trying smartctl -d scsi returns an outright failure. 

I can also reproduce with sg_sat_identify -c.

~$ sudo sg_sat_identify -c /dev/sg13
~$ dmesg |tail -n 5
sd 4:0:11:0: [sg13] Sense Key : Recovered Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
        72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 
        00 00 00 00 00 00 
sd 4:0:11:0: [sg13] Add. Sense: ATA pass through information available

~$ cat /sys/block/sdm/device/ioerr_cnt
0x5

~$ sudo smartctl -d sat -q errorsonly -H /dev/sdm
smartctl 5.39 2009-10-10 r2955 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net

Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

~$ cat /sys/block/sdm/device/ioerr_cnt
0x6

~$ cat /sys/class/scsi_host/host4/device_delay
00
~$ cat /sys/class/scsi_host/host4/version_fw
02.00.50.00
~$ cat /sys/class/scsi_host/host4/version_mpi
200.0b
~$ cat /sys/class/scsi_host/host4/version_product 
LSISAS2008
~$ cat /sys/class/scsi_host/host4/version_bios
07.01.01.00

~$ sudo sg_inq /dev/sg12
standard INQUIRY:
  PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=1  Resp_data_format=2
  SCCS=0  ACC=0  TGPS=0  3PC=0  Protect=0  BQue=0
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=0  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=74 (0x4a)   Peripheral device type: disk
 Vendor identification: ATA     
 Product identification: WDC WD2002FYPS-0
 Product revision level: 5G04
 Unit serial number:      WD-WCAVY0517841
Comment 9 Cláudio Martins 2010-04-03 22:07:47 UTC
Hello,

 I'd like to point out that this bug is still present on kernel version 2.6.34-rc3-00163-g5e11611.

 I'm using a Supermicro enclosure with a SAS backplane and 16 SATA 1.5TB drives (ST31500341AS).

The onboard controller, as reported by lspci:

05:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)

At boot time the mptsas kernel driver reports:

scsi4 : ioc0: LSISAS1068E B3, FwRev=011a0000h, Ports=1, MaxQ=478, IRQ=16

Smartmontools is version 5.38-2+lenny1 (v5.38 from Debian Lenny)


While generating I/O in the disks, I can easily make all I/O stall for several minutes and even kick drives out of an MD Array by running "smartctl -a /dev/sdX" repeatedly on several drives. During the stall, the kernel logged the following messages:

mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptscsih: ioc0: attempting task abort! (sc=ffff8802b57aa100)
sd 4:0:10:0: [sdk] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802b57aa100)
mptscsih: ioc0: attempting task abort! (sc=ffff8802b57aa100)
sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802b57aa100)
mptscsih: ioc0: attempting task abort! (sc=ffff8802be35ec00)
sd 4:0:10:0: [sdk] CDB: Write(10): 2a 00 96 27 78 00 00 04 00 00
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35ec00)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptscsih: ioc0: attempting task abort! (sc=ffff8802be35eb00)
sd 4:0:10:0: [sdk] CDB: Write(10): 2a 00 96 27 7c 00 00 04 00 00
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35eb00)
mptscsih: ioc0: attempting task abort! (sc=ffff8802be35eb00)
sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00
mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35eb00)
mptscsih: ioc0: attempting target reset! (sc=ffff8802b57aa100)
sd 4:0:10:0: [sdk] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
mptscsih: ioc0: target reset: FAILED (sc=ffff8802b57aa100)
mptscsih: ioc0: attempting bus reset! (sc=ffff8802b57aa100)
sd 4:0:10:0: [sdk] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8802b57aa100)
mptscsih: ioc0: attempting task abort! (sc=ffff8802b57aa100)
sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00
mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802b57aa100)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000)
mptscsih: ioc0: attempting task abort! (sc=ffff8802be35eb00)
sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00
mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000)
mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35eb00)
mptscsih: ioc0: attempting host reset! (sc=ffff8802b57aa100)
mptbase: ioc0: Initiating recovery
mptscsih: ioc0: host reset: SUCCESS (sc=ffff8802b57aa100)
end_request: I/O error, dev sdb, sector 3903551
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on sdb1, disabling device.
raid1: Operation continuing on 1 devices.
end_request: I/O error, dev sda, sector 3903551
md: super_written gets error=-5, uptodate=0
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1
 disk 1, wo:1, o:0, dev:sdb1
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1


--------------

 I have this hardware available for a few weeks, so I am willing to help with any tests, diagnostic operations, patches or firmware, that you might have.

 Any help with this is appreciated, since the fact that drives are being kicked from MD arrays, makes Smartmontools use quite difficult.

 Thanks in advance for your help.

Best regards 

Cláudio
Comment 10 kashyap 2010-04-05 07:49:42 UTC
Claudio,

I tried doing similar stuffs at my setup and I was not able to see similar issue as reported by you.

We need to know whether it is specific to SATA disk or generic issue.

Can you please provide me next possible details as mentioned below?

a) How about using different SATA disk instead of which you are using currently. What is behavior in that case?
b) I did below steps to reproduce things. (Please correct me if anything missing while mimicking your test case)
	mdadm --create --verbose /dev/md0 --level=raid1 --raid-devices=2 /dev/sdc /dev/sdd
	"while true; do smartclt -a /dev/sdX; done;" 
I kept running it for 15 min, I could not see any issue in my setup.
Is this correct way of reproducing the issue?

My disk are Seagate ST320000641AS (2TB) FW version CC12.


I am suspecting this issue may be mapped to the end devices also. 
Need to clarify this doubt doing some other combinations of experiment. Can you provide details on my queries to jump next steps of investigation?


--Kashyap
Comment 11 AndCycle 2010-04-27 22:29:53 UTC
oops, someone just get a more detail view over this problem on LKML,
I am gonna trying this

Date	Mon, 26 Apr 2010 18:11:54 -0500
From	Ryan Kuester <>
Subject	mptsas hangs caused by ATA pass-through explained

http://lkml.org/lkml/2010/4/26/335
Comment 12 Ken Stailey 2010-05-12 14:09:28 UTC
A utility from LSI is available here:
ftp://ftp.lsil.com/HostAdapterDrivers/linux/lsiutil/ 

Some information from my use of lsutil:

Board name: LSISAS3442E-R
Board assembly: L3-00120-05E

Current active firmware version is 01172b00 (1.23.43)
Firmware image's version is MPTFW-01.23.43.00-IE
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.18.05.00 (2008.05.14)
EFI BIOS image's version is 3.05.01.01

Diagnostics -> Display phy counters:
Adapter Phy 1: Link Up
  Invalid DWord Count 2,734
  Running Disparity Error Count 2,757
  Loss of DWord Synch Count 0
  Phy Reset Problem Count 0 

Other information:

$ lspci | grep LSI
03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)

$ uname -srvm
Linux 2.6.31-21-generic #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010 x86_64

$ strings /lib/modules/2.6.31-21-generic/kernel/drivers/message/fusion/mptsas.ko | grep version=
version=3.04.10
srcversion=4023EA52994688E9AE61982

$ lsb_release -d
Description:    Ubuntu 9.10
Comment 13 Anonymous Emailer 2010-05-12 15:20:09 UTC
Reply-To: dgilbert@interlog.com

The originally reported problem has been fixed. See:
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=91b25002bd58f55207e4662a611a6cded4ef9834

I was told that was scheduled to go in lk 2.6.33

Reading the bugzilla entry some of the latter posts
could be reporting some other LSI related problems.
Anyway, the bug report should be closed.

Doug Gilbert


bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=13594
> 
> 
> Ken Stailey <kstailey@yahoo.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |kstailey@yahoo.com
> 
> 
> 
> 
> --- Comment #12 from Ken Stailey <kstailey@yahoo.com>  2010-05-12 14:09:28
> ---
> A utility from LSI is available here:
> ftp://ftp.lsil.com/HostAdapterDrivers/linux/lsiutil/ 
> 
> Some information from my use of lsutil:
> 
> Board name: LSISAS3442E-R
> Board assembly: L3-00120-05E
> 
> Current active firmware version is 01172b00 (1.23.43)
> Firmware image's version is MPTFW-01.23.43.00-IE
>   LSI Logic
> x86 BIOS image's version is MPTBIOS-6.18.05.00 (2008.05.14)
> EFI BIOS image's version is 3.05.01.01
> 
> Diagnostics -> Display phy counters:
> Adapter Phy 1: Link Up
>   Invalid DWord Count 2,734
>   Running Disparity Error Count 2,757
>   Loss of DWord Synch Count 0
>   Phy Reset Problem Count 0 
> 
> Other information:
> 
> $ lspci | grep LSI
> 03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E
> PCI-Express
> Fusion-MPT SAS (rev 08)
> 
> $ uname -srvm
> Linux 2.6.31-21-generic #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010 x86_64
> 
> $ strings
> /lib/modules/2.6.31-21-generic/kernel/drivers/message/fusion/mptsas.ko | grep
> version=
> version=3.04.10
> srcversion=4023EA52994688E9AE61982
> 
> $ lsb_release -d
> Description:    Ubuntu 9.10
>
Comment 14 Steinar H. Gunderson 2010-05-12 17:42:13 UTC
On Wed, May 12, 2010 at 03:20:14PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> The originally reported problem has been fixed. See:
>
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=91b25002bd58f55207e4662a611a6cded4ef9834
> 
> I was told that was scheduled to go in lk 2.6.33
> 
> Reading the bugzilla entry some of the latter posts
> could be reporting some other LSI related problems.
> Anyway, the bug report should be closed.

It actually seems like that in 2.6.34-rc6, I can use SMART pretty much with
impunity. Don't know if I'm just luckier now or what happened...

/* Steinar */
Comment 15 Steinar H. Gunderson 2010-05-12 17:42:55 UTC
On Wed, May 12, 2010 at 06:45:33PM +0200, Steinar H. Gunderson wrote:
> It actually seems like that in 2.6.34-rc6, I can use SMART pretty much with
> impunity. Don't know if I'm just luckier now or what happened...

Scratch that; I could use smartctl all I wanted, but installing smartd
promptly floored the entire card (and with it, the machine, since the RAID
went away). dmesg below.

At reboot, I kept seeing the “IOC is in FAULT state” until I got logged in
and killed smartd again.

/* Steinar */

[588630.695020] mptscsih: ioc0: attempting task abort! (sc=ffff880182bab200)
[588630.702007] sd 0:0:4:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
[588632.074809] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
[588632.084283] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880182bab200)
[588638.081578] mptbase: ioc0: LogInfo(0x31111000): Originator={PL}, Code={Reset}, SubCode(0x1000)
[588638.095332] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
[588642.090380] mptscsih: ioc0: attempting task abort! (sc=ffff880182bab200)
[588642.097310] sd 0:0:4:0: [sde] CDB: Test Unit Ready: 00 00 00 00 00 00
[588642.104177] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880182bab200)
[588642.110862] mptscsih: ioc0: attempting task abort! (sc=ffff8801c9ffb600)
[588642.117813] sd 0:0:4:0: [sde] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
[588642.126382] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801c9ffb600)
[588652.133012] mptscsih: ioc0: attempting task abort! (sc=ffff8801c9ffb600)
[588652.140020] sd 0:0:4:0: [sde] CDB: Test Unit Ready: 00 00 00 00 00 00
[588652.146909] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801c9ffb600)
[588652.153621] mptscsih: ioc0: attempting target reset! (sc=ffff880182bab200)
[588652.160768] sd 0:0:4:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
[588652.177222] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
[588653.575199] mptscsih: ioc0: target reset: SUCCESS (sc=ffff880182bab200)
[588656.583548] mptbase: ioc0: LogInfo(0x31111000): Originator={PL}, Code={Reset}, SubCode(0x1000)
[588656.594285] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
[588663.582006] mptscsih: ioc0: attempting task abort! (sc=ffff880182bab200)
[588663.588952] sd 0:0:4:0: [sde] CDB: Test Unit Ready: 00 00 00 00 00 00
[588663.595810] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880182bab200)
[588663.602509] mptscsih: ioc0: attempting bus reset! (sc=ffff880182bab200)
[588663.609381] sd 0:0:4:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00
[588663.670443] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000)
[588665.077991] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff880182bab200)
[588668.083821] mptbase: ioc0: LogInfo(0x31111000): Originator={PL}, Code={Reset}, SubCode(0x1000)
[588675.085656] sd 0:0:4:0: [sde] Device not ready
[588675.090326] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.097771] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.103969] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.111743] sd 0:0:4:0: [sde] CDB: Write(10): 2a 00 57 54 52 08 00 00 08 00
[588675.119172] end_request: I/O error, dev sde, sector 1465143816
[588675.125343] end_request: I/O error, dev sde, sector 1465143816
[588675.126238] md: super_written gets error=-5, uptodate=0
[588675.126238] raid5: Disk failure on sde6, disabling device.
[588675.126238] raid5: Operation continuing on 5 devices.
[588675.148011] sd 0:0:4:0: [sde] Device not ready
[588675.152712] sd 0:0:4:0: [sde] Device not ready
[588675.152723] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152725] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152727] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152730] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 44 bc ae 88 00 00 80 00
[588675.152733] end_request: I/O error, dev sde, sector 1153216136
[588675.152740] sd 0:0:4:0: [sde] Device not ready
[588675.152741] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152743] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152745] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152747] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 2b 6a 42 70 00 00 18 00
[588675.152751] end_request: I/O error, dev sde, sector 728384112
[588675.152755] sd 0:0:4:0: [sde] Device not ready
[588675.152756] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152758] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152759] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152762] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 45 9c d0 08 00 00 80 00
[588675.152765] end_request: I/O error, dev sde, sector 1167904776
[588675.152769] sd 0:0:4:0: [sde] Device not ready
[588675.152770] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152771] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152773] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152775] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 2c c6 b8 88 00 00 80 00
[588675.152779] end_request: I/O error, dev sde, sector 751220872
[588675.152783] sd 0:0:4:0: [sde] Device not ready
[588675.152784] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152785] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152787] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152789] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 32 fb f1 08 00 00 80 00
[588675.152793] end_request: I/O error, dev sde, sector 855372040
[588675.152796] sd 0:0:4:0: [sde] Device not ready
[588675.152797] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152799] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152801] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152803] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 34 1e a8 88 00 00 80 00
[588675.152806] end_request: I/O error, dev sde, sector 874424456
[588675.152811] sd 0:0:4:0: [sde] Device not ready
[588675.152812] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152813] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152815] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152817] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 3e 4d bd 88 00 00 80 00
[588675.152821] end_request: I/O error, dev sde, sector 1045282184
[588675.152824] sd 0:0:4:0: [sde] Device not ready
[588675.152825] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152827] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152828] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152831] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 18 a3 50 88 00 00 10 00
[588675.152834] end_request: I/O error, dev sde, sector 413356168
[588675.152838] sd 0:0:4:0: [sde] Device not ready
[588675.152839] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152841] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152842] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152845] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 18 a3 50 a0 00 00 68 00
[588675.152848] end_request: I/O error, dev sde, sector 413356192
[588675.152855] sd 0:0:4:0: [sde] Device not ready
[588675.152856] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152857] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152859] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152861] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 36 12 e4 08 00 00 80 00
[588675.152865] end_request: I/O error, dev sde, sector 907207688
[588675.152868] sd 0:0:4:0: [sde] Device not ready
[588675.152869] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152871] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152873] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152875] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 31 8e c0 08 00 00 80 00
[588675.152878] end_request: I/O error, dev sde, sector 831438856
[588675.152882] sd 0:0:4:0: [sde] Device not ready
[588675.152883] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.152885] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.152886] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.152889] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 3d e3 2f 08 00 00 80 00
[588675.152892] end_request: I/O error, dev sde, sector 1038298888
[588675.152910] end_request: I/O error, dev sdj, sector 2930271882
[588675.152912] md: super_written gets error=-5, uptodate=0
[588675.152915] raid5: Disk failure on sdj6, disabling device.
[588675.152915] raid5: Operation continuing on 4 devices.
[588675.158145] end_request: I/O error, dev sdh, sector 2930271882
[588675.158147] md: super_written gets error=-5, uptodate=0
[588675.158149] raid5: Disk failure on sdh6, disabling device.
[588675.158150] raid5: Operation continuing on 3 devices.
[588675.160440] end_request: I/O error, dev sdk, sector 2930271882
[588675.160442] md: super_written gets error=-5, uptodate=0
[588675.160444] raid5: Disk failure on sdk6, disabling device.
[588675.160445] raid5: Operation continuing on 2 devices.
[588675.161965] end_request: I/O error, dev sdg, sector 2930271882
[588675.161967] md: super_written gets error=-5, uptodate=0
[588675.161969] raid5: Disk failure on sdg6, disabling device.
[588675.161970] raid5: Operation continuing on 1 devices.
[588675.168925] end_request: I/O error, dev sdi, sector 2930271882
[588675.168927] md: super_written gets error=-5, uptodate=0
[588675.168929] raid5: Disk failure on sdi6, disabling device.
[588675.168930] raid5: Operation continuing on 0 devices.
[588675.168948] RAID5 conf printout:
[588675.168950]  --- rd:5 wd:0
[588675.168951]  disk 0, o:0, dev:sdg6
[588675.168952]  disk 1, o:0, dev:sdh6
[588675.168953]  disk 2, o:0, dev:sdi6
[588675.168955]  disk 3, o:0, dev:sdj6
[588675.168956]  disk 4, o:0, dev:sdk6
[588675.758839] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.766304] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.772415] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.780146] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 00 5a cc 98 00 00 08 00
[588675.787487] end_request: I/O error, dev sde, sector 5950616
[588675.793314] sd 0:0:4:0: [sde] Device not ready
[588675.797984] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[588675.805453] sd 0:0:4:0: [sde] Sense Key : Not Ready [current]
[588675.811572] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration
[588675.819312] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 00 5a cc b0 00 00 08 00
[588675.826713] end_request: I/O error, dev sde, sector 5950640
[588675.838005] RAID5 conf printout:
[588675.841422]  --- rd:5 wd:0
[588675.844323]  disk 1, o:0, dev:sdh6
[588675.847913]  disk 2, o:0, dev:sdi6
[588675.851516]  disk 3, o:0, dev:sdj6
[588675.855106]  disk 4, o:0, dev:sdk6
[588675.858697] RAID5 conf printout:
[588675.862139]  --- rd:5 wd:0
[588675.862154] RAID5 conf printout:
[588675.862155]  --- rd:6 wd:5
[588675.862157]  disk 0, o:1, dev:sda6
[588675.862158]  disk 1, o:1, dev:sdf6
[588675.862160]  disk 2, o:0, dev:sde6
[588675.862162]  disk 3, o:1, dev:sdc6
[588675.862163]  disk 4, o:1, dev:sdb1
[588675.862164]  disk 5, o:1, dev:sdd1
[588675.892893]  disk 1, o:0, dev:sdh6
[588675.896490]  disk 2, o:0, dev:sdi6
[588675.900086]  disk 3, o:0, dev:sdj6
[588675.903674]  disk 4, o:0, dev:sdk6
[588675.912005] RAID5 conf printout:
[588675.915438]  --- rd:5 wd:0
[588675.918339]  disk 1, o:0, dev:sdh6
[588675.919256] RAID5 conf printout:
[588675.919258]  --- rd:6 wd:5
[588675.919259]  disk 0, o:1, dev:sda6
[588675.919261]  disk 1, o:1, dev:sdf6
[588675.919262]  disk 3, o:1, dev:sdc6
[588675.919263]  disk 4, o:1, dev:sdb1
[588675.919264]  disk 5, o:1, dev:sdd1
[588675.946353]  disk 2, o:0, dev:sdi6
[588675.949956]  disk 3, o:0, dev:sdj6
[588675.953553] RAID5 conf printout:
[588675.956990]  --- rd:5 wd:0
[588675.959890]  disk 1, o:0, dev:sdh6
[588675.963502]  disk 2, o:0, dev:sdi6
[588675.967103]  disk 3, o:0, dev:sdj6
[588675.974006] RAID5 conf printout:
[588675.977431]  --- rd:5 wd:0
[588675.980380]  disk 1, o:0, dev:sdh6
[588675.984017]  disk 2, o:0, dev:sdi6
[588675.987633] RAID5 conf printout:
[588675.991069]  --- rd:5 wd:0
[588675.993989]  disk 1, o:0, dev:sdh6
[588675.997600]  disk 2, o:0, dev:sdi6
[588676.006006] RAID5 conf printout:
[588676.009465]  --- rd:5 wd:0
[588676.012375]  disk 1, o:0, dev:sdh6
[588676.015985] RAID5 conf printout:
[588676.019473]  --- rd:5 wd:0
[588676.022367]  disk 1, o:0, dev:sdh6
[588676.030005] RAID5 conf printout:
[588676.033439]  --- rd:5 wd:0
[588676.036350] Buffer I/O error on device dm-15, logical block 307593216
[588676.043081] lost page write due to I/O error on dm-15
[588676.751012] ttyS0: 1 input overrun(s)
[588679.821212] ttyS0: 1 input overrun(s)
[588702.915013] mptbase: ioc0: WARNING - IOC is in FAULT state (7827h)!!!
[588702.921701] mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
[588702.929579] mptbase: ioc0: Initiating recovery
[588702.934245] mptbase: ioc0: WARNING - IOC is in FAULT state!!!
[588702.940210] mptbase: ioc0: WARNING -            FAULT code = 7827h
[588706.051011] mptbase: ioc0: Recovered from IOC FAULT
[588717.036031] mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
Comment 16 Ken Stailey 2010-05-18 15:04:15 UTC
If this bug report is to be closed on the grounds that it only encompasses suppressing some log messages can anyone post the ID of any bug reports that are for the "real" LSI MPT driver issues?
Comment 18 pipa.tk 2010-10-29 03:30:34 UTC
I also use LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS and seagate ST31500341AS 1.5TB harddisk. 

I found that the ST31500341AS has firmware issue: http://www.avsforum.com/avs-vb/showthread.php?t=1080005. So I check the /var/log/message and lsscsi, there are 2 firmware version in the server, and all sdX error messages loged are version SD17. The SD17 version should be upgrade to SD1B, or it will hung IO for almost half a minute randomly.

Oct 29 08:27:21 XEN-ST-27 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff8801e5465840)
Oct 29 08:27:21 XEN-ST-27 kernel: sd 4:0:3:0:
Oct 29 08:27:21 XEN-ST-27 kernel:         command: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00
Oct 29 08:27:23 XEN-ST-27 kernel: mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Oct 29 08:27:23 XEN-ST-27 kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801e5465840)

[4:0:0:0]    disk    ATA      ST31500341AS     SD17  /dev/sda
[4:0:1:0]    disk    ATA      ST31500341AS     CC1H  /dev/sdb
[4:0:2:0]    disk    ATA      ST31500341AS     CC1H  /dev/sdc
[4:0:3:0]    disk    ATA      ST31500341AS     SD17  /dev/sdd
[4:0:4:0]    disk    ATA      ST31500341AS     CC1H  /dev/sde
[4:0:5:0]    disk    ATA      ST31500341AS     SD17  /dev/sdf
[4:0:6:0]    disk    ATA      ST31500341AS     SD17  /dev/sdg
[4:0:7:0]    disk    ATA      ST31500341AS     CC1H  /dev/sdh
[4:0:8:0]    disk    ATA      ST31500341AS     CC1H  /dev/sdi
[4:0:9:0]    disk    ATA      ST31500341AS     CC1H  /dev/sdj
[4:0:10:0]   disk    ATA      ST31500341AS     CC1H  /dev/sdk
[4:0:11:0]   disk    ATA      ST31500341AS     CC1H  /dev/sdl

I am suffering IO hung in many xen servers. I've apply this patch http://lkml.org/lkml/2010/4/26/335 in 2.6.18-xen with mpt version mptlinux-3.04.01, and "task abort" still show in dmesg. But smartctl -a will not trigger error even without this patch. So I think havey IO hung issue may be caused by seagate firmware and ATA path-through bug in the kernel.

I didn't find ATA path-through issue in 2.6.18-xen and 2.6.16-xen, but 2.6.29 and 2.6.31 and 2.6.32 have this issue. It could be reproduced easily by running "while true; do smartctl -a /dev/sdd > /dev/null; done". Even apply patch http://lkml.org/lkml/2010/4/26/335, and try all mpt fusion driver I can find form 3.04.01 to the latest lsi version 4.0.22.

Finally I test 2.6.36, ATA issue seems solved. But it doesn't support xen dom0, I can't test this kernel in productive server. I'am trying reproduce IO hung issue in lab, and upgrade seagate firmware version to verify it.

Related bug: https://bugzilla.kernel.org/show_bug.cgi?id=18652

Note You need to log in before you can comment on or make changes to this bug.