Hi, I just bought a LSI SAS3081E-R which I use against a Supermicro backplane to drive ten Seagate SATA disks (7200.11, 750GB and 1.5GB). I'm using the standard Linux Fusion MPT device driver (CONFIG_FUSION_SAS) under Linux 2.6.30-rc6. Everything seems to work pretty well, with one exception: When I use SMART against the drives (say, smartctl -a /dev/sda) the kernel complains with: [ 811.091916] sd 0:0:0:0: [sda] Sense Key : Recovered Error [current] [descriptor] [ 811.099807] Descriptor sense data with sense descriptors (in hex): [ 811.106175] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 [ 811.113262] 00 4f 00 c2 00 50 [ 811.117379] sd 0:0:0:0: [sda] Add. Sense: ATA pass through information available I've tried upgrading to the newest firmware (1.28.02.00, 05-MAY-2009), but all that changed is that the hex dump was added to the error message. Whenever this happens, it appears like all the disks “hiccup” and the kernel loses contact with the controller for a small while. If too many of these happen at once, eventually disks start falling off RAIDs, and the entire machine goes down. It looks to me as if these messages should simply not be treated as errors by the kernel -- smartctl explicitly asks for a response even if the command doesn't fail (by setting CK_COND), so the response probably shouldn't be taken as an error.
Reply-To: James.Bottomley@HansenPartnership.com On Sun, 2009-06-21 at 17:26 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13594 > > Summary: SMART responses for SATA disks on SAS get interpreted > as errors > Product: IO/Storage > Version: 2.5 > Kernel Version: 2.6.30-rc6 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: SCSI > AssignedTo: linux-scsi@vger.kernel.org > ReportedBy: sgunderson@bigfoot.com > Regression: No > > > Hi, > > I just bought a LSI SAS3081E-R which I use against a Supermicro backplane to > drive ten Seagate SATA disks (7200.11, 750GB and 1.5GB). I'm using the > standard Linux Fusion MPT device driver (CONFIG_FUSION_SAS) under Linux > 2.6.30-rc6. Everything seems to work pretty well, with one exception: When I > use SMART against the drives (say, smartctl -a /dev/sda) the kernel complains > with: > > [ 811.091916] sd 0:0:0:0: [sda] Sense Key : Recovered Error [current] > [descriptor] > [ 811.099807] Descriptor sense data with sense descriptors (in hex): > [ 811.106175] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 > [ 811.113262] 00 4f 00 c2 00 50 > [ 811.117379] sd 0:0:0:0: [sda] Add. Sense: ATA pass through information > available This is a message the kernel prints out on all recovered error returns (except those marked REQ_QUIET). It's purely informational and doesn't affect return processing of the command at all, so the kernel is actually treating this as a successful completion not an error. > I've tried upgrading to the newest firmware (1.28.02.00, 05-MAY-2009), but > all that changed is that the hex dump was added to the error message. > > Whenever this happens, it appears like all the disks “hiccup” and the kernel > loses contact with the controller for a small while. If too many of these > happen at once, eventually disks start falling off RAIDs, and the entire > machine goes down. It looks to me as if these messages should simply not be > treated as errors by the kernel -- smartctl explicitly asks for a response > even > if the command doesn't fail (by setting CK_COND), so the response probably > shouldn't be taken as an error. So this sounds like the bug ... however, for the LSI card, this bug will be in the SAT layer in the fusion firmware. I can shut the kernel up by making the recovered error processing clause look for 01/00/1D as well as REQ_QUIET, but it won't affect this problem. James
Reply-To: James.Bottomley@HansenPartnership.com On Sun, 2009-06-21 at 13:47 -0500, James Bottomley wrote: > > [ 811.091916] sd 0:0:0:0: [sda] Sense Key : Recovered Error [current] > > [descriptor] > > [ 811.099807] Descriptor sense data with sense descriptors (in hex): > > [ 811.106175] 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 > > [ 811.113262] 00 4f 00 c2 00 50 > > [ 811.117379] sd 0:0:0:0: [sda] Add. Sense: ATA pass through information > > available > > This is a message the kernel prints out on all recovered error returns > (except those marked REQ_QUIET). It's purely informational and doesn't > affect return processing of the command at all, so the kernel is > actually treating this as a successful completion not an error. > > > I've tried upgrading to the newest firmware (1.28.02.00, 05-MAY-2009), but > > all that changed is that the hex dump was added to the error message. > > > > Whenever this happens, it appears like all the disks “hiccup” and the > kernel > > loses contact with the controller for a small while. If too many of these > > happen at once, eventually disks start falling off RAIDs, and the entire > > machine goes down. It looks to me as if these messages should simply not be > > treated as errors by the kernel -- smartctl explicitly asks for a response > even > > if the command doesn't fail (by setting CK_COND), so the response probably > > shouldn't be taken as an error. > > So this sounds like the bug ... however, for the LSI card, this bug will > be in the SAT layer in the fusion firmware. I can shut the kernel up by > making the recovered error processing clause look for 01/00/1D as well > as REQ_QUIET, but it won't affect this problem. Actually quieting the message is trivially easy, try this. James --- diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index f3c4089..a0235c9 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -774,7 +774,8 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes) * is what gets returned to the user */ if (sense_valid && sshdr.sense_key == RECOVERED_ERROR) { - if (!(req->cmd_flags & REQ_QUIET)) + if (!(req->cmd_flags & REQ_QUIET) && + !(sshdr.asc == 0x00 && sshdr.ascq == 0x1d)) scsi_print_sense("", cmd); result = 0; /* BLOCK_PC may have set error */
(In reply to comment #1) > This is a message the kernel prints out on all recovered error returns > (except those marked REQ_QUIET). It's purely informational and doesn't > affect return processing of the command at all, so the kernel is > actually treating this as a successful completion not an error. OK. > So this sounds like the bug ... however, for the LSI card, this bug will > be in the SAT layer in the fusion firmware. I can shut the kernel up by > making the recovered error processing clause look for 01/00/1D as well > as REQ_QUIET, but it won't affect this problem. I tried reporting this to the Linux fusionmpt driver people a while ago, but never received any response (thus this bug)... I guess I'm out of luck, then, if there's nothing that can be done for it in the kernel. It's a bit weird, though; one would believe people ran smartd on their systems and discovered this already. /* Steinar */
Reply-To: James.Bottomley@HansenPartnership.com On Sun, 2009-06-21 at 18:58 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13594 > > > > > > --- Comment #3 from Steinar H. Gunderson <sgunderson@bigfoot.com> 2009-06-21 > 18:58:28 --- > (In reply to comment #1) > > This is a message the kernel prints out on all recovered error returns > > (except those marked REQ_QUIET). It's purely informational and doesn't > > affect return processing of the command at all, so the kernel is > > actually treating this as a successful completion not an error. > > OK. > > > So this sounds like the bug ... however, for the LSI card, this bug will > > be in the SAT layer in the fusion firmware. I can shut the kernel up by > > making the recovered error processing clause look for 01/00/1D as well > > as REQ_QUIET, but it won't affect this problem. > > I tried reporting this to the Linux fusionmpt driver people a while ago, but > never received any response (thus this bug)... I guess I'm out of luck, OK, cc'd LSI people, let's see if I get better luck > then, > if there's nothing that can be done for it in the kernel. It's a bit weird, > though; one would believe people ran smartd on their systems and discovered > this already. I can guess that it's some type of firmware mode problem: either it runs for SMART or it runs for normal commands, hence the hiatus. If that's true, you'd likely only see the problem in a large disk setup ... it might also be possible to work around by simply quiescing the card before sending down SMART commands (that would be grossly inefficient, but at least devices wouldn't get errored). James
Reply-To: dgilbert@interlog.com bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=13594 > > > > > > --- Comment #4 from Anonymous Emailer <anonymous@kernel-bugs.osdl.org> > 2009-06-21 19:07:13 --- > Reply-To: James.Bottomley@HansenPartnership.com > > On Sun, 2009-06-21 at 18:58 +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: >> http://bugzilla.kernel.org/show_bug.cgi?id=13594 >> >> >> >> >> >> --- Comment #3 from Steinar H. Gunderson <sgunderson@bigfoot.com> >> 2009-06-21 18:58:28 --- >> (In reply to comment #1) >>> This is a message the kernel prints out on all recovered error returns >>> (except those marked REQ_QUIET). It's purely informational and doesn't >>> affect return processing of the command at all, so the kernel is >>> actually treating this as a successful completion not an error. >> OK. >> >>> So this sounds like the bug ... however, for the LSI card, this bug will >>> be in the SAT layer in the fusion firmware. I can shut the kernel up by >>> making the recovered error processing clause look for 01/00/1D as well >>> as REQ_QUIET, but it won't affect this problem. >> I tried reporting this to the Linux fusionmpt driver people a while ago, but >> never received any response (thus this bug)... I guess I'm out of luck, > > OK, cc'd LSI people, let's see if I get better luck > >> then, >> if there's nothing that can be done for it in the kernel. It's a bit weird, >> though; one would believe people ran smartd on their systems and discovered >> this already. > > I can guess that it's some type of firmware mode problem: either it runs > for SMART or it runs for normal commands, hence the hiatus. If that's > true, you'd likely only see the problem in a large disk setup ... it > might also be possible to work around by simply quiescing the card > before sending down SMART commands (that would be grossly inefficient, > but at least devices wouldn't get errored). I have just replicated the "ATA pass through information available" message report on a similar vintage LSI controller and a SATA disk with a recent smartctl version. There is no need to report this in the kernel error log, as the smartmontools ATA pass-through (SCSI) command asked for the final state of the ATA registers and the sense buffer is the conduit for that information. That ASC/ASCQ pair basically means "you asked for them and here they are". [reference: sat2r07b.pdf section 12.2.5 table 107 when CK_COND is 1] As for the hiccup, I have noticed that with SAS (SCSI) disks from Seagate there is a curious sound and a pause before the response to LOG SENSE SCSI command (the type the smartmontools uses on SCSI disks). Another annoyance is that the disk must be ready (i.e. spun up) before MODE SENSE and LOG SENSE work, haven't Seagate heard of flash :-) SCSI standards permit that (i.e. only a small number of commands have to work when the disk is not ready) but you would think accessing metadata given the disk has spun up once since power up could be accomplished from RAM or flash. Doug Gilbert
On Sun, Jun 21, 2009 at 08:53:37PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > I have just replicated the "ATA pass through information > available" message report on a similar vintage LSI > controller and a SATA disk with a recent smartctl > version. > > There is no need to report this in the kernel error log, > as the smartmontools ATA pass-through (SCSI) command asked > for the final state of the ATA registers and the sense > buffer is the conduit for that information. That ASC/ASCQ > pair basically means "you asked for them and here they > are". [reference: sat2r07b.pdf section 12.2.5 table 107 > when CK_COND is 1] OK, this is basically what we agreed on already. I'm not able to test the given patch right now, though (the machine is a production machine). > As for the hiccup, I have noticed that with SAS (SCSI) > disks from Seagate there is a curious sound and a pause > before the response to LOG SENSE SCSI command (the > type the smartmontools uses on SCSI disks). FWIW, I've used the same disks on SATA controllers with smartctl without any problems. I'm not entirely sure how to parse your message, though -- do you imply that the problem is in smartctl? The disk? /* Steinar */
On Sun, Jun 21, 2009 at 04:53:29PM -0400, Douglas Gilbert wrote: > As for the hiccup, I have noticed that with SAS (SCSI) > disks from Seagate there is a curious sound and a pause > before the response to LOG SENSE SCSI command (the > type the smartmontools uses on SCSI disks). > > Another annoyance is that the disk must be ready (i.e. > spun up) before MODE SENSE and LOG SENSE work, haven't > Seagate heard of flash :-) > SCSI standards permit that (i.e. only > a small number of commands have to work when the disk > is not ready) but you would think accessing metadata > given the disk has spun up once since power up could > be accomplished from RAM or flash. We've experienced similar problems at Intel with an LSI card and Intel SSDs (SATA, not SAS). This issue got pushed into the 'investigate later' category, as we were able to just disable smartd. I'll try and get some more information on this later.
I get the same issue on LSI SAS2008 using the mpt2sas driver in 2.6.32-rc5. It wouldn't be a big deal, but it actually increments /sys/block/$dev/device/ioerr_cnt, which I'd like to use for quick & dirty checks for drives going south (I realize it's not perfect). This occurs with both smartmontools 5.38-2+lenny1 as shipped with Debian 5 and with a local backport of 5.38+svn2956 from experimental. Trying smartctl -d scsi returns an outright failure. I can also reproduce with sg_sat_identify -c. ~$ sudo sg_sat_identify -c /dev/sg13 ~$ dmesg |tail -n 5 sd 4:0:11:0: [sg13] Sense Key : Recovered Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 01 00 1d 00 00 00 0e 09 0c 00 00 00 00 00 00 00 00 00 00 00 00 sd 4:0:11:0: [sg13] Add. Sense: ATA pass through information available ~$ cat /sys/block/sdm/device/ioerr_cnt 0x5 ~$ sudo smartctl -d sat -q errorsonly -H /dev/sdm smartctl 5.39 2009-10-10 r2955 [x86_64-unknown-linux-gnu] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum. === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED ~$ cat /sys/block/sdm/device/ioerr_cnt 0x6 ~$ cat /sys/class/scsi_host/host4/device_delay 00 ~$ cat /sys/class/scsi_host/host4/version_fw 02.00.50.00 ~$ cat /sys/class/scsi_host/host4/version_mpi 200.0b ~$ cat /sys/class/scsi_host/host4/version_product LSISAS2008 ~$ cat /sys/class/scsi_host/host4/version_bios 07.01.01.00 ~$ sudo sg_inq /dev/sg12 standard INQUIRY: PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=1 Resp_data_format=2 SCCS=0 ACC=0 TGPS=0 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=0 [RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1 [SPI: Clocking=0x0 QAS=0 IUS=0] length=74 (0x4a) Peripheral device type: disk Vendor identification: ATA Product identification: WDC WD2002FYPS-0 Product revision level: 5G04 Unit serial number: WD-WCAVY0517841
Hello, I'd like to point out that this bug is still present on kernel version 2.6.34-rc3-00163-g5e11611. I'm using a Supermicro enclosure with a SAS backplane and 16 SATA 1.5TB drives (ST31500341AS). The onboard controller, as reported by lspci: 05:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) At boot time the mptsas kernel driver reports: scsi4 : ioc0: LSISAS1068E B3, FwRev=011a0000h, Ports=1, MaxQ=478, IRQ=16 Smartmontools is version 5.38-2+lenny1 (v5.38 from Debian Lenny) While generating I/O in the disks, I can easily make all I/O stall for several minutes and even kick drives out of an MD Array by running "smartctl -a /dev/sdX" repeatedly on several drives. During the stall, the kernel logged the following messages: mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptscsih: ioc0: attempting task abort! (sc=ffff8802b57aa100) sd 4:0:10:0: [sdk] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802b57aa100) mptscsih: ioc0: attempting task abort! (sc=ffff8802b57aa100) sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00 mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802b57aa100) mptscsih: ioc0: attempting task abort! (sc=ffff8802be35ec00) sd 4:0:10:0: [sdk] CDB: Write(10): 2a 00 96 27 78 00 00 04 00 00 mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35ec00) mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptscsih: ioc0: attempting task abort! (sc=ffff8802be35eb00) sd 4:0:10:0: [sdk] CDB: Write(10): 2a 00 96 27 7c 00 00 04 00 00 mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35eb00) mptscsih: ioc0: attempting task abort! (sc=ffff8802be35eb00) sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00 mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35eb00) mptscsih: ioc0: attempting target reset! (sc=ffff8802b57aa100) sd 4:0:10:0: [sdk] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 mptscsih: ioc0: target reset: FAILED (sc=ffff8802b57aa100) mptscsih: ioc0: attempting bus reset! (sc=ffff8802b57aa100) sd 4:0:10:0: [sdk] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 mptscsih: ioc0: bus reset: SUCCESS (sc=ffff8802b57aa100) mptscsih: ioc0: attempting task abort! (sc=ffff8802b57aa100) sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00 mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802b57aa100) mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptbase: ioc0: LogInfo(0x31123000): Originator={PL}, Code={Abort}, SubCode(0x3000) mptscsih: ioc0: attempting task abort! (sc=ffff8802be35eb00) sd 4:0:10:0: [sdk] CDB: Test Unit Ready: 00 00 00 00 00 00 mptbase: ioc0: LogInfo(0x31130000): Originator={PL}, Code={IO Not Yet Executed}, SubCode(0x0000) mptscsih: ioc0: task abort: SUCCESS (sc=ffff8802be35eb00) mptscsih: ioc0: attempting host reset! (sc=ffff8802b57aa100) mptbase: ioc0: Initiating recovery mptscsih: ioc0: host reset: SUCCESS (sc=ffff8802b57aa100) end_request: I/O error, dev sdb, sector 3903551 md: super_written gets error=-5, uptodate=0 raid1: Disk failure on sdb1, disabling device. raid1: Operation continuing on 1 devices. end_request: I/O error, dev sda, sector 3903551 md: super_written gets error=-5, uptodate=0 RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:0, o:1, dev:sda1 disk 1, wo:1, o:0, dev:sdb1 RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:0, o:1, dev:sda1 -------------- I have this hardware available for a few weeks, so I am willing to help with any tests, diagnostic operations, patches or firmware, that you might have. Any help with this is appreciated, since the fact that drives are being kicked from MD arrays, makes Smartmontools use quite difficult. Thanks in advance for your help. Best regards Cláudio
Claudio, I tried doing similar stuffs at my setup and I was not able to see similar issue as reported by you. We need to know whether it is specific to SATA disk or generic issue. Can you please provide me next possible details as mentioned below? a) How about using different SATA disk instead of which you are using currently. What is behavior in that case? b) I did below steps to reproduce things. (Please correct me if anything missing while mimicking your test case) mdadm --create --verbose /dev/md0 --level=raid1 --raid-devices=2 /dev/sdc /dev/sdd "while true; do smartclt -a /dev/sdX; done;" I kept running it for 15 min, I could not see any issue in my setup. Is this correct way of reproducing the issue? My disk are Seagate ST320000641AS (2TB) FW version CC12. I am suspecting this issue may be mapped to the end devices also. Need to clarify this doubt doing some other combinations of experiment. Can you provide details on my queries to jump next steps of investigation? --Kashyap
oops, someone just get a more detail view over this problem on LKML, I am gonna trying this Date Mon, 26 Apr 2010 18:11:54 -0500 From Ryan Kuester <> Subject mptsas hangs caused by ATA pass-through explained http://lkml.org/lkml/2010/4/26/335
A utility from LSI is available here: ftp://ftp.lsil.com/HostAdapterDrivers/linux/lsiutil/ Some information from my use of lsutil: Board name: LSISAS3442E-R Board assembly: L3-00120-05E Current active firmware version is 01172b00 (1.23.43) Firmware image's version is MPTFW-01.23.43.00-IE LSI Logic x86 BIOS image's version is MPTBIOS-6.18.05.00 (2008.05.14) EFI BIOS image's version is 3.05.01.01 Diagnostics -> Display phy counters: Adapter Phy 1: Link Up Invalid DWord Count 2,734 Running Disparity Error Count 2,757 Loss of DWord Synch Count 0 Phy Reset Problem Count 0 Other information: $ lspci | grep LSI 03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) $ uname -srvm Linux 2.6.31-21-generic #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010 x86_64 $ strings /lib/modules/2.6.31-21-generic/kernel/drivers/message/fusion/mptsas.ko | grep version= version=3.04.10 srcversion=4023EA52994688E9AE61982 $ lsb_release -d Description: Ubuntu 9.10
Reply-To: dgilbert@interlog.com The originally reported problem has been fixed. See: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=91b25002bd58f55207e4662a611a6cded4ef9834 I was told that was scheduled to go in lk 2.6.33 Reading the bugzilla entry some of the latter posts could be reporting some other LSI related problems. Anyway, the bug report should be closed. Doug Gilbert bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=13594 > > > Ken Stailey <kstailey@yahoo.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |kstailey@yahoo.com > > > > > --- Comment #12 from Ken Stailey <kstailey@yahoo.com> 2010-05-12 14:09:28 > --- > A utility from LSI is available here: > ftp://ftp.lsil.com/HostAdapterDrivers/linux/lsiutil/ > > Some information from my use of lsutil: > > Board name: LSISAS3442E-R > Board assembly: L3-00120-05E > > Current active firmware version is 01172b00 (1.23.43) > Firmware image's version is MPTFW-01.23.43.00-IE > LSI Logic > x86 BIOS image's version is MPTBIOS-6.18.05.00 (2008.05.14) > EFI BIOS image's version is 3.05.01.01 > > Diagnostics -> Display phy counters: > Adapter Phy 1: Link Up > Invalid DWord Count 2,734 > Running Disparity Error Count 2,757 > Loss of DWord Synch Count 0 > Phy Reset Problem Count 0 > > Other information: > > $ lspci | grep LSI > 03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E > PCI-Express > Fusion-MPT SAS (rev 08) > > $ uname -srvm > Linux 2.6.31-21-generic #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010 x86_64 > > $ strings > /lib/modules/2.6.31-21-generic/kernel/drivers/message/fusion/mptsas.ko | grep > version= > version=3.04.10 > srcversion=4023EA52994688E9AE61982 > > $ lsb_release -d > Description: Ubuntu 9.10 >
On Wed, May 12, 2010 at 03:20:14PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > The originally reported problem has been fixed. See: > > http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=91b25002bd58f55207e4662a611a6cded4ef9834 > > I was told that was scheduled to go in lk 2.6.33 > > Reading the bugzilla entry some of the latter posts > could be reporting some other LSI related problems. > Anyway, the bug report should be closed. It actually seems like that in 2.6.34-rc6, I can use SMART pretty much with impunity. Don't know if I'm just luckier now or what happened... /* Steinar */
On Wed, May 12, 2010 at 06:45:33PM +0200, Steinar H. Gunderson wrote: > It actually seems like that in 2.6.34-rc6, I can use SMART pretty much with > impunity. Don't know if I'm just luckier now or what happened... Scratch that; I could use smartctl all I wanted, but installing smartd promptly floored the entire card (and with it, the machine, since the RAID went away). dmesg below. At reboot, I kept seeing the “IOC is in FAULT state” until I got logged in and killed smartd again. /* Steinar */ [588630.695020] mptscsih: ioc0: attempting task abort! (sc=ffff880182bab200) [588630.702007] sd 0:0:4:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 [588632.074809] mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) [588632.084283] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880182bab200) [588638.081578] mptbase: ioc0: LogInfo(0x31111000): Originator={PL}, Code={Reset}, SubCode(0x1000) [588638.095332] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) [588642.090380] mptscsih: ioc0: attempting task abort! (sc=ffff880182bab200) [588642.097310] sd 0:0:4:0: [sde] CDB: Test Unit Ready: 00 00 00 00 00 00 [588642.104177] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880182bab200) [588642.110862] mptscsih: ioc0: attempting task abort! (sc=ffff8801c9ffb600) [588642.117813] sd 0:0:4:0: [sde] CDB: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 [588642.126382] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801c9ffb600) [588652.133012] mptscsih: ioc0: attempting task abort! (sc=ffff8801c9ffb600) [588652.140020] sd 0:0:4:0: [sde] CDB: Test Unit Ready: 00 00 00 00 00 00 [588652.146909] mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801c9ffb600) [588652.153621] mptscsih: ioc0: attempting target reset! (sc=ffff880182bab200) [588652.160768] sd 0:0:4:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 [588652.177222] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) [588653.575199] mptscsih: ioc0: target reset: SUCCESS (sc=ffff880182bab200) [588656.583548] mptbase: ioc0: LogInfo(0x31111000): Originator={PL}, Code={Reset}, SubCode(0x1000) [588656.594285] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) [588663.582006] mptscsih: ioc0: attempting task abort! (sc=ffff880182bab200) [588663.588952] sd 0:0:4:0: [sde] CDB: Test Unit Ready: 00 00 00 00 00 00 [588663.595810] mptscsih: ioc0: task abort: SUCCESS (sc=ffff880182bab200) [588663.602509] mptscsih: ioc0: attempting bus reset! (sc=ffff880182bab200) [588663.609381] sd 0:0:4:0: [sde] CDB: ATA command pass through(16): 85 08 0e 00 d5 00 01 00 09 00 4f 00 c2 00 b0 00 [588663.670443] mptbase: ioc0: LogInfo(0x31112000): Originator={PL}, Code={Reset}, SubCode(0x2000) [588665.077991] mptscsih: ioc0: bus reset: SUCCESS (sc=ffff880182bab200) [588668.083821] mptbase: ioc0: LogInfo(0x31111000): Originator={PL}, Code={Reset}, SubCode(0x1000) [588675.085656] sd 0:0:4:0: [sde] Device not ready [588675.090326] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.097771] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.103969] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.111743] sd 0:0:4:0: [sde] CDB: Write(10): 2a 00 57 54 52 08 00 00 08 00 [588675.119172] end_request: I/O error, dev sde, sector 1465143816 [588675.125343] end_request: I/O error, dev sde, sector 1465143816 [588675.126238] md: super_written gets error=-5, uptodate=0 [588675.126238] raid5: Disk failure on sde6, disabling device. [588675.126238] raid5: Operation continuing on 5 devices. [588675.148011] sd 0:0:4:0: [sde] Device not ready [588675.152712] sd 0:0:4:0: [sde] Device not ready [588675.152723] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152725] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152727] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152730] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 44 bc ae 88 00 00 80 00 [588675.152733] end_request: I/O error, dev sde, sector 1153216136 [588675.152740] sd 0:0:4:0: [sde] Device not ready [588675.152741] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152743] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152745] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152747] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 2b 6a 42 70 00 00 18 00 [588675.152751] end_request: I/O error, dev sde, sector 728384112 [588675.152755] sd 0:0:4:0: [sde] Device not ready [588675.152756] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152758] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152759] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152762] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 45 9c d0 08 00 00 80 00 [588675.152765] end_request: I/O error, dev sde, sector 1167904776 [588675.152769] sd 0:0:4:0: [sde] Device not ready [588675.152770] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152771] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152773] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152775] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 2c c6 b8 88 00 00 80 00 [588675.152779] end_request: I/O error, dev sde, sector 751220872 [588675.152783] sd 0:0:4:0: [sde] Device not ready [588675.152784] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152785] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152787] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152789] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 32 fb f1 08 00 00 80 00 [588675.152793] end_request: I/O error, dev sde, sector 855372040 [588675.152796] sd 0:0:4:0: [sde] Device not ready [588675.152797] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152799] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152801] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152803] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 34 1e a8 88 00 00 80 00 [588675.152806] end_request: I/O error, dev sde, sector 874424456 [588675.152811] sd 0:0:4:0: [sde] Device not ready [588675.152812] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152813] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152815] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152817] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 3e 4d bd 88 00 00 80 00 [588675.152821] end_request: I/O error, dev sde, sector 1045282184 [588675.152824] sd 0:0:4:0: [sde] Device not ready [588675.152825] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152827] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152828] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152831] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 18 a3 50 88 00 00 10 00 [588675.152834] end_request: I/O error, dev sde, sector 413356168 [588675.152838] sd 0:0:4:0: [sde] Device not ready [588675.152839] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152841] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152842] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152845] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 18 a3 50 a0 00 00 68 00 [588675.152848] end_request: I/O error, dev sde, sector 413356192 [588675.152855] sd 0:0:4:0: [sde] Device not ready [588675.152856] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152857] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152859] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152861] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 36 12 e4 08 00 00 80 00 [588675.152865] end_request: I/O error, dev sde, sector 907207688 [588675.152868] sd 0:0:4:0: [sde] Device not ready [588675.152869] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152871] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152873] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152875] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 31 8e c0 08 00 00 80 00 [588675.152878] end_request: I/O error, dev sde, sector 831438856 [588675.152882] sd 0:0:4:0: [sde] Device not ready [588675.152883] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.152885] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.152886] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.152889] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 3d e3 2f 08 00 00 80 00 [588675.152892] end_request: I/O error, dev sde, sector 1038298888 [588675.152910] end_request: I/O error, dev sdj, sector 2930271882 [588675.152912] md: super_written gets error=-5, uptodate=0 [588675.152915] raid5: Disk failure on sdj6, disabling device. [588675.152915] raid5: Operation continuing on 4 devices. [588675.158145] end_request: I/O error, dev sdh, sector 2930271882 [588675.158147] md: super_written gets error=-5, uptodate=0 [588675.158149] raid5: Disk failure on sdh6, disabling device. [588675.158150] raid5: Operation continuing on 3 devices. [588675.160440] end_request: I/O error, dev sdk, sector 2930271882 [588675.160442] md: super_written gets error=-5, uptodate=0 [588675.160444] raid5: Disk failure on sdk6, disabling device. [588675.160445] raid5: Operation continuing on 2 devices. [588675.161965] end_request: I/O error, dev sdg, sector 2930271882 [588675.161967] md: super_written gets error=-5, uptodate=0 [588675.161969] raid5: Disk failure on sdg6, disabling device. [588675.161970] raid5: Operation continuing on 1 devices. [588675.168925] end_request: I/O error, dev sdi, sector 2930271882 [588675.168927] md: super_written gets error=-5, uptodate=0 [588675.168929] raid5: Disk failure on sdi6, disabling device. [588675.168930] raid5: Operation continuing on 0 devices. [588675.168948] RAID5 conf printout: [588675.168950] --- rd:5 wd:0 [588675.168951] disk 0, o:0, dev:sdg6 [588675.168952] disk 1, o:0, dev:sdh6 [588675.168953] disk 2, o:0, dev:sdi6 [588675.168955] disk 3, o:0, dev:sdj6 [588675.168956] disk 4, o:0, dev:sdk6 [588675.758839] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.766304] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.772415] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.780146] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 00 5a cc 98 00 00 08 00 [588675.787487] end_request: I/O error, dev sde, sector 5950616 [588675.793314] sd 0:0:4:0: [sde] Device not ready [588675.797984] sd 0:0:4:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [588675.805453] sd 0:0:4:0: [sde] Sense Key : Not Ready [current] [588675.811572] sd 0:0:4:0: [sde] Add. Sense: Logical unit failed self-configuration [588675.819312] sd 0:0:4:0: [sde] CDB: Read(10): 28 00 00 5a cc b0 00 00 08 00 [588675.826713] end_request: I/O error, dev sde, sector 5950640 [588675.838005] RAID5 conf printout: [588675.841422] --- rd:5 wd:0 [588675.844323] disk 1, o:0, dev:sdh6 [588675.847913] disk 2, o:0, dev:sdi6 [588675.851516] disk 3, o:0, dev:sdj6 [588675.855106] disk 4, o:0, dev:sdk6 [588675.858697] RAID5 conf printout: [588675.862139] --- rd:5 wd:0 [588675.862154] RAID5 conf printout: [588675.862155] --- rd:6 wd:5 [588675.862157] disk 0, o:1, dev:sda6 [588675.862158] disk 1, o:1, dev:sdf6 [588675.862160] disk 2, o:0, dev:sde6 [588675.862162] disk 3, o:1, dev:sdc6 [588675.862163] disk 4, o:1, dev:sdb1 [588675.862164] disk 5, o:1, dev:sdd1 [588675.892893] disk 1, o:0, dev:sdh6 [588675.896490] disk 2, o:0, dev:sdi6 [588675.900086] disk 3, o:0, dev:sdj6 [588675.903674] disk 4, o:0, dev:sdk6 [588675.912005] RAID5 conf printout: [588675.915438] --- rd:5 wd:0 [588675.918339] disk 1, o:0, dev:sdh6 [588675.919256] RAID5 conf printout: [588675.919258] --- rd:6 wd:5 [588675.919259] disk 0, o:1, dev:sda6 [588675.919261] disk 1, o:1, dev:sdf6 [588675.919262] disk 3, o:1, dev:sdc6 [588675.919263] disk 4, o:1, dev:sdb1 [588675.919264] disk 5, o:1, dev:sdd1 [588675.946353] disk 2, o:0, dev:sdi6 [588675.949956] disk 3, o:0, dev:sdj6 [588675.953553] RAID5 conf printout: [588675.956990] --- rd:5 wd:0 [588675.959890] disk 1, o:0, dev:sdh6 [588675.963502] disk 2, o:0, dev:sdi6 [588675.967103] disk 3, o:0, dev:sdj6 [588675.974006] RAID5 conf printout: [588675.977431] --- rd:5 wd:0 [588675.980380] disk 1, o:0, dev:sdh6 [588675.984017] disk 2, o:0, dev:sdi6 [588675.987633] RAID5 conf printout: [588675.991069] --- rd:5 wd:0 [588675.993989] disk 1, o:0, dev:sdh6 [588675.997600] disk 2, o:0, dev:sdi6 [588676.006006] RAID5 conf printout: [588676.009465] --- rd:5 wd:0 [588676.012375] disk 1, o:0, dev:sdh6 [588676.015985] RAID5 conf printout: [588676.019473] --- rd:5 wd:0 [588676.022367] disk 1, o:0, dev:sdh6 [588676.030005] RAID5 conf printout: [588676.033439] --- rd:5 wd:0 [588676.036350] Buffer I/O error on device dm-15, logical block 307593216 [588676.043081] lost page write due to I/O error on dm-15 [588676.751012] ttyS0: 1 input overrun(s) [588679.821212] ttyS0: 1 input overrun(s) [588702.915013] mptbase: ioc0: WARNING - IOC is in FAULT state (7827h)!!! [588702.921701] mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!! [588702.929579] mptbase: ioc0: Initiating recovery [588702.934245] mptbase: ioc0: WARNING - IOC is in FAULT state!!! [588702.940210] mptbase: ioc0: WARNING - FAULT code = 7827h [588706.051011] mptbase: ioc0: Recovered from IOC FAULT [588717.036031] mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
If this bug report is to be closed on the grounds that it only encompasses suppressing some log messages can anyone post the ID of any bug reports that are for the "real" LSI MPT driver issues?
Related bug reports: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/605939 https://bugzilla.redhat.com/show_bug.cgi?id=616572
I also use LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS and seagate ST31500341AS 1.5TB harddisk. I found that the ST31500341AS has firmware issue: http://www.avsforum.com/avs-vb/showthread.php?t=1080005. So I check the /var/log/message and lsscsi, there are 2 firmware version in the server, and all sdX error messages loged are version SD17. The SD17 version should be upgrade to SD1B, or it will hung IO for almost half a minute randomly. Oct 29 08:27:21 XEN-ST-27 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff8801e5465840) Oct 29 08:27:21 XEN-ST-27 kernel: sd 4:0:3:0: Oct 29 08:27:21 XEN-ST-27 kernel: command: Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00 Oct 29 08:27:23 XEN-ST-27 kernel: mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) Oct 29 08:27:23 XEN-ST-27 kernel: mptscsih: ioc0: task abort: SUCCESS (sc=ffff8801e5465840) [4:0:0:0] disk ATA ST31500341AS SD17 /dev/sda [4:0:1:0] disk ATA ST31500341AS CC1H /dev/sdb [4:0:2:0] disk ATA ST31500341AS CC1H /dev/sdc [4:0:3:0] disk ATA ST31500341AS SD17 /dev/sdd [4:0:4:0] disk ATA ST31500341AS CC1H /dev/sde [4:0:5:0] disk ATA ST31500341AS SD17 /dev/sdf [4:0:6:0] disk ATA ST31500341AS SD17 /dev/sdg [4:0:7:0] disk ATA ST31500341AS CC1H /dev/sdh [4:0:8:0] disk ATA ST31500341AS CC1H /dev/sdi [4:0:9:0] disk ATA ST31500341AS CC1H /dev/sdj [4:0:10:0] disk ATA ST31500341AS CC1H /dev/sdk [4:0:11:0] disk ATA ST31500341AS CC1H /dev/sdl I am suffering IO hung in many xen servers. I've apply this patch http://lkml.org/lkml/2010/4/26/335 in 2.6.18-xen with mpt version mptlinux-3.04.01, and "task abort" still show in dmesg. But smartctl -a will not trigger error even without this patch. So I think havey IO hung issue may be caused by seagate firmware and ATA path-through bug in the kernel. I didn't find ATA path-through issue in 2.6.18-xen and 2.6.16-xen, but 2.6.29 and 2.6.31 and 2.6.32 have this issue. It could be reproduced easily by running "while true; do smartctl -a /dev/sdd > /dev/null; done". Even apply patch http://lkml.org/lkml/2010/4/26/335, and try all mpt fusion driver I can find form 3.04.01 to the latest lsi version 4.0.22. Finally I test 2.6.36, ATA issue seems solved. But it doesn't support xen dom0, I can't test this kernel in productive server. I'am trying reproduce IO hung issue in lab, and upgrade seagate firmware version to verify it. Related bug: https://bugzilla.kernel.org/show_bug.cgi?id=18652