Created attachment 23729 [details] lspci -vvv output On our soon-to-be primary backup machine with an Adaptec 52445 (for the disk pool) and an LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (for the Tape Library), we see random failures: On Tape-IO (using btape from bacula), a few minutes into the writes, the changer device and both tape devices disappear from the system. Rescanning the bus sometimes helps, but sometimes results in a hardy freeze. On Tape-IO, a few minutes into the writes, the system freezes. lspci -vvv, lsscsi and dmesg and /proc/modules are attached, this might also be related to #14577 as it's the same machine. Attached is also a screenshot of the oops we saw on the last crash that happened durign a btape run. The system uses stock Debian/Lenny with a custom built 2.6.30.5 kernel.
Created attachment 23730 [details] cat /proc/modules output
Created attachment 23731 [details] lsscsi output
Created attachment 23732 [details] screenshot of last oops
Created attachment 23733 [details] dmesg output
fyi From: "Desai, Kashyap" <Kashyap.Desai@lsi.com> To: "Support@techfak.uni-bielefeld.de" <Support@techfak.uni-bielefeld.de>, "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org> CC: Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de> Date: Fri, 13 Nov 2009 17:29:43 +0530 Subject: RE: Bug 14579 - Devices disappear... and Bug 14577 - Data corruption with Adaptec Message-ID: <0D1E8821739E724A86F4D16902CE275C1C93C04462@inbmail01.lsi.com> References: <20091111160220.GC5705@TechFak.Uni-Bielefeld.DE> <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE> In-Reply-To: <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE> Subject line is related to *Adaptec* and there are some places LSI related = issue is pointed out. Little confusing to me. Is it possible to rewrite wha= t is an issue related to LSI card? From dmesg log I can figure out 3.04.07 is mpt fusion driver version. Please update LSI driver using latest upstream driver version 3.04.13. And = see what a result is. - Kashyap -----Original Message----- From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel= .org] On Behalf Of Sascha Frey Sent: Friday, November 13, 2009 4:28 AM To: linux-scsi@vger.kernel.org Cc: Lukas Kolbe Subject: Re: Bug 14579 - Devices disappear... and Bug 14577 - Data corrupti= on with Adaptec Hi, Lukas Kolbe wrote: >we'd really appreciate any hints and help we can get for the following >bugs: >http://bugzilla.kernel.org/show_bug.cgi?id=3D14579 We've done some further testing: it's very hard to trigger this bug. Sometimes the machine freezes after a few minutes into tape access and sometimes it works days - or even weeks - without any problem. The bug only appears during tape I/O (regardless of which tape program is used: btape, dd or tar). In most cases the tape write ends with an input/output error. After this error occurred, any access to the tape library robot (connected through the SAS interface of the first drive) fails: # mtx unload 1 1 Unloading drive 1 into Storage Element 1...mtx: Request Sense: Long Report= =3Dyes mtx: Request Sense: Valid Residual=3Dno mtx: Request Sense: Error Code=3D70 (Current) mtx: Request Sense: Sense Key=3DIllegal Request mtx: Request Sense: FileMark=3Dno mtx: Request Sense: EOM=3Dno mtx: Request Sense: ILI=3Dno mtx: Request Sense: Additional Sense Code =3D 53 mtx: Request Sense: Additional Sense Qualifier =3D 01 mtx: Request Sense: BPV=3Dno mtx: Request Sense: Error in CDB=3Dno mtx: Request Sense: SKSV=3Dno MOVE MEDIUM from Element Address 257 to 4096 Failed After resetting the scsi bus (echo "- - -" > /sys/class/scsi_host/host5/scan) the tape drives are revitalized, but the changer device disappears. Even after a cold restart of the whole library the device keeps missing. Yet another problem: restting the SCSI bus of the LSI SAS HBA sometimes results in a hardy freeze (console stuck; no log messages). > [...] > >I do not believe it's a hardware fault at the moment as the machine >ran OK under Solaris for a few weeks (including successful btape runs). > The very same piece of hardware worked fine using Solaris 10 with heavy disk and tape I/O at the same time for two months. We really prefer using Linux instead, but we're in pressure of time. We appreciate any help resolving this bug! Regards, Sascha Frey From: "Desai, Kashyap" <Kashyap.Desai@lsi.com> To: "support@TechFak.Uni-Bielefeld.DE" <support@TechFak.Uni-Bielefeld.DE> CC: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org> Date: Wed, 18 Nov 2009 10:24:38 +0530 Subject: RE: Bug 14579 - Devices disappear... and Bug 14577 - Data corruption with Adaptec Message-ID: <0D1E8821739E724A86F4D16902CE275C1C93C74A49@inbmail01.lsi.com> References: <20091111160220.GC5705@TechFak.Uni-Bielefeld.DE> <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE> <0D1E8821739E724A86F4D16902CE275C1C93C04462@inbmail01.lsi.com> <20091117142242.GA15638@TechFak.Uni-Bielefeld.DE> In-Reply-To: <20091117142242.GA15638@TechFak.Uni-Bielefeld.DE> Hello Lukas, > -----Original Message----- > From: Lukas Kolbe [mailto:lkolbe@TechFak.Uni-Bielefeld.DE] > Sent: Tuesday, November 17, 2009 7:53 PM > To: Desai, Kashyap > Cc: linux-scsi@vger.kernel.org > Subject: Re: Bug 14579 - Devices disappear... and Bug 14577 - Data > corruption with Adaptec >=20 > Desai, Kashyap wrote: >=20 > >Subject line is related to *Adaptec* and there are some places LSI > >related issue is pointed out. Little confusing to me. Is it possible to > >rewrite what is an issue related to LSI card? >=20 > Sorry for that one. This system has an Adaptec Controller for its > Storage array and an LSI controller for the tape library. Bug 14577 is > about a possible data corruption on 2.6.32-rc6 that seems to be either a > hardware error (currently trying to find that out) or a regression in > 2.6.32-rc6, as 2.6.30 is very happy with its storage. OK. In data corruption condition only LSI driver and controller are involve= d? I mean can I nullify Adaptec controller's roll in your test? >=20 > Finally, the real problem here is Bug 14579 that is about the systems > problems when using the tape library. >=20 > >From dmesg log I can figure out 3.04.07 is mpt fusion driver version. > >Please update LSI driver using latest upstream driver version 3.04.13. > And see what a result is. >=20 > Thanks for the pointer. Linus' current tree contains 3.04.12 - where can > I find 3.04.13? It is there in 2.6.32-rc5. Not sure in which exact rc version it is include= d, but I have 2.6.32-rc5 tree in my setup and for that kernel mptfusion ver= sion is 3.104.13 >=20 > >- Kashyap >=20 > Kind regards, > Lukas Kolbe
fyi From lkolbe@TechFak.Uni-Bielefeld.DE Wed Nov 18 14:39:08 2009 Date: Wed, 18 Nov 2009 14:39:09 +0100 From: Lukas Kolbe <lkolbe@TechFak.Uni-Bielefeld.DE> To: "Desai, Kashyap" <Kashyap.Desai@lsi.com> Cc: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org> Subject: Re: Bug 14579 - Devices disappear... and Bug 14577 - Data corruption with Adaptec Message-ID: <20091118133909.GD16440@TechFak.Uni-Bielefeld.DE> References: <20091111160220.GC5705@TechFak.Uni-Bielefeld.DE> <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE> <0D1E8821739E724A86F4D16902CE275C1C93C04462@inbmail01.lsi.com> <20091117142242.GA15638@TechFak.Uni-Bielefeld.DE> <0D1E8821739E724A86F4D16902CE275C1C93C74A49@inbmail01.lsi.com> In-Reply-To: <0D1E8821739E724A86F4D16902CE275C1C93C74A49@inbmail01.lsi.com> Desai, Kashyap wrote: >> >Subject line is related to *Adaptec* and there are some places LSI >> >related issue is pointed out. Little confusing to me. Is it possible to >> >rewrite what is an issue related to LSI card? >> >> Sorry for that one. This system has an Adaptec Controller for its >> Storage array and an LSI controller for the tape library. Bug 14577 is >> about a possible data corruption on 2.6.32-rc6 that seems to be either a >> hardware error (currently trying to find that out) or a regression in >> 2.6.32-rc6, as 2.6.30 is very happy with its storage. >OK. In data corruption condition only LSI driver and controller are >involved? I mean can I nullify Adaptec controller's roll in your test? No, it is the other way round. We have 24 1TB Seagate harddisks connected in a RAID 60 to the adaptec controller, and a Tandberg T80 with two IBM Ultrium-HH4 tape drives connected to the LSI controller. The system is installed on an LVM volume within the RAID 60. The data corruption occurs when we try to boot 2.6.32-rc6, we get write errors and the boot process stops somewhere. So, it seems the data corruption is related _only_ with the Adaptec Controller, the RAID array or the harddisks. >> Finally, the real problem here is Bug 14579 that is about the systems >> problems when using the tape library. >> >> >From dmesg log I can figure out 3.04.07 is mpt fusion driver version. >> >Please update LSI driver using latest upstream driver version 3.04.13. >> And see what a result is. >> >> Thanks for the pointer. Linus' current tree contains 3.04.12 - where can >> I find 3.04.13? > >It is there in 2.6.32-rc5. Not sure in which exact rc version it is >included, but I have 2.6.32-rc5 tree in my setup and for that kernel >mptfusion version is 3.104.13 Okay, I grep'ed for 3.04 in the source and only got one reference to the older version number. But there lies the problem: Unless we can fix the Adaptec-Bug first (or confirm it is a hardware issue), we can't boot 2.6.32-rc on that machine to test the new LSI driver version. Is it easily possible to backport/include the mptfusion in 2.6.30? Thanks for the help and kind regards, -- Lukas Kolbe
As it turned out this machine was faulty (all 24 disks began dematerializing under our feet) so we replaced it. We do now face similar issues, though I'll open a new bug for that one to have a clear separation. Whoever has the permission to close this bug might do so, please. Kind regards, Lukas
fyi: Supermicro's X7DWN+ mainboard needs a BIOS-Update (at least version 1.2b, no changelog available though) to cope with multiple SAS-controllers under linux. In the end, Kashyap Desai was right when he suspected it had something to do with IRQ-routing. Seagates' firmware problems (timeouts) with Adaptecs RAID-Controller didn't help with this either, as every few days more than two disks at a time were thrown out of the array resulting in loss of said array. Supposedly, the newest firmware 'DN06' for the Barracuda.ES2 drives fix these problems, but our distributor was so kind to replace all our Seagate drives with Hitachis.