Bug 14579 - Devices disappear; on bus reset machine hangs; on I/O machine hangs
Devices disappear; on bus reset machine hangs; on I/O machine hangs
Status: CLOSED INVALID
Product: IO/Storage
Classification: Unclassified
Component: SCSI
All Linux
: P1 blocking
Assigned To: linux-scsi@vger.kernel.org
:
Depends on: 14577
Blocks:
  Show dependency treegraph
 
Reported: 2009-11-10 15:28 UTC by Sascha Frey
Modified: 2012-06-14 16:58 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.30-1-amd64 Debian
Tree: Mainline
Regression: No


Attachments
lspci -vvv output (40.25 KB, text/plain)
2009-11-10 15:28 UTC, Sascha Frey
Details
cat /proc/modules output (3.54 KB, text/plain)
2009-11-10 15:29 UTC, Sascha Frey
Details
lsscsi output (17.11 KB, text/plain)
2009-11-10 15:30 UTC, Sascha Frey
Details
screenshot of last oops (63.88 KB, image/jpeg)
2009-11-10 15:30 UTC, Sascha Frey
Details
dmesg output (63.76 KB, text/plain)
2009-11-10 15:35 UTC, Sascha Frey
Details

Description Sascha Frey 2009-11-10 15:28:22 UTC
Created attachment 23729 [details]
lspci -vvv output

On our soon-to-be primary backup machine with an Adaptec 52445 (for the disk pool) and an LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (for the Tape Library), we see random failures:
On Tape-IO (using btape from bacula), a few minutes into the writes, the changer device and both tape devices disappear from the system. Rescanning the bus sometimes helps, but sometimes results in a hardy freeze.

On Tape-IO, a few minutes into the writes, the system freezes.

lspci -vvv, lsscsi and dmesg and /proc/modules are attached, this might also be related to #14577 as it's the same machine. Attached is also a screenshot of the oops we saw on the last crash that happened durign a btape run.

The system uses stock Debian/Lenny with a custom built 2.6.30.5 kernel.
Comment 1 Sascha Frey 2009-11-10 15:29:20 UTC
Created attachment 23730 [details]
cat /proc/modules output
Comment 2 Sascha Frey 2009-11-10 15:30:02 UTC
Created attachment 23731 [details]
lsscsi output
Comment 3 Sascha Frey 2009-11-10 15:30:46 UTC
Created attachment 23732 [details]
screenshot of last oops
Comment 4 Sascha Frey 2009-11-10 15:35:41 UTC
Created attachment 23733 [details]
dmesg output
Comment 5 lkolbe 2009-11-18 14:05:34 UTC
fyi

From: "Desai, Kashyap" <Kashyap.Desai@lsi.com>
To: "Support@techfak.uni-bielefeld.de" <Support@techfak.uni-bielefeld.de>,
        "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
CC: Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de>
Date: Fri, 13 Nov 2009 17:29:43 +0530
Subject: RE: Bug 14579 -  Devices disappear... and Bug 14577 - Data
	corruption with Adaptec
Message-ID: <0D1E8821739E724A86F4D16902CE275C1C93C04462@inbmail01.lsi.com>
References: <20091111160220.GC5705@TechFak.Uni-Bielefeld.DE>
 <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE>
In-Reply-To: <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE>

Subject line is related to *Adaptec* and there are some places LSI related =
issue is pointed out. Little confusing to me. Is it possible to rewrite wha=
t is an issue related to LSI card?

From dmesg log I can figure out 3.04.07 is mpt fusion driver version.
Please update LSI driver using latest upstream driver version 3.04.13. And =
see what a result is.

- Kashyap

-----Original Message-----
From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-owner@vger.kernel=
.org] On Behalf Of Sascha Frey
Sent: Friday, November 13, 2009 4:28 AM
To: linux-scsi@vger.kernel.org
Cc: Lukas Kolbe
Subject: Re: Bug 14579 - Devices disappear... and Bug 14577 - Data corrupti=
on with Adaptec

Hi,

Lukas Kolbe wrote:
>we'd really appreciate any hints and help we can get for the following
>bugs:
>http://bugzilla.kernel.org/show_bug.cgi?id=3D14579

We've done some further testing:
it's very hard to trigger this bug. Sometimes the machine freezes after
a few minutes into tape access and sometimes it works days - or even
weeks - without any problem.

The bug only appears during tape I/O (regardless of which tape program is
used: btape, dd or tar).
In most cases the tape write ends with an input/output error. After this
error occurred, any access to the tape library robot (connected through
the SAS interface of the first drive) fails:

# mtx unload 1 1
Unloading drive 1 into Storage Element 1...mtx: Request Sense: Long Report=
=3Dyes
mtx: Request Sense: Valid Residual=3Dno
mtx: Request Sense: Error Code=3D70 (Current)
mtx: Request Sense: Sense Key=3DIllegal Request
mtx: Request Sense: FileMark=3Dno
mtx: Request Sense: EOM=3Dno
mtx: Request Sense: ILI=3Dno
mtx: Request Sense: Additional Sense Code =3D 53
mtx: Request Sense: Additional Sense Qualifier =3D 01
mtx: Request Sense: BPV=3Dno
mtx: Request Sense: Error in CDB=3Dno
mtx: Request Sense: SKSV=3Dno
MOVE MEDIUM from Element Address 257 to 4096 Failed

After resetting the scsi bus (echo "- - -" >
/sys/class/scsi_host/host5/scan) the tape drives are revitalized, but
the changer device disappears. Even after a cold restart of the whole
library the device keeps missing.

Yet another problem: restting the SCSI bus of the LSI SAS HBA sometimes
results in a hardy freeze (console stuck; no log messages).

> [...]
>
>I do not believe it's a hardware fault at the moment as the machine
>ran OK under Solaris for a few weeks (including successful btape runs).
>

The very same piece of hardware worked fine using Solaris 10 with heavy
disk and tape I/O at the same time for two months.

We really prefer using Linux instead, but we're in pressure of time.


We appreciate any help resolving this bug!




Regards,
Sascha Frey


From: "Desai, Kashyap" <Kashyap.Desai@lsi.com>
To: "support@TechFak.Uni-Bielefeld.DE" <support@TechFak.Uni-Bielefeld.DE>
CC: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Date: Wed, 18 Nov 2009 10:24:38 +0530
Subject: RE: Bug 14579 -  Devices disappear... and Bug 14577 - Data
	corruption with Adaptec
Message-ID: <0D1E8821739E724A86F4D16902CE275C1C93C74A49@inbmail01.lsi.com>
References: <20091111160220.GC5705@TechFak.Uni-Bielefeld.DE>
 <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE>
 <0D1E8821739E724A86F4D16902CE275C1C93C04462@inbmail01.lsi.com>
 <20091117142242.GA15638@TechFak.Uni-Bielefeld.DE>
In-Reply-To: <20091117142242.GA15638@TechFak.Uni-Bielefeld.DE>

Hello Lukas,


> -----Original Message-----
> From: Lukas Kolbe [mailto:lkolbe@TechFak.Uni-Bielefeld.DE]
> Sent: Tuesday, November 17, 2009 7:53 PM
> To: Desai, Kashyap
> Cc: linux-scsi@vger.kernel.org
> Subject: Re: Bug 14579 - Devices disappear... and Bug 14577 - Data
> corruption with Adaptec
>=20
> Desai, Kashyap wrote:
>=20
> >Subject line is related to *Adaptec* and there are some places LSI
> >related issue is pointed out. Little confusing to me. Is it possible to
> >rewrite what is an issue related to LSI card?
>=20
> Sorry for that one. This system has an Adaptec Controller for its
> Storage array and an LSI controller for the tape library. Bug 14577 is
> about a possible data corruption on 2.6.32-rc6 that seems to be either a
> hardware error (currently trying to find that out) or a regression in
> 2.6.32-rc6, as 2.6.30 is very happy with its storage.
OK. In data corruption condition only LSI driver and controller are involve=
d? I mean can I nullify Adaptec controller's roll in your test?
>=20
> Finally, the real problem here is Bug 14579 that is about the systems
> problems when using the tape library.
>=20
> >From dmesg log I can figure out 3.04.07 is mpt fusion driver version.
> >Please update LSI driver using latest upstream driver version 3.04.13.
> And see what a result is.
>=20
> Thanks for the pointer. Linus' current tree contains 3.04.12 - where can
> I find 3.04.13?

It is there in 2.6.32-rc5. Not sure in which exact rc version it is include=
d, but I have 2.6.32-rc5 tree in my setup and for that kernel mptfusion ver=
sion is 3.104.13
>=20
> >- Kashyap
>=20
> Kind regards,
> Lukas Kolbe
Comment 6 lkolbe 2009-11-18 14:07:10 UTC
fyi


From lkolbe@TechFak.Uni-Bielefeld.DE Wed Nov 18 14:39:08 2009
Date: Wed, 18 Nov 2009 14:39:09 +0100
From: Lukas Kolbe <lkolbe@TechFak.Uni-Bielefeld.DE>
To: "Desai, Kashyap" <Kashyap.Desai@lsi.com>
Cc: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>
Subject: Re: Bug 14579 -  Devices disappear... and Bug 14577 - Data
	corruption with Adaptec
Message-ID: <20091118133909.GD16440@TechFak.Uni-Bielefeld.DE>
References: <20091111160220.GC5705@TechFak.Uni-Bielefeld.DE> <20091112225825.GA20808@TechFak.Uni-Bielefeld.DE> <0D1E8821739E724A86F4D16902CE275C1C93C04462@inbmail01.lsi.com> <20091117142242.GA15638@TechFak.Uni-Bielefeld.DE> <0D1E8821739E724A86F4D16902CE275C1C93C74A49@inbmail01.lsi.com>
In-Reply-To: <0D1E8821739E724A86F4D16902CE275C1C93C74A49@inbmail01.lsi.com>

Desai, Kashyap wrote:

>> >Subject line is related to *Adaptec* and there are some places LSI
>> >related issue is pointed out. Little confusing to me. Is it possible to
>> >rewrite what is an issue related to LSI card?
>> 
>> Sorry for that one. This system has an Adaptec Controller for its
>> Storage array and an LSI controller for the tape library. Bug 14577 is
>> about a possible data corruption on 2.6.32-rc6 that seems to be either a
>> hardware error (currently trying to find that out) or a regression in
>> 2.6.32-rc6, as 2.6.30 is very happy with its storage.
>OK. In data corruption condition only LSI driver and controller are
>involved? I mean can I nullify Adaptec controller's roll in your test?

No, it is the other way round. We have 24 1TB Seagate harddisks
connected in a RAID 60 to the adaptec controller, and a Tandberg T80
with two IBM Ultrium-HH4 tape drives connected to the LSI controller.

The system is installed on an LVM volume within the RAID 60.
The data corruption occurs when we try to boot 2.6.32-rc6, we get write
errors and the boot process stops somewhere. So, it seems the data
corruption is related _only_ with the Adaptec Controller, the RAID array
or the harddisks.

>> Finally, the real problem here is Bug 14579 that is about the systems
>> problems when using the tape library.
>> 
>> >From dmesg log I can figure out 3.04.07 is mpt fusion driver version.
>> >Please update LSI driver using latest upstream driver version 3.04.13.
>> And see what a result is.
>> 
>> Thanks for the pointer. Linus' current tree contains 3.04.12 - where can
>> I find 3.04.13?
>
>It is there in 2.6.32-rc5. Not sure in which exact rc version it is
>included, but I have 2.6.32-rc5 tree in my setup and for that kernel
>mptfusion version is 3.104.13

Okay, I grep'ed for 3.04 in the source and only got one reference to the
older version number. But there lies the problem: Unless we can fix the
Adaptec-Bug first (or confirm it is a hardware issue), we can't boot
2.6.32-rc on that machine to test the new LSI driver version. Is it
easily possible to backport/include the mptfusion in 2.6.30?

Thanks for the help and kind regards, 
-- 
Lukas Kolbe
Comment 7 lkolbe 2010-02-10 10:10:46 UTC
As it turned out this machine was faulty (all 24 disks began dematerializing under our feet) so we replaced it. We do now face similar issues, though I'll open a new bug for that one to have a clear separation. Whoever has the permission to close this bug might do so, please.

Kind regards, 
Lukas
Comment 8 lkolbe 2010-04-13 09:25:22 UTC
fyi: Supermicro's X7DWN+ mainboard needs a BIOS-Update (at least version 1.2b, no changelog available though) to cope with multiple SAS-controllers under linux. In the end, Kashyap Desai was right when he suspected it had something to do with IRQ-routing.

Seagates' firmware problems (timeouts) with Adaptecs RAID-Controller didn't help with this either, as every few days more than two disks at a time were thrown out of the array resulting in loss of said array. Supposedly, the newest firmware 'DN06' for the Barracuda.ES2 drives fix these problems, but our distributor was so kind to replace all our Seagate drives with Hitachis.

Note You need to log in before you can comment on or make changes to this bug.