Bug 14577

Summary: Data Corruption with Adaptec 52445, Firmware 5.2-0 (17380)
Product: SCSI Drivers Reporter: lkolbe
Component: AACRAIDAssignee: scsi_drivers-aacraid
Status: CLOSED CODE_FIX    
Severity: normal CC: rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230, 14579    
Attachments: dmesg of the running (and failing) 2.6.32-rc6 on this system
lspci -vvv on a working kernel
lsscsi -t/-v/-c/-H output on a working kernel
contents of /proc/modules on the working system (2.6.30)

Description lkolbe 2009-11-10 13:31:52 UTC
Created attachment 23725 [details]
dmesg of the running (and failing) 2.6.32-rc6 on this system

When booting this system with the 2.6.32-rc6 kernel, we see immediate data corruption on the root filesystem:

[   21.813985] Adding 4194296k swap on /dev/mapper/system-swap.  Priority:-1 extents:1 across:4194296k 
[   21.903794] sd 0:0:0:0: [sda] Unhandled sense code
[   21.903865] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   21.903998] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[   21.904219] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
[   21.904382] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 01 bf 00 00 08 00
[   21.905051] end_request: I/O error, dev sda, sector 447
[   21.905122] Buffer I/O error on device dm-0, logical block 0
[   21.905191] lost page write due to I/O error on dm-0
[   21.905271] EXT3 FS on dm-0, internal journal
[   23.263958] sd 0:0:0:0: [sda] Unhandled sense code
[   23.264029] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   23.264162] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[   23.264368] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
[   23.264531] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 bc 11 cf 00 00 08 00
[   23.265186] end_request: I/O error, dev sda, sector 12325327
[   23.265257] Buffer I/O error on device dm-1, logical block 492034
[   23.265327] lost page write due to I/O error on dm-1
[   23.364800] kjournald starting.  Commit interval 5 seconds
[   23.423871] sd 0:0:0:0: [sda] Unhandled sense code
[   23.423874] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[   23.423877] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
[   23.423880] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
[   23.423886] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 80 01 bf 00 00 08 00
[   23.423893] end_request: I/O error, dev sda, sector 8389055
[   23.423983] Buffer I/O error on device dm-1, logical block 0
[   23.424052] lost page write due to I/O error on dm-1

On 2.6.30 the system works 'fine' (in that it doesn't shredder the filesystems on bootup). dmesg, lspci, lsscsi all attached later.
Comment 1 lkolbe 2009-11-10 13:35:00 UTC
Created attachment 23726 [details]
lspci -vvv on a working kernel
Comment 2 lkolbe 2009-11-10 13:35:53 UTC
Created attachment 23727 [details]
lsscsi -t/-v/-c/-H output on a working kernel
Comment 3 lkolbe 2009-11-10 13:39:29 UTC
Created attachment 23728 [details]
contents of /proc/modules on the working system (2.6.30)

/proc/version is: Linux version 2.6.30-1-amd64 (Debian 2.6.30-5+techfaklenny1) (sfrey@TechFak.Uni-Bielefeld.DE) (gcc version 4.3.2 (Debian 4.3.2-1.1) ) #1 SMP Sat Aug 15 11:20:08 UTC 2009
Comment 4 lkolbe 2009-11-10 13:41:21 UTC
Firmware of the Adaptec-Controller is 5.2-0 (17380).
Comment 5 Anonymous Emailer 2009-11-10 14:03:56 UTC
Reply-To: James.Bottomley@suse.de

On Tue, 2009-11-10 at 13:31 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14577


> [   21.903794] sd 0:0:0:0: [sda] Unhandled sense code
> [   21.903865] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   21.903998] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
> [   21.904219] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
> [   21.904382] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 01 bf 00 00 08
> 00
> [   21.905051] end_request: I/O error, dev sda, sector 447
> [   21.905122] Buffer I/O error on device dm-0, logical block 0
> [   21.905191] lost page write due to I/O error on dm-0
> [   21.905271] EXT3 FS on dm-0, internal journal
> [   23.263958] sd 0:0:0:0: [sda] Unhandled sense code
> [   23.264029] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   23.264162] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
> [   23.264368] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
> [   23.264531] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 bc 11 cf 00 00 08
> 00
> [   23.265186] end_request: I/O error, dev sda, sector 12325327
> [   23.265257] Buffer I/O error on device dm-1, logical block 492034
> [   23.265327] lost page write due to I/O error on dm-1
> [   23.364800] kjournald starting.  Commit interval 5 seconds
> [   23.423871] sd 0:0:0:0: [sda] Unhandled sense code
> [   23.423874] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   23.423877] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
> [   23.423880] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
> [   23.423886] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 80 01 bf 00 00 08
> 00

These are sense codes (Hardware error and Target Failure) generated by
the driver when it encounters some type of firmware related failure in
the RAID ... the only people who can really debug and fix this are the
aacraid people (cc'd).  It may be a simple matter of a support call if
it's a hardware error.

James
Comment 6 lkolbe 2009-11-10 14:51:42 UTC
I understand that indicates a hardware failure, but this doesn't occur when using 2.6.30. That's why I opened it as a regression in the first place. I welcome input from the aacraid people, of course :)

-- 
Lukas
Comment 7 Anonymous Emailer 2009-11-16 01:58:08 UTC
Reply-To: Ganapathy_Sridaran@adaptec.com

*** Copying the bugzilla email in the email address ***


Hi James,

We'll look into this issue and get back to you. I wonder if they are genuine write errors due to the medium errors. Is this problem seen with other RAID volumes or just with this one particular volume? 

Are you running Adaptec Storage Manager Software on this system? If so, can you please collect the support.zip file? Support.zip can be collected by clicking "support.zip" under the "actions" menu. The Support.zip file provides us additional information on any disk errors that might have happened in the system.

Thanks,
Gana 


Gana S. Sridaran
Engg., Manager - RAID FW/Drivers
Adaptec Inc.,
691, S. Milpitas Blvd.,
Milpitas, CA 95035
408.957.4985



-----Original Message-----
From: James Bottomley [mailto:James.Bottomley@suse.de] 
Sent: Tuesday, November 10, 2009 6:04 AM
To: bugzilla-daemon@bugzilla.kernel.org
Cc: linux-scsi@vger.kernel.org; AACRAID
Subject: Re: [Bug 14577] New: Data Corruption with Adaptec 52445

On Tue, 2009-11-10 at 13:31 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14577


> [   21.903794] sd 0:0:0:0: [sda] Unhandled sense code
> [   21.903865] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   21.903998] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
> [   21.904219] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
> [   21.904382] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 01 bf 00 00 08
> 00
> [   21.905051] end_request: I/O error, dev sda, sector 447
> [   21.905122] Buffer I/O error on device dm-0, logical block 0
> [   21.905191] lost page write due to I/O error on dm-0
> [   21.905271] EXT3 FS on dm-0, internal journal
> [   23.263958] sd 0:0:0:0: [sda] Unhandled sense code
> [   23.264029] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   23.264162] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
> [   23.264368] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
> [   23.264531] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 bc 11 cf 00 00 08
> 00
> [   23.265186] end_request: I/O error, dev sda, sector 12325327
> [   23.265257] Buffer I/O error on device dm-1, logical block 492034
> [   23.265327] lost page write due to I/O error on dm-1
> [   23.364800] kjournald starting.  Commit interval 5 seconds
> [   23.423871] sd 0:0:0:0: [sda] Unhandled sense code
> [   23.423874] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [   23.423877] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] 
> [   23.423880] sd 0:0:0:0: [sda] Add. Sense: Internal target failure
> [   23.423886] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 80 01 bf 00 00 08
> 00

These are sense codes (Hardware error and Target Failure) generated by
the driver when it encounters some type of firmware related failure in
the RAID ... the only people who can really debug and fix this are the
aacraid people (cc'd).  It may be a simple matter of a support call if
it's a hardware error.

James
Comment 8 lkolbe 2009-11-17 12:25:21 UTC
bugzilla-daemon@bugzilla.kernel.org wrote:

>Hi James,
>
>We'll look into this issue and get back to you. I wonder if they are genuine
>write errors due to the medium errors. Is this problem seen with other RAID
>volumes or just with this one particular volume? 
>
>Are you running Adaptec Storage Manager Software on this system? If so, can
>you
>please collect the support.zip file? Support.zip can be collected by clicking
>"support.zip" under the "actions" menu. The Support.zip file provides us
>additional information on any disk errors that might have happened in the
>system.

Hello Gana,

we have the arcconf utility available on that system, but since it is
Debian Lenny based and we only found RPMs for the storage manager, we
cannot run it on that system (I suppose). Would the arcconf utility be
of any help here? 

>Thanks,
>Gana 

Thanks for your help,
Lukas Kolbe
Comment 9 Rafael J. Wysocki 2009-11-17 22:44:00 UTC
On Tuesday 17 November 2009, Lukas Kolbe wrote:
> Rafael J. Wysocki wrote:
> 
> >This message has been generated automatically as a part of a report
> >of recent regressions.
> >
> >The following bug entry is on the current list of known regressions
> >from 2.6.31.  Please verify if it still should be listed and let me know
> >(either way).
> 
> It is still valid. We haven't yet been able to verify if it is either a
> hardware problem (working with the adaptec folks to sort that out) or a
> kernel problem (working with you to find that out ;). Kernel 2.6.30, as
> already said, seems to think everything is fine, so it really might be a
> regression.
> 
> >Bug-Entry    : http://bugzilla.kernel.org/show_bug.cgi?id=14577
> >Subject              : Data Corruption with Adaptec 52445, Firmware 5.2-0
> (17380)
> >Submitter    :  <lkolbe@techfak.uni-bielefeld.de>
> >Date         : 2009-11-10 13:31 (7 days old)
Comment 10 lkolbe 2009-11-18 13:57:28 UTC
This bug also blocks #14579
The LSI requested us to try the current mtpfusion which is included in 2.6.32-rc, but we are unable to boot that kernel successfully.

fyi, the 24 1TB Sata-disks are in a RAID-60, and the system (debian lenny) is installed in an LVM volume.
Comment 11 lkolbe 2009-11-23 11:39:11 UTC
With rc7 _and_ new firmware 5.2-0 (17544) on the adaptec controller, this problem doesn't occur anymore. Thanks for your patience.