Created attachment 23725 [details] dmesg of the running (and failing) 2.6.32-rc6 on this system When booting this system with the 2.6.32-rc6 kernel, we see immediate data corruption on the root filesystem: [ 21.813985] Adding 4194296k swap on /dev/mapper/system-swap. Priority:-1 extents:1 across:4194296k [ 21.903794] sd 0:0:0:0: [sda] Unhandled sense code [ 21.903865] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 21.903998] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] [ 21.904219] sd 0:0:0:0: [sda] Add. Sense: Internal target failure [ 21.904382] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 01 bf 00 00 08 00 [ 21.905051] end_request: I/O error, dev sda, sector 447 [ 21.905122] Buffer I/O error on device dm-0, logical block 0 [ 21.905191] lost page write due to I/O error on dm-0 [ 21.905271] EXT3 FS on dm-0, internal journal [ 23.263958] sd 0:0:0:0: [sda] Unhandled sense code [ 23.264029] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 23.264162] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] [ 23.264368] sd 0:0:0:0: [sda] Add. Sense: Internal target failure [ 23.264531] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 bc 11 cf 00 00 08 00 [ 23.265186] end_request: I/O error, dev sda, sector 12325327 [ 23.265257] Buffer I/O error on device dm-1, logical block 492034 [ 23.265327] lost page write due to I/O error on dm-1 [ 23.364800] kjournald starting. Commit interval 5 seconds [ 23.423871] sd 0:0:0:0: [sda] Unhandled sense code [ 23.423874] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 23.423877] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] [ 23.423880] sd 0:0:0:0: [sda] Add. Sense: Internal target failure [ 23.423886] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 80 01 bf 00 00 08 00 [ 23.423893] end_request: I/O error, dev sda, sector 8389055 [ 23.423983] Buffer I/O error on device dm-1, logical block 0 [ 23.424052] lost page write due to I/O error on dm-1 On 2.6.30 the system works 'fine' (in that it doesn't shredder the filesystems on bootup). dmesg, lspci, lsscsi all attached later.
Created attachment 23726 [details] lspci -vvv on a working kernel
Created attachment 23727 [details] lsscsi -t/-v/-c/-H output on a working kernel
Created attachment 23728 [details] contents of /proc/modules on the working system (2.6.30) /proc/version is: Linux version 2.6.30-1-amd64 (Debian 2.6.30-5+techfaklenny1) (sfrey@TechFak.Uni-Bielefeld.DE) (gcc version 4.3.2 (Debian 4.3.2-1.1) ) #1 SMP Sat Aug 15 11:20:08 UTC 2009
Firmware of the Adaptec-Controller is 5.2-0 (17380).
Reply-To: James.Bottomley@suse.de On Tue, 2009-11-10 at 13:31 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14577 > [ 21.903794] sd 0:0:0:0: [sda] Unhandled sense code > [ 21.903865] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ 21.903998] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] > [ 21.904219] sd 0:0:0:0: [sda] Add. Sense: Internal target failure > [ 21.904382] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 01 bf 00 00 08 > 00 > [ 21.905051] end_request: I/O error, dev sda, sector 447 > [ 21.905122] Buffer I/O error on device dm-0, logical block 0 > [ 21.905191] lost page write due to I/O error on dm-0 > [ 21.905271] EXT3 FS on dm-0, internal journal > [ 23.263958] sd 0:0:0:0: [sda] Unhandled sense code > [ 23.264029] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ 23.264162] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] > [ 23.264368] sd 0:0:0:0: [sda] Add. Sense: Internal target failure > [ 23.264531] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 bc 11 cf 00 00 08 > 00 > [ 23.265186] end_request: I/O error, dev sda, sector 12325327 > [ 23.265257] Buffer I/O error on device dm-1, logical block 492034 > [ 23.265327] lost page write due to I/O error on dm-1 > [ 23.364800] kjournald starting. Commit interval 5 seconds > [ 23.423871] sd 0:0:0:0: [sda] Unhandled sense code > [ 23.423874] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ 23.423877] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] > [ 23.423880] sd 0:0:0:0: [sda] Add. Sense: Internal target failure > [ 23.423886] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 80 01 bf 00 00 08 > 00 These are sense codes (Hardware error and Target Failure) generated by the driver when it encounters some type of firmware related failure in the RAID ... the only people who can really debug and fix this are the aacraid people (cc'd). It may be a simple matter of a support call if it's a hardware error. James
I understand that indicates a hardware failure, but this doesn't occur when using 2.6.30. That's why I opened it as a regression in the first place. I welcome input from the aacraid people, of course :) -- Lukas
Reply-To: Ganapathy_Sridaran@adaptec.com *** Copying the bugzilla email in the email address *** Hi James, We'll look into this issue and get back to you. I wonder if they are genuine write errors due to the medium errors. Is this problem seen with other RAID volumes or just with this one particular volume? Are you running Adaptec Storage Manager Software on this system? If so, can you please collect the support.zip file? Support.zip can be collected by clicking "support.zip" under the "actions" menu. The Support.zip file provides us additional information on any disk errors that might have happened in the system. Thanks, Gana Gana S. Sridaran Engg., Manager - RAID FW/Drivers Adaptec Inc., 691, S. Milpitas Blvd., Milpitas, CA 95035 408.957.4985 -----Original Message----- From: James Bottomley [mailto:James.Bottomley@suse.de] Sent: Tuesday, November 10, 2009 6:04 AM To: bugzilla-daemon@bugzilla.kernel.org Cc: linux-scsi@vger.kernel.org; AACRAID Subject: Re: [Bug 14577] New: Data Corruption with Adaptec 52445 On Tue, 2009-11-10 at 13:31 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14577 > [ 21.903794] sd 0:0:0:0: [sda] Unhandled sense code > [ 21.903865] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ 21.903998] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] > [ 21.904219] sd 0:0:0:0: [sda] Add. Sense: Internal target failure > [ 21.904382] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 01 bf 00 00 08 > 00 > [ 21.905051] end_request: I/O error, dev sda, sector 447 > [ 21.905122] Buffer I/O error on device dm-0, logical block 0 > [ 21.905191] lost page write due to I/O error on dm-0 > [ 21.905271] EXT3 FS on dm-0, internal journal > [ 23.263958] sd 0:0:0:0: [sda] Unhandled sense code > [ 23.264029] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ 23.264162] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] > [ 23.264368] sd 0:0:0:0: [sda] Add. Sense: Internal target failure > [ 23.264531] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 bc 11 cf 00 00 08 > 00 > [ 23.265186] end_request: I/O error, dev sda, sector 12325327 > [ 23.265257] Buffer I/O error on device dm-1, logical block 492034 > [ 23.265327] lost page write due to I/O error on dm-1 > [ 23.364800] kjournald starting. Commit interval 5 seconds > [ 23.423871] sd 0:0:0:0: [sda] Unhandled sense code > [ 23.423874] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ 23.423877] sd 0:0:0:0: [sda] Sense Key : Hardware Error [current] > [ 23.423880] sd 0:0:0:0: [sda] Add. Sense: Internal target failure > [ 23.423886] sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 80 01 bf 00 00 08 > 00 These are sense codes (Hardware error and Target Failure) generated by the driver when it encounters some type of firmware related failure in the RAID ... the only people who can really debug and fix this are the aacraid people (cc'd). It may be a simple matter of a support call if it's a hardware error. James
bugzilla-daemon@bugzilla.kernel.org wrote: >Hi James, > >We'll look into this issue and get back to you. I wonder if they are genuine >write errors due to the medium errors. Is this problem seen with other RAID >volumes or just with this one particular volume? > >Are you running Adaptec Storage Manager Software on this system? If so, can >you >please collect the support.zip file? Support.zip can be collected by clicking >"support.zip" under the "actions" menu. The Support.zip file provides us >additional information on any disk errors that might have happened in the >system. Hello Gana, we have the arcconf utility available on that system, but since it is Debian Lenny based and we only found RPMs for the storage manager, we cannot run it on that system (I suppose). Would the arcconf utility be of any help here? >Thanks, >Gana Thanks for your help, Lukas Kolbe
On Tuesday 17 November 2009, Lukas Kolbe wrote: > Rafael J. Wysocki wrote: > > >This message has been generated automatically as a part of a report > >of recent regressions. > > > >The following bug entry is on the current list of known regressions > >from 2.6.31. Please verify if it still should be listed and let me know > >(either way). > > It is still valid. We haven't yet been able to verify if it is either a > hardware problem (working with the adaptec folks to sort that out) or a > kernel problem (working with you to find that out ;). Kernel 2.6.30, as > already said, seems to think everything is fine, so it really might be a > regression. > > >Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14577 > >Subject : Data Corruption with Adaptec 52445, Firmware 5.2-0 > (17380) > >Submitter : <lkolbe@techfak.uni-bielefeld.de> > >Date : 2009-11-10 13:31 (7 days old)
This bug also blocks #14579 The LSI requested us to try the current mtpfusion which is included in 2.6.32-rc, but we are unable to boot that kernel successfully. fyi, the 24 1TB Sata-disks are in a RAID-60, and the system (debian lenny) is installed in an LVM volume.
With rc7 _and_ new firmware 5.2-0 (17544) on the adaptec controller, this problem doesn't occur anymore. Thanks for your patience.