Bug 11564

Summary: ext3 I/O errors on certain hardware
Product: File System Reporter: Felix Miata (mrmazda)
Component: ext3Assignee: Andrew Morton (akpm)
Status: REJECTED INVALID    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27 Subsystem:
Regression: --- Bisected commit-id:
Attachments: /var/log/messages Mandriva 2007.1 2.6.17
/var/log/messages OpenSUSE 10.2 2.6.18
/var/log/messages current Cooker 2.6.27-0rc6
/var/log/messages current Factory 2.6.27-0.rc5

Description Felix Miata 2008-09-13 19:20:35 UTC
Latest working kernel version:2.6.17
Earliest failing kernel version:2.6.18
Distribution:Mandriva, OpenSUSE
Hardware Environment: PIII-700 on i440BX (100FSB Tyan S1846)
piix/sym53c8xx (SYM8751SP)
dysfunctional HD: Quantum Atlas III QM39100TD-SW Rev: N1B0
OK HD: IBM DPSS 309170; 07N3120; MLC: PS0S96 (Ultrastar)
OK HD #2: 60G Seagate Barracuda PATA on piix
Software Environment:typical, except  all partitions formatted ext3 -I128 & -b1024 or -b2048 due to their small size (4.8G or less)
Problem Description:
Tail of most recent (Factory 2.6.27-rc6) /var/log/messages:
Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1810985
Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1811039
Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file data on sda7
Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file data on sda7

Similar errors occur with other post-2.6.17 kernels. Typical result is rpm database corruption (see e.g. https://qa.mandriva.com/show_bug.cgi?id=32547 not reported by me) making system very difficult to use.

I've run current Cookers on this hardware combination for several years, but just over a year ago started having trouble when the 2.6.17 kernel was upgraded. I ran the manufacturer's QDPS diagnostics on the Quantum shortly after the problem appeared about 13 or so months ago, and again a few days ago, both times OK according to QDPS. I ran the LSI controller's format program on it a few days ago too. I then tried installing fresh Mandriva 2007.1 (complete success) and OpenSUSE 10.2 (limited number of errors of this type). Trying to do a current install of Cooker or Factory are hopeless. I tested Factory by copying a Factory/11.0 installation from the PATA to sda7 on SCSI, then trying to update to current Factory, while Cooker was on sda7 for several years. The problem simply did and does not exist with the Mandriva 2.6.17 and old kernels using the Atlas III. I tried cloning the Atlas III to the Ultrastar, and cannot reproduce using either the Barracuda or the Ultrastar. Trying a different SCSI cable didn't help.

Steps to reproduce:
Try to use a wrong hardware combination.
Comment 1 Felix Miata 2008-09-13 19:24:51 UTC
Created attachment 17767 [details]
/var/log/messages Mandriva 2007.1 2.6.17
Comment 2 Felix Miata 2008-09-13 19:24:55 UTC
Created attachment 17768 [details]
/var/log/messages OpenSUSE 10.2 2.6.18
Comment 3 Felix Miata 2008-09-13 19:25:04 UTC
Created attachment 17769 [details]
/var/log/messages current Cooker 2.6.27-0rc6
Comment 4 Felix Miata 2008-09-13 19:25:08 UTC
Created attachment 17770 [details]
/var/log/messages current Factory 2.6.27-0.rc5
Comment 5 Anonymous Emailer 2008-09-14 00:14:39 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 13 Sep 2008 19:20:35 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11564
> 
>            Summary: ext3 I/O errors when <4096 blocksize on certain hardware
>            Product: File System
>            Version: 2.5
>      KernelVersion: 2.6.27
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: ext3
>         AssignedTo: akpm@osdl.org
>         ReportedBy: mrmazda@ij.net
> 
> 
> Latest working kernel version:2.6.17
> Earliest failing kernel version:2.6.18
> Distribution:Mandriva, OpenSUSE
> Hardware Environment: PIII-700 on i440BX (100FSB Tyan S1846)
> piix/sym53c8xx (SYM8751SP)
> dysfunctional HD: Quantum Atlas III QM39100TD-SW Rev: N1B0
> OK HD: IBM DPSS 309170; 07N3120; MLC: PS0S96 (Ultrastar)
> OK HD #2: 60G Seagate Barracuda PATA on piix
> Software Environment:typical, except  all partitions formatted ext3 -I128 &
> -b1024 or -b2048 due to their small size (4.8G or less)
> Problem Description:
> Tail of most recent (Factory 2.6.27-rc6) /var/log/messages:
> Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR
> driverbyte=DRIVER_OK,SUGGEST_OK
> Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1810985
> Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR
> driverbyte=DRIVER_OK,SUGGEST_OK
> Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1811039
> Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file
> data
> on sda7
> Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file
> data
> on sda7
> 
> Similar errors occur with other post-2.6.17 kernels. Typical result is rpm
> database corruption (see e.g. https://qa.mandriva.com/show_bug.cgi?id=32547
> not
> reported by me) making system very difficult to use.
> 
> I've run current Cookers on this hardware combination for several years, but
> just over a year ago started having trouble when the 2.6.17 kernel was
> upgraded. I ran the manufacturer's QDPS diagnostics on the Quantum shortly
> after the problem appeared about 13 or so months ago, and again a few days
> ago,
> both times OK according to QDPS. I ran the LSI controller's format program on
> it a few days ago too. I then tried installing fresh Mandriva 2007.1
> (complete
> success) and OpenSUSE 10.2 (limited number of errors of this type). Trying to
> do a current install of Cooker or Factory are hopeless. I tested Factory by
> copying a Factory/11.0 installation from the PATA to sda7 on SCSI, then
> trying
> to update to current Factory, while Cooker was on sda7 for several years. The
> problem simply did and does not exist with the Mandriva 2.6.17 and old
> kernels
> using the Atlas III. I tried cloning the Atlas III to the Ultrastar, and
> cannot
> reproduce using either the Barracuda or the Ultrastar. Trying a different
> SCSI
> cable didn't help.
> 
> Steps to reproduce:
> Try to use a wrong hardware combination.
> 
Comment 6 Anonymous Emailer 2008-09-14 23:22:44 UTC
Reply-To: adilger@sun.com

On Sep 14, 2008  00:14 -0700, Andrew Morton wrote:
> On Sat, 13 Sep 2008 19:20:35 -0700 (PDT) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=11564
> > Tail of most recent (Factory 2.6.27-rc6) /var/log/messages:
> > Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector
> 1810985
> > Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result:
> hostbyte=DID_SOFT_ERROR
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector
> 1811039
> > Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file
> data
> > on sda7
> > Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file
> data
> > on sda7

I'd think from the above errors that the problem is in the device itself,
or in the SCSI layer.  No amount of ext3 IO should be able to trigger SCSI
errors.

> > Similar errors occur with other post-2.6.17 kernels. Typical result is rpm
> > database corruption (see e.g. https://qa.mandriva.com/show_bug.cgi?id=32547
> > not reported by me) making system very difficult to use.
> > 
> > The problem simply did and does not exist with the
> > Mandriva 2.6.17 and old kernels using the Atlas III. I tried cloning
> > the Atlas III to the Ultrastar, and cannot reproduce using either the
> > Barracuda or the Ultrastar. Trying a different SCSI cable didn't help.

This sounds like a case where git-bisect of 2.6.17-2.6.18 would be able
to isolate the problem fairly efficiently.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Comment 7 Felix Miata 2008-09-18 10:17:43 UTC
I did more testing, and found that 4096 blocksize does not help. I put the Atlas III in a different system, on a twin SYM8751SP SCSI host, using a different cable. Using the 2.6.17 kernel in Fedora 4 & the 2.6.19 kernel in Knoppix 5.1.1 errors do occur. Obviously Linux is better at detecting bad disk than the manufacturer's disk failure detection software. -> INVALID