Latest working kernel version:2.6.17 Earliest failing kernel version:2.6.18 Distribution:Mandriva, OpenSUSE Hardware Environment: PIII-700 on i440BX (100FSB Tyan S1846) piix/sym53c8xx (SYM8751SP) dysfunctional HD: Quantum Atlas III QM39100TD-SW Rev: N1B0 OK HD: IBM DPSS 309170; 07N3120; MLC: PS0S96 (Ultrastar) OK HD #2: 60G Seagate Barracuda PATA on piix Software Environment:typical, except all partitions formatted ext3 -I128 & -b1024 or -b2048 due to their small size (4.8G or less) Problem Description: Tail of most recent (Factory 2.6.27-rc6) /var/log/messages: Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1810985 Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1811039 Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file data on sda7 Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file data on sda7 Similar errors occur with other post-2.6.17 kernels. Typical result is rpm database corruption (see e.g. https://qa.mandriva.com/show_bug.cgi?id=32547 not reported by me) making system very difficult to use. I've run current Cookers on this hardware combination for several years, but just over a year ago started having trouble when the 2.6.17 kernel was upgraded. I ran the manufacturer's QDPS diagnostics on the Quantum shortly after the problem appeared about 13 or so months ago, and again a few days ago, both times OK according to QDPS. I ran the LSI controller's format program on it a few days ago too. I then tried installing fresh Mandriva 2007.1 (complete success) and OpenSUSE 10.2 (limited number of errors of this type). Trying to do a current install of Cooker or Factory are hopeless. I tested Factory by copying a Factory/11.0 installation from the PATA to sda7 on SCSI, then trying to update to current Factory, while Cooker was on sda7 for several years. The problem simply did and does not exist with the Mandriva 2.6.17 and old kernels using the Atlas III. I tried cloning the Atlas III to the Ultrastar, and cannot reproduce using either the Barracuda or the Ultrastar. Trying a different SCSI cable didn't help. Steps to reproduce: Try to use a wrong hardware combination.
Created attachment 17767 [details] /var/log/messages Mandriva 2007.1 2.6.17
Created attachment 17768 [details] /var/log/messages OpenSUSE 10.2 2.6.18
Created attachment 17769 [details] /var/log/messages current Cooker 2.6.27-0rc6
Created attachment 17770 [details] /var/log/messages current Factory 2.6.27-0.rc5
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sat, 13 Sep 2008 19:20:35 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11564 > > Summary: ext3 I/O errors when <4096 blocksize on certain hardware > Product: File System > Version: 2.5 > KernelVersion: 2.6.27 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: ext3 > AssignedTo: akpm@osdl.org > ReportedBy: mrmazda@ij.net > > > Latest working kernel version:2.6.17 > Earliest failing kernel version:2.6.18 > Distribution:Mandriva, OpenSUSE > Hardware Environment: PIII-700 on i440BX (100FSB Tyan S1846) > piix/sym53c8xx (SYM8751SP) > dysfunctional HD: Quantum Atlas III QM39100TD-SW Rev: N1B0 > OK HD: IBM DPSS 309170; 07N3120; MLC: PS0S96 (Ultrastar) > OK HD #2: 60G Seagate Barracuda PATA on piix > Software Environment:typical, except all partitions formatted ext3 -I128 & > -b1024 or -b2048 due to their small size (4.8G or less) > Problem Description: > Tail of most recent (Factory 2.6.27-rc6) /var/log/messages: > Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: > hostbyte=DID_SOFT_ERROR > driverbyte=DRIVER_OK,SUGGEST_OK > Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1810985 > Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: > hostbyte=DID_SOFT_ERROR > driverbyte=DRIVER_OK,SUGGEST_OK > Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector 1811039 > Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file > data > on sda7 > Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file > data > on sda7 > > Similar errors occur with other post-2.6.17 kernels. Typical result is rpm > database corruption (see e.g. https://qa.mandriva.com/show_bug.cgi?id=32547 > not > reported by me) making system very difficult to use. > > I've run current Cookers on this hardware combination for several years, but > just over a year ago started having trouble when the 2.6.17 kernel was > upgraded. I ran the manufacturer's QDPS diagnostics on the Quantum shortly > after the problem appeared about 13 or so months ago, and again a few days > ago, > both times OK according to QDPS. I ran the LSI controller's format program on > it a few days ago too. I then tried installing fresh Mandriva 2007.1 > (complete > success) and OpenSUSE 10.2 (limited number of errors of this type). Trying to > do a current install of Cooker or Factory are hopeless. I tested Factory by > copying a Factory/11.0 installation from the PATA to sda7 on SCSI, then > trying > to update to current Factory, while Cooker was on sda7 for several years. The > problem simply did and does not exist with the Mandriva 2.6.17 and old > kernels > using the Atlas III. I tried cloning the Atlas III to the Ultrastar, and > cannot > reproduce using either the Barracuda or the Ultrastar. Trying a different > SCSI > cable didn't help. > > Steps to reproduce: > Try to use a wrong hardware combination. >
Reply-To: adilger@sun.com On Sep 14, 2008 00:14 -0700, Andrew Morton wrote: > On Sat, 13 Sep 2008 19:20:35 -0700 (PDT) bugme-daemon@bugzilla.kernel.org > wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=11564 > > Tail of most recent (Factory 2.6.27-rc6) /var/log/messages: > > Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: > hostbyte=DID_SOFT_ERROR > > driverbyte=DRIVER_OK,SUGGEST_OK > > Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector > 1810985 > > Sep 13 21:29:23 xxxxx kernel: sd 0:0:1:0: [sda] Result: > hostbyte=DID_SOFT_ERROR > > driverbyte=DRIVER_OK,SUGGEST_OK > > Sep 13 21:29:23 xxxxx kernel: end_request: I/O error, dev sda, sector > 1811039 > > Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file > data > > on sda7 > > Sep 13 21:29:23 xxxxx kernel: JBD: Detected IO errors while flushing file > data > > on sda7 I'd think from the above errors that the problem is in the device itself, or in the SCSI layer. No amount of ext3 IO should be able to trigger SCSI errors. > > Similar errors occur with other post-2.6.17 kernels. Typical result is rpm > > database corruption (see e.g. https://qa.mandriva.com/show_bug.cgi?id=32547 > > not reported by me) making system very difficult to use. > > > > The problem simply did and does not exist with the > > Mandriva 2.6.17 and old kernels using the Atlas III. I tried cloning > > the Atlas III to the Ultrastar, and cannot reproduce using either the > > Barracuda or the Ultrastar. Trying a different SCSI cable didn't help. This sounds like a case where git-bisect of 2.6.17-2.6.18 would be able to isolate the problem fairly efficiently. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
I did more testing, and found that 4096 blocksize does not help. I put the Atlas III in a different system, on a twin SYM8751SP SCSI host, using a different cable. Using the 2.6.17 kernel in Fedora 4 & the 2.6.19 kernel in Knoppix 5.1.1 errors do occur. Obviously Linux is better at detecting bad disk than the manufacturer's disk failure detection software. -> INVALID