Bug 16545 - i7 920 motherboard, corrupts ext4 filesystem on SATA
Summary: i7 920 motherboard, corrupts ext4 filesystem on SATA
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Greg Kroah-Hartman
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-09 04:02 UTC by Andrew Valencia
Modified: 2012-02-22 21:51 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.32-24-generic-pae
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci output (28.57 KB, text/plain)
2010-08-09 04:03 UTC, Andrew Valencia
Details
lsusb output (23.81 KB, text/plain)
2010-08-09 04:03 UTC, Andrew Valencia
Details
dmesg (55.78 KB, text/plain)
2010-08-09 04:04 UTC, Andrew Valencia
Details

Description Andrew Valencia 2010-08-09 04:02:05 UTC
Newer i7 quad core with 6GB of RAM running in 32 bit mode (problem verified with and without PAE).  System installs and runs OK, but when I attach an external 1.5TB drive via USB as an ext4 filesystem, and then back up a 1TB filesystem onto it (using dump), I reliably get through about 75% of the backup, and then I suddenly find the system offline--unresponsive to keyboard, no disk activity, no response to pings.  On reset there is no sign of what happened in the logs.

Post-crash fsck shows USB storage OK, partition being backed up OK, but root filesystem has extensive corruption, requiring manual fsck and repair--in one case, beyond the abilities of fsck and requiring a filesystem rebuild.

I suspected LVM, but have reproduced without it.  I thought it was the drive, but have swapped in a new drive.  I thought it was a PAE issue, but reproduced with a non-PAE kernel.  I am left to wonder if 32-bit kernels are hitting something Bad when run on an i7 920 motherboard.  In about 2 weeks I'll be swapping out this hardware in its entirety, at which point I can install 64-bit and see if the problem still reproduces--somebody let me know if that's of interest.

Otherwise I'm pretty much out of ideas.  This is the most messed up I've ever seen a properly run system get.  I'm pretty sure it only happens with USB mass storage I/O, and I have a moratorium on USB for the system--I'll update if we hit corruption anyway.  So far it's looking OK.
Comment 1 Andrew Valencia 2010-08-09 04:03:18 UTC
Created attachment 27381 [details]
lspci output
Comment 2 Andrew Valencia 2010-08-09 04:03:42 UTC
Created attachment 27382 [details]
lsusb output
Comment 3 Andrew Valencia 2010-08-09 04:04:02 UTC
Created attachment 27383 [details]
dmesg
Comment 4 Andrew Valencia 2010-08-10 01:54:04 UTC
Nope, today verified that with *no* USB activity, filesystem was still corrupted.  So this is a core ext4/SATA bug.  I'm going to try and see if I can shuffle everything up off the first SATA master/slave port pair, see if it's related to a specific SATA controller port.  Reaching, I know... but dang, gotta keep this thing alive until the replacement server gets here.
Comment 5 Andrew Valencia 2010-08-10 02:35:04 UTC
<bad word inserted here>

Did you know that fsck.ext4, when it finishes, does not necessarily leave a clean and consistent filesystem?  You have to run it successively (at least with the kind of corruption I'm dealing with) until it finishes cleanly.

So I have *no* idea whether USB activity is indeed needed to cause the problem.  I will run with a now (apparently) fully clean disk, and will now be pretty sure if I see corruption that it's from a new instance of the problem, not residual damage from fsck.ext4 incompletely doing its job.
Comment 6 Andrew Valencia 2010-08-11 03:37:28 UTC
Confirmed corruption with exclusively SATA filesystem access.
Drive has been swapped, so all that remains is a problem with the
chain from ext4 through SATA driver and onto the MB's chip set.
Comment 7 Andrew Morton 2010-08-26 23:01:06 UTC
I don't know whether to bug the ext4 guys or the ata guys or someone else, really.  hard.

Are you sure the hardware isn't just busted?  Has it been observed on more than one machine?
Comment 8 Andrew Valencia 2010-08-27 15:51:51 UTC
It is AT MOST a motherboard compatibility issue.  Swapped MB with all other components the same and am running fine.  Since this bug hasn't received any "me too" posts, I think it's safe to say it was bad hardware.  Sorry to bother you, and feel free to close it out.
Comment 9 Greg Kroah-Hartman 2012-02-22 21:51:30 UTC
All USB bugs should be sent to the linux-usb@vger.kernel.org mailing 
list, and not entered into bugzilla.  Please bring this issue up there,
if it is still a problem in the latest kernel release.

Note You need to log in before you can comment on or make changes to this bug.