Most recent kernel where this bug did not occur: Distribution: Fedora Core 4 Hardware Environment: Tyan S2882 2 x Opteron 252 8G RAM LSI MegaRAID SATA 150-4 Software Environment: Problem Description: A RAID1 mirror was setup using the MegaRAID card and 2 x 250 GB drives. After only a day or so of use, there was severe filesystem corruption (using ext3) and the user complained that even before the corruption it was incredibly slow (the mirror is used to house $HOME). We have other machines here with only 4G RAM using the same card and setup. Performance there is fine and there's been no filesystem corruption. In one case I removed the drives from the card and connected them to the SATA ports on the motherboard, using software RAID. That machine is working fine. In the other case I have tried using a Fedora test kernel, equivalent to 2.6.16-rc2, which has the latest version of the MegaRAID driver (2.20.4.7). I was unable to recover the filesystem on the original partition when running fsck using the new kernel. However, at least on this occasion fsck kept running (for two days...) instead of crashing. When running mkfs on the hardware RAID1 mirror, performance still seems to be very slow. Hence it appears there is a specific problem with the LSI megaraid driver on x86_64 machines with more than 4G RAM. Steps to reproduce:
Changed to severity high owing to the filesystem corruption and severe performance degradation.
Begin forwarded message: Date: Fri, 10 Feb 2006 10:19:45 -0800 From: bugme-daemon@bugzilla.kernel.org To: bugme-new@lists.osdl.org Subject: [Bugme-new] [Bug 6052] New: Megaraid file corruption and performance degradation on x86_64 with 8G RAM http://bugzilla.kernel.org/show_bug.cgi?id=6052 Summary: Megaraid file corruption and performance degradation on x86_64 with 8G RAM Kernel Version: 2.6.15-1.2005_FC4smp Status: NEW Severity: normal Owner: andmike@us.ibm.com Submitter: bloch@verdurin.com Most recent kernel where this bug did not occur: Distribution: Fedora Core 4 Hardware Environment: Tyan S2882 2 x Opteron 252 8G RAM LSI MegaRAID SATA 150-4 Software Environment: Problem Description: A RAID1 mirror was setup using the MegaRAID card and 2 x 250 GB drives. After only a day or so of use, there was severe filesystem corruption (using ext3) and the user complained that even before the corruption it was incredibly slow (the mirror is used to house $HOME). We have other machines here with only 4G RAM using the same card and setup. Performance there is fine and there's been no filesystem corruption. In one case I removed the drives from the card and connected them to the SATA ports on the motherboard, using software RAID. That machine is working fine. In the other case I have tried using a Fedora test kernel, equivalent to 2.6.16-rc2, which has the latest version of the MegaRAID driver (2.20.4.7). I was unable to recover the filesystem on the original partition when running fsck using the new kernel. However, at least on this occasion fsck kept running (for two days...) instead of crashing. When running mkfs on the hardware RAID1 mirror, performance still seems to be very slow. Hence it appears there is a specific problem with the LSI megaraid driver on x86_64 machines with more than 4G RAM. Steps to reproduce: ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
Reply-To: ak@muc.de On Fri, Feb 10, 2006 at 11:26:40AM -0800, Andrew Morton wrote: > > > Begin forwarded message: Hmm, could be the merging issue too? Does megaraid do anything funny with sg lists? Does iommu=nomerge help? If yes apply ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/gart-dma-merge and report back. -Andi
Have tried rebooting the machine remotely with the iommu parameter but it hasn't come back up... I'll have physical access to it again on Monday. The original report to LKML was here: http://marc.theaimsgroup.com/?l=linux-kernel&m=113093777230634&w=2 and the original Fedora bugzilla report (regarding the ext3 errors) is at: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=172284
Booting with iommu=nomerge doesn't seem to have helped. After having wiped and rebuilt the filesystem in that configuration, there are still errors: EXT3-fs error (device sdb2): ext3_check_descriptors: Block bitmap for group 1 not in group (block 0)! EXT3-fs: group descriptors corrupted ! These errors occur when trying to mount the newly-created filesystem. mke2fs was also rather slow, as before.
As suggested on linux-scsi, I altered the controller to use 'cachedio' instead of 'directio'. This made performance a lot faster but there are still corruption problems. When I untarred a kernel image, there was a journal error and the filesystem was remounted read-only: EXT3 FS on sdb2, internal journal EXT3-fs: mounted filesystem with ordered data mode. EXT3-fs error (device sdb2): ext3_new_block: Allocating block in system zone - block = 44335105 Aborting journal on device sdb2. EXT3-fs error (device sdb2) in ext3_reserve_inode_write: Journal has aborted ext3_abort called. EXT3-fs error (device sdb2): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_committed_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data __journal_remove_journal_head: freeing b_frozen_data
Please file fedora bugs in their own bugzilla, or reproduce on mainline
As I indicated in my earlier comments, I had already reported it in the Fedora bugzilla. Of course, their policy is "report upstream", which following from your reaction would leave me with no avenue at all. In any case, it appears as though the problem has been addressed in version 2.20.4.9 of the megaraid scsi module.