Bug 6052

Summary: Megaraid file corruption and performance degradation on x86_64 with 8G RAM
Product: SCSI Drivers Reporter: Adam Huffman (bloch)
Component: OtherAssignee: Mike Anderson (andmike)
Status: CLOSED INVALID    
Severity: high    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.15-1.2005_FC4smp Subsystem:
Regression: --- Bisected commit-id:

Description Adam Huffman 2006-02-10 10:19:38 UTC
Most recent kernel where this bug did not occur:
Distribution: 
Fedora Core 4

Hardware Environment: 

Tyan S2882
2 x Opteron 252
8G RAM
LSI MegaRAID SATA 150-4

Software Environment:

Problem Description:

A RAID1 mirror was setup using the MegaRAID card and 2 x 250 GB drives.
After only a day or so of use, there was severe filesystem corruption (using
ext3) and the user complained that even before the corruption it was incredibly
slow (the mirror is used to house $HOME).

We have other machines here with only 4G RAM using the same card and setup. 
Performance there is fine and there's been no filesystem corruption.

In one case I removed the drives from the card and connected them to the SATA
ports on the motherboard, using software RAID.  That machine is working fine.

In the other case I have tried using a Fedora test kernel, equivalent to
2.6.16-rc2, which has the latest version of the MegaRAID driver (2.20.4.7).
I was unable to recover the filesystem on the original partition when running
fsck using the new kernel.  However, at least on this occasion fsck kept running
(for two days...) instead of crashing.

When running mkfs on the hardware RAID1 mirror, performance still seems to be
very slow.

Hence it appears there is a specific problem with the LSI megaraid driver on
x86_64 machines with more than 4G RAM.

Steps to reproduce:
Comment 1 Adam Huffman 2006-02-10 10:20:40 UTC
Changed to severity high owing to the filesystem corruption and severe
performance degradation.
Comment 2 Andrew Morton 2006-02-10 11:27:17 UTC
Begin forwarded message:

Date: Fri, 10 Feb 2006 10:19:45 -0800
From: bugme-daemon@bugzilla.kernel.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 6052] New: Megaraid file corruption and performance degradation on x86_64 with 8G RAM


http://bugzilla.kernel.org/show_bug.cgi?id=6052

           Summary: Megaraid file corruption and performance degradation on
                    x86_64 with 8G RAM
    Kernel Version: 2.6.15-1.2005_FC4smp
            Status: NEW
          Severity: normal
             Owner: andmike@us.ibm.com
         Submitter: bloch@verdurin.com


Most recent kernel where this bug did not occur:
Distribution: 
Fedora Core 4

Hardware Environment: 

Tyan S2882
2 x Opteron 252
8G RAM
LSI MegaRAID SATA 150-4

Software Environment:

Problem Description:

A RAID1 mirror was setup using the MegaRAID card and 2 x 250 GB drives.
After only a day or so of use, there was severe filesystem corruption (using
ext3) and the user complained that even before the corruption it was incredibly
slow (the mirror is used to house $HOME).

We have other machines here with only 4G RAM using the same card and setup. 
Performance there is fine and there's been no filesystem corruption.

In one case I removed the drives from the card and connected them to the SATA
ports on the motherboard, using software RAID.  That machine is working fine.

In the other case I have tried using a Fedora test kernel, equivalent to
2.6.16-rc2, which has the latest version of the MegaRAID driver (2.20.4.7).
I was unable to recover the filesystem on the original partition when running
fsck using the new kernel.  However, at least on this occasion fsck kept running
(for two days...) instead of crashing.

When running mkfs on the hardware RAID1 mirror, performance still seems to be
very slow.

Hence it appears there is a specific problem with the LSI megaraid driver on
x86_64 machines with more than 4G RAM.

Steps to reproduce:

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Comment 3 Anonymous Emailer 2006-02-11 01:37:57 UTC
Reply-To: ak@muc.de

On Fri, Feb 10, 2006 at 11:26:40AM -0800, Andrew Morton wrote:
> 
> 
> Begin forwarded message:

Hmm, could be the merging issue too? Does megaraid do anything
funny with sg lists? Does iommu=nomerge help?

If yes apply

ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/gart-dma-merge

and report back.

-Andi

Comment 4 Adam Huffman 2006-02-11 03:04:28 UTC
Have tried rebooting the machine remotely with the iommu parameter but it hasn't
come back up...

I'll have physical access to it again on Monday.

The original report to LKML was here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=113093777230634&w=2

and the original Fedora bugzilla report (regarding the ext3 errors) is at:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=172284
Comment 5 Adam Huffman 2006-02-13 08:52:32 UTC
Booting with iommu=nomerge doesn't seem to have helped.  After having wiped and
rebuilt the filesystem in that configuration, there are still errors:

EXT3-fs error (device sdb2): ext3_check_descriptors: Block bitmap for group 1
not in group (block 0)!
EXT3-fs: group descriptors corrupted !

These errors occur when trying to mount the newly-created filesystem.

mke2fs was also rather slow, as before.
Comment 6 Adam Huffman 2006-02-16 08:21:18 UTC
As suggested on linux-scsi, I altered the controller to use 'cachedio' instead
of 'directio'.  This made performance a lot faster but there are still
corruption problems.  When I untarred a kernel image, there was a journal error
and the filesystem was remounted read-only:

EXT3 FS on sdb2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
EXT3-fs error (device sdb2): ext3_new_block: Allocating block in system zone -
block = 44335105
Aborting journal on device sdb2.
EXT3-fs error (device sdb2) in ext3_reserve_inode_write: Journal has aborted
ext3_abort called.
EXT3-fs error (device sdb2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data
Comment 7 Martin J. Bligh 2006-05-03 07:41:34 UTC
Please file fedora bugs in their own bugzilla, or reproduce on mainline
Comment 8 Adam Huffman 2006-07-25 14:58:29 UTC
As I indicated in my earlier comments, I had already reported it in the Fedora
bugzilla.  Of course, their policy is "report upstream", which following from
your reaction would leave me with no avenue at all.

In any case, it appears as though the problem has been addressed in version
2.20.4.9 of the megaraid scsi module.