Bug 207359 - MegaRAID SAS 9361 controller hang/reset
Summary: MegaRAID SAS 9361 controller hang/reset
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: PPC-64 (show other bugs)
Hardware: PPC-64 Linux
: P1 normal
Assignee: platform_ppc-64
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-19 18:25 UTC by Cameron Berkenpas
Modified: 2020-08-06 17:56 UTC (History)
1 user (show)

See Also:
Kernel Version: >=v5.4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output for controller hang (5.03 KB, text/plain)
2020-04-19 18:25 UTC, Cameron Berkenpas
Details
5.6.11 megaraid POWER hang (20.49 KB, text/plain)
2020-05-10 03:02 UTC, Cameron Berkenpas
Details

Description Cameron Berkenpas 2020-04-19 18:25:56 UTC
Created attachment 288623 [details]
dmesg output for controller hang

On a Talos II 2x 36 core (144 thread) POWER9 box, MegaRAID SAS 9361-16i PCIE controller can be made to pretty consistently hang with "heavy IO" on kernel versions greater than 5.3.18.
I am unable to reproduce this on a 16/32 core/thread amd64 box with a MegaRAID SAS 9361-16i PCIE with the exact same firmware revision.

The box also has a Microsemi SAS HBA which seems unaffected by this.

System details:
Talos II motherboard
2x 36 core (144 thread) POWER9 processors
512GB memory
4k page size
MegaRAID SAS 9361-16i PCIE controller (4 disk RAID10 volume, megaraid_sas driver)
Microsemi HBA w/4x SSD's

The relevant dmesg messages are attached.
Comment 1 gyakovlev 2020-04-19 20:24:39 UTC
In my case I see similar problem on same motherboard but with aacraid driver (microsemi one)

https://bugzilla.kernel.org/show_bug.cgi?id=206123
Comment 2 Cameron Berkenpas 2020-04-19 20:55:08 UTC
Looking at bug 206123 above, it's worth noting that the amd64 box I'm using for comparison has SATA disks, though this is probably still a PPC specific issue.
Comment 3 Cameron Berkenpas 2020-05-10 03:02:06 UTC
Created attachment 289041 [details]
5.6.11 megaraid POWER hang

Still happens with 5.6.11. There seems to be potentially a bit more output this time, and I've included output from shutting down too in case it's useful.
Comment 4 Cameron Berkenpas 2020-08-06 17:56:24 UTC
I converted the box's filesystems from BTRFS to XFS, and switched the page size from 4k to 64k. The problem appears to be entirely gone now. I am able to conclusively run 5.7.13 without issue, which I verified as having the megaraid_sas controller hang problem while still running my previous BTRFS+4k page configuration.

Unfortunately, it took a great deal of time to perform this conversion, and I wasn't able to keep the box down even longer to test if converting to XFS and 64k pages individually resolved the issue. All I can say for certain is that either switching to XFS, to a 64k page size, or both has fixed the problem for me.

The backup volume is a single SATA disk that is still using BTRFS (for snapshotting), and is not giving me any trouble. But if this has any relation to https://bugzilla.kernel.org/show_bug.cgi?id=206123, then this may not be conclusive due to being that SATA disks potentially may not trigger the issue. The single disk also can't push as much IO as the RAID10 volume so that may be another reason.

My quasi educated non-kernel-dev guess is that this is probably a bug relating to the 4k page size. Whether or not the regular behavior of BTRFS exacerbates this (making it easier to reproduce), is possible, but unknown.

Hopefully someone else encountering this issue will find this helpful.

Note You need to log in before you can comment on or make changes to this bug.