Bug 208605 - AACRAID frequent hos bus reset with intensive IO on large arrays
Summary: AACRAID frequent hos bus reset with intensive IO on large arrays
Status: RESOLVED INVALID
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: AACRAID (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: scsi_drivers-aacraid
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-19 06:36 UTC by Janpieter Sollie
Modified: 2020-08-12 09:10 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.14 - 5.7.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
quick and dirty patch to fix the issue (30.81 KB, patch)
2020-07-19 06:36 UTC, Janpieter Sollie
Details | Diff
modification to make Microsemi driver work with 5.7 kernel (27.80 KB, patch)
2020-07-20 10:22 UTC, Janpieter Sollie
Details | Diff

Description Janpieter Sollie 2020-07-19 06:36:30 UTC
Created attachment 290345 [details]
quick and dirty patch to fix the issue

On a large array (>15 drives), it is impossible to backup the storage to a SAS tape without the driver detecting a lockup, and causing a bus reset.
This seems to be a false detection, as the host controller actually is not locking up anything.  It's just a bit delayed.
This issue seems to go back to 4.14.

I reverted some cleanup stuff introduced in 4.14, and the driver is working correctly.

I attached a patch for it, but this is just to show where the bug may be, it is not ready for production (though it works, but this may be for 7 series only).  I also have no idea what exactly causes this issue

Bug observed on a series 7 controller with a 12-drive RAID6 array.
Comment 1 Janpieter Sollie 2020-07-19 07:32:09 UTC
Sorry, this patch seems to be a false positive ... the error still occurs: scsi_eh_handler still appears, though a little later
Comment 2 Andrey Jr. Melnikov 2020-07-19 10:08:41 UTC
check this https://patchwork.kernel.org/patch/11038347/
Comment 3 Janpieter Sollie 2020-07-19 12:31:50 UTC
I saw that, the modifications are included in this patch (but for 7 series instead of 6), but they do not seem to work.  There must be another issue.
I know that the controller works fine when issuing commands like create / erase / repair etc ... but during large IO, it fails.  So there must be some sync issue between the scsi subsystem (or the aacraid driver) and the adapter.
Comment 4 Janpieter Sollie 2020-07-20 10:22:25 UTC
Created attachment 290373 [details]
modification to make Microsemi driver work with 5.7 kernel

I know this is bad practice, but at least it produces some results:
I tried the proprietary Microsemi driver (58012).  Of course it does not work with recent kernels, but after modifying the code a bit, I made "something" that works.
Patch in attachment.  Any idea why this one works but the open source variant does not? When I take a look at the amount of abandoned / junk in the code of Microsemi after modifying, I'd expect the opposite.
Comment 5 Janpieter Sollie 2020-07-29 07:43:18 UTC
I think I found a solution:
When I force sync mode, the driver handles everything perfectly.  Off course this has a performance impact, so if anyone could help me debug this driver in async mode, it would be very much appreciated ...
Comment 6 Janpieter Sollie 2020-08-12 09:10:32 UTC
the previous setting was no solution.  The functionality of the driver is largely reduced. aacraid cache=3 & arcconf setcache ld 1 coff & echo "write through" > /sys/block/sdc/queue/write_cache fixed the issue.  This is most probably hardware related.  No linux bug

Note You need to log in before you can comment on or make changes to this bug.