Created attachment 290345 [details] quick and dirty patch to fix the issue On a large array (>15 drives), it is impossible to backup the storage to a SAS tape without the driver detecting a lockup, and causing a bus reset. This seems to be a false detection, as the host controller actually is not locking up anything. It's just a bit delayed. This issue seems to go back to 4.14. I reverted some cleanup stuff introduced in 4.14, and the driver is working correctly. I attached a patch for it, but this is just to show where the bug may be, it is not ready for production (though it works, but this may be for 7 series only). I also have no idea what exactly causes this issue Bug observed on a series 7 controller with a 12-drive RAID6 array.
Sorry, this patch seems to be a false positive ... the error still occurs: scsi_eh_handler still appears, though a little later
check this https://patchwork.kernel.org/patch/11038347/
I saw that, the modifications are included in this patch (but for 7 series instead of 6), but they do not seem to work. There must be another issue. I know that the controller works fine when issuing commands like create / erase / repair etc ... but during large IO, it fails. So there must be some sync issue between the scsi subsystem (or the aacraid driver) and the adapter.
Created attachment 290373 [details] modification to make Microsemi driver work with 5.7 kernel I know this is bad practice, but at least it produces some results: I tried the proprietary Microsemi driver (58012). Of course it does not work with recent kernels, but after modifying the code a bit, I made "something" that works. Patch in attachment. Any idea why this one works but the open source variant does not? When I take a look at the amount of abandoned / junk in the code of Microsemi after modifying, I'd expect the opposite.
I think I found a solution: When I force sync mode, the driver handles everything perfectly. Off course this has a performance impact, so if anyone could help me debug this driver in async mode, it would be very much appreciated ...
the previous setting was no solution. The functionality of the driver is largely reduced. aacraid cache=3 & arcconf setcache ld 1 coff & echo "write through" > /sys/block/sdc/queue/write_cache fixed the issue. This is most probably hardware related. No linux bug