Bug 219467 - Adaptec 71605 hangs with aacraid: Host adapter abort request after update to linux 6.11.5
Summary: Adaptec 71605 hangs with aacraid: Host adapter abort request after update to ...
Status: NEW
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: AACRAID (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: scsi_drivers-aacraid
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-11-04 22:17 UTC by Nathan Grennan
Modified: 2024-11-04 22:47 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Nathan Grennan 2024-11-04 22:17:04 UTC
On October 31st I upgraded a system from Fedora 40 to Fedora 41. This upgraded the kernel from 6.10.6-200.fc40.x86_64 to 6.11.5-300.fc41.x86_64. One of the system's primary uses is as a NAS using an Adaptec 71605 and zfs-2.2.6. The system does zfs scrubs on the two zfs filesystems on Mondays, like Oct 28th and Nov 4th. On Oct 28th it was still on the 6.10.6 kernel, and today it was on the 6.11.5 kernel.

The errors repeated until I woke up, and found the scrubs had stopped from zfs errors caused by the controller errors. After a bit I rebooted the system, and then had to stop the scrubs again. They had automatically restarted. I then installed 6.10.14-200.fc40.x86_64, and restarted the scrubs.

The scrub processes started at nearly 4am. You can see from the timing of the logs below that the errors didn't start for over two hours into the scrub. The house thermostat is set to 73F/76F, and the outside temperature at 6am was 45F. So the room shouldn't have been unusually hot.

I saw zfs read and write errors on all the drives on the 71605.

I restarted the scrubs after downgrading to 6.10.14. It has been about three hours since then. Which means it has lasted longer than 6.11.5 so far. I will update with a new comment when it either throws an error or completes.

I built the system in May of 2021, and it hasn't given many any issues like this before. It started with a 5.11.12-300.fc34 kernel.

I did look for a newer version of the disk controller's bios, but found it is already the latest, 32118.

System hardware:
AMD Ryzen 9 5950X, processor
Kingston 128gb(4x32gb) DDR4 ECC, memory
ASUS Pro WS X570-ACE, motherboard
Adaptec 71605, disk controller
6 WD 18tb SATA, drives(one on the 71605, rest on other controllers)
9 WD 8tb SATA, drives(all on the 71605)

BIOS/Firmware versions:
BIOS                                       : 7.5-0 (32118)
Firmware                                   : 7.5-0 (32118)

A older, but very similar bug:
https://bugzilla.kernel.org/show_bug.cgi?id=217599

Timing of scrubs and errors:
Nov 04 03:46:01 storage zed[2545101]: eid=11 class=scrub_start pool='data18'
Nov 04 03:46:11 storage zed[2545231]: eid=13 class=scrub_start pool='data8'
Nov 04 06:08:38 storage kernel: aacraid: Host adapter abort request.

Errors:
Nov 04 06:08:38 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host adapter abort request.
                                aacraid: Outstanding commands on (2,1,12,0):
Nov 04 06:09:08 storage kernel: aacraid: Host bus reset request. SCSI hang ?
Nov 04 06:09:08 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: midlevel-0
Nov 04 06:09:08 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: lowlevel-0
Nov 04 06:09:08 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: error handler-8
Nov 04 06:09:08 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: firmware-0
Nov 04 06:09:08 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: kernel-0
Nov 04 06:09:08 storage kernel: aacraid 0000:0a:00.0: Controller reset type is 3
Nov 04 06:09:08 storage kernel: aacraid 0000:0a:00.0: Issuing IOP reset
Nov 04 06:10:19 storage kernel: aacraid 0000:0a:00.0: IOP reset failed
Nov 04 06:10:19 storage kernel: aacraid 0000:0a:00.0: ARC Reset attempt failed
Nov 04 06:11:19 storage kernel: aacraid: Host bus reset request. SCSI hang ?
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: Adapter health - -3
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: midlevel-0
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: lowlevel-0
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: error Issuing IOP resethandler-0
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: firmware-124
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: kernel-0
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: Controller reset type is 3
Nov 04 06:11:19 storage kernel: aacraid 0000:0a:00.0: Issuing IOP reset
Nov 04 06:11:19 storage kernel:  rfkill wmi_bmof snd_timer drm_ttm_helper pcspkr ttm k10temp i2c_piix4 snd i2c_smbus video soundcore igc nfsd auth_rpcgss nfs_acl lockd grace sunrpc loop nfnetlink crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic raid1 ghash_clmulni_intel mxm_wmi nvme sha512_ssse3 aacraid sha256_ssse3 sha1_ssse3 nvme_core sp5100_tco nvme_auth wmi ip6_tables ip_tables fuse
Nov 04 06:11:19 storage kernel:  src_sync_cmd+0x108/0x2e0 [aacraid]
Nov 04 06:11:19 storage kernel:  aac_src_restart_adapter.part.0+0x112/0x2b6 [aacraid]
Nov 04 06:11:19 storage kernel:  aac_reset_adapter+0xeb/0x650 [aacraid]
Nov 04 06:11:19 storage kernel:  aac_eh_host_reset+0x62/0xe0 [aacraid]
Nov 04 06:12:34 storage kernel: aacraid 0000:0a:00.0: IOP reset failed
Nov 04 06:12:34 storage kernel: aacraid 0000:0a:00.0: ARC Reset attempt failed
Nov 04 06:12:34 storage kernel:  mxm_wmi nvme sha512_ssse3 aacraid
Nov 04 06:13:04 storage kernel: aacraid: Host bus reset request. SCSI hang ?
Nov 04 06:13:04 storage kernel: aacraid 0000:0a:00.0: Adapter health - -3
Nov 04 06:13:04 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: midlevel-0
Nov 04 06:13:04 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: lowlevel-0
Nov 04 06:13:04 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: error handler-0
Nov 04 06:13:05 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: firmware-1
Nov 04 06:13:05 storage kernel: aacraid 0000:0a:00.0: outstanding cmd: kernel-0
Nov 04 06:13:05 storage kernel: aacraid 0000:0a:00.0: Controller reset type is 3
Nov 04 06:13:05 storage kernel: aacraid 0000:0a:00.0: Issuing IOP reset
Nov 04 06:13:05 storage kernel:  rfkill wmi_bmof snd_timer drm_ttm_helper pcspkr ttm k10temp i2c_piix4 snd i2c_smbus video soundcore igc nfsd auth_rpcgss nfs_acl lockd grace sunrpc loop nfnetlink crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic raid1 ghash_clmulni_intel mxm_wmi nvme sha512_ssse3 aacraid sha256_ssse3 sha1_ssse3 nvme_core sp5100_tco nvme_auth wmi ip6_tables ip_tables fuse
Nov 04 06:13:05 storage kernel:  src_sync_cmd+0x108/0x2e0 [aacraid]
Nov 04 06:13:05 storage kernel:  aac_src_restart_adapter.part.0+0x112/0x2b6 [aacraid]
Nov 04 06:13:05 storage kernel:  aac_reset_adapter+0xeb/0x650 [aacraid]
Nov 04 06:13:05 storage kernel:  aac_eh_host_reset+0x62/0xe0 [aacraid]
Nov 04 06:14:20 storage kernel: aacraid 0000:0a:00.0: IOP reset failed
Nov 04 06:14:20 storage kernel: aacraid 0000:0a:00.0: ARC Reset attempt failed
Comment 1 Nathan Grennan 2024-11-04 22:32:00 UTC
boot drives:
2x Samsung SSD 980 PRO 500GB drives in mdadm raid1

lspci, short, disk controllers:
07:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
08:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
0a:00.0 RAID bus controller: Adaptec Series 7 6G SAS/PCIe 3 (rev 01)

lspci, long, everything:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse/Vermeer Data Fabric: Device 18h; Function 7
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream
03:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
04:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03)
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
06:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
06:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
07:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
08:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
09:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
0a:00.0 RAID bus controller: Adaptec Series 7 6G SAS/PCIe 3 (rev 01)
0b:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
0c:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
0c:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
0c:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
Comment 2 Nathan Grennan 2024-11-04 22:47:28 UTC
Normal boot kernel messages about the disk controller for both version of the kernel:
Nov 04 11:31:53 storage kernel: Linux version 6.11.5-300.fc41.x86_64 (mockbuild@a0564de4e00d4277aa3a51770ad85255) (gcc (GCC) 14.2.1 20240912 (Red Hat 14.2.1-3), GNU ld version 2.43.1-2.fc41) #1 SMP PREEMPT_DYNAMIC Tue Oct 22 20:11:15 UTC 2024
Nov 04 11:31:53 storage kernel: Adaptec aacraid driver 1.2.1[50983]-custom
Nov 04 11:31:53 storage kernel: aacraid: Comm Interface type2 enabled
Nov 04 11:31:53 storage kernel: scsi host2: aacraid

Nov 04 12:09:05 storage kernel: Linux version 6.10.14-200.fc40.x86_64 (mockbuild@2cac3d8aa36b4f0888a34a961cba75ab) (gcc (GCC) 14.2.1 20240912 (Red Hat 14.2.1-3), GNU ld version 2.41-37.fc40) #1 SMP PREEMPT_DYNAMIC Thu Oct 10 18:49:57 UTC 2024
Nov 04 12:09:06 storage kernel: Adaptec aacraid driver 1.2.1[50983]-custom
Nov 04 12:09:06 storage kernel: aacraid: Comm Interface type2 enabled
Nov 04 12:09:06 storage kernel: scsi host2: aacraid

Note You need to log in before you can comment on or make changes to this bug.