Latest working kernel version: unknown Earliest failing kernel version: probably forever Distribution: Ubuntu hardy Hardware Environment: Dell PowerEdge 2650 Problem Description: Under load, this happens rather often: Jul 18 22:55:24 nun kernel: [86674.467410] aacraid: Host adapter abort request (0,0,2,0) Jul 18 22:55:24 nun kernel: [86674.467487] aacraid: Host adapter abort request (0,0,3,0) Jul 18 22:55:24 nun kernel: [86674.467617] aacraid: Host adapter reset request. SCSI hang ? Jul 18 22:57:26 nun kernel: [86815.728423] aacraid: Host adapter abort request (0,0,0,0) Jul 18 22:57:26 nun kernel: [86815.728500] aacraid: Host adapter abort request (0,0,3,0) Jul 18 22:57:26 nun kernel: [86815.728573] aacraid: Host adapter abort request (0,0,2,0) Jul 18 22:57:26 nun kernel: [86815.728640] aacraid: Host adapter abort request (0,0,1,0) Jul 18 22:57:26 nun kernel: [86815.728772] aacraid: Host adapter reset request. SCSI hang ? Access to the storage thus stalls for ten seconds or so. I have successfully worked around the problem by using "schedtool -a 1 pid-of-basically-everything", so it seems to be an SMP-related problem. However, one CPU is _somewhat_ slower than four, which is quite noticeable, so we'd like to get this handled somehow :-/ lspci: 05:06.0 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01) Subsystem: Dell PowerEdge 2400,2500,2550,4400 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7 BIST result: 00 I/O ports at cc00 [size=256] Memory at fccff000 (64-bit, non-prefetchable) [size=4K] Expansion ROM at fcd00000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 05:06.1 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01) Subsystem: Dell PowerEdge 2400,2500,2550,4400 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 11 BIST result: 00 I/O ports at c800 [size=256] Memory at fccfe000 (64-bit, non-prefetchable) [size=4K] Expansion ROM at f8100000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 lspci -n: 05:06.0 0100: 9005:00c5 (rev 01) Subsystem: 1028:00c5 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7 BIST result: 00 I/O ports at cc00 [size=256] Memory at fccff000 (64-bit, non-prefetchable) [size=4K] Expansion ROM at fcd00000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 05:06.1 0100: 9005:00c5 (rev 01) Subsystem: 1028:00c5 Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 11 BIST result: 00 I/O ports at c800 [size=256] Memory at fccfe000 (64-bit, non-prefetchable) [size=4K] Expansion ROM at f8100000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2
Update: my uniprocessor band-aid, besides significantly decreasing performance, resulted in an eventual CPU soft-hang (all of them) some hours later, so this workaround obviously doesn't.
Increase your scsi bus timeouts and/or decrease the device queue depth. The driver is doing what it can when the Adapter's Firmware gets overloaded and reticent. One of the changes post 2.6.18 was to increase the maximum SGB Length to 256 from 128 as safe at the time, this may have allowed this series of Adapters to run out of internal resources in combination with other changes and improvement in the block and scsi subsystem. The line in .../drivers/scsi/aacraid/aacraid.h: #define AAC_MAX_32BIT_SGBCOUNT ((unsigned short)256) affects this value.
Thank you. I will test this workaround today.
Works. --- Reduce AACRAID hardware queue size (kernel bug#11120) Signed-Off-By: Mathias Urlichs <matthias@urlichs.de> diff --git a/drivers/scsi/aacraid/aacraid.h b/drivers/scsi/aacraid/aacraid.h index 73916ad..b1b10b3 100644 --- a/drivers/scsi/aacraid/aacraid.h +++ b/drivers/scsi/aacraid/aacraid.h @@ -24,7 +24,7 @@ #define AAC_MAX_LUN (8) #define AAC_MAX_HOSTPHYSMEMPAGES (0xfffff) -#define AAC_MAX_32BIT_SGBCOUNT ((unsigned short)256) +#define AAC_MAX_32BIT_SGBCOUNT ((unsigned short)127) /* * These macros convert from physical channels to virtual channels
Also confirm CentOS 5.3 x64 - Adaptec 2810SA
Owning to kick upstream
James the SCSI maintainer says: The maximum transfer length critically impacts I/O throughput and performance ... I can't just penalise everyone for the sake of two bug reports. This value can already be altered on the fly using the /sys/block/<dev>/queue/max_sectors_kb So closing as wont fix (pending other evidence obviously)