Bug 11120

Summary: [PATCH, TRIVIAL]aacraid driver stalls on high-load SMP machines
Product: SCSI Drivers Reporter: Matthias Urlichs (smurf)
Component: AACRAIDAssignee: Alan (alan)
Status: CLOSED WILL_NOT_FIX    
Severity: normal CC: alan, stefan.nader
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24 Subsystem:
Regression: No Bisected commit-id:

Description Matthias Urlichs 2008-07-18 14:36:00 UTC
Latest working kernel version: unknown
Earliest failing kernel version: probably forever
Distribution: Ubuntu hardy
Hardware Environment: Dell PowerEdge 2650

Problem Description:

Under load, this happens rather often:

Jul 18 22:55:24 nun kernel: [86674.467410] aacraid: Host adapter abort request (0,0,2,0)
Jul 18 22:55:24 nun kernel: [86674.467487] aacraid: Host adapter abort request (0,0,3,0)
Jul 18 22:55:24 nun kernel: [86674.467617] aacraid: Host adapter reset request. SCSI hang ?
Jul 18 22:57:26 nun kernel: [86815.728423] aacraid: Host adapter abort request (0,0,0,0)
Jul 18 22:57:26 nun kernel: [86815.728500] aacraid: Host adapter abort request (0,0,3,0)
Jul 18 22:57:26 nun kernel: [86815.728573] aacraid: Host adapter abort request (0,0,2,0)
Jul 18 22:57:26 nun kernel: [86815.728640] aacraid: Host adapter abort request (0,0,1,0)
Jul 18 22:57:26 nun kernel: [86815.728772] aacraid: Host adapter reset request. SCSI hang ?

Access to the storage thus stalls for ten seconds or so.

I have successfully worked around the problem by using "schedtool -a 1 pid-of-basically-everything", so it seems to be an SMP-related problem.

However, one CPU is _somewhat_ slower than four, which is quite noticeable, so we'd like to get this handled somehow :-/


lspci:

05:06.0 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01)
	Subsystem: Dell PowerEdge 2400,2500,2550,4400
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7
	BIST result: 00
	I/O ports at cc00 [size=256]
	Memory at fccff000 (64-bit, non-prefetchable) [size=4K]
	Expansion ROM at fcd00000 [disabled] [size=128K]
	Capabilities: [dc] Power Management version 2

05:06.1 SCSI storage controller: Adaptec RAID subsystem HBA (rev 01)
	Subsystem: Dell PowerEdge 2400,2500,2550,4400
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 11
	BIST result: 00
	I/O ports at c800 [size=256]
	Memory at fccfe000 (64-bit, non-prefetchable) [size=4K]
	Expansion ROM at f8100000 [disabled] [size=128K]
	Capabilities: [dc] Power Management version 2


lspci -n:
05:06.0 0100: 9005:00c5 (rev 01)
	Subsystem: 1028:00c5
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 7
	BIST result: 00
	I/O ports at cc00 [size=256]
	Memory at fccff000 (64-bit, non-prefetchable) [size=4K]
	Expansion ROM at fcd00000 [disabled] [size=128K]
	Capabilities: [dc] Power Management version 2

05:06.1 0100: 9005:00c5 (rev 01)
	Subsystem: 1028:00c5
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 11
	BIST result: 00
	I/O ports at c800 [size=256]
	Memory at fccfe000 (64-bit, non-prefetchable) [size=4K]
	Expansion ROM at f8100000 [disabled] [size=128K]
	Capabilities: [dc] Power Management version 2
Comment 1 Matthias Urlichs 2008-07-18 18:13:10 UTC
Update: my uniprocessor band-aid, besides significantly decreasing performance, resulted in an eventual CPU soft-hang (all of them) some hours later, so this workaround obviously doesn't.
Comment 2 Mark Salyzyn 2008-07-20 05:30:05 UTC
Increase your scsi bus timeouts and/or decrease the device queue depth. The driver is doing what it can when the Adapter's Firmware gets overloaded and reticent. One of the changes post 2.6.18 was to increase the maximum SGB Length to 256 from 128 as safe at the time, this may have allowed this series of Adapters to run out of internal resources in combination with other changes and improvement in the block and scsi subsystem.

The line in .../drivers/scsi/aacraid/aacraid.h:

#define AAC_MAX_32BIT_SGBCOUNT  ((unsigned short)256)

affects this value.
Comment 3 Matthias Urlichs 2008-08-28 22:41:39 UTC
Thank you. I will test this workaround today.
Comment 4 Matthias Urlichs 2008-08-29 20:36:55 UTC
Works.

---

Reduce AACRAID hardware queue size (kernel bug#11120)
    
Signed-Off-By: Mathias Urlichs <matthias@urlichs.de>

diff --git a/drivers/scsi/aacraid/aacraid.h b/drivers/scsi/aacraid/aacraid.h
index 73916ad..b1b10b3 100644
--- a/drivers/scsi/aacraid/aacraid.h
+++ b/drivers/scsi/aacraid/aacraid.h
@@ -24,7 +24,7 @@
 #define AAC_MAX_LUN            (8)
 
 #define AAC_MAX_HOSTPHYSMEMPAGES (0xfffff)
-#define AAC_MAX_32BIT_SGBCOUNT ((unsigned short)256)
+#define AAC_MAX_32BIT_SGBCOUNT ((unsigned short)127)
 
 /*
  * These macros convert from physical channels to virtual channels
Comment 5 stefan.nader 2009-04-30 18:39:14 UTC
Also confirm

CentOS 5.3 x64 - Adaptec 2810SA
Comment 6 Alan 2010-01-19 19:52:24 UTC
Owning to kick upstream
Comment 7 Alan 2010-01-27 23:18:31 UTC
James the SCSI maintainer says:

The maximum transfer length critically impacts I/O throughput and
performance ... I can't just penalise everyone for the sake of two bug
reports.

This value can already be altered on the fly using the

/sys/block/<dev>/queue/max_sectors_kb

So closing as wont fix (pending other evidence obviously)