Bug 11117

Summary: aic94xx doesn't sustain the load when more than 2 SAS drives are connected and actively used
Product: SCSI Drivers Reporter: Michael Gleibman (michael.gleibman)
Component: AIC94XXAssignee: scsi_drivers-aic94xx
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.23.1, 2.6.25 Subsystem:
Regression: No Bisected commit-id:

Description Michael Gleibman 2008-07-18 08:37:42 UTC
Latest working kernel version: NA
Earliest failing kernel version: 2.6.23.1
Distribution: Gentoo, customized; vanilla kernels
Hardware Environment: SuperMicro X7DB3 motherboard with integrated AIC 9410 controllers, 4 SAS drives connected 
Software Environment: aic94xx SAS/SATA driver version 1.0.3
Problem Description:

When more than 2 SAS drives are connected and system becomes loaded, aic94xx logs a lot of messages similar to the following:

aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
sas: command 0xffff8101d39733c0, task 0xffff8105e9e51240, timed out: EH_NOT_HANDLED
sas: command 0xffff8104db3d1e40, task 0xffff8105ed10a6c0, timed out: EH_NOT_HANDLED
sas: command 0xffff81060f001380, task 0xffff81046fc0e3c0, timed out: EH_NOT_HANDLED
sas: command 0xffff8102710e6200, task 0xffff810587fb2980, timed out: EH_NOT_HANDLED
sas: command 0xffff8101df43f0c0, task 0xffff810235fcf9c0, timed out: EH_NOT_HANDLED
sas: command 0xffff810182974380, task 0xffff810020939b40, timed out: EH_NOT_HANDLED
sas: command 0xffff810591dd93c0, task 0xffff8105aab7ee00, timed out: EH_NOT_HANDLED
sas: command 0xffff8104fa24b540, task 0xffff8105e9e51840, timed out: EH_NOT_HANDLED
sas: command 0xffff8103a400d540, task 0xffff8102c7927e40, timed out: EH_NOT_HANDLED
sas: command 0xffff8105f8d29cc0, task 0xffff8102c3582980, timed out: EH_NOT_HANDLED
sas: command 0xffff8105c684c980, task 0xffff8105aebd5240, timed out: EH_NOT_HANDLED
sas: command 0xffff8104fa24b240, task 0xffff8103140edcc0, timed out: EH_NOT_HANDLED
sas: command 0xffff81058572cb00, task 0xffff8105aab7e980, timed out: EH_NOT_HANDLED
sas: command 0xffff8105f08d6500, task 0xffff8100266169c0, timed out: EH_NOT_HANDLED
sas: command 0xffff810182974500, task 0xffff8103140ed3c0, timed out: EH_NOT_HANDLED
sas: command 0xffff810302b9a080, task 0xffff8101ca94e800, timed out: EH_NOT_HANDLED
sas: command 0xffff81058572c080, task 0xffff8105aab7eb00, timed out: EH_NOT_HANDLED
sas: command 0xffff8105bc7df980, task 0xffff810026616540, timed out: EH_NOT_HANDLED
sas: Enter sas_scsi_recover_host
sas: trying to find task 0xffff8105e9e51240
sas: sas_scsi_find_task: aborting task 0xffff8105e9e51240
aic94xx: tmf tasklet complete
aic94xx: tmf resp tasklet
aic94xx: tmf came back
aic94xx: task not done, clearing nexus
aic94xx: asd_clear_nexus_tag: PRE
aic94xx: asd_clear_nexus_tag: POST
aic94xx: asd_clear_nexus_tag: clear nexus posted, waiting...
aic94xx: task 0xffff8105e9e51240 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer!
aic94xx: asd_clear_nexus_tasklet_complete: here
aic94xx: asd_clear_nexus_tasklet_complete: opcode: 0x0
aic94xx: task 0xffff8105ed10a6c0 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff81046fc0e3c0 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff810587fb2980 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff810235fcf9c0 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff810020939b40 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8105aab7ee00 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8105e9e51840 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8102c7927e40 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8102c3582980 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8105aebd5240 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8103140edcc0 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8105aab7e980 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8100266169c0 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8103140ed3c0 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8101ca94e800 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff8105aab7eb00 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: task 0xffff810026616540 done with opcode 0x0 resp 0x0 stat 0x0 but aborted by upper layer!
aic94xx: came back from clear nexus
aic94xx: task 0xffff8105e9e51240 aborted, res: 0x0
sas: sas_scsi_find_task: task 0xffff8105e9e51240 is done
sas: sas_eh_handle_sas_errors: task 0xffff8105e9e51240 is done
sas: trying to find task 0xffff8105ed10a6c0
sas: sas_scsi_find_task: aborting task 0xffff8105ed10a6c0
aic94xx: asd_abort_task: task 0xffff8105ed10a6c0 done
aic94xx: task 0xffff8105ed10a6c0 aborted, res: 0x0
sas: sas_scsi_find_task: task 0xffff8105ed10a6c0 is done
sas: sas_eh_handle_sas_errors: task 0xffff8105ed10a6c0 is done
...
...
EXT2-fs error (device sda9): read_block_bitmap: Cannot read block bitmap - block_group = 0, block_bitmap = 19
sd 2:0:0:0: rejecting I/O to offline device

It seems to closely resemble the problem earlier reported by Patrick LeBoutillier: http://lkml.org/lkml/2008/6/25/305

The system would eventually freeze during stress tests.
Here's what I've tried:

- I've tried to spread the drives over 2 ports (2 drives on each port) - same thing
- Leaving only 2 drives connected, unplugging the other 2 - the system passes the stress tests well
- Adaptec sequencer firmware versions 1.1 (V32A4, V30) same story
- latest Adaptec bios - same thing
- plugged in an external ASC48300 controller with the same AIC-9410 chip, connected the drives to it - the same problem as before
- plugged in an external LSI MegaRaid 8208ELP controller to the same system and connected the same 4 drives to it - system passes all the load tests just fine
- reproduced the same behavior on another similar system, same motherboard, different drives
- the current workaround, as suggested by Patrick, setting the PHY rate to 1.5 Gbps in Adaptec controller BIOS seems to cure the problem for both onboard and PCI controllers at cost of some performance degradation; system is able to pass the stress tests without errors, but the drives are a bit (~10%) slower. 

Steps to reproduce:

Plug 4 SAS drives to a system with aic9410 controller and load kernel 2.6.23.1 or 2.6.25 with aic94xx driver configured; apply some kind of stress test to all the drives simultaneously (I used a homebrew intensive disk load test utility, but earlier we used dd if=/dev/sd* of=/dev/null bs=16384 in parallel to create enough load to reproduce the errors)
Comment 1 Anonymous Emailer 2008-07-28 07:45:06 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Fri, 2008-07-18 at 08:37 -0700, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11117
> 
>            Summary: aic94xx doesn't sustain the load when more than 2 SAS
>                     drives are connected and actively used
[...]
> aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
> sas: command 0xffff8101d39733c0, task 0xffff8105e9e51240, timed out:
> EH_NOT_HANDLED
> sas: command 0xffff8104db3d1e40, task 0xffff8105ed10a6c0, timed out:

This is more or less a known problem with aic94xx.  It's root cause is
that there are certain bus conditions the firmware requires help with.
REQ_TASK_ABORT is one of them (reason 0x6 means there was a protocol
error on the bus).  What the card would like is for us to abort and
retransmit that command immediately (running abort).  What we actually
do is to mark the command for abort by the error handler, halt all
in-progress commands and wake up the eh thread.  This causes a nasty
hiccough in the data flow and runs into a potential snowball effect in
that if we get another REQ_TASK_ABORT on the retry of all the halted
commands (and there are quite a number of them), we have to do
everything over again (do this too often and the command will time out).

The fix is to alter the aic94xx code to do a running abort (as in do it
itself on the single command instead of halting everything and waking
the error handler).  Unfortunately no-one's found the time to sit down
and code this up yet.

James