Bug 5378

Summary: aic7xxx deadlock/freeze on Adaptec AIC-7899P
Product: SCSI Drivers Reporter: Tomasz Lemiech (szpajder)
Component: OtherAssignee: Hannes Reinecke (hare)
Status: REJECTED UNREPRODUCIBLE    
Severity: high CC: akpm, bunk, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13.3 Subsystem:
Regression: --- Bisected commit-id:
Attachments: .config for 2.6.12.4
.config for 2.6.13.3
Startup dmesg of 2.6.12.4
Startup dmesg of 2.6.13.3
aic7xxx error messages from dmesg
lspci -vvv

Description Tomasz Lemiech 2005-10-06 09:25:25 UTC
The following messages appeared in dmesg:

scsi0:0:1:0: Attempting to queue an ABORT message
CDB: 0x28 0x0 0x0 0xab 0x8b 0x99 0x0 0x0 0x8 0x0
scsi0: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi0: Dumping Card State while idle, at SEQADDR 0x9
Card was paused
[...] (entire dump attached)

At the time of these errors, load average exceeded 30. After issuing SCSI RESET,
the system went back to normal. The problem reappeared several hours later -
load average reached 140 and all the tasks hung waiting for I/O. I was waiting
for SCSI RESET, which did not occur this time - after about 3 minutes I had to
reboot with sysrq.

The problem ocurred about a day after upgrading from 2.6.12.4 (which was running
fine for over 50 days) to 2.6.13.3. Hardware: Intel SDS2 mainboard with Adaptec
AIC-7899P SCSI onboard, 4 x Seagate ST336753LW, software RAID-5, configs, lspci,
etc - attached. The main difference between startup dmesgs is that all hard
drives were set up as asynchronous during bootup - this didn't occur under
2.6.12.4. So I went back to 2.6.12.4 for now, it seems to be ok.
Comment 1 Tomasz Lemiech 2005-10-06 09:26:18 UTC
Created attachment 6239 [details]
.config for 2.6.12.4
Comment 2 Tomasz Lemiech 2005-10-06 09:26:58 UTC
Created attachment 6240 [details]
.config for 2.6.13.3
Comment 3 Tomasz Lemiech 2005-10-06 09:27:36 UTC
Created attachment 6241 [details]
Startup dmesg of 2.6.12.4
Comment 4 Tomasz Lemiech 2005-10-06 09:28:07 UTC
Created attachment 6242 [details]
Startup dmesg of 2.6.13.3
Comment 5 Tomasz Lemiech 2005-10-06 09:29:00 UTC
Created attachment 6243 [details]
aic7xxx error messages from dmesg
Comment 6 Tomasz Lemiech 2005-10-06 09:29:36 UTC
Created attachment 6244 [details]
lspci -vvv
Comment 7 Andrew Morton 2005-10-10 18:40:16 UTC
bugme-daemon@kernel-bugs.osdl.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5378
> 
>            Summary: aic7xxx deadlock/freeze on Adaptec AIC-7899P
>     Kernel Version: 2.6.13.3
>             Status: NEW
>           Severity: high
>              Owner: andmike@us.ibm.com
>          Submitter: szpajder@staszic.waw.pl
> 
> 
> The following messages appeared in dmesg:
> 
> scsi0:0:1:0: Attempting to queue an ABORT message
> CDB: 0x28 0x0 0x0 0xab 0x8b 0x99 0x0 0x0 0x8 0x0
> scsi0: At time of recovery, card was not paused
> >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
> scsi0: Dumping Card State while idle, at SEQADDR 0x9
> Card was paused
> [...] (entire dump attached)
> 
> At the time of these errors, load average exceeded 30. After issuing SCSI RESET,
> the system went back to normal. The problem reappeared several hours later -
> load average reached 140 and all the tasks hung waiting for I/O. I was waiting
> for SCSI RESET, which did not occur this time - after about 3 minutes I had to
> reboot with sysrq.
> 
> The problem ocurred about a day after upgrading from 2.6.12.4 (which was running
> fine for over 50 days) to 2.6.13.3. Hardware: Intel SDS2 mainboard with Adaptec
> AIC-7899P SCSI onboard, 4 x Seagate ST336753LW, software RAID-5, configs, lspci,
> etc - attached. The main difference between startup dmesgs is that all hard
> drives were set up as asynchronous during bootup - this didn't occur under
> 2.6.12.4. So I went back to 2.6.12.4 for now, it seems to be ok.

ISTR that there have been several reports of this regression.

What could have caused this?

Comment 8 Anonymous Emailer 2005-10-10 19:59:06 UTC
Reply-To: James.Bottomley@SteelEye.com

On Mon, 2005-10-10 at 18:41 -0700, Andrew Morton wrote:
> ISTR that there have been several reports of this regression.
> 
> What could have caused this?

Well ... the prior bug reports with this are in aic79xx, and there there
were no significant code changes between the working and the non working
versions.  The aic7xxx driver has been fairly significantly changed but
all in the area of setup and initialisation.

This one looks like a sequencer error, possibly induced by a flakey bus.
Apparently it actually managed to recover once, which is surprising for
the aic driver, but then it died a second time around.  There actually
was a sequencer change (backport from aic latest driver) that could be
responsible.  However, it's in both 2.6.12 and 2.6.13.

James


Comment 9 Tomasz Lemiech 2005-10-27 15:09:03 UTC
After reporting this problem, I changed back to 2.6.12.6 and after several days 
running, it freezed in a similar way. Output from dmesg follows.

I'm running out of ideas. Could this be a hardware problem? I've been running 
this machine for over 2 years, without hardware modifications. It was running 
flawlessly on 2.4 series and also on 2.6.11.x and 2.6.12.x for several weeks 
before I upgraded to 2.6.13.3.

scsi0:0:1:0: Attempting to queue an ABORT message
CDB: 0x2a 0x0 0x0 0x27 0x4f 0x4f 0x0 0x0 0x8 0x0
scsi0: At time of recovery, card was not paused
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi0: Dumping Card State while idle, at SEQADDR 0x9
Card was paused
ACCUM = 0x0, SINDEX = 0x0, DINDEX = 0xe4, ARG_2 = 0x0
HCNT = 0x0 SCBPTR = 0x0
SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] 
LASTPHASE[0x1]:(P_BUSFREE) SCSISEQ[0x12]:(ENAUTOATNP|ENRSELI) 
SBLKCTL[0xa]:(SELWIDE|SELBUSB) SCSIRATE[0x0] SEQCTL[0x10]:(FASTMODE) 
SEQ_FLAGS[0xc0]:(NO_CDB_SENT|NOT_IDENTIFIED) SSTAT0[0x0] 
SSTAT1[0x8]:(BUSFREE) SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8]:(ENSWRAP) 
SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO) 
SXFRCTL0[0x80]:(DFON) DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL) 
STACK: 0x0 0x163 0x109 0x3
SCB count = 8
Kernel NEXTQSCB = 3
Card NEXTQSCB = 3
QINFIFO entries: 
Waiting Queue entries: 
Disconnected Queue entries: 3:1 
QOUTFIFO entries: 
Sequencer Free SCB List: 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
22 23 24 25 26 27 28 29 30 31 
Sequencer SCB Info: 
  0 SCB_CONTROL[0xe0]:(TAG_ENB|DISCENB|TARGET_SCB) SCB_SCSIID[0x27] 
SCB_LUN[0x0] SCB_TAG[0xff] 
  1 SCB_CONTROL[0xe0]:(TAG_ENB|DISCENB|TARGET_SCB) SCB_SCSIID[0x17] 
SCB_LUN[0x0] SCB_TAG[0xff] 
  2 SCB_CONTROL[0xe0]:(TAG_ENB|DISCENB|TARGET_SCB) SCB_SCSIID[0x47] 
SCB_LUN[0x0] SCB_TAG[0xff] 
  3 SCB_CONTROL[0x64]:(DISCONNECTED|TAG_ENB|DISCENB) SCB_SCSIID[0x17] 
SCB_LUN[0x0] SCB_TAG[0x1] 
  4 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
  5 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
  6 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
  7 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
  8 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
  9 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 16 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 17 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 18 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 19 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 21 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 23 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 25 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 27 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 29 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
 31 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) 
SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 
Pending list: 
  1 SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x17] 
SCB_LUN[0x0] 
Kernel Free SCB list: 0 2 7 6 5 4 
Untagged Q(1): 1 

<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
(scsi0:A:1:0): Device is disconnected, re-queuing SCB
Recovery code sleeping
(scsi0:A:1:0): Abort Tag Message Sent
Recovery code awake
Timer Expired
aic7xxx_abort returns 0x2003
scsi0:0:1:0: Attempting to queue a TARGET RESET message
CDB: 0x2a 0x0 0x0 0x27 0x4f 0x4f 0x0 0x0 0x8 0x0
aic7xxx_dev_reset returns 0x2003
Recovery SCB completes
Comment 10 Natalie Protasevich 2007-07-22 02:43:35 UTC
Any update on this problem?
Thanks.
Comment 11 Tomasz Lemiech 2007-07-22 05:52:07 UTC
(In reply to comment #10)
> Any update on this problem?
> Thanks.

Sorry for keeping quiet for so long. In the meantime, I've got another job, so I'm no longer in charge of maintaining that particular piece of hardware. But AFAIK, the drive was later replaced and the problem disappeared.