The following messages appeared in dmesg: scsi0:0:1:0: Attempting to queue an ABORT message CDB: 0x28 0x0 0x0 0xab 0x8b 0x99 0x0 0x0 0x8 0x0 scsi0: At time of recovery, card was not paused >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< scsi0: Dumping Card State while idle, at SEQADDR 0x9 Card was paused [...] (entire dump attached) At the time of these errors, load average exceeded 30. After issuing SCSI RESET, the system went back to normal. The problem reappeared several hours later - load average reached 140 and all the tasks hung waiting for I/O. I was waiting for SCSI RESET, which did not occur this time - after about 3 minutes I had to reboot with sysrq. The problem ocurred about a day after upgrading from 2.6.12.4 (which was running fine for over 50 days) to 2.6.13.3. Hardware: Intel SDS2 mainboard with Adaptec AIC-7899P SCSI onboard, 4 x Seagate ST336753LW, software RAID-5, configs, lspci, etc - attached. The main difference between startup dmesgs is that all hard drives were set up as asynchronous during bootup - this didn't occur under 2.6.12.4. So I went back to 2.6.12.4 for now, it seems to be ok.
Created attachment 6239 [details] .config for 2.6.12.4
Created attachment 6240 [details] .config for 2.6.13.3
Created attachment 6241 [details] Startup dmesg of 2.6.12.4
Created attachment 6242 [details] Startup dmesg of 2.6.13.3
Created attachment 6243 [details] aic7xxx error messages from dmesg
Created attachment 6244 [details] lspci -vvv
bugme-daemon@kernel-bugs.osdl.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=5378 > > Summary: aic7xxx deadlock/freeze on Adaptec AIC-7899P > Kernel Version: 2.6.13.3 > Status: NEW > Severity: high > Owner: andmike@us.ibm.com > Submitter: szpajder@staszic.waw.pl > > > The following messages appeared in dmesg: > > scsi0:0:1:0: Attempting to queue an ABORT message > CDB: 0x28 0x0 0x0 0xab 0x8b 0x99 0x0 0x0 0x8 0x0 > scsi0: At time of recovery, card was not paused > >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< > scsi0: Dumping Card State while idle, at SEQADDR 0x9 > Card was paused > [...] (entire dump attached) > > At the time of these errors, load average exceeded 30. After issuing SCSI RESET, > the system went back to normal. The problem reappeared several hours later - > load average reached 140 and all the tasks hung waiting for I/O. I was waiting > for SCSI RESET, which did not occur this time - after about 3 minutes I had to > reboot with sysrq. > > The problem ocurred about a day after upgrading from 2.6.12.4 (which was running > fine for over 50 days) to 2.6.13.3. Hardware: Intel SDS2 mainboard with Adaptec > AIC-7899P SCSI onboard, 4 x Seagate ST336753LW, software RAID-5, configs, lspci, > etc - attached. The main difference between startup dmesgs is that all hard > drives were set up as asynchronous during bootup - this didn't occur under > 2.6.12.4. So I went back to 2.6.12.4 for now, it seems to be ok. ISTR that there have been several reports of this regression. What could have caused this?
Reply-To: James.Bottomley@SteelEye.com On Mon, 2005-10-10 at 18:41 -0700, Andrew Morton wrote: > ISTR that there have been several reports of this regression. > > What could have caused this? Well ... the prior bug reports with this are in aic79xx, and there there were no significant code changes between the working and the non working versions. The aic7xxx driver has been fairly significantly changed but all in the area of setup and initialisation. This one looks like a sequencer error, possibly induced by a flakey bus. Apparently it actually managed to recover once, which is surprising for the aic driver, but then it died a second time around. There actually was a sequencer change (backport from aic latest driver) that could be responsible. However, it's in both 2.6.12 and 2.6.13. James
After reporting this problem, I changed back to 2.6.12.6 and after several days running, it freezed in a similar way. Output from dmesg follows. I'm running out of ideas. Could this be a hardware problem? I've been running this machine for over 2 years, without hardware modifications. It was running flawlessly on 2.4 series and also on 2.6.11.x and 2.6.12.x for several weeks before I upgraded to 2.6.13.3. scsi0:0:1:0: Attempting to queue an ABORT message CDB: 0x2a 0x0 0x0 0x27 0x4f 0x4f 0x0 0x0 0x8 0x0 scsi0: At time of recovery, card was not paused >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< scsi0: Dumping Card State while idle, at SEQADDR 0x9 Card was paused ACCUM = 0x0, SINDEX = 0x0, DINDEX = 0xe4, ARG_2 = 0x0 HCNT = 0x0 SCBPTR = 0x0 SCSIPHASE[0x0] SCSISIGI[0x0] ERROR[0x0] SCSIBUSL[0x0] LASTPHASE[0x1]:(P_BUSFREE) SCSISEQ[0x12]:(ENAUTOATNP|ENRSELI) SBLKCTL[0xa]:(SELWIDE|SELBUSB) SCSIRATE[0x0] SEQCTL[0x10]:(FASTMODE) SEQ_FLAGS[0xc0]:(NO_CDB_SENT|NOT_IDENTIFIED) SSTAT0[0x0] SSTAT1[0x8]:(BUSFREE) SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8]:(ENSWRAP) SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO) SXFRCTL0[0x80]:(DFON) DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL) STACK: 0x0 0x163 0x109 0x3 SCB count = 8 Kernel NEXTQSCB = 3 Card NEXTQSCB = 3 QINFIFO entries: Waiting Queue entries: Disconnected Queue entries: 3:1 QOUTFIFO entries: Sequencer Free SCB List: 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Sequencer SCB Info: 0 SCB_CONTROL[0xe0]:(TAG_ENB|DISCENB|TARGET_SCB) SCB_SCSIID[0x27] SCB_LUN[0x0] SCB_TAG[0xff] 1 SCB_CONTROL[0xe0]:(TAG_ENB|DISCENB|TARGET_SCB) SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0xff] 2 SCB_CONTROL[0xe0]:(TAG_ENB|DISCENB|TARGET_SCB) SCB_SCSIID[0x47] SCB_LUN[0x0] SCB_TAG[0xff] 3 SCB_CONTROL[0x64]:(DISCONNECTED|TAG_ENB|DISCENB) SCB_SCSIID[0x17] SCB_LUN[0x0] SCB_TAG[0x1] 4 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 5 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 6 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 7 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 8 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 9 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 10 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 11 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 12 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 13 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 14 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 15 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 16 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 17 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 18 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 19 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 20 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 21 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 22 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 23 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 24 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 25 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 26 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 27 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 28 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 29 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 30 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] 31 SCB_CONTROL[0x0] SCB_SCSIID[0xff]:(TWIN_CHNLB|OID|TWIN_TID) SCB_LUN[0xff]:(SCB_XFERLEN_ODD|LID) SCB_TAG[0xff] Pending list: 1 SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x17] SCB_LUN[0x0] Kernel Free SCB list: 0 2 7 6 5 4 Untagged Q(1): 1 <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>> (scsi0:A:1:0): Device is disconnected, re-queuing SCB Recovery code sleeping (scsi0:A:1:0): Abort Tag Message Sent Recovery code awake Timer Expired aic7xxx_abort returns 0x2003 scsi0:0:1:0: Attempting to queue a TARGET RESET message CDB: 0x2a 0x0 0x0 0x27 0x4f 0x4f 0x0 0x0 0x8 0x0 aic7xxx_dev_reset returns 0x2003 Recovery SCB completes
Any update on this problem? Thanks.
(In reply to comment #10) > Any update on this problem? > Thanks. Sorry for keeping quiet for so long. In the meantime, I've got another job, so I'm no longer in charge of maintaining that particular piece of hardware. But AFAIK, the drive was later replaced and the problem disappeared.