Latest working kernel version: Earliest failing kernel version: Distribution: Gentoo Hardware Environment: ML150G3, (2Core cpu, 64Bit) AHA3944AUWD card, Storagetek L80 +2x DLT8000 Software Environment: gentoo Problem Description: kernel panic Steps to reproduce: Panic if the L80 is powered up when the kernel boots. 100% on any failing kernel. Not all kernels fail but most do. Git Bisect across linus's tree did not produce a convincing patch. Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708 I have joined the linux-scsi list and will ..continue
What I said previously.. The event that brought the problem to light was the installation of a secondhand Storagetek L80 tape library. This has two DLT8000 drives on a HV-Differential bus. This needed special card, an adaptec 3944AUWD. The kernel I was running at that time was 2.6.22-gentoo-r8. It worked fine. Then when -r9 came out and this error manifested, the assumption was that -r9 was broken. I no longer think this to be the case. I think they are _ALL_ broken, possibly going way back toward the start of the 2.6 series. I think that the bug may or may not manifest depending on the internal layout of data in the kernel --A true heisenbug-- All that the git bisect did was to change the internal layout, not add/remove a bad patch. This explains why I could take the 2.6.23.8 kernel and compile for SMP and have it fail. Compile it for UP and have it work. Initially I thought that meant a locking or race issue. Now I think its was just another case of altering the internal kernel layout.
Created attachment 14500 [details] Screenshot showing panic
Created attachment 14501 [details] My .config file
I have drivers for other scsi cards compiled in. For the purpose of testing only the 3944AUWD card is installed.
Reply-To: James.Bottomley@HansenPartnership.com > Latest working kernel version: > Earliest failing kernel version: > Distribution: Gentoo > Hardware Environment: ML150G3, (2Core cpu, 64Bit) AHA3944AUWD card, > Storagetek > L80 +2x DLT8000 > Software Environment: gentoo > Problem Description: kernel panic > > Steps to reproduce: > Panic if the L80 is powered up when the kernel boots. 100% on any failing > kernel. > Not all kernels fail but most do. > Git Bisect across linus's tree did not produce a convincing patch. > Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708 > I have joined the linux-scsi list and will > > The event that brought the problem to light was the installation of a > secondhand Storagetek L80 > tape library. This has two DLT8000 drives on a HV-Differential bus. > This needed special card, an adaptec 3944AUWD. > The kernel I was running at that time was 2.6.22-gentoo-r8. > It worked fine. Then when -r9 came out and this error manifested, the > assumption > was that -r9 was broken. > > I no longer think this to be the case. > > I think they are _ALL_ broken, possibly going way back toward the start of > the > 2.6 series. > I think that the bug may or may not manifest depending on the internal layout > of data in the kernel > --A true heisenbug-- > > All that the git bisect did was to change the internal layout, not add/remove > a > bad patch. > > This explains why I could take the 2.6.23.8 kernel and compile for SMP and > have > it fail. > Compile it for UP and have it work. Initially I thought that meant a locking > or > race issue. > Now I think its was just another case of altering the internal kernel layout. Actually, I'd investigate either your tapes or the SCSI bus. The message is produced deep in the heart of the aic7xxx driver. It happens when the driver gets reselected with a tag that doesn't exist. However, in this case, I think your device is untagged, in which case this is some handling issue with SCB_LIST_NULL (the value 0xff). James
Thanks, I've just done some more testing. There are no tapes in the drives. Normally, there is the L80 and a DLT8000 on channel B and a DLT8000 on channel A Both busses have external terminators. If Ch B is used alone the system is fine! If Ch A is used alone it will fail. If you you are thinking of some hardware problem, its possible to boot with the L80 off, cause the scsi bus to rescan and have everything work fine. Regards, john
Duh! I mean boot with it off, power it up and rescan.
Ok, I've spent some time trying different combinations of devices. Against kernel 2.6.24 T0 is Quantum DLT8000 ID0 T1 is Quantum DLT8000 ID1 MTX is STK L80 ID 15 Terminators A, B Channel A B T0,T1,MTX,B Nil Crash Nil T0,T1,MTX,B Parity Error in Data-in Phase Nil T0,MTX,B Ok, Tar test ok, MTX ok Nil T1,MTX,B Ok, Tar test ok, MTX ok -- Both drives work ok T1,MTX,B Nil Ok Skipped Tests T1,MTX,A Nil Ok Skipped Tests T0,MTX,B Nil Crash T0,MTX,A Nil Crash -- Not the terminator --Test on two channels T0,MTX,A T1,B Crash T1,B T0,MTX,A Parity Error in Data-in Phase It really doesn't like three devices, on two busses or one.
Wrap around doesn't help.. I've also the the 'old' AIC78XX driver. That driver hangs even with no devices attached. So now what? --john
Reply-To: James.Bottomley@HansenPartnership.com On Fri, 2008-02-08 at 18:52 -0800, bugme-daemon@bugzilla.kernel.org wrote: > Ok, I've spent some time trying different combinations of devices. > > Against kernel 2.6.24 > T0 is Quantum DLT8000 ID0 > T1 is Quantum DLT8000 ID1 > MTX is STK L80 ID 15 > Terminators A, B > > Channel A B > T0,T1,MTX,B Nil > Crash > Nil T0,T1,MTX,B > Parity Error in Data-in Phase > Nil T0,MTX,B > Ok, Tar test ok, MTX ok > Nil T1,MTX,B > Ok, Tar test ok, MTX ok > -- Both drives work ok > T1,MTX,B Nil > Ok Skipped Tests > T1,MTX,A Nil > Ok Skipped Tests > T0,MTX,B Nil > Crash > T0,MTX,A Nil > Crash > -- Not the terminator > > > --Test on two channels > T0,MTX,A T1,B > Crash > T1,B T0,MTX,A > Parity Error in Data-in Phase > > It really doesn't like three devices, on two busses or one. Well, I still think you have some type of bus instability, but that said we need to get rid of the panic. I'm afraid this is going to be a long process. For the first attempt, let's see if this is an unsolicited msgin ... it looks like the driver handling for those is wrong. Can you try this patch? Thanks, James --- diff --git a/drivers/scsi/aic7xxx/aic7xxx_core.c b/drivers/scsi/aic7xxx/aic7xxx_core.c index 6d2ae64..64e62ce 100644 --- a/drivers/scsi/aic7xxx/aic7xxx_core.c +++ b/drivers/scsi/aic7xxx/aic7xxx_core.c @@ -695,15 +695,16 @@ ahc_handle_seqint(struct ahc_softc *ahc, u_int intstat) scb_index = ahc_inb(ahc, SCB_TAG); scb = ahc_lookup_scb(ahc, scb_index); if (devinfo.role == ROLE_INITIATOR) { - if (scb == NULL) - panic("HOST_MSG_LOOP with " - "invalid SCB %x\n", scb_index); + if (bus_phase == P_MESGOUT) { + if (scb == NULL) + panic("HOST_MSG_LOOP with " + "invalid SCB %x\n", + scb_index); - if (bus_phase == P_MESGOUT) ahc_setup_initiator_msgout(ahc, &devinfo, scb); - else { + } else { ahc->msg_type = MSG_TYPE_INITIATOR_MSGIN; ahc->msgin_index = 0;
Thanks James, I've spent an afternoon rebooting now and finally discovered I had a faulty external SSCI cable. Initial tests suggest its ok. However I remain perplexed. The problem initially manifested when I upgraded my kernel, not when I diddled with my hardware. This now seems to have fixed udev bug http://bugs.gentoo.org/show_bug.cgi?id=200437 as well how bizarre! Thanks for your help everyone. Regards John