Bug 9775 - HOST_MSG_LOOP invalid SCB ff
Summary: HOST_MSG_LOOP invalid SCB ff
Status: CLOSED CODE_FIX
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-18 13:34 UTC by John Huttley
Modified: 2021-05-24 15:25 UTC (History)
2 users (show)

See Also:
Kernel Version: Through to 2.6.24-rc8 (X64)
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Screenshot showing panic (154.86 KB, image/jpeg)
2008-01-18 13:39 UTC, John Huttley
Details
My .config file (40.95 KB, text/plain)
2008-01-18 13:53 UTC, John Huttley
Details

Description John Huttley 2008-01-18 13:34:00 UTC
Latest working kernel version:
Earliest failing kernel version: 
Distribution: Gentoo
Hardware Environment: ML150G3, (2Core cpu, 64Bit)  AHA3944AUWD card, Storagetek L80 +2x DLT8000
Software Environment: gentoo
Problem Description: kernel panic 

Steps to reproduce:
Panic if the L80 is powered up when the kernel boots. 100% on any failing kernel.
Not all kernels fail but most do.
Git Bisect across linus's tree did not produce a convincing patch.
Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708
I have joined the linux-scsi list and will
..continue
Comment 1 John Huttley 2008-01-18 13:35:05 UTC
What I said previously..

The event that brought the problem to light was the installation of a secondhand Storagetek L80
tape library. This has two DLT8000 drives on a HV-Differential bus.
This needed special card, an adaptec 3944AUWD.
The kernel I was running at that time was 2.6.22-gentoo-r8.
It worked fine. Then when -r9 came out and this error manifested, the assumption
was that -r9 was broken.

I no longer think this to be the case.

I think they are _ALL_ broken, possibly going way back toward the start of the 2.6 series.
I think that the bug may or may not manifest depending on the internal layout of data in the kernel
--A true heisenbug--

All that the git bisect did was to change the internal layout, not add/remove a bad patch.

This explains why I could take the 2.6.23.8 kernel and compile for SMP and have it fail.
Compile it for UP and have it work. Initially I thought that meant a locking or race issue.
Now I think its was just another case of altering the internal kernel layout.
Comment 2 John Huttley 2008-01-18 13:39:17 UTC
Created attachment 14500 [details]
Screenshot showing panic
Comment 3 John Huttley 2008-01-18 13:53:27 UTC
Created attachment 14501 [details]
My .config file
Comment 4 John Huttley 2008-01-18 13:55:54 UTC
I have drivers for other scsi cards compiled in.
For the purpose of testing only the 3944AUWD card is installed.
Comment 5 Anonymous Emailer 2008-01-18 14:27:59 UTC
Reply-To: James.Bottomley@HansenPartnership.com


> Latest working kernel version:
> Earliest failing kernel version: 
> Distribution: Gentoo
> Hardware Environment: ML150G3, (2Core cpu, 64Bit)  AHA3944AUWD card,
> Storagetek
> L80 +2x DLT8000
> Software Environment: gentoo
> Problem Description: kernel panic 
> 
> Steps to reproduce:
> Panic if the L80 is powered up when the kernel boots. 100% on any failing
> kernel.
> Not all kernels fail but most do.
> Git Bisect across linus's tree did not produce a convincing patch.
> Originally filed here: http://bugs.gentoo.org/show_bug.cgi?id=200708
> I have joined the linux-scsi list and will
> 
> The event that brought the problem to light was the installation of a
> secondhand Storagetek L80
> tape library. This has two DLT8000 drives on a HV-Differential bus.
> This needed special card, an adaptec 3944AUWD.
> The kernel I was running at that time was 2.6.22-gentoo-r8.
> It worked fine. Then when -r9 came out and this error manifested, the
> assumption
> was that -r9 was broken.
> 
> I no longer think this to be the case.
> 
> I think they are _ALL_ broken, possibly going way back toward the start of
> the
> 2.6 series.
> I think that the bug may or may not manifest depending on the internal layout
> of data in the kernel
> --A true heisenbug--
> 
> All that the git bisect did was to change the internal layout, not add/remove
> a
> bad patch.
> 
> This explains why I could take the 2.6.23.8 kernel and compile for SMP and
> have
> it fail.
> Compile it for UP and have it work. Initially I thought that meant a locking
> or
> race issue.
> Now I think its was just another case of altering the internal kernel layout.

Actually, I'd investigate either your tapes or the SCSI bus.

The message is produced deep in the heart of the aic7xxx driver.  It
happens when the driver gets reselected with a tag that doesn't exist.
However, in this case, I think your device is untagged, in which case
this is some handling issue with SCB_LIST_NULL (the value 0xff).

James
Comment 6 John Huttley 2008-01-18 14:35:27 UTC
Thanks, I've just done some more testing.
There are no tapes in the drives.
Normally, there is the L80 and a DLT8000 on channel B
and a DLT8000 on channel A

Both busses have external terminators.

If Ch B is used alone the system is fine!
If Ch A is used alone it will fail.

If you you are thinking of some hardware problem, its possible to boot with the L80 off, cause the scsi bus to rescan and have everything work fine.
Regards,
john
Comment 7 John Huttley 2008-01-18 14:36:26 UTC
Duh! I mean boot with it off, power it up and rescan.
Comment 8 John Huttley 2008-02-08 18:52:41 UTC
Ok, I've spent some time trying different combinations of devices.

Against kernel 2.6.24
T0 is Quantum DLT8000 ID0
T1 is Quantum DLT8000 ID1
MTX	is STK L80  ID 15
Terminators A, B

Channel				A			B
					T0,T1,MTX,B	Nil					Crash
					Nil			T0,T1,MTX,B			Parity Error in Data-in Phase
					Nil			T0,MTX,B			Ok, Tar test ok, MTX ok
					Nil			T1,MTX,B			Ok, Tar test ok, MTX ok	
-- Both drives work ok		
					T1,MTX,B	Nil					Ok   Skipped Tests
					T1,MTX,A	Nil					Ok   Skipped Tests
					T0,MTX,B	Nil					Crash
					T0,MTX,A	Nil					Crash
-- Not the terminator


--Test on two channels
					T0,MTX,A	T1,B				Crash
					T1,B		T0,MTX,A			Parity Error in Data-in Phase					

It really doesn't like three devices, on two busses or one.
Comment 9 John Huttley 2008-02-08 18:54:20 UTC
Wrap around doesn't help..

I've also the the 'old' AIC78XX driver.
That driver hangs even with no devices attached.

So now what?

--john
Comment 10 Anonymous Emailer 2008-02-12 13:56:03 UTC
Reply-To: James.Bottomley@HansenPartnership.com

On Fri, 2008-02-08 at 18:52 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> Ok, I've spent some time trying different combinations of devices.
> 
> Against kernel 2.6.24
> T0 is Quantum DLT8000 ID0
> T1 is Quantum DLT8000 ID1
> MTX     is STK L80  ID 15
> Terminators A, B
> 
> Channel                         A                       B
>                                         T0,T1,MTX,B     Nil                   
>                 Crash
>                                         Nil                     T0,T1,MTX,B   
>                 Parity Error in Data-in Phase
>                                         Nil                     T0,MTX,B      
>                 Ok, Tar test ok, MTX ok
>                                         Nil                     T1,MTX,B      
>                 Ok, Tar test ok, MTX ok 
> -- Both drives work ok          
>                                         T1,MTX,B        Nil                   
>                 Ok   Skipped Tests
>                                         T1,MTX,A        Nil                   
>                 Ok   Skipped Tests
>                                         T0,MTX,B        Nil                   
>                 Crash
>                                         T0,MTX,A        Nil                   
>                 Crash
> -- Not the terminator
> 
> 
> --Test on two channels
>                                         T0,MTX,A        T1,B                  
>         Crash
>                                         T1,B            T0,MTX,A              
>         Parity Error in Data-in Phase                                   
> 
> It really doesn't like three devices, on two busses or one.

Well, I still think you have some type of bus instability, but that said
we need to get rid of the panic.

I'm afraid this is going to be a long process.  For the first attempt,
let's see if this is an unsolicited msgin ... it looks like the driver
handling for those is wrong.  Can you try this patch?

Thanks,

James

---

diff --git a/drivers/scsi/aic7xxx/aic7xxx_core.c b/drivers/scsi/aic7xxx/aic7xxx_core.c
index 6d2ae64..64e62ce 100644
--- a/drivers/scsi/aic7xxx/aic7xxx_core.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_core.c
@@ -695,15 +695,16 @@ ahc_handle_seqint(struct ahc_softc *ahc, u_int intstat)
 			scb_index = ahc_inb(ahc, SCB_TAG);
 			scb = ahc_lookup_scb(ahc, scb_index);
 			if (devinfo.role == ROLE_INITIATOR) {
-				if (scb == NULL)
-					panic("HOST_MSG_LOOP with "
-					      "invalid SCB %x\n", scb_index);
+				if (bus_phase == P_MESGOUT) {
+					if (scb == NULL)
+						panic("HOST_MSG_LOOP with "
+						      "invalid SCB %x\n",
+						      scb_index);
 
-				if (bus_phase == P_MESGOUT)
 					ahc_setup_initiator_msgout(ahc,
 								   &devinfo,
 								   scb);
-				else {
+				} else {
 					ahc->msg_type =
 					    MSG_TYPE_INITIATOR_MSGIN;
 					ahc->msgin_index = 0;
Comment 11 John Huttley 2008-02-16 18:40:09 UTC
Thanks James,
I've spent an afternoon rebooting now and finally discovered  I had a faulty external SSCI cable.

Initial tests suggest its ok.

However I remain perplexed. The problem initially manifested when I upgraded my kernel, not when I diddled with my hardware.

This now seems to have fixed udev bug
http://bugs.gentoo.org/show_bug.cgi?id=200437

as well

how bizarre!

Thanks for your help everyone.

Regards
John

Note You need to log in before you can comment on or make changes to this bug.