Bug 8469

Summary: aacraid oops during boot
Product: IO/Storage Reporter: Rainer Malitzke-Goes (malitzke)
Component: SCSIAssignee: Mark Salyzyn (aacraid)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: akpm, bunk, randy.dunlap
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.21.1 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Netconsole dumps of three boot attempts (1 panicked, 2 working
Another panicked SMP boot using kernel-2.6.22-rc1

Description Rainer Malitzke-Goes 2007-05-12 07:36:49 UTC
Most recent kernel where this bug did *NOT* occur:2.6.20.11
Distribution:kernel.org 2.6.21.1
Hardware Environment:Dell 6300 SMP 4xpentium3 4G memory adaptec SCSI controller
Software Environment:compiled gcc-4.1.3 (gcc-4.1 branch)
Problem Description:Recognizes Adaptec controller and seems to try to go to init
on successive processors. Code:Bad Eip value \n EIP[<00000000>}
_stext+0x3fefff/0x20  SS:ESP 0068:c295fe58 \n Kernel panic - not syncing :
Attempted to kill init! \n BUG: at arch/i386kernel/smpc 546 smp_call_function()
[<c010e00b>] smp_call_function + 0x12b/0x130 \n ..... 
(above hand copied from screen, apparently one message like this for each CPU)

Steps to reproduce: Recompiled both 2.6.20.11 and 2.6.21.1 and fault is
considtent. Same kernel (2.6.21.1) boots fine on a SMP G4 {MAC) machine

Will try various release candidate versions.

Will need help to characterize further
Comment 1 Rainer Malitzke-Goes 2007-05-12 08:38:33 UTC
Well! Boot Bug occurs ever since kernel-2.6.21-rc1.

However, kernel-2.6.21.1 boots fine on a single processor Pentium3 machine.

Thus, the bug is restricted to SMP i386 type machines ever since release
candidate 1.
Comment 2 Andrew Morton 2007-05-12 10:15:05 UTC
Thanks.  A digital photo of the screen might help us to get a look
at that oops.

Or set up netconsole - it's pretty easy:
Documentation/networking/netconsole.txt

Comment 3 Rainer Malitzke-Goes 2007-05-12 13:05:38 UTC
Thanks for the direction.

Will set-up and debug the netconsole on another machine, as boot on that server
is quite slow and the SCSI drives do not like to stop and spin-up repeatedly.
There is no reset on that machine.

I do not have a digital camera and I do not believe what appears on the screen
will be helpful as it is one set of consequential calls like "panic, do_exit,
die, do_page_fault, do_page_fault, error code+, acpi_nmi_disable" etc. At the
top of the screen is what appears to be the tail end of one other message
sequence that appears whole. Therefore I surmise that there are three or four
equivalent message sequences referring to either all four CPU's or just the
additional three not used during the initial boot.

After correctly identifying the three Adaptec controllers things scroll too fast
to capture either by eye or camera  until things lock up with the last message
sequence.
Comment 4 Rainer Malitzke-Goes 2007-05-12 19:40:11 UTC
Created attachment 11488 [details]
Netconsole dumps of three boot attempts (1 panicked, 2 working

These three boot attempts show that 2.6.21.1 fails on i386 SMP but works on
single
processor; 2.6.20.11 boots fine on SMP. the corresponding SMP .configs are
equivalent but for menuconfig introduced differences. 'make V=1 2>&1 |tee
.Build'
are available if needed. same as .configs
Comment 5 Rainer Malitzke-Goes 2007-05-12 19:45:08 UTC
Have some suggestions about netconsole Documentation if if requested with party
to send to.
Comment 6 Randy Dunlap 2007-05-12 23:08:49 UTC
Post netconsole Doc. comments here or send them to me or to the
netconsole owner:  Matt Mackall <mpm@selenic.com>
Comment 7 Rainer Malitzke-Goes 2007-05-13 15:25:30 UTC
Created attachment 11493 [details]
Another panicked SMP boot using kernel-2.6.22-rc1

I am afraid that this is more bad news.

I had not noticed before that in going from 2.6.20.11 to 2.6.21 both aacraid
and aic7xxx had undergone significant changes. I was too fixed on SMP.

As 2.6.22 has more changes in aacraid I am submitting another failed boot
netconsole dump using 2.6.22-rc1.

I will try 2.6.22-rc1 on the MAC SMP G4 on which I installed also a SCSI drive
with an Adaptec APD-29160N Ultra160 controller.

The good news is that netconsole is a fantastic tool that should much more
prominence instead of being pratically hidden. Will review the 2.6.22
documentation and configuration before submitting my comments.
Comment 8 Mark Salyzyn 2007-05-14 07:35:04 UTC
This coincides with the introduction of the adapter_comm and adapter_deliver 
platform functions.

I need to know which aacraid based adapters are installed in the system.

The panic appears to occur with an uninitialized adapter_deliver platform 
function pointer. I can see an oversight in the sa style adapters, but it 
would affect all kernel configurations, not just SMP. This may be the case 
because it appears the UP boot did NOT load the aacraid driver (!). These 
adapters are the Adaptec 5400S and HP NetRAID, last produced these cards in 
2000. Inspection has not turned up any holes as this is part of a single 
threaded initialization of the Adapters. I am aware that the 
aac_command_thread has started up, but it is inert.

If there is an Adaptec 5400S or HP NetRAID, please pull them to confirm that 
these are the cause of the panic.
Comment 9 Rainer Malitzke-Goes 2007-05-14 22:12:36 UTC
Thanks for the prompt action!

The machine in question; a Dell 6300/550; is now working with the patch applied
to 2.6.22-rc1. 

It worked on the second try because a change in configuration was required. the
details and answers to Mark's question follow:

The controller is neither a HP NetRaid nor An Adaptec 5400S. It is an OEM
Adaptec ASSY 1790106-01 with an Adaptec ASSY 1790206-01. It sports two Adaptec
AIC-7897.
In an earlier query Adaptec claimed no residual responsibility and referred me
to Dell, who claimed it being obsolete.

I am using that Dell 6300 not as a server but as a fantastic development machine
with its four processor and three Gigabytes of memory. I am not even using it as
a RAID machine but as a plain SCSI machine. When I first tried to bring it up
with Linux I had a rather steep learning curve; and ended up with the old
aictxxx_old driver but had to also activate the aacraid driver. However I never
 selected the RAID setup option. 

However, with the new drivers I had to also select the RAID option otherwise  it
 would no find the root on /dev/sda3. 

As I am not familiar with the kernel/osdl/bugzilla arrangement. I only realized
the existence of the patch when checking my mail as there was no mention in the
bugzilla problem report. 

It seems that I am very spoiled by the excellent quality of the kernel releases.
 I only hacked the kernel in 1993 to get it to read the "old" SCO Xenix
formatted hard drives. Luckkily I refrained from publishing my work, given the
legal encumbrances imposed even then by SCO. Had I published it It could have
been fodder for the "bad-new" SCO and their legal manoeverings.

I am quite willing to act as a tester using the Dell and a MAC with dual G4.
Just to introduce my-self a little here follows:

I am 72 and retired but still active trying to preserve abouts 30 G of
work-station packages in peril of ending in bit-buckets.

Istarted programming with unit-record machine plug-boards and progressed to
real-time assembly language programmer on central office telephone switches.
Then went on to system designer and internation telecommincations consultant
ending up in the satellite industry (COMSAT INTELSAT)

I have had more exposure with bugzilla as operated by GCC-GNU.ORG, where I filed
about ten problem reports. 

Oh yes I was also an airline pilot and want to take flightgear to become an
instrument flying trainer to prevent unnecessary deaths like the one that befell
the young Kennedy and his wife.

Testing kernels and compilers just fits in with these activities.

PS I could add Mark as mark_salyzyn to the CC. Pleas forward the info to him.