Bug 5171

Summary: 2.6.13 SMP kernel crash on boot at pm_idle_save()
Product: ACPI Reporter: Masoud Sharbiani (masouds)
Component: OtherAssignee: Venkatesh Pallipadi (venki)
Status: REJECTED UNREPRODUCIBLE    
Severity: high CC: acpi-bugzilla, akpm, gbillios, masouds
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13-rc7 and later Subsystem:
Regression: --- Bisected commit-id:
Attachments: Kernel messages on failure
Kernel config file
The config file that works happily w/2.6.13 and ACPI
The happy 2.6.13 dmesg output
/proc/cpuinfo for working 2.6.13 config

Description Masoud Sharbiani 2005-09-01 10:38:53 UTC
Most recent kernel where this bug did not occur: 2.6.13-rc6
Distribution: Unbutnu 5.04
Hardware Environment: 2x866 MHz machine with MSI-694D board (via chipset), 768
Megs of RAM, a 160Gig IDE hdd, VIA-Rhine nic, SymBIOS scsi card, 3dfx voodoo
graphics adapter
Software Environment: gcc version 3.3.5, nothing else running at the time of crash
Problem Description/Steps to reproduce:
Configure the kernel on the same hardware with the attached config and set it to
boot. It crashes with the attached capture file (through serial interface)
only SMP kernel crashes. UP boots fine.
Some chasing in the source showed that it is the call to non-set pm_idle_save()
function that fails. It probably has something to do with the machine being an SMP.
Comment 1 Masoud Sharbiani 2005-09-01 10:44:44 UTC
Created attachment 5849 [details]
Kernel messages on failure

Captured through serialport.
Comment 2 Masoud Sharbiani 2005-09-01 10:45:33 UTC
Created attachment 5850 [details]
Kernel config file
Comment 3 Masoud Sharbiani 2005-09-01 11:03:08 UTC
0xc0237817 <acpi_processor_idle+217>:   movzbl 0x1(%esi),%eax                  
                                                                               
                
0xc023781b <acpi_processor_idle+221>:   incl   0x14(%esi)                      
                                                                               
                
0xc023781e <acpi_processor_idle+224>:   cmp    $0x2,%eax                       
                                                                               
                
0xc0237821 <acpi_processor_idle+227>:   je     0xc023784e
<acpi_processor_idle+272>                                                      
                                      
0xc0237823 <acpi_processor_idle+229>:   jg     0xc023782d
<acpi_processor_idle+239>                                                      
                                      
0xc0237825 <acpi_processor_idle+231>:   dec    %eax                            
                                                                               
                
0xc0237826 <acpi_processor_idle+232>:   je     0xc0237837
<acpi_processor_idle+249>                                                      
                                      
0xc0237828 <acpi_processor_idle+234>:   jmp    0xc023792a
<acpi_processor_idle+492>                                                      
                                      
0xc023782d <acpi_processor_idle+239>:   cmp    $0x3,%eax                       
                                                                               
                
0xc0237830 <acpi_processor_idle+242>:   je     0xc023788a
<acpi_processor_idle+332>                                                      
                                      
0xc0237832 <acpi_processor_idle+244>:   jmp    0xc023792a
<acpi_processor_idle+492>                                                      
                                      
0xc0237837 <acpi_processor_idle+249>:   mov    0xc054e160,%eax                 
                                                                               
                
0xc023783c <acpi_processor_idle+254>:   test   %eax,%eax                       
                                                                               
                
0xc023783e <acpi_processor_idle+256>:   je     0xc0237844
<acpi_processor_idle+262>                                                      
                                      
0xc0237840 <acpi_processor_idle+258>:   call   *%eax                           
                                                           
(which is the first invokation of pm_idle_save() function in
acpi_processor_save() function.)
Comment 4 Masoud Sharbiani 2005-09-07 12:15:34 UTC
The machine boots just fine if the ACPI is disabled in BIOS.
Comment 5 Len Brown 2005-09-07 19:07:04 UTC
curious -- processor_idle.c didn't change between
2.6.13-rc6 and 2.6.13.  Any change if you
do a make clean and then build the kernel from scratch?

> CONFIG_PREEMPT_NONE=y

hmmm, that's not it...

no idea...
Comment 6 Masoud Sharbiani 2005-09-08 08:33:55 UTC
Created attachment 5932 [details]
The config file that works happily w/2.6.13 and ACPI

With this config, and a clean 2.6.13, it works just fine.
With the other config file and a clean-compiled 2.6.13, it crashes at the point
I've mentioned. I've also tried it with gcc-4.0.1 (FC4), so it is not a gcc
issue.
Unless the first config file is ruled out as invalid by the more experienced
people, I'll go back and try to move options from old one to new one so that I
can ind out what option combination causes this.
Comment 7 Venkatesh Pallipadi 2005-09-08 11:11:00 UTC
This is really strange...

Nothing obvious from the configs. Probably you have to do check config options
in BAD config one by one, to narrow this down.
Comment 8 Venkatesh Pallipadi 2005-09-08 17:04:03 UTC
Can you also post the complete dmesg and /proc/cpuinfo when the system boot
(with good config).

Thanks.
Comment 9 Masoud Sharbiani 2005-09-08 18:26:23 UTC
Created attachment 5941 [details]
The happy 2.6.13 dmesg output
Comment 10 Masoud Sharbiani 2005-09-08 18:27:16 UTC
Created attachment 5942 [details]
/proc/cpuinfo for working 2.6.13 config
Comment 11 Masoud Sharbiani 2005-09-14 11:16:45 UTC
Same problem and symptoms persist with 2.6.14-rc1.
Comment 12 Masoud Sharbiani 2005-09-16 10:43:51 UTC
I am now pretty sure that this is caused by a flipping bit (or very tiny amount
of memory). 
I managed to regenerate the bug by just adding a printk() line to *bsd partition
detection code, which has absolutely nothing to do with the cpu_idle and
subsequent call trace, it could just merely be moving the offending address to a
different address upon load.

I'll spend the weekend memtesting it with different memory sticks. It may also
help to switch the CPU1 and CPU2 and rerun the test (so that the CPUs get a good
beating too).
It is very weird though, as the machine (with the kernels that actually and
successfully boot) manages to go through a cycle of make -jN kernel builds.
Feel free as close the bug; If I happen to make it work with a very well known
RAM chip, I'll reopen or file a new one.
Thanks!
Comment 13 Venkatesh Pallipadi 2005-09-16 16:31:37 UTC
We are trying to see whether we can reproduce the problem in our lab. Will
update you as and when I find some update.

Thanks.
Comment 14 Masoud Sharbiani 2005-09-20 12:33:59 UTC
During the weekend I played with this: 
1) It goes away when ACPI is disabled from BIOS
2) Exchanging the RAM (There are 2 sticks in that machine, I tested with one of
the two sticks) doesn't help.
3) I even exchanged the CPUs on the motherboard; CPU0 with CPU1, still happenes.

 
Comment 15 George Billios 2005-09-21 04:57:13 UTC
I also have a similar problem with every 2.6.13 and 2.6.14 kernel. When I boot
with SMP enabled my system just hangs at boot, but works ok if I boot without SMP. 
I can't tell if it is the same problem because I can't capture through serial.
My system is a 3.06Ghz HT P4 laptop. 
Note that my 2.6.12 kernel boots with the acpi=noapic command, but if I boot the
2.6.13 kernel with that parameter is hangs instantly, while without it, it hangs
after a few seconds during boot.
Comment 16 Venkatesh Pallipadi 2005-09-21 10:15:39 UTC
About the problem with P3 Coppermine:
Can you please try to boot with kernel boot option "idle=halt". With that this
system should boot OK, with ACPI enabled. I still don't know what is causing the
original error. But, this should be a good workaround until we figure out the
real cause.

About P4 issue:
My feeling is it is a different problem. Do you see any oops message on the
console at the time of hang (both early hang and hang at a later point)? What
are the last messages that you see on console before the hang? 
Can you try to get some more info using Magic SysRq keys?
Comment 17 George Billios 2005-09-22 08:57:57 UTC
Can you please tell me what exactly do you want by using the Magic SysRq keys?
Memory output, registers, something else?
Also note that I get no oops message in console.
Comment 18 Venkatesh Pallipadi 2005-09-22 14:53:20 UTC
SysRq:

Current registers and current tasks information will be useful

Also, Note this 2-3 times in the span of few seconds. So, that we will know
whether it is stuck at the same point.

Thanks.
Comment 19 Masoud Sharbiani 2005-09-26 19:42:51 UTC
I have *NOT* been able to reproduce the bug with 2.6.14-rc2, compiled with the 
dotconfig above (so called bad config). If you guys haven't been able to 
reproduce it, mark it either as fixed or invalid.
Comment 20 Len Brown 2005-09-28 19:01:00 UTC
okay, please re-open if you reproduce it.
Comment 21 George Billios 2005-09-29 09:30:48 UTC
Well with 2.6.14-rc2 I have the same problem but I can't use SysRq at all, I
can't understand why. 
What I noticed this time though is that the my pc doesn't hang it just boots
very very very slowly. It stays a long time at the 'Starting udev' message but
the slowness has started almost after boot. I have noticed any unusual message
like oops or anything else during boot though. And as always if I compile the
kernel for 1 CPU it works just fine!
Comment 22 Venkatesh Pallipadi 2005-09-29 09:44:04 UTC
Hmm Looks like its some mtrr issue. Is you cpu model/stepping f33 or f34 by any
chance?

Complete dmesg with slow boot and with normal boot (earlier kernel that used to
boot fine) should help. Also if you can open a new bugzilla with a pointer to
this one, that should help too, as the two issues in this bug are different.

Thanks.
Comment 23 George Billios 2005-09-29 11:03:44 UTC
I have created a new bug here: http://bugzilla.kernel.org/show_bug.cgi?id=5331

I have attached the dmesg of the working kernel but can't get one from the
non-working. 
My P4 3.06GHz HT is revision 09.