Bug 5171
Summary: | 2.6.13 SMP kernel crash on boot at pm_idle_save() | ||
---|---|---|---|
Product: | ACPI | Reporter: | Masoud Sharbiani (masouds) |
Component: | Other | Assignee: | Venkatesh Pallipadi (venki) |
Status: | REJECTED UNREPRODUCIBLE | ||
Severity: | high | CC: | acpi-bugzilla, akpm, gbillios, masouds |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.13-rc7 and later | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
Kernel messages on failure
Kernel config file The config file that works happily w/2.6.13 and ACPI The happy 2.6.13 dmesg output /proc/cpuinfo for working 2.6.13 config |
Description
Masoud Sharbiani
2005-09-01 10:38:53 UTC
Created attachment 5849 [details]
Kernel messages on failure
Captured through serialport.
Created attachment 5850 [details]
Kernel config file
0xc0237817 <acpi_processor_idle+217>: movzbl 0x1(%esi),%eax 0xc023781b <acpi_processor_idle+221>: incl 0x14(%esi) 0xc023781e <acpi_processor_idle+224>: cmp $0x2,%eax 0xc0237821 <acpi_processor_idle+227>: je 0xc023784e <acpi_processor_idle+272> 0xc0237823 <acpi_processor_idle+229>: jg 0xc023782d <acpi_processor_idle+239> 0xc0237825 <acpi_processor_idle+231>: dec %eax 0xc0237826 <acpi_processor_idle+232>: je 0xc0237837 <acpi_processor_idle+249> 0xc0237828 <acpi_processor_idle+234>: jmp 0xc023792a <acpi_processor_idle+492> 0xc023782d <acpi_processor_idle+239>: cmp $0x3,%eax 0xc0237830 <acpi_processor_idle+242>: je 0xc023788a <acpi_processor_idle+332> 0xc0237832 <acpi_processor_idle+244>: jmp 0xc023792a <acpi_processor_idle+492> 0xc0237837 <acpi_processor_idle+249>: mov 0xc054e160,%eax 0xc023783c <acpi_processor_idle+254>: test %eax,%eax 0xc023783e <acpi_processor_idle+256>: je 0xc0237844 <acpi_processor_idle+262> 0xc0237840 <acpi_processor_idle+258>: call *%eax (which is the first invokation of pm_idle_save() function in acpi_processor_save() function.) The machine boots just fine if the ACPI is disabled in BIOS. curious -- processor_idle.c didn't change between
2.6.13-rc6 and 2.6.13. Any change if you
do a make clean and then build the kernel from scratch?
> CONFIG_PREEMPT_NONE=y
hmmm, that's not it...
no idea...
Created attachment 5932 [details]
The config file that works happily w/2.6.13 and ACPI
With this config, and a clean 2.6.13, it works just fine.
With the other config file and a clean-compiled 2.6.13, it crashes at the point
I've mentioned. I've also tried it with gcc-4.0.1 (FC4), so it is not a gcc
issue.
Unless the first config file is ruled out as invalid by the more experienced
people, I'll go back and try to move options from old one to new one so that I
can ind out what option combination causes this.
This is really strange... Nothing obvious from the configs. Probably you have to do check config options in BAD config one by one, to narrow this down. Can you also post the complete dmesg and /proc/cpuinfo when the system boot (with good config). Thanks. Created attachment 5941 [details]
The happy 2.6.13 dmesg output
Created attachment 5942 [details]
/proc/cpuinfo for working 2.6.13 config
Same problem and symptoms persist with 2.6.14-rc1. I am now pretty sure that this is caused by a flipping bit (or very tiny amount of memory). I managed to regenerate the bug by just adding a printk() line to *bsd partition detection code, which has absolutely nothing to do with the cpu_idle and subsequent call trace, it could just merely be moving the offending address to a different address upon load. I'll spend the weekend memtesting it with different memory sticks. It may also help to switch the CPU1 and CPU2 and rerun the test (so that the CPUs get a good beating too). It is very weird though, as the machine (with the kernels that actually and successfully boot) manages to go through a cycle of make -jN kernel builds. Feel free as close the bug; If I happen to make it work with a very well known RAM chip, I'll reopen or file a new one. Thanks! We are trying to see whether we can reproduce the problem in our lab. Will update you as and when I find some update. Thanks. During the weekend I played with this: 1) It goes away when ACPI is disabled from BIOS 2) Exchanging the RAM (There are 2 sticks in that machine, I tested with one of the two sticks) doesn't help. 3) I even exchanged the CPUs on the motherboard; CPU0 with CPU1, still happenes. I also have a similar problem with every 2.6.13 and 2.6.14 kernel. When I boot with SMP enabled my system just hangs at boot, but works ok if I boot without SMP. I can't tell if it is the same problem because I can't capture through serial. My system is a 3.06Ghz HT P4 laptop. Note that my 2.6.12 kernel boots with the acpi=noapic command, but if I boot the 2.6.13 kernel with that parameter is hangs instantly, while without it, it hangs after a few seconds during boot. About the problem with P3 Coppermine: Can you please try to boot with kernel boot option "idle=halt". With that this system should boot OK, with ACPI enabled. I still don't know what is causing the original error. But, this should be a good workaround until we figure out the real cause. About P4 issue: My feeling is it is a different problem. Do you see any oops message on the console at the time of hang (both early hang and hang at a later point)? What are the last messages that you see on console before the hang? Can you try to get some more info using Magic SysRq keys? Can you please tell me what exactly do you want by using the Magic SysRq keys? Memory output, registers, something else? Also note that I get no oops message in console. SysRq: Current registers and current tasks information will be useful Also, Note this 2-3 times in the span of few seconds. So, that we will know whether it is stuck at the same point. Thanks. I have *NOT* been able to reproduce the bug with 2.6.14-rc2, compiled with the dotconfig above (so called bad config). If you guys haven't been able to reproduce it, mark it either as fixed or invalid. okay, please re-open if you reproduce it. Well with 2.6.14-rc2 I have the same problem but I can't use SysRq at all, I can't understand why. What I noticed this time though is that the my pc doesn't hang it just boots very very very slowly. It stays a long time at the 'Starting udev' message but the slowness has started almost after boot. I have noticed any unusual message like oops or anything else during boot though. And as always if I compile the kernel for 1 CPU it works just fine! Hmm Looks like its some mtrr issue. Is you cpu model/stepping f33 or f34 by any chance? Complete dmesg with slow boot and with normal boot (earlier kernel that used to boot fine) should help. Also if you can open a new bugzilla with a pointer to this one, that should help too, as the two issues in this bug are different. Thanks. I have created a new bug here: http://bugzilla.kernel.org/show_bug.cgi?id=5331 I have attached the dmesg of the working kernel but can't get one from the non-working. My P4 3.06GHz HT is revision 09. |