Bug 12590
Summary: | Kernel Panic with Pentium D and ECS MB | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Philip Wall (philip.wall) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | RESOLVED OBSOLETE | ||
Severity: | high | CC: | akataria, alan, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.28.2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 56331 | ||
Attachments: |
Capture of the panic
apic debug dmesg out with noapic acpidump lspci dmidecode acpi_use_timer_override panic acpi=off dmesg Debug patch The panic with the patch working release kernel 2.6.26.8 dmesg Most current DMESG dmesg with hpet=disable and the noapic boot option removed |
Description
Philip Wall
2009-01-31 15:18:07 UTC
Created attachment 20053 [details]
Capture of the panic
console capture of the failure
Created attachment 20054 [details]
apic debug
Created attachment 20055 [details]
dmesg out with noapic
Will you please attach the output of acpidump, lspic -vxxx, dmidecode? Created attachment 20058 [details]
acpidump
Created attachment 20059 [details]
lspci
Created attachment 20060 [details]
dmidecode
Hi, Philip From the dmesg in comment #1 it seems that there exists the following message: >ACPI: BIOS IRQ0 pin2 override ignored How about the boot option of "acpi_use_timer_override"? Thanks. no affect, still causes a panic Hi, Philip Will you please attach the capture of the kernel panic after adding the boot option of "acpi_use_timer_override"? Will you please add the option of "acpi=off" and see whether the system can be booted? Thanks. Created attachment 20128 [details]
acpi_use_timer_override panic
Created attachment 20129 [details]
acpi=off dmesg
it booted with this switch
keep in mind this all started with the 2.6.27 branch so something someplace went different at that point that doesn't agree with my hardware. Hi, Philip Thanks for the confirm. From the log in comment #12 it seems that the timer is connected with I/O APIC through the I/O APIC pin0. In such case it can work. In fact from the log in comment #1 it seems that the timer is also connected with I/O APIC through pin0. But it can't work. Maybe the difference is that the irqtype reported by MPS table is MP_EXTINT and the irqtype in ACPI table is MP_INT. Thanks. Is there anything I can do to make it start working again? It use to have no problem till the 2.6.27 kernels hit the streets. Hi, Philip Will you please use the git-bisect to identify the commit which causes the regression? Thanks. Output from GIT is: fca515fbfa5ecd9f7b54db311317e2c877d7831a is first bad commit but this commit is an ia64 change, nothing changed in x86 side, can you please double check? The Pentium D is an ia64 dual core processor and I run it in 64 bit mode. Hi, Philip The pentium D is a dual core processor based on x86 architecture. It can work in 32/64 bit mode. But it is not based on Itanium architecture. The ia64 branch is for the process based on itanium. Will you please double check this issue? Thanks. I will run git bisect again. Will take a few days but I'll run it again. Output this time is: 3da757daf86e498872855f0b5e101f763ba79499 is first bad commit During some of the bad builds the reboot died with this error MP-BIOS bug: 8254 timer not connected to IO-APIC Kernel Panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter. cc : Rafael J. Wysocki Hi, Rui From the result of git-bisect it seems that this is related with x86 branch. How about assign it to other category? Thanks. first bad commit: commit 3da757daf86e498872855f0b5e101f763ba79499 Author: Alok Kataria <akataria@vmware.com> Date: Fri Jun 20 15:06:33 2008 -0700 x86: use cpu_khz for loops_per_jiffy calculation On the x86 platform we can use the value of tsc_khz computed during tsc calibration to calculate the loops_per_jiffy value. Its very important to keep the error in lpj values to minimum as any error in that may result in kernel panic in check_timer. In virtualization environment, On a highly overloaded host the guest delay calibration may sometimes result in errors beyond the ~50% that timer_irq_works can handle, resulting in the guest panicking. Does some formating changes to lpj_setup code to now have a single printk to print the bogomips value. We do this only for the boot processor because the AP's can have different base frequencies or the BIOS might boot a AP at a different frequency. Signed-off-by: Alok N Kataria <akataria@vmware.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Daniel Hecht <dhecht@vmware.com> Cc: Tim Mann <mann@vmware.com> Cc: Zach Amsden <zach@vmware.com> Cc: Sahil Rihan <srihan@vmware.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> > ------- Comment #26 from rui.zhang@intel.com 2009-03-08 19:54 -------
> first bad commit:
>
> commit 3da757daf86e498872855f0b5e101f763ba79499
> Author: Alok Kataria <akataria@vmware.com>
> Date: Fri Jun 20 15:06:33 2008 -0700
>
> x86: use cpu_khz for loops_per_jiffy calculation
I doubt that this is in any way connected to the IO-APIC timer
failure.
Thanks,
tglx
Actually the timer_irq_works function uses mdelay to check if the IO-APIC is working correctly on the h/w or not. This mdelay routine is dependant on the lpj value that is calculated. Now with this commit (76499 since we calculate the lpj value directly from the cpu_khz value, we expect the TSC frequency calibration algortihm to get us the correct value for cpu_khz. And maybe that's not working correctly on this hardware. Philip, can you please post the dmesg of the the kernel which booted fine ( the without using any commandline options). Thanks, Alok Created attachment 20472 [details]
Debug patch
Patch on top of current git, but should apply on 2.6.27.. as well.
(In reply to comment #29) > Created an attachment (id=20472) [details] > Debug patch > > Patch on top of current git, but should apply on 2.6.27.. as well. > Philip, I have attached a patch which removes the code which makes use of the tsc_khz value to calculate lpj and also adds a printk statement to get some more debug data. If the commit above is at all the root cause for this problem, this attached debug patch should atleast get the system up for you. Please send the dmesg output with this patch. Thanks, Alok Created attachment 20473 [details]
The panic with the patch
At this point I'm not sure what to do. I could try and run git bisect again but I'm concerned I'll probably get a completely different answer again. And I've lost one of the kernels that worked. Is there a way to get git to roll to just before that patch or do I have to bisect the entire thing again? I think I still have the last release kernel that worked without noapic. (In reply to comment #32) > At this point I'm not sure what to do. I could try and run git bisect again > but > I'm concerned I'll probably get a completely different answer again. And I've > lost one of the kernels that worked. Is there a way to get git to roll to > just > before that patch or do I have to bisect the entire thing again? git reset --hard f22f9a89ce6857d377bf22dba4c1a8cd256c5136 This should take it to the patch just before the commit that you mentioned was faulty. But please be careful, doing a hard reset will result in loosing all the local changes that you might have done in your tree. > I think I still have the last release kernel that worked without noapic. > Yes, please try and boot this one and get the dmesg output. Also please try and do multiple boot cycles as well, just to confirm that the problem is actually not present on all the kernels for that hardware. From the dmesg that you just posted...here are the relevant lines ********* Calibrating delay using timer specific routine.. 87629.16 BogoMIPS (lpj=145990657) ANK, debug lpj_fine is 8857983 tsc_khz is 2657395 cpu_khz 2657395 hz is 300 ********* The lpj value that was calibrated by the timer specific routine looks wrong to me, lpj value shouldn't be this large IMO, but we would know better once we get the dmesg for a working kernel. The panic with this patch also makes me think if, the "IOAPIC + timer" panic is totally unrelated to the delay being calibrated incorrectly. May be something else in the IOAPIC space is wrong. Created attachment 20474 [details]
working release kernel 2.6.26.8 dmesg
Looking at the dmesg for 2.6.26.8 : The log looks really weird to me ==================== For CPU0 Calibrating delay using timer specific routine.. 87629.44 BogoMIPS (lpj=145991105) For CPU1 Initializing CPU#1 Calibrating delay using timer specific routine.. 5317.01 BogoMIPS (lpj=8859218) ==================== Why do the lpj values for these 2 processor differ by so much ? Philip is this a asymmeteric system ? or do you have any setting in the BIOS which changes the frequency at which processors are running ? Also i notice that you are running with a HZ value of 300, does the system boot with any other values of HZ, say 100 or 1000 ? Assuming whatever values you are seeing for lpj on cpu0 are correct, the kernel with the debug patch too set that value and it still didn't boot, that part confuses me. How may boots did you try with the patched kernel ? Did all of them fail ? No BIOS setting I have ever seen. The ECS motherboard doesn't have alot of options. This is just one of the first Intel dual core processors, Pentium D 805. The HZ value of 300 was a good middle of the ground choice for me since I watch alot of video on this machine. I tried it 5 times with the patch and all failed. The older kernel seems much more tolerant of APIC and timer strangeness. Given that we are not able to boot with the debug patch that I had attached in comment #29, i think its some problem with the ioapic on this machine. Thomas/Ingo any ideas with that ? Apart from that, commit 79499 has a problem of miscalculating the lpj value for this box too. Though its separate from the problem mentioned above, i still don't understand how do we calibrate the lpj value to such a high value on this machine. Created attachment 20600 [details]
Most current DMESG
Newest Kernels are producing some new errors and warnings concerning APIC now.
Plus an error of
select() to /dev/rtc to wait for clock tick timed out
which has produced the interesting problem that nothing seems to be able to figure out the timezone I'm in.
While searching for fixes for the /dev/rtc problem I found out that it is caused by a HPET issue of unshared IRQ with the RTC. On my system using --directisa to the hwclock program kept producing the same /dev/rtc error so I tried booting with hpet=disable to kill off the fancy timer and to go old school. When that worked I took off my noapic boot param and suddenly the thing booted with no issue, just had to make sure I had the hpet=disable option. So someplace in my hardware there is a conflict between HPET, APIC and RTC far as I can tell. Created attachment 21193 [details]
dmesg with hpet=disable and the noapic boot option removed
If this is still seen on modern kernels then please re-open/update |