Created attachment 20718 [details] dmesg with acpi=debug Last working kernel version: 2.6.29-rc8-gitsomething Hardware: Tyan K8W Motherboard with 2 1.8GHz Dual Core Opterons with 2GB Ram for each node. The Motherboard uses the AMD 8xxx chipsets for AGP, PCI-X and Southbridge. Software: Debian unstable, 64bit. The kernel crashes with the following message (and a backtrace afterwards): [ 0.008000] ENABLING IO-APIC IRQs [ 0.008000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=0 pin2=0 [ 0.008000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.008000] ...trying to set up timer (IRQ0) through the 8259A ... [ 0.008000] ..... (found apic 0 pin 0) ... [ 0.008000] ....... failed. [ 0.008000] ...trying to set up timer as Virtual Wire IRQ... [ 0.008000] ..... failed. [ 0.008000] ...trying to set up timer as ExtINT IRQ... [ 0.008000] ..... failed :(. [ 0.008000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a r. Attached is the complete dmesg as capture via serial-console (and with acpi=debug).
I've just tested v2.6.29-7100-g833bb30. The problem still exists. Does anyone have an idea or should I start bisecting?
Does v2.6.29 without any patches on top work?
On Fri, Apr 03, 2009 at 09:37:48PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12961 > --- Comment #2 from Rafael J. Wysocki <rjw@sisk.pl> 2009-04-03 21:37:48 --- > Does v2.6.29 without any patches on top work? v2.6.29-rc8-223-ga1e4ee2 does not work and there is nothing related to timers/irq or x86 between that specific version and 2.6.29. So v2.6.29 works most likely, too. I'm gonna bisect now, because it looks like there's no obvious culprit at hand.
I've bisected and tracked down the problem to the following commit: commit 8d6f0c8214928f7c5083dd54ecb69c5d615b516e Author: Andreas Herrmann <andreas.herrmann3@amd.com> Date: Sat Feb 21 00:10:44 2009 +0100 x86: hpet: provide separate functions to stop and start the counter Yep, I've looked at the patch and there is nothing in there. I then reverted this patch on top of v2.6.29-7100-g833bb30. This yielded an non-compilable kernel. So I additionally reverted commit c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 Author: Andreas Herrmann <andreas.herrmann3@amd.com> Date: Sat Feb 21 00:16:35 2009 +0100 x86: hpet: stop HPET_COUNTER when programming periodic mode With these two patches reverted, the kernel boots fine and everything works. I'll post the dmesg of this working kernel shortly.
Created attachment 20811 [details] dmesg of v2.6.29-7100-g833bb30 with the 2 mentioned reverts
First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403
On Sun, Apr 05, 2009 at 05:34:36PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 That's wrong. First-Bad-Commit is 8d6f0c8214928f7c5083dd54ecb69c5d615b516e I just had to revert them both because c23... depends on 8d6... to compile the kernel at all. The only thing 8d6... does is split a function in two parts and replace the original function with one calling both parts. So most likely the bug is somewhere completely else, only uncovered by a small timing/code layout change.
First-Bad-Commit : 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
Daniel, can you please retest both kernels with and without commit 8d6f0c8214928f7c5083dd54ecb69c5d615b516e but this time with kernel parameter hpet=verbose. Thanks.
Further kernel parameters you should use are "debug" and "apic=debug".
Created attachment 21013 [details] dmesg with "debug acpi=debug hpet=verbose", broken kernel
Created attachment 21014 [details] dmesg with "debug acpi=debug hpet=verbose", 8d6f0c... and c23e25... reverted
Thanks for the logs. (BTW, I meant apic=debug not acpi=debug but it doesn't matter as the problem seems to be unrelated to apic configuration.) I assume that the problem is the hpet of your chipset. The diff between both dmesgs shows: (- == working, + == broken) -[ 0.004000] hpet: hpet_set_mode(325): +[ 0.004000] hpet: hpet_set_mode(328): [ 0.004000] hpet: ID: 0x10228203, PERIOD: 0x429b17f [ 0.004000] hpet: CFG: 0x3, STATUS: 0x1 -[ 0.004000] hpet: COUNTER_l: 0x2c90d, COUNTER_h: 0x0 +[ 0.004000] hpet: COUNTER_l: 0x2cd4d, COUNTER_h: 0x0 [ 0.004000] hpet: T0: CFG_l: 0x261c, CFG_h: 0xfdefa -[ 0.004000] hpet: T0: CMP_l: 0x5450a, CMP_h: 0x0 +[ 0.004000] hpet: T0: CMP_l: 0xdfb7, CMP_h: 0x0 In the broken case the timer comparator is less than the counter value. The counter should have been reset, e.g. like in following example hpet: hpet_set_mode(328): hpet: ID: 0x43538301, PERIOD: 0x429b17e hpet: CFG: 0x3, STATUS: 0x0 hpet: COUNTER_l: 0x1c80, COUNTER_h: 0x0 hpet: T0: CFG_l: 0x1c, CFG_h: 0xc0ffff hpet: T0: CMP_l: 0x37ee, CMP_h: 0x0 This means that reset of the counter did not work properly. I'll check errata information for your chipset. Last not least can you please confirm that commit 8d6f0c8214928f7c508 (which doesn't contain functional changes) introduced the problem on your system: Can you do an additional test with the kernel obtained with # git checkout 8d6f0c8214928f7c5083dd54ecb69c5d615b516e and kernel parameters "debug hpet=verbose" such that I can also compare the dmesg of this test with that of the working test run.
And please attach also output of # lspci -nnxxx
Created attachment 21016 [details] dmesg of 8d6f... with "debug hpet=verbose"
Created attachment 21017 [details] dmesg of 23e2... with "debug hpet=verbose" This one actually crashes, in contrast to 8d6f..., which doesn't crash anymore. I double-checked with my bisect-log, and I really noted 8d6f as bad. Most likely I screwed something up, but it was the last kernel to test, so the rest still holds. Sorry for the confusion this caused.
Created attachment 21018 [details] lspci -nnxxx as root
> lspci -nnxxx as root Ok, thanks for this information. I've checked for errata wrt your chipset and hpet -- there are none. > This one actually crashes, in contrast > to 8d6f..., which doesn't crash anymore. Nice to know. So, the second patch causes the problem. It writes 0 to the main counter but this doesn't seem to work on your chipset (for unknown reason). I'll prepare a patch which stops the main counter but does not reset it while the HPET is programmed in periodic mode. This should fix the system hang that I tried to fix with commit c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 and might also be compatible with your system.
Created attachment 21070 [details] [PATCH] x86: hpet: fix periodic mode programming on AMD 81xx My observation was wrong. The reset of HPET_COUNTER works. But the period for periodic mode is not properly set. On this chipset a second write is required. I've tested the patch on a similar system with 81xx chipset.
On Tue, Apr 21, 2009 at 05:58:01PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #19 from Andreas Herrmann <andreas.herrmann3@amd.com> 2009-04-21 > 17:58:01 --- > Created an attachment (id=21070) > --> (http://bugzilla.kernel.org/attachment.cgi?id=21070) > [PATCH] x86: hpet: fix periodic mode programming on AMD 81xx > > My observation was wrong. The reset of HPET_COUNTER works. But > the period for periodic mode is not properly set. On this chipset > a second write is required. I've just tested your patch on top of latest -linus and it indeed fixes the hang. Thanks a lot for tracking down this issue. I'll close the bug report as soon as the fix has hit mainline and I've retested. -Daniel
Handled-By : Andreas Herrmann <andreas.herrmann3@amd.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=21070
Just tested v2.6.30-rc3-376-ge25c2c8 in mainline and the hang is no longer! - Thx, Daniel