|Summary:||Kernel panics in early boot: IO-APIC + timer doesn't work|
|Product:||Platform Specific/Hardware||Reporter:||Daniel Vetter (daniel)|
|Bug Depends on:|
dmesg with acpi=debug
dmesg of v2.6.29-7100-g833bb30 with the 2 mentioned reverts
dmesg with "debug acpi=debug hpet=verbose", broken kernel
dmesg with "debug acpi=debug hpet=verbose", 8d6f0c... and c23e25... reverted
dmesg of 8d6f... with "debug hpet=verbose"
dmesg of 23e2... with "debug hpet=verbose"
lspci -nnxxx as root
[PATCH] x86: hpet: fix periodic mode programming on AMD 81xx
Description Daniel Vetter 2009-03-28 19:00:13 UTC
Created attachment 20718 [details] dmesg with acpi=debug Last working kernel version: 2.6.29-rc8-gitsomething Hardware: Tyan K8W Motherboard with 2 1.8GHz Dual Core Opterons with 2GB Ram for each node. The Motherboard uses the AMD 8xxx chipsets for AGP, PCI-X and Southbridge. Software: Debian unstable, 64bit. The kernel crashes with the following message (and a backtrace afterwards): [ 0.008000] ENABLING IO-APIC IRQs [ 0.008000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=0 pin2=0 [ 0.008000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.008000] ...trying to set up timer (IRQ0) through the 8259A ... [ 0.008000] ..... (found apic 0 pin 0) ... [ 0.008000] ....... failed. [ 0.008000] ...trying to set up timer as Virtual Wire IRQ... [ 0.008000] ..... failed. [ 0.008000] ...trying to set up timer as ExtINT IRQ... [ 0.008000] ..... failed :(. [ 0.008000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a r. Attached is the complete dmesg as capture via serial-console (and with acpi=debug).
Comment 1 Daniel Vetter 2009-04-02 15:01:06 UTC
I've just tested v2.6.29-7100-g833bb30. The problem still exists. Does anyone have an idea or should I start bisecting?
Comment 2 Rafael J. Wysocki 2009-04-03 21:37:48 UTC
Does v2.6.29 without any patches on top work?
Comment 3 Daniel Vetter 2009-04-04 18:35:53 UTC
On Fri, Apr 03, 2009 at 09:37:48PM +0000, firstname.lastname@example.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12961 > --- Comment #2 from Rafael J. Wysocki <email@example.com> 2009-04-03 21:37:48 --- > Does v2.6.29 without any patches on top work? v2.6.29-rc8-223-ga1e4ee2 does not work and there is nothing related to timers/irq or x86 between that specific version and 2.6.29. So v2.6.29 works most likely, too. I'm gonna bisect now, because it looks like there's no obvious culprit at hand.
Comment 4 Daniel Vetter 2009-04-05 12:39:16 UTC
I've bisected and tracked down the problem to the following commit: commit 8d6f0c8214928f7c5083dd54ecb69c5d615b516e Author: Andreas Herrmann <firstname.lastname@example.org> Date: Sat Feb 21 00:10:44 2009 +0100 x86: hpet: provide separate functions to stop and start the counter Yep, I've looked at the patch and there is nothing in there. I then reverted this patch on top of v2.6.29-7100-g833bb30. This yielded an non-compilable kernel. So I additionally reverted commit c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 Author: Andreas Herrmann <email@example.com> Date: Sat Feb 21 00:16:35 2009 +0100 x86: hpet: stop HPET_COUNTER when programming periodic mode With these two patches reverted, the kernel boots fine and everything works. I'll post the dmesg of this working kernel shortly.
Comment 5 Daniel Vetter 2009-04-05 12:41:24 UTC
Created attachment 20811 [details] dmesg of v2.6.29-7100-g833bb30 with the 2 mentioned reverts
Comment 6 Rafael J. Wysocki 2009-04-05 17:34:36 UTC
First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403
Comment 7 Daniel Vetter 2009-04-05 17:56:28 UTC
On Sun, Apr 05, 2009 at 05:34:36PM +0000, firstname.lastname@example.org wrote: > First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 That's wrong. First-Bad-Commit is 8d6f0c8214928f7c5083dd54ecb69c5d615b516e I just had to revert them both because c23... depends on 8d6... to compile the kernel at all. The only thing 8d6... does is split a function in two parts and replace the original function with one calling both parts. So most likely the bug is somewhere completely else, only uncovered by a small timing/code layout change.
Comment 8 Rafael J. Wysocki 2009-04-05 18:32:31 UTC
First-Bad-Commit : 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
Comment 9 herrmann.der.user 2009-04-06 12:42:53 UTC
Daniel, can you please retest both kernels with and without commit 8d6f0c8214928f7c5083dd54ecb69c5d615b516e but this time with kernel parameter hpet=verbose. Thanks.
Comment 10 herrmann.der.user 2009-04-06 12:45:13 UTC
Further kernel parameters you should use are "debug" and "apic=debug".
Comment 11 Daniel Vetter 2009-04-16 14:03:05 UTC
Created attachment 21013 [details] dmesg with "debug acpi=debug hpet=verbose", broken kernel
Comment 12 Daniel Vetter 2009-04-16 14:04:35 UTC
Created attachment 21014 [details] dmesg with "debug acpi=debug hpet=verbose", 8d6f0c... and c23e25... reverted
Comment 13 herrmann.der.user 2009-04-16 16:40:35 UTC
Thanks for the logs. (BTW, I meant apic=debug not acpi=debug but it doesn't matter as the problem seems to be unrelated to apic configuration.) I assume that the problem is the hpet of your chipset. The diff between both dmesgs shows: (- == working, + == broken) -[ 0.004000] hpet: hpet_set_mode(325): +[ 0.004000] hpet: hpet_set_mode(328): [ 0.004000] hpet: ID: 0x10228203, PERIOD: 0x429b17f [ 0.004000] hpet: CFG: 0x3, STATUS: 0x1 -[ 0.004000] hpet: COUNTER_l: 0x2c90d, COUNTER_h: 0x0 +[ 0.004000] hpet: COUNTER_l: 0x2cd4d, COUNTER_h: 0x0 [ 0.004000] hpet: T0: CFG_l: 0x261c, CFG_h: 0xfdefa -[ 0.004000] hpet: T0: CMP_l: 0x5450a, CMP_h: 0x0 +[ 0.004000] hpet: T0: CMP_l: 0xdfb7, CMP_h: 0x0 In the broken case the timer comparator is less than the counter value. The counter should have been reset, e.g. like in following example hpet: hpet_set_mode(328): hpet: ID: 0x43538301, PERIOD: 0x429b17e hpet: CFG: 0x3, STATUS: 0x0 hpet: COUNTER_l: 0x1c80, COUNTER_h: 0x0 hpet: T0: CFG_l: 0x1c, CFG_h: 0xc0ffff hpet: T0: CMP_l: 0x37ee, CMP_h: 0x0 This means that reset of the counter did not work properly. I'll check errata information for your chipset. Last not least can you please confirm that commit 8d6f0c8214928f7c508 (which doesn't contain functional changes) introduced the problem on your system: Can you do an additional test with the kernel obtained with # git checkout 8d6f0c8214928f7c5083dd54ecb69c5d615b516e and kernel parameters "debug hpet=verbose" such that I can also compare the dmesg of this test with that of the working test run.
Comment 14 herrmann.der.user 2009-04-16 18:03:32 UTC
And please attach also output of # lspci -nnxxx
Comment 15 Daniel Vetter 2009-04-16 19:04:31 UTC
Created attachment 21016 [details] dmesg of 8d6f... with "debug hpet=verbose"
Comment 16 Daniel Vetter 2009-04-16 19:21:27 UTC
Created attachment 21017 [details] dmesg of 23e2... with "debug hpet=verbose" This one actually crashes, in contrast to 8d6f..., which doesn't crash anymore. I double-checked with my bisect-log, and I really noted 8d6f as bad. Most likely I screwed something up, but it was the last kernel to test, so the rest still holds. Sorry for the confusion this caused.
Comment 17 Daniel Vetter 2009-04-16 19:25:44 UTC
Created attachment 21018 [details] lspci -nnxxx as root
Comment 18 herrmann.der.user 2009-04-16 19:42:56 UTC
> lspci -nnxxx as root Ok, thanks for this information. I've checked for errata wrt your chipset and hpet -- there are none. > This one actually crashes, in contrast > to 8d6f..., which doesn't crash anymore. Nice to know. So, the second patch causes the problem. It writes 0 to the main counter but this doesn't seem to work on your chipset (for unknown reason). I'll prepare a patch which stops the main counter but does not reset it while the HPET is programmed in periodic mode. This should fix the system hang that I tried to fix with commit c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 and might also be compatible with your system.
Comment 19 herrmann.der.user 2009-04-21 17:58:01 UTC
Created attachment 21070 [details] [PATCH] x86: hpet: fix periodic mode programming on AMD 81xx My observation was wrong. The reset of HPET_COUNTER works. But the period for periodic mode is not properly set. On this chipset a second write is required. I've tested the patch on a similar system with 81xx chipset.
Comment 20 Daniel Vetter 2009-04-21 21:17:21 UTC
On Tue, Apr 21, 2009 at 05:58:01PM +0000, email@example.com wrote: > --- Comment #19 from Andreas Herrmann <firstname.lastname@example.org> 2009-04-21 > 17:58:01 --- > Created an attachment (id=21070) > --> (http://bugzilla.kernel.org/attachment.cgi?id=21070) > [PATCH] x86: hpet: fix periodic mode programming on AMD 81xx > > My observation was wrong. The reset of HPET_COUNTER works. But > the period for periodic mode is not properly set. On this chipset > a second write is required. I've just tested your patch on top of latest -linus and it indeed fixes the hang. Thanks a lot for tracking down this issue. I'll close the bug report as soon as the fix has hit mainline and I've retested. -Daniel
Comment 21 Rafael J. Wysocki 2009-04-26 11:30:37 UTC
Handled-By : Andreas Herrmann <email@example.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=21070
Comment 22 Daniel Vetter 2009-04-27 16:20:25 UTC
Just tested v2.6.30-rc3-376-ge25c2c8 in mainline and the hang is no longer! - Thx, Daniel