Created attachment 20718 [details]
dmesg with acpi=debug
Last working kernel version: 2.6.29-rc8-gitsomething
Tyan K8W Motherboard with 2 1.8GHz Dual Core Opterons with 2GB Ram for each node. The Motherboard uses the AMD 8xxx chipsets for AGP, PCI-X and Southbridge.
Software: Debian unstable, 64bit.
The kernel crashes with the following message (and a backtrace afterwards):
[ 0.008000] ENABLING IO-APIC IRQs
[ 0.008000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=0 pin2=0
[ 0.008000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[ 0.008000] ...trying to set up timer (IRQ0) through the 8259A ...
[ 0.008000] ..... (found apic 0 pin 0) ...
[ 0.008000] ....... failed.
[ 0.008000] ...trying to set up timer as Virtual Wire IRQ...
[ 0.008000] ..... failed.
[ 0.008000] ...trying to set up timer as ExtINT IRQ...
[ 0.008000] ..... failed :(.
[ 0.008000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a r.
Attached is the complete dmesg as capture via serial-console (and with acpi=debug).
I've just tested v2.6.29-7100-g833bb30. The problem still exists. Does anyone have an idea or should I start bisecting?
Does v2.6.29 without any patches on top work?
On Fri, Apr 03, 2009 at 09:37:48PM +0000, email@example.com wrote:
> --- Comment #2 from Rafael J. Wysocki <firstname.lastname@example.org> 2009-04-03 21:37:48 ---
> Does v2.6.29 without any patches on top work?
v2.6.29-rc8-223-ga1e4ee2 does not work and there is nothing related to
timers/irq or x86 between that specific version and 2.6.29. So v2.6.29
works most likely, too. I'm gonna bisect now, because it looks like
there's no obvious culprit at hand.
I've bisected and tracked down the problem to the following commit:
Author: Andreas Herrmann <email@example.com>
Date: Sat Feb 21 00:10:44 2009 +0100
x86: hpet: provide separate functions to stop and start the counter
Yep, I've looked at the patch and there is nothing in there. I then reverted this patch on top of v2.6.29-7100-g833bb30. This yielded an non-compilable kernel. So I additionally reverted
Author: Andreas Herrmann <firstname.lastname@example.org>
Date: Sat Feb 21 00:16:35 2009 +0100
x86: hpet: stop HPET_COUNTER when programming periodic mode
With these two patches reverted, the kernel boots fine and everything works. I'll post the dmesg of this working kernel shortly.
Created attachment 20811 [details]
dmesg of v2.6.29-7100-g833bb30 with the 2 mentioned reverts
First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403
On Sun, Apr 05, 2009 at 05:34:36PM +0000, email@example.com wrote:
> First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403
That's wrong. First-Bad-Commit is 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
I just had to revert them both because c23... depends on 8d6... to
compile the kernel at all. The only thing 8d6... does is split a function
in two parts and replace the original function with one calling both
parts. So most likely the bug is somewhere completely else, only uncovered
by a small timing/code layout change.
First-Bad-Commit : 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
Daniel, can you please retest both kernels with and without commit 8d6f0c8214928f7c5083dd54ecb69c5d615b516e but this time with
kernel parameter hpet=verbose. Thanks.
Further kernel parameters you should use are "debug" and "apic=debug".
Created attachment 21013 [details]
dmesg with "debug acpi=debug hpet=verbose", broken kernel
Created attachment 21014 [details]
dmesg with "debug acpi=debug hpet=verbose", 8d6f0c... and c23e25... reverted
Thanks for the logs.
(BTW, I meant apic=debug not acpi=debug but it doesn't matter as the
problem seems to be unrelated to apic configuration.)
I assume that the problem is the hpet of your chipset.
The diff between both dmesgs shows:
(- == working, + == broken)
-[ 0.004000] hpet: hpet_set_mode(325):
+[ 0.004000] hpet: hpet_set_mode(328):
[ 0.004000] hpet: ID: 0x10228203, PERIOD: 0x429b17f
[ 0.004000] hpet: CFG: 0x3, STATUS: 0x1
-[ 0.004000] hpet: COUNTER_l: 0x2c90d, COUNTER_h: 0x0
+[ 0.004000] hpet: COUNTER_l: 0x2cd4d, COUNTER_h: 0x0
[ 0.004000] hpet: T0: CFG_l: 0x261c, CFG_h: 0xfdefa
-[ 0.004000] hpet: T0: CMP_l: 0x5450a, CMP_h: 0x0
+[ 0.004000] hpet: T0: CMP_l: 0xdfb7, CMP_h: 0x0
In the broken case the timer comparator is less than the counter value.
The counter should have been reset, e.g. like in following example
hpet: ID: 0x43538301, PERIOD: 0x429b17e
hpet: CFG: 0x3, STATUS: 0x0
hpet: COUNTER_l: 0x1c80, COUNTER_h: 0x0
hpet: T0: CFG_l: 0x1c, CFG_h: 0xc0ffff
hpet: T0: CMP_l: 0x37ee, CMP_h: 0x0
This means that reset of the counter did not work properly.
I'll check errata information for your chipset.
Last not least can you please confirm that commit
8d6f0c8214928f7c508 (which doesn't contain functional changes)
introduced the problem on your system:
Can you do an additional test with the kernel obtained with
# git checkout 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
and kernel parameters "debug hpet=verbose" such that I can
also compare the dmesg of this test with that of the working
And please attach also output of
# lspci -nnxxx
Created attachment 21016 [details]
dmesg of 8d6f... with "debug hpet=verbose"
Created attachment 21017 [details]
dmesg of 23e2... with "debug hpet=verbose"
This one actually crashes, in contrast to 8d6f..., which doesn't crash anymore. I double-checked with my bisect-log, and I really noted 8d6f as bad. Most likely I screwed something up, but it was the last kernel to test, so the rest still holds. Sorry for the confusion this caused.
Created attachment 21018 [details]
lspci -nnxxx as root
> lspci -nnxxx as root
Ok, thanks for this information.
I've checked for errata wrt your chipset and hpet -- there are none.
> This one actually crashes, in contrast
> to 8d6f..., which doesn't crash anymore.
Nice to know. So, the second patch causes the problem.
It writes 0 to the main counter but this doesn't seem
to work on your chipset (for unknown reason).
I'll prepare a patch which stops the main counter but does not
reset it while the HPET is programmed in periodic mode.
This should fix the system hang that I tried to fix with
commit c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 and might also
be compatible with your system.
Created attachment 21070 [details]
[PATCH] x86: hpet: fix periodic mode programming on AMD 81xx
My observation was wrong. The reset of HPET_COUNTER works. But
the period for periodic mode is not properly set. On this chipset
a second write is required.
I've tested the patch on a similar system with 81xx chipset.
On Tue, Apr 21, 2009 at 05:58:01PM +0000, firstname.lastname@example.org wrote:
> --- Comment #19 from Andreas Herrmann <email@example.com> 2009-04-21
> 17:58:01 ---
> Created an attachment (id=21070)
> --> (http://bugzilla.kernel.org/attachment.cgi?id=21070)
> [PATCH] x86: hpet: fix periodic mode programming on AMD 81xx
> My observation was wrong. The reset of HPET_COUNTER works. But
> the period for periodic mode is not properly set. On this chipset
> a second write is required.
I've just tested your patch on top of latest -linus and it indeed fixes
the hang. Thanks a lot for tracking down this issue. I'll close the bug
report as soon as the fix has hit mainline and I've retested.
Handled-By : Andreas Herrmann <firstname.lastname@example.org>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=21070
Just tested v2.6.30-rc3-376-ge25c2c8 in mainline and the hang is no longer!
- Thx, Daniel