Bug 12961

Summary: Kernel panics in early boot: IO-APIC + timer doesn't work
Product: Platform Specific/Hardware Reporter: Daniel Vetter (daniel)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: CLOSED CODE_FIX    
Severity: normal CC: herrmann.der.user, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29-03652-g5d80f8e Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 12398    
Attachments: dmesg with acpi=debug
dmesg of v2.6.29-7100-g833bb30 with the 2 mentioned reverts
dmesg with "debug acpi=debug hpet=verbose", broken kernel
dmesg with "debug acpi=debug hpet=verbose", 8d6f0c... and c23e25... reverted
dmesg of 8d6f... with "debug hpet=verbose"
dmesg of 23e2... with "debug hpet=verbose"
lspci -nnxxx as root
[PATCH] x86: hpet: fix periodic mode programming on AMD 81xx

Description Daniel Vetter 2009-03-28 19:00:13 UTC
Created attachment 20718 [details]
dmesg with acpi=debug

Last working kernel version: 2.6.29-rc8-gitsomething

Hardware:

Tyan K8W Motherboard with 2 1.8GHz Dual Core Opterons with 2GB Ram for each node. The Motherboard uses the AMD 8xxx chipsets for AGP, PCI-X and Southbridge.

Software: Debian unstable, 64bit.

The kernel crashes with the following message (and a backtrace afterwards):

[    0.008000] ENABLING IO-APIC IRQs
[    0.008000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=0 pin2=0
[    0.008000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[    0.008000] ...trying to set up timer (IRQ0) through the 8259A ...
[    0.008000] ..... (found apic 0 pin 0) ...
[    0.008000] ....... failed.
[    0.008000] ...trying to set up timer as Virtual Wire IRQ...
[    0.008000] ..... failed.
[    0.008000] ...trying to set up timer as ExtINT IRQ...
[    0.008000] ..... failed :(.
[    0.008000] Kernel panic - not syncing: IO-APIC + timer doesn't work!  Boot with apic=debug and send a r.

Attached is the complete dmesg as capture via serial-console (and with acpi=debug).
Comment 1 Daniel Vetter 2009-04-02 15:01:06 UTC
I've just tested v2.6.29-7100-g833bb30. The problem still exists. Does anyone have an idea or should I start bisecting?
Comment 2 Rafael J. Wysocki 2009-04-03 21:37:48 UTC
Does v2.6.29 without any patches on top work?
Comment 3 Daniel Vetter 2009-04-04 18:35:53 UTC
On Fri, Apr 03, 2009 at 09:37:48PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12961
> --- Comment #2 from Rafael J. Wysocki <rjw@sisk.pl>  2009-04-03 21:37:48 ---
> Does v2.6.29 without any patches on top work?

v2.6.29-rc8-223-ga1e4ee2 does not work and there is nothing related to
timers/irq or x86 between that specific version and 2.6.29. So v2.6.29
works most likely, too. I'm gonna bisect now, because it looks like
there's no obvious culprit at hand.
Comment 4 Daniel Vetter 2009-04-05 12:39:16 UTC
I've bisected and tracked down the problem to the following commit:

commit 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
Author: Andreas Herrmann <andreas.herrmann3@amd.com>
Date:   Sat Feb 21 00:10:44 2009 +0100

    x86: hpet: provide separate functions to stop and start the counter

Yep, I've looked at the patch and there is nothing in there. I then reverted this patch on top of v2.6.29-7100-g833bb30. This yielded an non-compilable kernel. So I additionally reverted

commit c23e253e67c9d8a91a0ffa33c1f571a17f0a2403
Author: Andreas Herrmann <andreas.herrmann3@amd.com>
Date:   Sat Feb 21 00:16:35 2009 +0100

    x86: hpet: stop HPET_COUNTER when programming periodic mode

With these two patches reverted, the kernel boots fine and everything works. I'll post the dmesg of this working kernel shortly.
Comment 5 Daniel Vetter 2009-04-05 12:41:24 UTC
Created attachment 20811 [details]
dmesg of v2.6.29-7100-g833bb30 with the 2 mentioned reverts
Comment 6 Rafael J. Wysocki 2009-04-05 17:34:36 UTC
First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403
Comment 7 Daniel Vetter 2009-04-05 17:56:28 UTC
On Sun, Apr 05, 2009 at 05:34:36PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> First-Bad-Commit : c23e253e67c9d8a91a0ffa33c1f571a17f0a2403
That's wrong. First-Bad-Commit is 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
I just had to revert them both because c23... depends on 8d6... to
compile the kernel at all. The only thing 8d6... does is split a function
in two parts and replace the original function with one calling both
parts. So most likely the bug is somewhere completely else, only uncovered
by a small timing/code layout change.
Comment 8 Rafael J. Wysocki 2009-04-05 18:32:31 UTC
First-Bad-Commit : 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
Comment 9 herrmann.der.user 2009-04-06 12:42:53 UTC
Daniel, can you please retest both kernels with and without commit 8d6f0c8214928f7c5083dd54ecb69c5d615b516e but this time with
kernel parameter hpet=verbose. Thanks.
Comment 10 herrmann.der.user 2009-04-06 12:45:13 UTC
Further kernel parameters you should use are "debug" and "apic=debug".
Comment 11 Daniel Vetter 2009-04-16 14:03:05 UTC
Created attachment 21013 [details]
dmesg with "debug acpi=debug hpet=verbose", broken kernel
Comment 12 Daniel Vetter 2009-04-16 14:04:35 UTC
Created attachment 21014 [details]
dmesg with "debug acpi=debug hpet=verbose", 8d6f0c... and c23e25... reverted
Comment 13 herrmann.der.user 2009-04-16 16:40:35 UTC
Thanks for the logs.
(BTW, I meant apic=debug not acpi=debug but it doesn't matter as the
problem seems to be unrelated to apic configuration.)
I assume that the problem is the hpet of your chipset.

The diff between both dmesgs shows:
(- == working, + == broken)

-[    0.004000] hpet: hpet_set_mode(325):
+[    0.004000] hpet: hpet_set_mode(328):
 [    0.004000] hpet: ID: 0x10228203, PERIOD: 0x429b17f
 [    0.004000] hpet: CFG: 0x3, STATUS: 0x1
-[    0.004000] hpet: COUNTER_l: 0x2c90d, COUNTER_h: 0x0
+[    0.004000] hpet: COUNTER_l: 0x2cd4d, COUNTER_h: 0x0
 [    0.004000] hpet: T0: CFG_l: 0x261c, CFG_h: 0xfdefa
-[    0.004000] hpet: T0: CMP_l: 0x5450a, CMP_h: 0x0
+[    0.004000] hpet: T0: CMP_l: 0xdfb7, CMP_h: 0x0

In the broken case the timer comparator is less than the counter value.
The counter should have been reset, e.g. like in following example

hpet: hpet_set_mode(328):
hpet: ID: 0x43538301, PERIOD: 0x429b17e
hpet: CFG: 0x3, STATUS: 0x0
hpet: COUNTER_l: 0x1c80, COUNTER_h: 0x0
hpet: T0: CFG_l: 0x1c, CFG_h: 0xc0ffff
hpet: T0: CMP_l: 0x37ee, CMP_h: 0x0

This means that reset of the counter did not work properly.

I'll check errata information for your chipset.

Last not least can you please confirm that commit
8d6f0c8214928f7c508 (which doesn't contain functional changes)
introduced the problem on your system:

Can you do an additional test with the kernel obtained with
# git checkout 8d6f0c8214928f7c5083dd54ecb69c5d615b516e
and kernel parameters "debug hpet=verbose" such that I can
also compare the dmesg of this test with that of the working
test run.
Comment 14 herrmann.der.user 2009-04-16 18:03:32 UTC
And please attach also output of

 # lspci -nnxxx
Comment 15 Daniel Vetter 2009-04-16 19:04:31 UTC
Created attachment 21016 [details]
dmesg of 8d6f... with "debug hpet=verbose"
Comment 16 Daniel Vetter 2009-04-16 19:21:27 UTC
Created attachment 21017 [details]
dmesg of 23e2... with "debug hpet=verbose"

This one actually crashes, in contrast to 8d6f..., which doesn't crash anymore. I double-checked with my bisect-log, and I really noted 8d6f as bad. Most likely I screwed something up, but it was the last kernel to test, so the rest still holds. Sorry for the confusion this caused.
Comment 17 Daniel Vetter 2009-04-16 19:25:44 UTC
Created attachment 21018 [details]
lspci -nnxxx as root
Comment 18 herrmann.der.user 2009-04-16 19:42:56 UTC
> lspci -nnxxx as root

Ok, thanks for this information.
I've checked for errata wrt your chipset and hpet -- there are none.

> This one actually crashes, in contrast
> to 8d6f..., which doesn't crash anymore.

Nice to know. So, the second patch causes the problem.
It writes 0 to the main counter but this doesn't seem
to work on your chipset (for unknown reason).

I'll prepare a patch which stops the main counter but does not
reset it while the HPET is programmed in periodic mode.
This should fix the system hang that I tried to fix with
commit c23e253e67c9d8a91a0ffa33c1f571a17f0a2403 and might also
be compatible with your system.
Comment 19 herrmann.der.user 2009-04-21 17:58:01 UTC
Created attachment 21070 [details]
 [PATCH] x86: hpet: fix periodic mode programming on AMD 81xx

My observation was wrong. The reset of HPET_COUNTER works. But
the period for periodic mode is not properly set. On this chipset
a second write is required.

I've tested the patch on a similar system with 81xx chipset.
Comment 20 Daniel Vetter 2009-04-21 21:17:21 UTC
On Tue, Apr 21, 2009 at 05:58:01PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #19 from Andreas Herrmann <andreas.herrmann3@amd.com>  2009-04-21
> 17:58:01 ---
> Created an attachment (id=21070)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=21070)
>  [PATCH] x86: hpet: fix periodic mode programming on AMD 81xx
> 
> My observation was wrong. The reset of HPET_COUNTER works. But
> the period for periodic mode is not properly set. On this chipset
> a second write is required.

I've just tested your patch on top of latest -linus and it indeed fixes
the hang. Thanks a lot for tracking down this issue. I'll close the bug
report as soon as the fix has hit mainline and I've retested.

-Daniel
Comment 21 Rafael J. Wysocki 2009-04-26 11:30:37 UTC
Handled-By : Andreas Herrmann <andreas.herrmann3@amd.com>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=21070
Comment 22 Daniel Vetter 2009-04-27 16:20:25 UTC
Just tested v2.6.30-rc3-376-ge25c2c8 in mainline and the hang is no longer!

- Thx, Daniel