Bug 15289 - Regression 2.6.32 -> 2.6.33, Kernel needs a helping key to boot :) (C1E)
Regression 2.6.32 -> 2.6.33, Kernel needs a helping key to boot :) (C1E)
Status: ASSIGNED
Product: Timers
Classification: Unclassified
Component: Other
All Linux
: P1 normal
Assigned To: john stultz
:
: 15476 29952 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-02-12 19:47 UTC by Asbjørn Sannes
Modified: 2015-02-19 15:40 UTC (History)
21 users (show)

See Also:
Kernel Version: 3.15
Tree: Mainline
Regression: Yes


Attachments
dmesg of plain 2.6.33rc8 with problem (81.51 KB, text/plain)
2010-02-13 09:53 UTC, Asbjørn Sannes
Details
dmesg of plain 2.6.33rc8 with hpet=disabled and problem (81.52 KB, text/plain)
2010-02-13 09:56 UTC, Asbjørn Sannes
Details
dmesg of 2.6.33rc8 with 73472a46b5b28116b145fb5fc05242c1aa8e1461 reverted, works (81.46 KB, text/plain)
2010-02-13 09:59 UTC, Asbjørn Sannes
Details
my kernel config (69.68 KB, text/plain)
2010-02-13 10:12 UTC, Asbjørn Sannes
Details
/proc/interrupts in plain 2.6.33rc8 (not working) (1.99 KB, text/plain)
2010-02-13 10:14 UTC, Asbjørn Sannes
Details
/proc/interrupts in plain 2.6.33rc8 with 73472a46b5b28116b145fb5fc05242c1aa8e1461 reverted (working) (2.06 KB, text/plain)
2010-02-13 10:15 UTC, Asbjørn Sannes
Details
GA-MA78GM-S2H lspci (111.65 KB, text/plain)
2010-02-17 13:18 UTC, Yanko Kaneti
Details
GA-MA790FXT-UD5P lspci (201.40 KB, text/plain)
2010-02-18 06:16 UTC, Asbjørn Sannes
Details
dmesg of plain 2.6.33rc8 with apic=debug, hpet=verbose and problem on a GA-MA790FXT-UD5P (96.38 KB, text/plain)
2010-02-18 06:19 UTC, Asbjørn Sannes
Details
dmesg of 2.6.33rc8 with apic=debug, hpet=verbose and 7347.. reverted on GA-MA790FXT-UD5P (92.36 KB, text/plain)
2010-02-18 06:22 UTC, Asbjørn Sannes
Details
2.6.32.8 on GA-MA78GM-S2H with apic=debug and hpet=verbose (problem) (52.14 KB, text/plain)
2010-02-18 15:09 UTC, Yanko Kaneti
Details
GA-MA78GM-S2H lspci verbose under 2.6.32.8 (problem) (111.65 KB, text/plain)
2010-02-18 15:11 UTC, Yanko Kaneti
Details
2.6.32.7 on the same hardware with apic=debug and hpet=verbose (no problem) (52.76 KB, text/plain)
2010-02-18 15:13 UTC, Yanko Kaneti
Details
x86, hpet: disable MSI on ATI SBX00 (2.41 KB, patch)
2010-02-23 14:00 UTC, herrmann.der.user
Details | Diff
apic=debug hpet=vervose with attachment 25176 applied (50.02 KB, text/plain)
2010-02-23 17:09 UTC, Yanko Kaneti
Details
test patch to set physical APIC ID of IOAPIC(0) to 4 (973 bytes, text/plain)
2010-03-02 18:21 UTC, herrmann.der.user
Details
lsmsr with idle=mwait (548 bytes, text/plain)
2010-03-04 08:41 UTC, Yanko Kaneti
Details
default dmesg log with freeze on MA-790X-UD4 (102.05 KB, text/plain)
2010-11-02 09:52 UTC, Yann Droneaud
Details
dmesg log without freeze on MA-790X-UD4 with nmi_watchdog=panic,ioapic (103.43 KB, text/plain)
2010-11-02 10:00 UTC, Yann Droneaud
Details
dmesg log without freeze on MA-790X-UD4 with irqpoll (102.31 KB, text/plain)
2010-11-02 10:05 UTC, Yann Droneaud
Details

Description Asbjørn Sannes 2010-02-12 19:47:41 UTC
When booting a 2.6.33rc[567] kernel it stops and hangs, if I press a key (even the soft power button) it continues. As long as I feed it interrupts it continues happily.

Anyways, bisecting it git found:
73472a46b5b28116b145fb5fc05242c1aa8e1461
x86: Disable HPET MSI on ATI SB700/SB800

Reverting that from 2.6.33rc7 makes it work for me for the last week atleast..
 also, that is the same chipset I have :) Hope this helps.. 

If you have a fix you want to test, I'll test it.
Comment 1 john stultz 2010-02-12 20:23:02 UTC
Asbjørn: From the bisection point, I assume you have an ATI SB700/SB800 board? Also do the hangs continue to occur after boot, or just at boot time?

Could you attach dmesg output as well as "cat /sys/devices/system/clocksource/clocksource0/current_clocksource" output once the system is up?


Venkatesh: Any thoughts here? Shouldn't the hpet work ok without msi?
Comment 2 Venkatesh Pallipadi 2010-02-12 22:46:18 UTC
Hmm. This is strange.

Without the patch the clockevents will be
HPET 0 used for platform timer in legacy mode, and should also be the broadcast timer
APIC timers used on each CPU and used as percpu clockevent
HPET2 will be used as MSI per CPU timers and attached to CPU 0. This is meant to be used as per CPU timer so that we don't need to use APIC timer broadcast. IIRC, there are no HPET3, HPET4, etc supported on this platform.

With this patch
HPET 0 used for platform timer in legacy mode, and should also be the broadcast timer
APIC timers used on each CPU and used as percpu clockevent


So, if the system has a problem, that means broadcast logic has a problem or HPET 0 also has some problem on this system.

Some data that will be interesting
- Does the system support deep C-states? Otherwise, things whould just work fine with APIC timer, without ever needing HPET.
(grep . /sys/devices/system/cpu/cpu*/cpuidle/state*/* )

- How does /proc/interrupts look with and without this patch, after complete bootup

- dmesg as john mentioned, with and without the patch.

- Can you also try hpet=disable boot option and see whether that makes any difference in the failing case.
Comment 3 Rafael J. Wysocki 2010-02-13 00:07:29 UTC
First-Bad-Commit : 73472a46b5b28116b145fb5fc05242c1aa8e1461

Bisected to:

commit 73472a46b5b28116b145fb5fc05242c1aa8e1461
Author: Pallipadi, Venkatesh <venkatesh.pallipadi@intel.com>
Date:   Thu Jan 21 11:09:52 2010 -0800

    x86: Disable HPET MSI on ATI SB700/SB800

    Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
    Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Comment 4 Asbjørn Sannes 2010-02-13 09:53:06 UTC
Created attachment 25022 [details]
dmesg of plain 2.6.33rc8 with problem
Comment 5 Asbjørn Sannes 2010-02-13 09:56:59 UTC
Created attachment 25023 [details]
dmesg of plain 2.6.33rc8 with hpet=disabled and problem
Comment 6 Asbjørn Sannes 2010-02-13 09:59:25 UTC
Created attachment 25024 [details]
dmesg of 2.6.33rc8 with 73472a46b5b28116b145fb5fc05242c1aa8e1461 reverted, works
Comment 7 Asbjørn Sannes 2010-02-13 10:08:11 UTC
It often occurs more than once during boot, when there are lots of things running it seems to not happen, altough it almost always happen again during shutdown where I have to help it by pressing a key or similar.

Using hpet=disable does not help and the exact same thing happens.

Clock sources (/sys/devices/system/clocksource/clocksource0/current_clocksource):
plain 2.6.33rc8: tsc
2.6.33rc8 hpet=disable: acpi_pm
2.6.33rc8 with 73472a.. reverted: tsc
Comment 8 Asbjørn Sannes 2010-02-13 10:12:22 UTC
Created attachment 25025 [details]
my kernel config
Comment 9 Asbjørn Sannes 2010-02-13 10:14:06 UTC
Created attachment 25026 [details]
/proc/interrupts in plain 2.6.33rc8 (not working)
Comment 10 Asbjørn Sannes 2010-02-13 10:15:09 UTC
Created attachment 25027 [details]
/proc/interrupts in plain 2.6.33rc8 with 73472a46b5b28116b145fb5fc05242c1aa8e1461 reverted (working)
Comment 11 Asbjørn Sannes 2010-02-13 10:20:12 UTC
Could not find any C states, seems I do not have a cpuidle directory within /sys/devices/system/cpu/cpu*/, so :
$ grep . /sys/devices/system/cpu/cpu*/cpuidle/state*/*
grep: /sys/devices/system/cpu/cpu*/cpuidle/state*/*: No such file or directory

thank you for looking into it :)
Comment 12 Yanko Kaneti 2010-02-15 13:14:31 UTC
Similar symptoms here with the Fedora 33 rc kernels around rc5 and later, haven't tried reverting the patch in question yet. ATI SB700/SB800 (Gigabyte GA-MA78GM-S2H)
Comment 13 Venkatesh Pallipadi 2010-02-17 00:04:36 UTC
May be related to C1E. Can you try idle=halt boot option on the not-working kernel (without hpet=disable option).

Andreas: Have you seen any issues like this.
Looks like IRQ0 does not work correctly here, both with HPET and PIT
  0:        351         11      36077        745   IO-APIC-edge      timer

And the platform needs this
  24:      11040          0          0          0  HPET_MSI-edge      hpet2
to function correctly.
Comment 14 Yanko Kaneti 2010-02-17 08:01:58 UTC
idle=halt does not go very far here. First there is a stall that's broken by key or mouse movement and then it seems to lock up after
  NET: Registered protocol family 1" 
which in a normal fedora boot is usually followed by 
 pci 0000:01:05.0: Boot video device
 Trying to unpack rootfs image as initramfs...
....

FWIW stable 2.6.32.8 now also exhibits this, assuming it has to do with the hpet patch being pulled there. Behavior the same as 33rc

2.6.32.7 with idle=halt has the same single stall but once broken with key or mouse it goes on without further problems.
Comment 15 herrmann.der.user 2010-02-17 12:54:52 UTC
Can you please provide output of "lspci -nn -xxxx"
from your system. Thanks.
Comment 16 Yanko Kaneti 2010-02-17 13:18:55 UTC
Created attachment 25090 [details]
GA-MA78GM-S2H lspci
Comment 17 herrmann.der.user 2010-02-17 20:39:12 UTC
I've checked the lspci output for HPET relevant settings.
Everything seems ok (e.g. interrupt routing).

I've also checked the code and at the moment it's not clear to me why your
system shows these hickups.

To gather more debug information it would be nice if you could boot
with apic=debug parameter (to double check IO-APIC/PIC settings) -
both in the working and non-working case.

And while doing this you can also add "hpet=verbose" to your command line.

Maybe that gives some insight what's wrong with the latest kernel on
your system.

And another question.
Do you have the latest BIOS for your system installed?
Comment 18 Asbjørn Sannes 2010-02-18 06:14:32 UTC
idle=halt makes it stop for me too.
I got a GA-MA790FXT-UD6P (BIOS version F7), which is the second to newest.
Comment 19 Asbjørn Sannes 2010-02-18 06:16:46 UTC
Created attachment 25098 [details]
GA-MA790FXT-UD5P lspci
Comment 20 Asbjørn Sannes 2010-02-18 06:19:51 UTC
Created attachment 25099 [details]
dmesg of plain 2.6.33rc8 with apic=debug, hpet=verbose and problem on a GA-MA790FXT-UD5P
Comment 21 Asbjørn Sannes 2010-02-18 06:22:05 UTC
Created attachment 25100 [details]
dmesg of 2.6.33rc8 with apic=debug, hpet=verbose and 7347.. reverted on GA-MA790FXT-UD5P
Comment 22 Yanko Kaneti 2010-02-18 15:09:42 UTC
Created attachment 25101 [details]
2.6.32.8 on GA-MA78GM-S2H  with apic=debug and hpet=verbose (problem)
Comment 23 Yanko Kaneti 2010-02-18 15:11:54 UTC
Created attachment 25102 [details]
GA-MA78GM-S2H lspci verbose  under 2.6.32.8 (problem)

I realized that the previous lspci was done under an ok (2.6.32.7) kernel, here is with the problem one.
Comment 24 Yanko Kaneti 2010-02-18 15:13:28 UTC
Created attachment 25103 [details]
2.6.32.7 on the same hardware with apic=debug and hpet=verbose (no problem)
Comment 25 Yanko Kaneti 2010-02-18 15:15:01 UTC
Running with the latest BIOS for this board (rev 1.0, F11)
Comment 26 herrmann.der.user 2010-02-23 11:44:38 UTC
I am wondering why for both test runs there is this log message:

  hpet: hpet_msi_capability_lookup(591):
  hpet: hpet_msi_capability_lookup(596):

The respective hpet_print_config() call is

static void hpet_msi_capability_lookup(unsigned int start_timer)
{
        unsigned int id;
        unsigned int num_timers;
        unsigned int num_timers_used = 0;
        int i;

        if (hpet_msi_disable)
                return;

        if (boot_cpu_has(X86_FEATURE_ARAT))
                return;
        id = hpet_readl(HPET_ID);

        num_timers = ((id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT);
        num_timers++; /* Value read out starts from 0 */
        hpet_print_config();
...
}
but that should never be reached, if hpet_msi_disable is set.

I guess the respective quirk should be added to arch/x86/kernel/early-quirks.c
instead of arch/x86/kernel/quirks.c.

Patch to fix this follows asap.
Comment 27 herrmann.der.user 2010-02-23 14:00:06 UTC
Created attachment 25176 [details]
x86, hpet: disable MSI on ATI SBX00
Comment 28 herrmann.der.user 2010-02-23 14:01:44 UTC
Can you please verify whether this patch on top of commit
73472a46b5b28116b145fb5fc05242c1aa8e1461 or current
git (v2.6.33-rc8-189-g9f3a628) fixes this issue on your box?
Thanks.
Comment 29 Yanko Kaneti 2010-02-23 17:09:02 UTC
Created attachment 25177 [details]
apic=debug hpet=vervose  with attachment 25176 [details] applied

Didn't seem to help here. verbose log of 2.6.33-0.51.rc8.git6.fc14.x86_64 + patch
local rebuild attached.
Comment 30 Asbjørn Sannes 2010-03-01 06:17:06 UTC
hm, so from what I have gathered the previously bisected commit only exposes another bug (hidden by enabling hpet). So I started bisecting again, this time with hpet=disabled all the way and found:

commit aa276e1cafb3ce9d01d1e837bcd67e92616013ac
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Jun 9 19:15:00 2008 +0200

x86, clockevents: add C1E aware idle function

Reverting this from a non-working 2.6.27 makes it work also. Things have changed considerably since then so I was not able to revert it from the newest kernel.

Maybe that disabling of SBX00 hpet msi, only should be done when you do actually have floppy support? Makes more people boot atleast :P
Comment 31 Rafael J. Wysocki 2010-03-01 20:01:12 UTC
[Swithing to e-mail, please keep bugzilla-daemon in the CC list.]

On Monday 01 March 2010, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15289
> 
> --- Comment #30 from Asbjørn Sannes <kernelbugzilla@sannes.org>  2010-03-01 06:17:06 ---
> hm, so from what I have gathered the previously bisected commit only exposes
> another bug (hidden by enabling hpet). So I started bisecting again, this time
> with hpet=disabled all the way and found:
> 
> commit aa276e1cafb3ce9d01d1e837bcd67e92616013ac
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Mon Jun 9 19:15:00 2008 +0200
> 
> x86, clockevents: add C1E aware idle function
> 
> Reverting this from a non-working 2.6.27 makes it work also. Things have
> changed considerably since then so I was not able to revert it from the newest
> kernel.
> 
> Maybe that disabling of SBX00 hpet msi, only should be done when you do
> actually have floppy support? Makes more people boot atleast :P

Thomas, it looks like something's missing in our C1E handling.  Can you have
a look at this bug report, please?

Rafael
Comment 32 Rafael J. Wysocki 2010-03-01 20:03:24 UTC
Not a recent regression, so dropping from the list.
Comment 33 Asbjørn Sannes 2010-03-01 20:59:28 UTC
I agree that it may not be a recent regression that is the root cause, but it certainly is a regression in the sense that less people will be able to boot 2.6.33 than 2.6.32 (altough more people will be able to use their floppies flawlessly). It is a tradeoff, just my two cents..
Comment 34 Thomas Gleixner 2010-03-01 22:10:36 UTC
On Mon, 1 Mar 2010, Rafael J. Wysocki wrote:
> [Swithing to e-mail, please keep bugzilla-daemon in the CC list.]
> 
> On Monday 01 March 2010, bugzilla-daemon@bugzilla.kernel.org wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=15289
> > 
> > --- Comment #30 from Asbjørn Sannes <kernelbugzilla@sannes.org>  2010-03-01 06:17:06 ---
> > hm, so from what I have gathered the previously bisected commit only exposes
> > another bug (hidden by enabling hpet). So I started bisecting again, this time
> > with hpet=disabled all the way and found:
> > 
> > commit aa276e1cafb3ce9d01d1e837bcd67e92616013ac
> > Author: Thomas Gleixner <tglx@linutronix.de>
> > Date:   Mon Jun 9 19:15:00 2008 +0200
> > 
> > x86, clockevents: add C1E aware idle function
> > 
> > Reverting this from a non-working 2.6.27 makes it work also. Things have
> > changed considerably since then so I was not able to revert it from the newest
> > kernel.
> > 
> > Maybe that disabling of SBX00 hpet msi, only should be done when you do
> > actually have floppy support? Makes more people boot atleast :P
> 
> Thomas, it looks like something's missing in our C1E handling.  Can you have
> a look at this bug report, please?

Groan. We have been through that exercise of blaming the above commit
and the C1E handling for a couple of times now. It never turned out to
be the real culprit.

Looking at the various steps Asbjorn took to analyse that problem it
simply boils down to the oldest problem with timers on ATI chipsets:

       the irq0 timer interrupt routing is hosed

I have no clue yet, why this is not detected by the test logic we have
in place for that, but it might be something which gets borked later
in the boot process.

Enabling MSI for HPET just papers over the problem as it uses a
different interrupt vector and mechanism.

Disabling HPET does not help simply because PIT is using IRQ0 as well
as the MSI disabled HPET.

I need some sleep to come up with a reasonable method to debug that,
but maybe someone else has an brilliant idea before I have to twist my
brain around it.

Thanks,

	tglx
Comment 35 Thomas Gleixner 2010-03-02 12:05:05 UTC
Asbjørn,

can you please run the following tests and report the results ?

1) Add "maxcpus=1" to the kernel command line

2) Add "nomsi" to the kernel command line

Thanks,

	tglx
Comment 36 herrmann.der.user 2010-03-02 16:44:27 UTC
Looking again on the dmesg output, I am now wondering why APIC IDs are not unique:

CPUs 0,...,3 have lapic_ids of 0, ..., 3
The IOAPIC has

IO APIC #2......
.... register #00: 00000000
.......    : physical APIC id: 00

I think that each APIC device must have a unique ID (both local APICs and IO APICs).

Still have to look into the Linux code, how it copes with that.
Comment 37 Asbjørn Sannes 2010-03-02 17:43:34 UTC
Thomas:
1) Boots
2) No difference

1 + 2 boots ..
Comment 38 herrmann.der.user 2010-03-02 17:54:30 UTC
I assume booting with maxcpus=2 also works as this would result in
a system with 3 unique APIC IDs ...
Comment 39 herrmann.der.user 2010-03-02 18:21:03 UTC
Created attachment 25326 [details]
test patch to set physical APIC ID of IOAPIC(0) to 4

Can you please test this patch. (I assume this is a BIOS bug, because your BIOS should assign unique IDs to all APICs in the system.)
Comment 40 Asbjørn Sannes 2010-03-02 18:42:02 UTC
maxcpus=2 and maxcpus=3 works
patch still had the same problem

The changelog of the bios does not say it should fix any apic issues, but they do have one newer (beta) bios:
http://www.gigabyte.com.tw/Support/Motherboard/BIOS_Model.aspx?ProductID=3005

Should I give that a chance, or try contacting them for a fix?
Comment 41 Thomas Gleixner 2010-03-03 13:48:52 UTC
> --- Comment #36 from Andreas Herrmann <andreas.herrmann3@amd.com>  2010-03-02 16:44:27 ---
> Looking again on the dmesg output, I am now wondering why APIC IDs are not
> unique:
> 
> CPUs 0,...,3 have lapic_ids of 0, ..., 3
> The IOAPIC has
> 
> IO APIC #2......
> .... register #00: 00000000
> .......    : physical APIC id: 00

Good catch. Missed that !
 
> I think that each APIC device must have a unique ID (both local APICs and IO
> APICs).

Yes, we need unique IDs.

Thanks,

	tglx
Comment 42 herrmann.der.user 2010-03-03 14:45:11 UTC
> maxcpus=2 and maxcpus=3 works
> patch still had the same problem

Hmm, unfortunate.

> The changelog of the bios does not say it should fix any apic issues,
> but they do have one newer (beta) bios:
http://www.gigabyte.com.tw/Support/Motherboard/BIOS_Model.aspx?ProductID=3005

By changelog you mean the short description, right?
That's quite sparse.

> Should I give that a chance, or try contacting them for a fix?

Informing them about the apic IDs which are reported as

 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
 ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
 ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
 ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
 ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
 ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
 IOAPIC[0]: apic_id 2, version 33, address 0xfec00000, GSI 0-23

is always a good idea.

I can't say whether you should install a beta-BIOS.
In general newer BIOS versions are supposed to contain bug fixes.

One further test you could do is to boot with idle=mwait.
If this works you could use this as a workaround with current kernels
for the time being.

And can you please install the package x86info and run
# for i in `seq 0 3 `; do lsmsr -c $i Int; done
Comment 43 Asbjørn Sannes 2010-03-03 19:15:06 UTC
Works to boot with idle=mwait and it gives:
# for i in `seq 0 3 `; do lsmsr -c $i Int; done
IntPendingMessage    = 0x00000000083400b0
IntPendingMessage    = 0x00000000083400b0
IntPendingMessage    = 0x00000000083400b0
IntPendingMessage    = 0x00000000083400b0

Also found that disabling C1E support in the BIOS works, then:
IntPendingMessage    = 0x0000000000000000
IntPendingMessage    = 0x0000000000000000
IntPendingMessage    = 0x0000000000000000
IntPendingMessage    = 0x0000000000000000

I will try out the beta bios, to see if it helps, and report back again (either way), and then try to figure out how to contact gigabyte about their bioses, which I suspect is not as easy as getting in touch with kernel developers :P
Comment 44 herrmann.der.user 2010-03-04 07:16:12 UTC
IMHO, this is clear indication that not the timer interrupt routing but
the way your system (BIOS) is handling C1e is the root cause.

Your system uses "SMI Initiated C1E" (bit 27 is set in the
IntPendingMsg MSR) if C1e is enabled in your BIOS.
That means that BIOS provides an SMM handler which has to place the system
into the C1e state. I think there must be something wrong with that handler.

Reporting this problem to Gigabyte and hoping to obtain a
working BIOS is the only thing you can do. It doesn't make
sense to try further debugging of this problem from OS perspective.

HPET MSI and older kernels just seem to hide this C1e problem
on your machine. Besides you should disable C1e when using latest
kernels (as long as you don't have a BIOS that is properly working).

The duplicate APIC IDs is an indication that your BIOS is not 100
percent correct. Maybe Gigabyte did not correctly implement support
for quad-core processors and they also forgot to check state of all
4 cores in their SMM handler for C1e. But that's just a wild guess.

Last thing to note:
The CPU support list of your motherboard (GA-MA790FXT-UD5P) contains only
AM3 CPUs. Your CPU is not listed.
It also states that AM2+ Phenom II (45nm) CPUs are not supported. And
your CPU is even an AMD Phenom (65nm) AM2+ CPU:
AMD Phenom(tm) 9350e Quad-Core Processor stepping 03

Are your sure that this hardware configuration is really supported?
Comment 45 Yanko Kaneti 2010-03-04 07:31:19 UTC
> Last thing to note:
> The CPU support list of your motherboard (GA-MA790FXT-UD5P) contains only
> AM3 CPUs. Your CPU is not listed.
> It also states that AM2+ Phenom II (45nm) CPUs are not supported. And
> your CPU is even an AMD Phenom (65nm) AM2+ CPU:
> AMD Phenom(tm) 9350e Quad-Core Processor stepping 03
> 
> Are your sure that this hardware configuration is really supported?

Some confusion here.
Asbjørn's config:
GA-MA790FXT-UD5P (F7 and F8C beta) with 
CPU0: AMD Phenom(tm) II X4 955 Processor stepping 02

My (Yanko) config:
GA-MA78GM-S2H (F11) with 
CPU0: AMD Phenom(tm) 9350e Quad-Core Processor stepping 03
Comment 46 herrmann.der.user 2010-03-04 07:48:48 UTC
Yanko,

You have a different motherboard but same CPU, same BIOS weirdness
(duplicate APID IDs), similar HPET/APIC configuration (as seen in your
lspci-output) and same symptoms.
One difference is that Gigabyte's CPU support list for your motherboard
mentions your CPU.

Can you please also try to
(1) boot with idle=mwait with problematic kernel,
(2) check for BIOS option to toggle C1e support
(3) and provide lsmsr output(*)?

If all is similar to Asbjørn's observation I would give you similar
recommendations. (report problem to Gigabyte and switch off C1e support
in BIOS.)


Thanks, Andreas

(*) Install package x86info and run
# for i in `seq 0 3`; do lsmsr -c $i Int -V 3; done
Comment 47 herrmann.der.user 2010-03-04 07:52:30 UTC
In reply to comment #44 and #45:
Please ignore my comment about unsupported CPU. I was looking at the wrong dmesg
when writing this.
Sorry.

Of course AMD Phenom(tm) II X4 955 Processor stepping 02
is in the CPU support list of Asbjørn's  mobo.

Just need some more coffee in the morning ;-)
Comment 48 herrmann.der.user 2010-03-04 07:59:47 UTC
Correcting comment #46:
> You have a different motherboard but same CPU, same BIOS weirdness
> (duplicate APID IDs), similar HPET/APIC configuration (as seen in your
> lspci-output) and same symptoms.
> One difference is that Gigabyte's CPU support list for your motherboard
> mentions your CPU.

That is wrong it should be:

  You have a different motherboard and CPU,
  same BIOS weirdness (duplicate APID IDs),
  similar HPET/APIC configuration (as seen in your lspci-output)
  and same symptoms.
Comment 49 herrmann.der.user 2010-03-04 08:13:23 UTC
FYI, currently there are two ways to enter C1e mode.
(1) SMI initiated C1e (which is sometimes problematic as the SMM handler
might do things wrong.)
(2) Hardware initiated C1e

(2) is supported on dual-core CPUs and with family 0x10 revision C3 also
with any number of cores.

That is the reason why (1) has to be used for Asbjørn's system (CPU revision C2)
and for Yanko's system (revision B3).

If we get more reports due to insufficient implementation of (1) it might
be an option to clear bit 27 of MSRC001_0055 in c1e_idle to simply avoid
usage of SMI initiated C1e.
Comment 50 Yanko Kaneti 2010-03-04 08:41:38 UTC
Created attachment 25352 [details]
lsmsr  with idle=mwait

Yes, idle=mwait or disabling "AMD C1E Support" in the BIOS work around the problem here. lsmsr output attached.
I feel quite incompetent to write a bug report to GigaByte that might be taken seriously. Perhaps there is an internal AMD-GB channel that might help here, or at least I expect there to be.
Comment 51 daniel schroeder 2010-03-09 06:17:40 UTC
*** Bug 15476 has been marked as a duplicate of this bug. ***
Comment 52 Asbjørn Sannes 2010-03-13 14:36:17 UTC
M79XTUD6.F8c (beta version of the bios) did not work either with C1E enabled.
Comment 53 herrmann.der.user 2010-03-16 09:01:52 UTC
FYI, I've passed the problem information on to people who are working with Gigabyte. Will see what happens. For the time being you should disable C1E in BIOS. Thanks.
Comment 54 john stultz 2010-07-14 01:37:42 UTC
Patch for a similar issue was posted to lkml: https://patchwork.kernel.org/patch/111824/
Comment 55 herrmann.der.user 2010-09-08 18:10:01 UTC
Please, can all who have observed this issue run a test with C1e enabled
in BIOS and use "acpi_skip_timer_override" on the kernel command line?
(Kernel version shouldn't matter -- just use a recent kernel >=2.6.32).

Thanks!

Further info below.

For the records.

Unfortunately I am still not in contact with BIOS developers from Gigabyte.

But at least Gigabyte provided a mainboard for further testing/debugging.
So I was able to test on a "GA-MA78LMT-US2H".

My observations are:
- It's no regression from .32 to .33
  .32 does not properly work either During shutdown/reboot a helping key is
  required to continue operation
- So far I did not find a kernel where "helping keys" are not required.
- The APIC ID colission does not seem to cause the problem. Activating
  code (integrated for some other vendor) to check for collisions does
  not fix the problem, with a slightly modified 2.6.35.4 (32-bit) I get 
         "IOAPIC[0]: apic_id 2 already used, trying 4"
  but a helping key is still needed during boot/shutdown.
- HPET does not seem to be the problem it's properly configured
- configuration for HPET and timer IRQ routing are correct as far as I can tell
- Commit e8c534ec068af1a0845aceda373a9bfd2de62030 (x86: Fix keeping track of AMD C1E) does not solve the problem either

The only workaround that I have at the moment to run all recent kernels
when C1e is activated is using
              acpi_skip_timer_override
as kernel command line parameter.

This modifies routing of the timer interrupt. It will be routed via PIC
and then to IO-APIC pin 0 (instead of directly routing it to IO-APIC pin 2).
I don't have an explanation for this though. (Buggy BIOS?)
Comment 56 Yanko Kaneti 2010-09-09 06:20:30 UTC
I've just booted fedora's 2.6.36-0.18.rc3.git1.fc15.x86_64 kernel with C1E enabled on the same Gigabyte board with the same bios as before  without adding additional parameters on the command line and it didn't need a helping hand.
Comment 57 herrmann.der.user 2010-09-09 13:31:32 UTC
Thanks for this info. My system still requires the helping key during boot
with vanilla kernel 2.6.36-rc3-00185-gd56557a Using acpi_skip_timer_override
fixes the problem.
(Somehow I am afraid that it is pure luck that your system works with
the fedora kernel.)
Comment 58 Asbjørn Sannes 2010-09-18 07:45:59 UTC
acpi_skip_timer_override fixes the problem for me (only tested on 2.6.36rc4).

This was on a "GA-MA790FXT-UD5P".
Comment 59 Yann Droneaud 2010-11-02 09:43:54 UTC
I have the problem with a Gigabyte GA-MA-790X-UD4 with an AMD Phenom II X4 955 when trying to install Fedora 14 Beta (kernel 2.6.35.4).

Disabling AMD C1E in BIOS fix the problem.

Using command line options like nmi_watchdog=panic,ioapic or irqpoll seems to also work around the problem.
Comment 60 Yann Droneaud 2010-11-02 09:52:49 UTC
Created attachment 35862 [details]
default dmesg log with freeze on MA-790X-UD4
Comment 61 Yann Droneaud 2010-11-02 10:00:55 UTC
Created attachment 35872 [details]
dmesg log without freeze on MA-790X-UD4 with nmi_watchdog=panic,ioapic

Using nmi_watchdog change which timer used, so this work around the problem.
Difference between default and nmi_watchdog option:

...
 CPU0: AMD Phenom(tm) II X4 955 Processor stepping 02
+APIC calibration not consistent with PM-Timer: 49ms instead of 100ms
+APIC delta adjusted to PM-Timer: 1255722 (627847)
+APIC timer registered as dummy, due to nmi_watchdog=1!
 calling  migration_init+0x0/0x57 @ 1
...
 Switch to broadcast mode on CPU3
-Total of 4 processors activated (25718.88 BogoMIPS).
+Total of 4 processors activated (16084.16 BogoMIPS).
+Testing NMI watchdog ... 
+WARNING: CPU#0: NMI appears to be stuck (0->0)!
+Please report this to bugzilla.kernel.org,
+and attach the output of the 'dmesg' command.
+
+WARNING: CPU#1: NMI appears to be stuck (0->0)!
+Please report this to bugzilla.kernel.org,
+and attach the output of the 'dmesg' command.
+
+WARNING: CPU#2: NMI appears to be stuck (0->0)!
+Please report this to bugzilla.kernel.org,
+and attach the output of the 'dmesg' command.
+
+WARNING: CPU#3: NMI appears to be stuck (0->0)!
+Please report this to bugzilla.kernel.org,
+and attach the output of the 'dmesg' command.
 sizeof(vma)=184 bytes
...
+initcall init_ext3_fs+0x0/0x6f returned 0 after 130 usecs
+Clockevents: could not switch to one-shot mode: lapic is not functional.
+Could not switch to high resolution mode on CPU 3
+Clockevents: could not switch to one-shot mode: lapic is not functional.
+Could not switch to high resolution mode on CPU 0
+Clockevents: could not switch to one-shot mode: lapic is not functional.
+Could not switch to high resolution mode on CPU 2
+Clockevents: could not switch to one-shot mode: lapic is not functional.
+Could not switch to high resolution mode on CPU 1
 calling  init_ext4_fs+0x0/0xe5 @ 1
Comment 62 Yann Droneaud 2010-11-02 10:05:53 UTC
Created attachment 35882 [details]
dmesg log without freeze on MA-790X-UD4 with irqpoll

With irqpoll option, it seems to work without freezing.

Difference between default options:

+Kernel command line: root=live:CDLABEL=Fedora-14-Beta-x86_64-Live liveimg rhgb debug initcall_debug irqpoll
+Misrouted IRQ fixup and polling support enabled
+This may significantly impact system performance
Comment 63 herrmann.der.user 2010-11-02 10:08:40 UTC
Yann,
does acpi_skip_timer_override kernel option work around the problem
on your system?
Comment 64 Yann Droneaud 2010-11-03 09:50:13 UTC
(In reply to comment #63)
> Yann,
> does acpi_skip_timer_override kernel option work around the problem
> on your system?

Yes, it does also work.

Here the difference in dmesg:

 IOAPIC[0]: apic_id 2, version 33, address 0xfec00000, GSI 0-23
 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
+ACPI: BIOS IRQ0 pin2 override ignored.
 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 low level)
-ACPI: IRQ0 used by override.
-ACPI: IRQ2 used by override.
 ACPI: IRQ9 used by override.
 Using ACPI (MADT) for SMP configuration information
...
 Setting APIC routing to flat
-..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
+..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
+..MP-BIOS bug: 8254 timer not connected to IO-APIC
+...trying to set up timer (IRQ0) through the 8259A ...
+..... (found apic 0 pin 0) ...
+....... works.
 CPU0: AMD Phenom(tm) II X4 955 Processor stepping 02
+APIC calibration not consistent with PM-Timer: 49ms instead of 100ms
+APIC delta adjusted to PM-Timer: 1255712 (627849)
Comment 65 xenus 2010-11-19 20:22:37 UTC
An additional data point for you.
I have a Asus M4N78 BIOS 1001 using Athlon II X2 250
Enabling C1E in the bios gives desktop lockups when playing back audio e.g. aplay -vv *.wav.
No errors in messages after lockup.
Disabling C1E fixes the issue.
Noticed the issue (!) after upgrading to Ubuntu maverick kernels 2.6.35-22 and -23
Didn't see anything about C1E in dmesg.  Let me know if there are logs that are of interest.
Comment 66 gir1dhar 2010-12-15 13:34:37 UTC
I'm having the same problem on a Toshiba nb205-n210 netbook.
acpi_skip_timer_override seems to work around, at least it boots without the need to press shift all the time.
Comment 67 dajoker 2011-02-13 14:55:55 UTC
Having the problem with the 2.6.34 kernel using OpenSUSE 11.3 x86_32 on an HP laptop.  The Bug# 579932 with Novell is located at the following URL: https://bugzilla.novell.com/show_bug.cgi?id=579932

A couple workarounds consistently work (allow the system to work without extra user input) including hpet=disable and processor.max_cstate=1 from Grub.  Anything I can do to test/help?

Thanks.
Comment 68 Yann Droneaud 2011-02-22 14:55:19 UTC
For the record, this bug is also reported on RedHat's bugzilla as bug# 648837, see https://bugzilla.redhat.com/show_bug.cgi?id=648837
Comment 69 Rafael J. Wysocki 2011-03-27 20:00:25 UTC
*** Bug 29952 has been marked as a duplicate of this bug. ***
Comment 70 Álvaro Manuel Recio Pérez 2011-05-04 00:08:37 UTC
I have the same problem with kernel 2.6.28 in Ubuntu 11.04.

My motherboard is a Gigabyte GA-MA790XT-UD4P, BIOS version F8g (latest beta), with an AMD Phenom II X2 550 processor. With C1E disabled, the system boots fine but with C1E enabled it needs the helping key.
Comment 71 Yann Droneaud 2011-05-04 14:55:49 UTC
Another APIC timer bug on older AMD processors were fixed in kernel 2.6.39-rc6 by Boris Ostrovsky. His commit has the following title:
"x86, AMD: Fix APIC timer erratum 400 affecting K8 Rev.A-E processors"

https://patchwork.kernel.org/patch/746192/

Regression report by Jörg-Volker Peetz "regression due to x86 AMD ARAT feature"
https://lkml.org/lkml/2011/4/24/20

Fix by Boris Ostrovsky
https://lkml.org/lkml/2011/4/29/328
Comment 72 Álvaro Manuel Recio Pérez 2011-05-07 12:02:52 UTC
(In reply to comment #70)
> I have the same problem with kernel 2.6.28 in Ubuntu 11.04.
> 
> My motherboard is a Gigabyte GA-MA790XT-UD4P, BIOS version F8g (latest beta),
> with an AMD Phenom II X2 550 processor. With C1E disabled, the system boots
> fine but with C1E enabled it needs the helping key.

There is an unfortunate typo there. I meant kernel 2.6.38 not 2.6.28.
Comment 73 Angus Turnbull 2011-07-06 06:13:48 UTC
I have a Gigabyte motherboard with a Phenom II X4 810, per DMIDECODE:

Base Board Information
	Manufacturer: Gigabyte Technology Co., Ltd.
	Product Name: GA-MA78GPM-UD2H
	Version: x.x
	Serial Number: 

It's on the latest non-beta BIOS (F7) and my APIC ID table is the same as Andreas'.

Running Ubuntu, since some mid-Lucid-cycle 2.6.32.x kernel update it's had problems with intermittent inability to boot as described above and slow suspends which were both solved by adding "clocksource=hpet" to the kernel command line (after I noticed a few "clocksource TSC unstable" DMESG lines after long pauses).

Beginning with the 2.6.38.x kernels in Natty it wouldn't even boot reliably with that fix. Disabling C1E solved it (but locked the CPU to full speed), and now I've re-enabled C1E and changed my kernel command line to just "acpi_skip_timer_override", and it boots well, suspends/resumes well, and clocks down to 800Mhz when idle.

This seems to be a fairly widespread Gigabyte problem, should this be added to some known-quirks list?
Comment 74 Alan 2012-06-18 14:49:07 UTC
Is this still the case with recent kernels ?
Comment 75 Yanko Kaneti 2012-06-18 16:02:15 UTC
I've just tried enabling C1E support in the bios with 3.4.2-4.fc17.x86_64 on the same board just with a slightly newer bios verion :12b , and it wouldn't even boot. Hanging somewhere around after hpet initialisation...
Comment 76 Asbjørn Sannes 2012-06-18 17:10:25 UTC
Also just tried to boot up with C1E support enabled and it still have the same problems on 3.4.2 from kernel.org.
Comment 77 Alan 2012-06-18 17:42:22 UTC
thanks
Comment 78 Andrew Kohlsmith 2013-06-07 20:20:03 UTC
I have a Gigabyte X58A-UD3R v2.0 running both FH1 (beta) but also tried the actual FH release BIOS that seems to be exhibiting this exact same problem.

It boots and runs my old Debian squeeze just fine, but a new Wheezy install has the CPU #1 through #7 not responding, then reboots.

none of hpet=disable, acpi_skip_timer_override or idle=mwait fixes or changes the issue. The only way I can boot is with maxcpus=1. This is obviously suboptimal.

My APIC IDs seem fine:

[    0.000000] ACPI: PM-Timer IO Port: 0x408
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x02] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x04] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x06] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x01] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x03] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x05] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] enabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x0c] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0d] lapic_id[0x0d] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0e] lapic_id[0x0e] disabled)
[    0.000000] ACPI: LAPIC (acpi_id[0x0f] lapic_id[0x0f] disabled)
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x04] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x05] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x06] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x07] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x08] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x09] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0a] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0b] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0c] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0d] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0e] dfl dfl lint[0x1])
[    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0f] dfl dfl lint[0x1])
[    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.000000] ACPI: IRQ0 used by override.
[    0.000000] ACPI: IRQ2 used by override.
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] Using ACPI (MADT) for SMP configuration information
[    0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000

I've tried booting with the BIOS C1E support set to on, off and auto with no change.  I have also tried with the HPET set to 32 or 64 bits with no change. I have not yet tried disabling the HPET, but none of this was required with Squeeze.

Is there any other information I could provide which would help?
Comment 79 Dan 2014-08-09 01:29:25 UTC
I am not sure if this is the same as my issue, but it sounds similar. I am using a gigabyte GA-890FXA-UD5 updated to the latest BIOS, and cannot boot any version of Fedora (kernels 3.15.5-3.15.8 x86_64, or the Fedora 20 Live CD) without either acpi=off or turning off AMD C1E support in the BIOS (I have opted for the latter). With C1E on, it will lock up almost immediately, and turning off quiet boot, the text on the screen at the time does not seem helpful.

Note You need to log in before you can comment on or make changes to this bug.