------------[ cut here ]------------ WARNING: at arch/x86/kernel/hpet.c:402 hpet_next_event+0x7d/0x90() Hardware name: Studio 1557 hpet: compare register read back failed. Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.33-rc3 #5 Call Trace: [<ffffffff81038c58>] warn_slowpath_common+0x78/0xb0 [<ffffffff81038cec>] warn_slowpath_fmt+0x3c/0x40 [<ffffffff8102133d>] hpet_next_event+0x7d/0x90 [<ffffffff8102136b>] hpet_legacy_next_event+0xb/0x10 [<ffffffff8105ab99>] clockevents_program_event+0x59/0xa0 [<ffffffff8105bfb0>] tick_dev_program_event+0x40/0xc0 [<ffffffff8105b956>] tick_broadcast_oneshot_control+0x116/0x120 [<ffffffff8105b1a5>] tick_notify+0x305/0x440 [<ffffffff811acf10>] ? acpi_hw_read+0x51/0x58 [<ffffffff81053ff7>] notifier_call_chain+0x47/0x90 [<ffffffff81054111>] raw_notifier_call_chain+0x11/0x20 [<ffffffff8105a9d2>] clockevents_notify+0x32/0x140 [<ffffffff811bb4e5>] lapic_timer_state_broadcast+0x41/0x43 [<ffffffff811bb9ef>] acpi_idle_enter_bm+0x15c/0x2cd [<ffffffff811bb885>] ? acpi_idle_enter_c1+0xe9/0xf7 [<ffffffff81230f5b>] cpuidle_idle_call+0x9b/0x100 [<ffffffff81001e3d>] cpu_idle+0x5d/0xb0 [<ffffffff812dd38b>] start_secondary+0x197/0x19b ---[ end trace d7852ea24f7bfccd ]---
Matteo: Few questions: Did anything like this crop up with older kernels? Does this happen every boot? Or is it sporadic? thanks -john
Thomas: Would a possible fix be to disable the hpet clockevent driver in some fashion similar to how we disable the TSC clocksource? Alternatively, I don't see any locking around the hpet_Tn_CMP(timer) writing, which makes it seem possible that a race could occur. I've not totally audited the code, but it would make sense that a clockevent device could only have one operation on it done at a time, but maybe suspend/resume paths are catching us? Matteo: Could you also attach a dmesg log?
Created attachment 24544 [details] dmesg
It happens with 2.6.31, .32 and .33-rc4 at every boot
Venkatesh: Does this look like the unannounced hpet related hardware errata that was brought up earlier on lkml (http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/02305.html) ?
Yes. This is the same problem as in the thread mentioned in comment #5. And as of now the problem is not yet announced in spec update. I hear it will be there soon. May be its safe to have the workaround and update with the pointer to spec update once that goes public.
Created attachment 25096 [details] Use spinlock_irqsave to disable irqs in hpet_next_event This patch adds locking around the hpet_next_event internals. Its really just to make sure we're not hitting any sort of driver that's taking a long time while we set the event to expire. I could have probably used local_irq_disable() but this should also remove the possibility that lots of cpus hammering on the hpet_write might cause the write to stall and take longer then expected. It would be interesting to see if the messages continue with this patch or go away. If they continue, its likely a hardware issue or an SMI issue, which we can't really control. If they go away, we'll have to look at what is going on in irq context that is preempting this function for so long. Oh, and this patch is totally un-tested, so my apologies if it doesn't boot.
(In reply to comment #7) > Created an attachment (id=25096) [details] > Use spinlock_irqsave to disable irqs in hpet_next_event > > This patch adds locking around the hpet_next_event internals. Its really just > to make sure we're not hitting any sort of driver that's taking a long time > while we set the event to expire. I could have probably used > local_irq_disable() but this should also remove the possibility that lots of > cpus hammering on the hpet_write might cause the write to stall and take > longer > then expected. > > It would be interesting to see if the messages continue with this patch or go > away. If they continue, its likely a hardware issue or an SMI issue, which we > can't really control. If they go away, we'll have to look at what is going on > in irq context that is preempting this function for so long. > > Oh, and this patch is totally un-tested, so my apologies if it doesn't boot. Ugh! Attached this to the wrong bug #. Please ignore.
Venkatesh posted a patch to document and resolve this issue: http://patchwork.kernel.org/patch/82092/
It fixes, great