Bug 10561

Summary: p4_clockmod: repeated lockups - classmate PC
Product: Power Management Reporter: Omair Eshkenazi (stimpson)
Component: cpufreqAssignee: acpi_power-processor
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: bunk, davej, lenb, rui.zhang, tglx, venki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24.5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Lots of info from /proc that might help
Output of dmesg -s 1M
Info about system time being slowed down.

Description Omair Eshkenazi 2008-04-27 02:07:32 UTC
Latest working kernel version: 2.6.18.1 (haven't tried any other since)
Earliest failing kernel version: 2.6.24.4
Distribution: none (vanilla +unionfs +squashfs +TuxOnIce)
Hardware Environment: 1st (1.5) generation Intel Classmate (BIOS version CM91510A.86A.0024.2007.0820.1744)
Software Environment: gcc-3.4.4

Problem Description: With HPET enabled, loading the processor and p4_clockmod modules causes the kernel to repeatedly lock up.

Steps to reproduce:

I booted without loading the above modules, killed nearly all processes (except for klogd and syslogd) and loaded the modules, which resulted in seemingly random lockups, which are released after some time (from a few seconds to a few minutes).
(note that the system does not lock up immediately after the drivers are loaded and that rmmod-ing p4-clockmod and speedstep-lib does not cause the lockups to cease -- it seems like it triggers some problem that persists afterwards.)

However, starting X will always cause it to lock up, about 10 seconds after starting.

Sometimes the lockups can be freed by generating some acpi event -- (dis)connecting the AC cable or pressing the power button.

Sometimes, after the lockup, one of two kernel messages appears:
- psmouse.c: TouchPad at isa0060/serio2/input0 lost synchronization, throwing 5 bytes away.
- usb 1-6: reset high speed USB device using ehci_hcd and address 4
(I tried enabling USB_DEBUG, USB_STORAGE_DEBUG and ACPI_DEBUG, but no extra messages were emitted after a lockup was released.) 

Some time (~30 sec) after the modules are loaded, a message appears. It seems like the lockups only start after it. The message is of the form:
Clocksource tsc unstable (delta = -28083753020 ns)
(I also noticed that if powernowd is running at boot and I kill it, it gives a message like:
daemon.info powernowd:   63 speed changes in -27445 seconds
However, I also get such a message with the old kernel.)

Something that might be of some relevance, the difference in dmesg output when booting with hpet enabled and disabled:
APIC calibration not consistent with PM Timer: 156ms instead of 100ms
APIC delta adjusted to PM-Timer: 623436 (976648)
--
hpet clockevent registered
hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
hpet0: 3 64-bit timers, 14318180 Hz

Looking at /proc/sched_debug, it is obvious that, with hpet enabled,
the time is wrong (order of magnitude too big).

I will attach a tarball with detailed information from /proc and such later.
Comment 1 Omair Eshkenazi 2008-04-27 02:11:07 UTC
Created attachment 15931 [details]
Lots of info from /proc that might help
Comment 2 Adrian Bunk 2008-04-27 12:10:28 UTC
Please attach the output of "dmesg -s 1000000" for both booting with HPET and booting without HPET.
Comment 3 Omair Eshkenazi 2008-04-28 08:41:44 UTC
Created attachment 15955 [details]
Output of dmesg -s 1M
Comment 4 Thomas Gleixner 2008-04-28 10:04:16 UTC
Can you please provide the output of 

/sys/devices/system/clocksource/clocksource0/current_clocksource

for both cases ?

Thanks,
       tglx
Comment 5 Omair Eshkenazi 2008-04-28 12:45:29 UTC
When using hpet it says hpet and when not it says tsc, but I have a feeling I saw it say acpi_pm in some cases.
Comment 6 Omair Eshkenazi 2008-04-28 13:01:55 UTC
My previous comment was incorrect.

It says tsc weather hpet is enabled or not.

However, I just tried and modprobing processor, battery and ac made it change to hpet! (battery and ac might be irrelevant and it just needed time to update -- I can rmmod them but not processor - it says "Module processor is in use".) 
This is with hpet enabled, so with hpet disabled it will probably be acpi_pm, as I've noticed before.
(I also tried modprobing p4-clockmod to see if it has any effect, but it hasn't -- it still says hpet and it just locked up...)
Comment 7 Thomas Gleixner 2008-04-28 14:03:56 UTC
> This is with hpet enabled, so with hpet disabled it will probably be acpi_pm,
> as I've noticed before.

Is the switch to hpet resp. pm_timer noted in dmesg ?

> (I also tried modprobing p4-clockmod to see if it has any effect, but it
> hasn't
> -- it still says hpet and it just locked up...)

Hmm, that's a tough one :( Venki, any pointers ?
Comment 8 Len Brown 2008-04-28 19:23:32 UTC
Can you reproduce this when the p4_clockmod driver is not used?

(and why are you using p4_clockmod on a modern processor in the first place
 instead of acpi_cpufreq?)

But it may be that p4_clockmod has nothing to do with this,
and that it is purely an hpet issue.  For we switch to using
the hpet when the processor module is loaded -- as this
box has C2 we will stop using the tsc when processor is loaded
and use the next best timer, which will be the hpet if allowed...

In particular, the BIOS doesn't export the HPET on this box,
we are enabling via chipset probing:

> Force enabled HPET at base address 0xfed00000

Another thing to try is to load the processor module
with processor.max_cstate=1 (via /etc/modprobe.conf)
which will disable C2 and allow the TSC to be used instead
of the hpet.  This should give you the same results as
hpet=disable, and perhaps you should be able to happily
run p4_clockmod in this case and prove that it is not
related to the hang.
Comment 9 Omair Eshkenazi 2008-04-29 03:44:56 UTC
Created attachment 15974 [details]
Info about system time being slowed down.

| Can you reproduce this when the p4_clockmod driver is not used?
No, I tried that in the past and again now -- I waited a long time and also run X and everything seems fine. As I mentioned at the beginning, I *can* remove p4-clockmod (and speedstep_lib) and the lockups will continue, but it seems to trigger something that causes the lockups... 

| (and why are you using p4_clockmod on a modern processor in the first place
 instead of acpi_cpufreq?)
acpi-cpufreq won't load on this machine: "no such device".

Another thing I tried is, after it had already locked up once, I echoed acpi_pm into /sys/devices/system/clocksource/clocksource0/current_clocksource, which caused a message to appear about acpi_pm being enabled (so it worked ok).
However, it still locked up afterwards...

| Another thing to try is to load the processor module
| with processor.max_cstate=1 (via /etc/modprobe.conf)
You mean "option processor max_cstate=1" (the other notation is for builtins).
When probing processor I got a message about it being limited to max C-state 1.
It also caused the time/tsc problem (below) to go away so I got no "tsc unstable" message and it didn't lock up.
However, unlike booting with hpet=disable, in this case it kept using tsc, rather than use acpi_pm.
If I'm not mistaken, it also means the processor can't sleep... so it's not very power-efficient.

I discovered something about the "Clocksource tsc unstable" messages:
When I boot (with hpet enabled or disabled), tsc is used at first.
Then I modprobe processor and there's a period of from a few seconds to a few minutes before that message appears (after which the clocksource used changes to hpet or acpi_pm, depending on how I booted).
Now, during that period, is seems like time pretty much comes to a halt! For every 5 seconds that pass in the real world, the system time changes by about 20 ms.
I've gathered some data that shows how it develops and am attaching it (more info inside).
Note that this is the same whether I boot with hpet enabled or disabled, as hpet/acpi_pm aren't used yet at this stage.
Comment 10 Thomas Gleixner 2008-04-29 08:51:52 UTC
What happens when you boot with clocksource=hpet or clocksource=acpi_pm on the kernel command line ?

Thanks,
       tglx
Comment 11 Omair Eshkenazi 2008-04-30 03:31:09 UTC
> What happens when you boot with clocksource=hpet or clocksource=acpi_pm on
> the
> kernel command line ?

Whichever one I boot with is, indeed, used from early on in the boot process (when tsc was previously initialized) and goes on being used after processor is loaded.
As a result, the period (after loading processor) when the system clock is slowed down does not exist.
However, after loading processor I still get the "tsc clocksource unstable" message (so I guess it's alive somewhere).
Loading p4-clockmod still causes lockups when hpet is used...
Comment 12 Thomas Gleixner 2008-04-30 03:53:02 UTC
> ------- Comment #11 from stimpson@phys.huji.ac.il  2008-04-30 03:31 -------
> > What happens when you boot with clocksource=hpet or clocksource=acpi_pm on
> the
> > kernel command line ?
> 
> Whichever one I boot with is, indeed, used from early on in the boot process
> (when tsc was previously initialized) and goes on being used after processor
> is
> loaded.
> As a result, the period (after loading processor) when the system clock is
> slowed down does not exist.
> However, after loading processor I still get the "tsc clocksource unstable"
> message (so I guess it's alive somewhere).
> Loading p4-clockmod still causes lockups when hpet is used...

Hmm. When you use clocksource=acpi_pm then it works with p4-clockmod, right ?
Can you please provide the output of /proc/timer_list for that scenario ?

Thanks,
	tglx
Comment 13 Omair Eshkenazi 2008-04-30 04:21:10 UTC
> When you use clocksource=acpi_pm then it works with p4-clockmod, right ?

No it locks up. I guess it's like when I boot with no parameter and echo acpi_pm into the current_clocksource file (as I did a couple of days ago).
I'll try and get you some stats from /proc/time_list, just in case it helps, but it seems to be completely locked up at the moment so I'll have to forcefully power it off and get back to you later.
Comment 14 Thomas Gleixner 2008-04-30 05:29:13 UTC
On Wed, 30 Apr 2008, bugme-daemon@bugzilla.kernel.org wrote:
> ------- Comment #13 from stimpson@phys.huji.ac.il  2008-04-30 04:21 -------
> > When you use clocksource=acpi_pm then it works with p4-clockmod, right ?
> 
> No it locks up. I guess it's like when I boot with no parameter and echo
> acpi_pm into the current_clocksource file (as I did a couple of days ago).
> I'll try and get you some stats from /proc/time_list, just in case it helps,
> but it seems to be completely locked up at the moment so I'll have to
> forcefully power it off and get back to you later.

Ok, so the only way to make it work is hpet=disable. I fear we can't
do much about that then. hpet is not supported by the BIOS so it might
be a chipset/BIOS bug in combination with the clock scaling.

Thanks,
	tglx
Comment 15 Omair Eshkenazi 2008-05-03 02:55:50 UTC
> hpet is not supported by the BIOS so it might
> be a chipset/BIOS bug in combination with the clock scaling.

Hmm. That was kind of my basic assumption (hence the "funky" in the title) and I was hoping something could be done to batter handle such problems (distro kernels have hpet enabled, so some users might encounter it and not know what the cause is...).

Oh, well, on to my next bug report...
Comment 16 Len Brown 2008-05-05 18:53:40 UTC
> Ok, so the only way to make it work is hpet=disable.

Isn't the other solution to simply not load p4_clockmod?
(and continue using the hpet, as is done by default)

I thought that dave was going to delete p4_clockmod from the source
tree about this time last year, I'm not sure why it still exists.
Comment 17 Omair Eshkenazi 2008-05-06 03:35:16 UTC
> I thought that dave was going to delete p4_clockmod from the source
> tree about this time last year, I'm not sure why it still exists.

But if it is removed, what will us mortals use? 
These new Intel mobile processors don't support EST and so the only driver that works with them in recent kernels is p4_clockmod...
Comment 18 Len Brown 2011-01-18 05:58:29 UTC
re: comment #9

> You mean "option processor max_cstate=1" (the other notation is for
> builtins).
> When probing processor I got a message about it being limited to max C-state
> 1.
> It also caused the time/tsc problem (below) to go away so I got no "tsc
> unstable" message and it didn't lock up.
> However, unlike booting with hpet=disable, in this case it kept using tsc,
> rather than use acpi_pm.
> If I'm not mistaken, it also means the processor can't sleep... so it's not
> very power-efficient.

Actually this is probably fine.  It will give you C1E,
which is likely 90+ % of the value of C2, and it will
not disable the use of the TSC as a timesource.

I'd be interested if you can detect a battery life
difference between booting with max_cstate=1 and without.

But this business of the HPET and the TSC is not directly
related to the actual failure, as in comment #13 you stated
that this can happen when using acpi_pm as the clocksource,
eliminating HPET and TSC from the picture.  The problem is
with p4_clockmod.

I'd be interested if you can measure any benefit from using
p4_clockmod.  My recommendation would be to simply not use it.
Comment 19 Zhang Rui 2011-04-19 07:43:10 UTC
bug closed as there is no response from the bug reporter.
please re-open it if you can provide the info required in comment #18.
Comment 20 Omair Eshkenazi 2011-04-19 13:19:10 UTC
Sorry, I had no internet access for six months... will email Len about this soon.
Comment 21 Zhang Rui 2011-04-20 01:32:03 UTC
that's okay.
you don't need to email len, please just re-open the bug once you can do the test and attach the test result here. :)