Bug 12118

Summary: Random freezes - unless nohz=off - Fujitsu Siemens Amilo Xi 2428, Intel Core2Duo T8100, 4GB
Product: Timers Reporter: Pay87 (pay)
Component: OtherAssignee: john stultz (john.stultz)
Status: ASSIGNED ---    
Severity: high CC: ajungstand, alan, andreas_nordal_4, estellnb, gabriele.svelto, james, jbucata, kernel, tglx, thahn01
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.7.6 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 56331    
Attachments: Kernel failure message with nohz=off
Boot-Log with nolapic_timer
Boot-Log without nolapic_timer
Output from normal boot plus points 2--4 in comment #18
dmesg from James Ettle's notebook (2.6.27.10-169.fc10, normal cmdline params)
Info req'd in comment #18
dmidecode from a Clevo M720R
dmidecode from a Fujitsu-Siemens Amilo Pi 2540
dmidecode from a Multicom Compal JFL92+

Description Pay87 2008-11-28 17:37:56 UTC
Latest working kernel version: none
Earliest failing kernel version: since tickless kernel feature
Distribution: Fedora, openSuse, Ubuntu
Hardware Environment: Fujitsu Siemens Amilo Xi 2428, Intel Core2Duo T8100, 4GB RAM, Intel PM965 (Crestline-PM) + ICH8M, nVidia GF 8600 GS (G86M)  
Software Environment: All
Problem Description: Notebook only works with nohz=off, else I get random freezes or short random freezes until I move the mouse or type something.
It seems it has something to do with the tickless kernel feature.
If you search the forums you will have a lot of others having this problem on several notebooks. I get this error with 32bit and with 64bit kernel versions.
Comment 1 ykzhao 2008-11-29 05:53:32 UTC
Will you please attach the output of acpidump?
Will you please try the following boot option and see whether the system can work well?
   a. processor.max_cstate=1 ( The processor should be compiled as built-in kernel)
   b. idle=poll
   c. idle=nomwait
   d. nolapic_timer
   
   Thanks.
Comment 2 Pay87 2008-11-29 17:53:31 UTC
Hello I wasn't able to get a output of acpidump, I have to look how to get it.
I tried all the commands and it seems that all of them made things a little better.
While testing I had no freezes and while booting and starting some basic operations.
I tried all of the commands twice and all seemed to work for the moment.
Of course I have to do a long time test to see how stable this will work.
This will take some time and I try to contact a few people with the same problem to test the same commands. If I get this problem fixed without using nohz=off, and with one of the commands above for a longer time, I will post again. Thanks and regards.
Comment 3 ykzhao 2008-11-30 17:18:00 UTC
Thanks for the test.
   It seems that the system can be booted after the boot option mentioned in comment #1 is used.
   
   Please use the acpidump tools to get the output of acpidump.The latest dump tool can be found in 
   http://www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/

   Thanks.
Comment 4 Zhang Rui 2008-11-30 18:06:52 UTC
an tickless kernel bug?
it doesn't look like an ACPI bug from the current description.
re-assign to timers-other category.
Comment 5 Len Brown 2008-11-30 19:45:34 UTC
Exactly what kernel version is this reported against?
(uname -a)
Do you still see it with 2.6.27.stable?
how about with 2.6.28-rc?
Comment 6 Pay87 2008-12-05 09:26:42 UTC
Created attachment 19158 [details]
Kernel failure message with nohz=off
Comment 7 Pay87 2008-12-05 09:27:51 UTC
Well, nothing new from me.
I tested the new fedora 10 live cd to see if some things have changed in kernel 2.6.27.
Without nohz=off I got a lot of freezes, but even with I got this message after a while (see attachment above).
Comment 8 Pay87 2008-12-20 09:02:16 UTC
Something new from me:
On Fedora (2.6.27.7-134.fc10.i686) nohz=off causes freezes and a very slow pc performance. Very strange because before it solved my problems. 
I'm now testing nolapic_timer and it seems to work fine (so far).
If new problems occur or this is a permanently fix I will post again..
Comment 9 Pay87 2008-12-21 16:23:24 UTC
nolapic_timer really seems to fix it on 2.6.27.7-134.fc10.i686.
Now I wonder what exactly does this option do and why does nohz=off doesn't work anymore?
Comment 10 Thomas Gleixner 2008-12-22 15:42:21 UTC
nolapic_timer does not fix it. It just disables code pathes which expose the problem.

Can you please upload complete boot logs for a boot with and without nolapic_timer on the kernel command line ?

Thanks,

        tglx
 
Comment 11 Pay87 2008-12-22 19:31:58 UTC
hello, do you mean the log file which can be found in /var/log/boot.log ?
because both seem to contain the same information, even with nolapic_timer.
everything is "ok" there.
Comment 12 Thomas Gleixner 2008-12-22 23:00:25 UTC
> hello, do you mean the log file which can be found in /var/log/boot.log ?
> because both seem to contain the same information, even with nolapic_timer.
> everything is "ok" there.

Do after boot:

# dmesg >boot.txt

And upload the file.
Comment 13 Pay87 2008-12-23 07:25:48 UTC
Created attachment 19456 [details]
Boot-Log with nolapic_timer
Comment 14 Pay87 2008-12-23 07:26:39 UTC
Created attachment 19457 [details]
Boot-Log without nolapic_timer
Comment 15 James Ettle 2009-01-09 02:26:09 UTC
Pay87, I believe I am seeing something similar on my hardware --- also a T8100-based notebook with the same chipset as yours, as far back as kernel 2.6.24 on Fedora 8. I filed under bug 12390; please let me know if my symptoms match yours and I'll mark my bug as a duplicate of this one.
Comment 16 James Ettle 2009-01-09 07:57:46 UTC
*** Bug 12390 has been marked as a duplicate of this bug. ***
Comment 17 James Ettle 2009-01-09 08:03:00 UTC
Hello, I think I see the same thing on the same processor and chipset. I reported it in the Red Hat Bugzilla[1] back in April, where I found that  processor.max_cstate=1 seemed to stop the issue (I'd have to check again); someone else said it went away at processor.max_cstate=2. Anyone want any more logs/dumps? ;)

1. https://bugzilla.redhat.com/show_bug.cgi?id=443155
Comment 18 Thomas Gleixner 2009-01-14 00:09:38 UTC
Ok, I checked the boot logs. The interesting differences:

--- boot-with.txt	2009-01-14 08:48:44.000000000 +0100
+++ boot-without.txt	2009-01-14 08:48:53.000000000 +0100

+Marking TSC unstable due to TSC halts in idle

+Clocksource tsc unstable (delta = -94756607 ns)

+CE: hpet increasing min_delta_ns to 15000 nsec
+CE: hpet increasing min_delta_ns to 22500 nsec
+CE: hpet increasing min_delta_ns to 33750 nsec

That means, that we don't go into deep power states when nolapic_timer
is on the kernel command line.

Please try the following steps:

1) boot w/o the nolapic_timer option and wait until the freezes start
   to happen. When the freezes become longer, run dmesg >log.txt and
   upload the file.

2) boot with and w/o nolapic_timer option and provide the output of
   # cat /proc/timer_list 
   and
   # cat /sys/devices/system/clocksource/clocksource0/current_clocksource
   and
   # cat /proc/acpi/processor/CPU0/power
   for each

3) boot with "hpet=disable" on the kernel command line

4) boot with "clocksource=acpi_pm" on the kernel command line

5) If possible can you try 2.6.28 ?

Thanks,

	tglx
Comment 19 James Ettle 2009-01-14 11:38:43 UTC
Created attachment 19794 [details]
Output from normal boot plus points 2--4 in comment #18

tglx, I haven't uploaded a dmesg because there's nothing in it for me between freezes. In my case, they show up after around 10 seconds and I've observed them to last between 30--60 seconds, or until a hardware interrupt comes along (so, keypress, attach USB device, network activity, etc.). In order to reproduce this, I have to disable all networking, plus unplug all external and internal USB devices. Processes coming out of sleep don't seem to wake things up.

What I'm attaching comes from my default config, plus points 2--4 above. Rather than overload the attachment list, it's all in timer.tar.bz2 --- the filenames should be self-explanatory. Hope this is helpful!
Comment 20 Thomas Gleixner 2009-01-14 12:37:51 UTC
> What I'm attaching comes from my default config, plus points 2--4
> above.

And the freezes happened in all 4 scenarios ?

Thanks,

	tglx
Comment 21 Thomas Gleixner 2009-01-14 12:48:35 UTC
> And the freezes happened in all 4 scenarios ?

Just noticed, that on 2.6.27 when you run a 64bit kernel you need
"noapictimer" instead of "nolapic_timer" :(

Thanks,

	tglx
Comment 22 James Ettle 2009-01-14 13:05:52 UTC
Created attachment 19799 [details]
dmesg from James Ettle's notebook (2.6.27.10-169.fc10, normal cmdline params)

Sorry, tglx, I forgot... Yes, the "pauses" happen in all four cases, this is with kernel-2.6.27.10-169 from Fedora 10 (2.6.28 has some other "resource sanity check" bug so I'm leaving that one alone for the time-being). I'll get the noapictimer results for you soon. I've decided to attach dmesg for this kernel (normal boot options) since it might have some useful info for you anyway.
Comment 23 James Ettle 2009-01-15 11:42:43 UTC
I tried noapictimer on 2.6.27. *As far as observed*, the bug did not manifest. The default clocksource was tsc, which is normally marked unstable; this upset a number of multimedia applications. I'll upload a new archive obsoleting the old one with the results using this boot option.
Comment 24 James Ettle 2009-01-15 11:43:55 UTC
Created attachment 19819 [details]
Info req'd in comment #18
Comment 25 Thomas Gleixner 2009-01-16 01:51:01 UTC
> I tried noapictimer on 2.6.27. *As far as observed*, the bug did not
> manifest.  The default clocksource was tsc, which is normally marked
> unstable; 

Hmm. The power log says, that the system is permanent in C0 state. The
TSC is not marked unstable in that case.

> this upset a number of multimedia applications.

Can you add "clocksource=acpi_pm" as well ? Are the multimedia apps
more happy then ?

Thanks,

	tglx
Comment 26 Thomas Gleixner 2009-01-16 02:07:26 UTC
Another test would be to add "idle=nomwait" (no other options) to the
kernel command line.

Thanks,

	tglx
Comment 27 James Ettle 2009-01-19 03:55:56 UTC
The bug still happens with "idle=nomwait" (as in Comment #26). Using "noapictimer clocksource=acpi_pm", I didn't see the processor entering anything below C1; I'm not sure the MM apps were *completely* happy, either --- I think PulseAudio in particular likes hpet.
Comment 28 Arne Jungstand 2009-01-25 13:24:39 UTC
I confirm this with an Clevo M720R mainboard,
T9300 processor, PM965/GM965/GL960, 4G RAM
Kernel 2.6.27-11, different distributions (Ubuntu, SUSE, both 32bit or 64 bit)
One additinal observation, after the freezes the system time is delayed by 5 min or multiples of this.
With nohz=off, the bug doesn't occur.
idle=nomwait not tested yet.

I hope someone finds a solution

Thanks
Arne
Comment 29 James Ettle 2009-01-25 23:55:40 UTC
(In reply to comment #28)
> One additinal observation, after the freezes the system time is delayed by 5
> min or multiples of this.

Just to add to the confusion, I've NOT seen any clockskew on my M720R...
Comment 30 Arne Jungstand 2009-03-26 09:56:05 UTC
Just tested with kernel 2.6.27-14, 64bit,
with nohz=off no freezes and no clockskews, 
without nohz=off the problem persists

Thanks
Arne
Comment 31 James Ettle 2009-04-07 20:17:49 UTC
Any more thoughts on this? Running with nohz seems to make the latest "glitchless" PulseAudio rather, er, glitchy... this is on 2.6.29.1-54.fc11.x86_64.
Comment 32 Pay87 2009-04-07 21:16:17 UTC
I use nolapic_timer because nohz=off caused some random crashs on my sys.
Comment 33 James Ettle 2009-05-08 13:17:43 UTC
Hi, I note this bug is still NEEDINFO. Please let me know what extra information is required and I'll try and provide it. Thanks!
Comment 34 James Ettle 2009-05-26 20:01:45 UTC
No different with kernel-2.6.30-0.91.rc7.git1.fc12.x86_64.
Comment 35 James Ettle 2009-08-13 21:00:38 UTC
Anyone else notice an improvement between 2.6.29 and 2.6.30.4, which I'm testing now? .29 had it severely, the system was basically unusable without nohz=off or continual keypresses; however .30.4 isn't perfect so I'm not going to cry "fixedforme" just yet...
Comment 36 James Ettle 2009-08-13 21:35:43 UTC
Addendum: It still happens in 2.6.30.4, but it's somewhat rarer.
Comment 37 James Ettle 2009-10-01 10:49:51 UTC
Still exhibited by 2.6.31-series kernels.
Comment 38 john stultz 2009-11-02 22:18:36 UTC
Just for reference, there's another bug (bug #14280) that is an Amilo Pro 2030, which seems to have an ACPI PM timer that changes speed when NO_HZ is enabled.

May be related to this issue. Might need some sort of pciquirk that disables NOHZ on these boxes?
Comment 39 James Ettle 2009-11-02 22:24:46 UTC
My notebook currently defaults to hpet as its clocksource (.31); a few kernels ago (.29? .30?) it used tsc; neither made a different to the problem. It's like the interrupts for whatever was supposed to be waking the machine up either weren't being received --- or not being sent in the first place. The trouble is NOHZ=off prevents the machine from reaching the higher C-states. If anyone can point me to a deep diagnostic test to find out precisely what's (not) going on, I'd be quite willing to try it.
Comment 40 James Ettle 2009-12-04 19:15:18 UTC
No improvement for me with 2.6.32; if anything, it's become worse.
Comment 41 James Ettle 2009-12-04 19:16:18 UTC
I should add, my notebook seems to experience the freezes even with nohz=off.
Comment 42 James Ettle 2009-12-09 21:49:13 UTC
Could this be the same as/related to bug 11166?
Comment 43 Morten Nielsen 2010-04-03 20:38:20 UTC
I have the same issue on my lenovo ideapad S12.

I have tried both hpet and acpi_pm clocksources and on both kernel 2.6.32 (debian trunk) and 2.6.34rc2.

I started investigating because of the system time being wrong. See this for other details. http://forums.debian.net/viewtopic.php?f=5&t=50634
Basicly, the system time lost 6 hours overnight.

Time is correct using using nohz=off, but I get some extra 500 wakeups per second. My kernel is compiled with CONFIG_HZ_250=y, which (i think) explains the number of 500.
powertop reports:
Top causes for wakeups:
  81,2% (500,4)     <kernel core> : hrtimer_start_range_ns (tick_sched_timer) 

and also
Cn                Avg residency       P-states (frequencies)
C0 (cpu running)        ( 4,0%)         1,60 Ghz     0,0%
polling           8,2ms (96,0%)         1333 Mhz     0,0%
C1 mwait          0,0ms ( 0,0%)         1067 Mhz     0,0%
C2 mwait          0,0ms ( 0,0%)          800 Mhz   100,0%
C4 mwait          0,0ms ( 0,0%)

I don't get the polling part, but it is not a good solution never to use the low C-levels. I expect the battery lifetime will be way longer if I started using the low power modes of the processor :-)

That was my two-pence, since this bug is still marked as NEEDINFO.
I would like to see it fixed, so what other info is needed?
Comment 44 Gabriele Svelto 2010-04-17 08:19:41 UTC
I am seeing the very same problem on a Fujitsu-Siemens Amilo Pi 2540 using the stock kernels in Fedora 12 as well as the SystemRescueCd live distribution (kernel versions 2.6.31 as well as 2.6.32).

The problem goes away if I boot with the processor.max_cstate=1 option. On the other hand changing clock source doesn't seem to fix the problem. I can post more information if needed (hardware listing, dumps, etc...).
Comment 45 Morten Nielsen 2010-04-19 11:40:05 UTC
I have tried the processor.max_cstate=1 parameter.

Two observations:
1) I don't get the systematic 500 wakeups per seconds. This is expected since I now use dynamic ticks.
2) powertop doesn't show the C states anymore. It says
"< Detailed C-state information is not available.>"
I would have expected to have both C0 and C1 shown.
3) Even with dynamic ticks, the time is correct and I get no random freezes.

In conclusion, processor.max_cstate=1 seems to work, but still, the part about using the lower c-states would be nice :-)

Cpu info - if relevant.
leon:~# cat /proc/cpuinfo 
[snip]
model name	: Intel(R) Atom(TM) CPU N270   @ 1.60GHz
[/snip]
This is for both processors.
Comment 46 James Ettle 2010-05-19 10:37:48 UTC
Reading through some recent entries on the kernel mailing list about another notebook observed to do this, I tried adding the command-line option

  pci=nomsi

and this seems to work. No strange pauses. (Doesn't help with bug 12788, but it means the machine now seems to be able to use C3 without nodding off.)
Comment 47 Gabriele Svelto 2010-05-27 09:46:28 UTC
I have tried the pci=nomsi command-line option on my machine and while it reduces significantly the number of pauses it doesn't eliminate them entirely. My machine still freezes from time to time even though it takes several seconds for this to happen, without the option it happens pretty much all the time.
Comment 48 James Ettle 2010-11-10 11:26:00 UTC
Anyone tried adding acpi_skip_timer_override to the kernel cmdline? I have a *suspicion* this fixes things on mine, bit I'm still investigating.
Comment 49 Gabriele Svelto 2010-11-26 23:04:31 UTC
(In reply to comment #48)
> Anyone tried adding acpi_skip_timer_override to the kernel cmdline? I have a
> *suspicion* this fixes things on mine, bit I'm still investigating.

On my machine (Amilo Pi 2540) using this flag greatly improves the situation but doesn't remove the pauses entirely, I'm using Fedora 14 (x86), kernel 2.6.35.
Comment 50 Morten Nielsen 2010-11-27 18:19:50 UTC
I swicthed from processor.max_cstate=1 to using the suggested acpi_skip_timer_override. I changed a couple of days after James Ettles comment, and have been using it since.

It works. I don't have random freezes, and the C states works also. 

I have an issue with too many wake-ups-from-idle, but that might be something else (see below).

Good suggestion, thanks.

PS. running debian testing with custom compiled kernel 2.6.34

--- powertop output, should it be relevant ---
leon:~# powertop -d
PowerTOP 1.11   (C) 2007, 2008 Intel Corporation 

Collecting data for 15 seconds 


Your CPU supports the following C-states : C1 C2 C4 
Your BIOS reports the following C-states : C1 C2 C4 
Cn	          Avg residency
C0 (cpu running)        ( 5,9%)
C0		  0,0ms ( 0,0%)
C1 mwait	  9,7ms (47,5%)
C2 mwait	  0,9ms (29,2%)
C4 mwait	  0,4ms (17,3%)
P-states (frequencies)
  1,60 Ghz     2,1%
  1333 Mhz     0,1%
  1067 Mhz     0,1%
   800 Mhz    97,8%
Wakeups-from-idle per second : 859,0	interval: 15,0s
no ACPI power usage estimate available
Top causes for wakeups:
  38,7% ( 89,5)     <kernel core> : hrtimer_start_range_ns (tick_sched_timer) 
  33,2% ( 76,9)              java : hrtimer_start_range_ns (hrtimer_wakeup)
Comment 51 James Ettle 2010-11-27 19:31:38 UTC
(In reply to comment #50)
> I have an issue with too many wake-ups-from-idle, but that might be something
> else (see below).

I see this too, many more wake-ups with the timer_override option (around 2500 wups).
Comment 52 Thomas Hahn 2011-01-05 22:01:40 UTC
I am still having freezing when booting.
It is not always the same time but mostly when configuring network.
As soon as I touch the touchpad booting resumes.

Kernel command line: BOOT_IMAGE=/boot/vmlinuz-2.6.36-2.slh.3-aptosid-686 root=UUID=ab6a290e-a341-4776-9846-fd8787b9d3ad ro acpi_skip_timer_override quiet

The first line on the screen when booting is:
Jan  5 20:49:48 xtrema kernel: [    0.010999] ..MP-BIOS bug: 8254 timer not conn
ected to IO-APIC
Jan  5 20:49:48 xtrema kernel: [    0.010999] ...trying to set up timer (IRQ0) through the 8259A ...
Jan  5 20:49:48 xtrema kernel: [    0.010999] ..... (found apic 0 pin 0) ...
Jan  5 20:49:48 xtrema kernel: [    0.021803] ....... works.

This has anything to do with it?

Thomas
Comment 53 Morten Nielsen 2011-01-06 19:12:47 UTC
I have the exact same 4 lines i dmesg, and I don't experience the random freezes anymore, so most likely the answer is no.
See my comment #50 above for system details.
Comment 54 Thomas Hahn 2011-01-06 22:33:21 UTC
Ok, now I have tried the latest kernel:

Jan  6 23:27:58 xtrema kernel: [    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-2.6.37-0.slh.1-aptosid-686 root=UUID=ab6a290e-a341-4776-9846-fd8787b9d3ad ro acpi_skip_timer_override quiet

Still about the same.
Hangs twice and resumes action as soon as I hit the touchpad.

How come the acpi_skip_timer_override doesn't work for my box?
We are all talking about the same laptop, right?
Comment 55 Morten Nielsen 2011-01-07 07:29:14 UTC
I work with a Lenovo S12 with an Intel Mobile 945GME graphics card and 2gb of memory. CPU is reported to be two Intel(R) Atom(TM) CPU N270   @ 1.60GHz
Comment 56 Gabriele Svelto 2011-06-15 16:51:59 UTC
Another update, if I boot my machine with acpi_skip_timer_override the pauses last only until X starts. Once X has started the machines doesn't pause any more. However with this option on I noticed the following error message at boot:

[    0.012999] ..MP-BIOS bug: 8254 timer not connected to IO-APIC

I will try the noapic option to see if the problem goes away. Still on Fedora 14 BTW, kernel 2.6.35.13-92.fc14.i686.PAE.
Comment 57 James Ettle 2011-06-26 14:28:12 UTC
This bug is still present in kernel 2.6.39.2. It's also still marked NEEDINFO; what specific further information is required at this time?
Comment 58 john stultz 2011-06-27 18:39:13 UTC
James: First thanks for your diligence here and sorry this issue has gone on for so long.

Could you provide a brief summary of which boot options resolve the issue (against the 2.6.39 kernel)? From the logs above it seems there is some uncertainty as to how well "nolapic_timer" and "acpi_skip_timer_override" help.

I suspect we are going to have to quirk the specific system to try to address this, as it seems the board in your laptop acts oddly enough and I'm not sure we have a good method to detect the problem without causing issues on other systems.

Could you also provide the output to dmidecode so we have the right machine id to wire the quirk up to?
Comment 59 Morten Nielsen 2011-07-08 18:15:44 UTC
I looked at it again, and the issue seems to have gone. This is the conclusion after one day. 

I run a custom compiled kernel, but I think the important part is that I now use version 2.6.37. I can supply the .config, if anyone is interested.

# dmesg   | grep Kernel
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-2.6.37.6 root=UUID=2b09e3eb-7445-4a4b-9af7-6dd00c036061 ro no_console_suspend

and in powertop it show me all the C-states and processor speeds. I have around 100-150 wakeups per seconds, divided among my applications, so that looks normal (as opposed to 500+ wakeups/sec that I reported earlier).

I have not tested with 2.6.39, since it breaks hibernation, but that is (most likely) unrelated.
Comment 60 James Ettle 2012-01-19 12:38:25 UTC
(In reply to comment #58)
> James: First thanks for your diligence here and sorry this issue has gone on
> for so long.

Likewise, I apologise for the delay (thankfully this Bugzilla has now resumed normal service!). I have attached the machine's dmidecode below. I'm currently testing acpi_skip_timer_override on kernel 3.1.6 (Fedora build). This *seems* to resolve the pausing issue and now no longer introduces excessive wakeups on this machine. However, I'd like to test this for a few more days; if it doesn't work out, I'll be back to max_cstate=1.
Comment 61 James Ettle 2012-01-19 12:39:07 UTC
Created attachment 72125 [details]
dmidecode from a Clevo M720R
Comment 62 Gabriele Svelto 2012-01-19 19:17:34 UTC
Using the acpi_skip_timer_override seems to work for me too on kernel 3.1.9 (Fedora build). I will attach the dmidecode data from my laptop too, in case it could be useful.
Comment 63 Gabriele Svelto 2012-01-19 19:20:24 UTC
Created attachment 72133 [details]
dmidecode from a Fujitsu-Siemens Amilo Pi 2540
Comment 64 James Ettle 2012-01-28 10:36:49 UTC
(In reply to comment #60)
> this machine. However, I'd like to test this for a few more days; if it
> doesn't
> work out, I'll be back to max_cstate=1.

Yes, I spoke too soon. Only processor.max_cstate=1 consistently stops the freezing on mine at this stage.
Comment 65 Morten Nielsen 2012-01-28 11:03:24 UTC
I just reinstalled my laptop witht he latest debian testing, and the problem remains.
The acpi_skip_timer_override is still necessary

looking at powertop, hrtimer_wakeups are at 20-25/s. I guess that is fairly normal - it is uncomparable to the 500 wakeups/s that I reported in comment 43 above.

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.1.0-1-686-pae root=/dev/mapper/leon-root ro acpi_skip_timer_override
Comment 66 Andreas Nordal 2012-05-19 00:42:05 UTC
Have we done parallel investigations?
https://bugzilla.novell.com/show_bug.cgi?id=579932

If you think we are experiencing the same problem, I think my fellow bughunters would like to join forces with you.
Comment 67 Andreas Nordal 2012-05-23 18:53:55 UTC
My test of ancient distros (as interpreted by Nik Swiridow at openSUSE) indicated that random hangs appeared with the introduction of dyn-ticks:
https://bugzilla.novell.com/show_bug.cgi?id=579932#c82

In other words, there was never introduced a bug per se, dyn-ticks just never worked (at least on my laptop).

Well, it works for the vast majority of timer interrupts, it's just that some go unnoticed. Note that some kernels hang vastly easier than others, making this very hard to test. In my tests, I used a short-sleeping task (rt-benchmark) to provoke hangs (audio playback is also effective). I can reliably hang some kernels (e.g. 2.6.35) quicker than humanly observable (instant indefinite kernel hang without root = LOL), whereas with 2.6.32, you would never notice anything wrong during normal desktop usage (but it eventually hung a few hundred seconds on the bench). Linux 3.1 feels kinda-good, but not like 2.6.32.

My laptop:
Multicom Compal JFL92+
Intel Core 2 Duo T8100
Phoenix BIOS v1.16

Elmar Stellnberger is having random hangs with this:
Fujistu Siemens Amilo Xi 2550
Intel Core 2 Duo T9300
Phoenix BIOS v1.15

Aaron Burgemeister only has hangs during boot, maybe related to bug 15289. His hardware:
HP Pavilion dv6700 Notebook PC
AMD Turion(tm) 64 X2 Mobile Technology TL-60
Hewlett-Packard BIOS version F.25 (released 2007-11-29)

What bioses do people have here? Has anyone tried updating their bios?
(I didn't make it because installing DOS was too difficult)
Comment 68 Gabriele Svelto 2012-05-24 06:53:51 UTC
I'm on the latest BIOS from Fujitsu (1.15c) but it didn't solve the problem (and the MP-BIOS bug: 8254 timer not connected to IO-APIC message still appears though I don't know if the BIOS bug is the cause of the freezes). acpi_skip_timer_override mostly solves it but I can still run in the odd freeze. The only robust way to prevent the freezes is to use either nohz=off or processor.max_cstate=1. Now on kernel 3.3.6-3.fc16.i686.PAE (Fedora 16).
Comment 69 Andreas Nordal 2012-05-25 07:54:10 UTC
Created attachment 73388 [details]
dmidecode from a Multicom Compal JFL92+
Comment 70 Morten Nielsen 2012-05-25 22:48:35 UTC
As requested, I have included my BIOS info from dmidecode below.

As I told in #65 above, it is still necessary for me to do the acpi_timer_override.

I recently checked lenovos homepage, and they did not have a BIOS update available. Just for completeness, I still get a kernel error saying "MP-BIOS bug: 8254 timer not connected to IO-APIC".

Since I am not into the details of how the kernel handles timers, I don't know what kind of info that might relevant. Please suggest stuff to post.

--- 
BIOS Information
	Vendor: LENOVO
	Version: 19CN21WW
	Release Date: 07/17/2009
	Address: 0xE71C0
	Runtime Size: 101952 bytes
	ROM Size: 1024 kB
	Characteristics:
		PCI is supported
		PC Card (PCMCIA) is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		ESCD support is available
		Boot from CD is supported
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
	BIOS Revision: 1.12
	Firmware Revision: 3.30
Comment 71 Elmar Stellnberger 2013-02-11 15:48:58 UTC
confirmed for kernel 3.7.6-1.2-desktop (FS Amilo Xi-2550).
Comment 72 Elmar Stellnberger 2013-02-11 15:50:45 UTC
 Note that this is NOT a BIOS issue.
"... C2 is the 2nd idle state. The external I/O Controller Hub blocks interrupts
to the processor. And so on with C3, C4, etc. I'll discuss this further down in
this paper. By the way, there is nothing preventing the OS from busy waiting in
its idle state, and thus keeping the processor in C0, as did older operating
systems. ... "
http://software.intel.com/en-us/blogs/2008/03/27/update-c-states-c-states-and-even-more-c-states/
Comment 73 Elmar Stellnberger 2013-02-11 16:14:35 UTC
... you will need to busy wait with at least one core on any Intel Core 2 Duo system if there are pending timers.
  "C1 is the first idle state. The clock running to the processor is gated, i.e. the clock is prevented from reaching the core, effectively shutting it down in an operational sense. "
  ... or perhaps use the APIC timer to wake up at a coarser granularity.
Comment 74 Andreas Nordal 2013-03-01 00:45:40 UTC
Consistently with others here, I did independently conclude that:
* processor.max_cstate=1 works
* processor.max_cstate=2 does not

(In reply to comment #72)
> C2 is the 2nd idle state. The external I/O Controller Hub blocks
> interrupts to the processor.
Nice finding, Elmar! HPET uses interrupts (according to wikipedia), so based on this info, HPET should not work in C2. But then I don't get why other timers would work in C2 either… Except we know there must be different levels of interrupts or something, since the kind of interrupt coming from user interaction works.

I doubt busy waiting (C0) is necessary — experience says C1 works. I would say:
– At least one core of an Intel Core 2 Duo needs to be in CC1 or CC0 whenever HPET is the only timer with a pending interrupt. Otherwise, the processor sleeps indefinately.
Comment 75 Jason Bucata 2014-07-07 06:39:59 UTC
I believe I've also been encountering the same problem with dynticks.  I've got a Gigabyte AMD motherboard GA-MA78GPM with (if I'm reading the manual right) an AMD 780G chipset.

I recently lost my old kernel compilation history in a hard drive failure but I remember struggling with dynticks and another related feature (maybe having to do with ACPI, but I can't swear to it now).  Dynticks has never worked for me since the feature was released.  Since I compile my own kernels I just disabled both of them and get on with my life.  Now recently it became an issue for me because (due to the aforementioned hard drive failure) I was booted into a rescue CD that had a kernel with dynticks enabled.  In order to get stuff done I had to constantly move the mouse or tap the Shift key to generate interrupts to get things to actually finish.

I see some kernel parameters to try at boot time from this thread.  In the next few days I'll give them a try and see what I can find, if I can confirm that I'm seeing the very same issue here.

I see this bug has been quiet for over a year.  I'm hoping we can finally get this thing fixed, since now it appears that it might affect a lot of unsuspecting users who don't compile their own kernels.

As I said, I compile my own kernels so I can try configurations and test patches and stuff.  I've become motivated to get this thing squashed. :D
Comment 76 Jason Bucata 2014-07-07 07:14:18 UTC
This may be old news, but net searching brought me back around to bug 13053.  Apparently that bug was fixed for the OP with a BIOS update.
Comment 77 Elmar Stellnberger 2014-07-07 16:14:41 UTC
Concerning me, I have experienced this bug on multiple machines all of them provided with the latest BIOS. However I have run out of time and resources and could no more continue my testing effort on this bug. Nonetheless it is somehow possible to live with that bug when certain command line options are used. This bug has a long history. You may also find some interesting material at: https://bugzilla.novell.com/show_bug.cgi?id=579932.