Most recent kernel where this bug did not occur: 2.6.23 Distribution: Fedora Rawhide (almost F8) Hardware Environment: Laptop HP nx6125 Problem Description: see https://bugzilla.redhat.com/show_bug.cgi?id=312201 and http://lkml.org/lkml/2007/11/9/24 for hardware info, boot logs etcetera. I'm creating this bug to add some data requested, without spamming lkml. I will also close the Fedora bug as UPSTREAM, trying to keep all info together. Another interesting info is that sometimes while booting 2.6.24-rc2, I get: Marking TSC unstable due to: TSC halts in idle. Time: acpi_pm clocksource has been installed. Then the kernel halts for a while (seconds or minutes), and then continues. But if I press a key on the keyboard the system continues the boot immediately (as if it has been waiting in idle, and is restarted by an hardware interrupt). Unfortunately the patch sent by Venki Pallipadi (http://lkml.org/lkml/2007/11/9/263) doesn't work. I will attach to this bug the acpi dump (btw, using pmtools-20070714 I get: "ACPI tables were not found. If you know location of RSD PTR table (from dmesg, etc), supply it with either --addr or -a option", so the dump is generated by pmtools-20070511) and the .config used.
Created attachment 13506 [details] 'acpidump -b' output (binary form)
Created attachment 13507 [details] 'acpidump' output (hex dump format)
Created attachment 13508 [details] .config used to build kernel 2.6.24-rc2
So we have two separate bugs here. Is the time/clocksource instability/stall also a regression from 2.6.23?
With Fedora kernel 2.6.23-0.197.rc7.git5.fc8 I get: Marking TSC unstable due to: TSC halts in C3. I'm now building again a vanilla 2.6.23 and will report what it says.
Reply-To: akpm@linux-foundation.org On Mon, 12 Nov 2007 02:53:46 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > With Fedora kernel 2.6.23-0.197.rc7.git5.fc8 I get: > > Marking TSC unstable due to: TSC halts in C3. OK. Does 2.6.23 also have the long delay after this?
Created attachment 13509 [details] Full dmesg output for kernel 2.6.23. Ok, vanilla 2.6.23 says: Marking TSC unstable due to: TSC halts in C3. Time: acpi_pm clocksource has been installed. with no delay. I'm attaching the full boot log (easy to get, as this kernel boots fine :-)) just in case somebody needs it.
I am confused a bit, as I don't see the mention of CPU_IDLE here. Just to double check: The above hang in udev happens only when CPU_IDLE is configured and things work fine when it is not configured?
Yes: - On 2.6.24-rc2 (see attachment named ".config used to build kernel 2.6.24-rc2"), compiling with CONFIG_CPU_IDLE=y, kernel hangs. Simply switching CONFIG_CPU_IDLE to 'n' and rebuilding gives my a working kernel. I said this in the original report to lkml (http://lkml.org/lkml/2007/11/9/24), but probably I was not clear enough (my english is not so good). - on 2.6.23 there's no CONFIG_CPU_IDLE option :-)
OK. We can keep the timer guys off the hook until we figure out the problem with cpuidle or that this is not a cpuidle problem. Can you try to boot rc2 with CPU_IDLE and adding processor.max_cstate=1 boot option in grub. Atleast that will tell us the problem is only with C3 state and C1 is working fine.
Created attachment 13518 [details] Forward porting Ingo's timer correctionpatch to cpuidle Can you try the attched patch and see whether this makes any difference. This is a patch that fixes timer part on C3 exit and it was missing in CPU_IDLE path. This patch can be tried over the previous patch I had or without it. I would suggest trying over the previous one I sent on lkml.
unpatched kernel 2.6.24-rc2, CPU_IDLE=y, with 'processor.max_cstate=1' boots fine. I have tried your patch "Forward porting Ingo's timer correctionpatch to cpuidle" applying to a plain 2.6.24-rc2 and over your previous patch, and either way the kernel hangs. I don't know if this can be useful (probably not) but I have tried also sticking a printk() of pr->flags.bm_check and pr->flags.bm_control and both seems to be always equal to 1 (at least until it hangs).
Hmmm.. I am thoroughly confused now. With the above two patches, code executed on C3 is very identical with or without CPU_IDLE. Can you try few more things that can give some lead on this. boot CPU_IDLE kernel with boot options 1) "highres=off" 2) "nohz=off" 3) "highres=off" "nohz=off" 4) One other thing if possible, Change the rating from 20 to 5 in file /drivers/cpuidle/governors/menu.c and try (without any above off options. Thanks.
And one more bit of information that can help here. With CPU_IDLE disabled and once system boots provide the output of the following cat /proc/acpi/processor/*/power cat /proc/interrupts sleep 10 cat /proc/acpi/processor/*/power cat /proc/interrupts
Ok: - 2.6.24-rc2 unpatched: - highres=off -> stops after "Marking TSC as unstable...", hangs while running udev (it behaves as with no parameters). - nohz=off -> does NOT hangs after "Marking TSC as unstable...", and boots fine (dmesg attached as 2.6.24-rc2.nohz-off.dmesg, power dump as requested in comment #14 attached as 2.6.24-rc2.nohz-off.dumppower). - highres=off nohz=off -> as with nohz=off only. setting .rating = 5 in menu.c and booting without parameters hangs. using nohz=off somtimes hangs after 'Marking TSC as unstable..." but this time pressing a key does not makes the boot restart... I'm doing a clean build right now to rule out problems on my side, I will report in a while.
Created attachment 13538 [details] dmesg after booting unpatched kernel 2.6.24-rc2 using "nohz=off" parameter
Created attachment 13539 [details] commands requested on comment #14 after booting unpatched kernel 2.6.24-rc2 with "nohz=off" parameter
Ok, after a rebuild from scratch, I can say that: - 2.6.24-rc2 with .rating = 5, with no boot parameters: nearly half times it goes on after "Marking TSC unstable..." without user intervention, but halts while running udev. The other half it prints: Marking TSC unstable due to: RSC halts in idle. Time: acpi_pm clocksource has been installed. Creating initial device nodes Settin up hotplug. Synaptics Touchpad, model: 1, fw: 6.2, id: 0x1a0b1, caps: 0xa04713/0x200000 input: SynPS/2 Synaptics TouchPad as /class/input/input2 Then halts, and even if I press some keys, the boot keeps stalled. But the keys I press can be seen on the screen. - 2.6.24-rc2 with .rating = 5, with "nohz=off" parameter behaves as without it: sometimes halts soon after "marking TSC unstable...", every time halts in udev. - 2.6.24-rc2 with the default .rating = 20, without parameters, after booting 10 times, everytime it halts after "Marking...", but pressing a key everytime restarts the boot. Then halts at udev. - 2.6.24-rc2 with the default .rating = 20, with "nohz=off" always boots fine. Should I try a kernel with .rating=5 and both your patches applied?
OK. Thanks for all the data. Will get back if any more experiments are required. Let me look at all this data and try to figure out what is going on here..
nohz=off is making cpuidle disable C3 state and the result is same as processor.max_cstate=1. In both cases system boots fine. The problem here seems to be C3 state in this platform. For some reason it is not waking up from this C3 state on timer interrupt. Can you provide the data from comment #14. That is with rc2 and CPU_IDLE disabled in config and no nohz or highres option. That will tell us if C3 is really getting used in that case and how the system is working there. And also same data with rc2 and CPU_IDLE enabled and again no nohz and highres option. This time system hangs. use your key pressing workaround to boot the system up and then get the same data as in comment #14. Thanks for providing all these data. Hopefully we should be able to figure out something after this round.
Created attachment 13559 [details] output of commands requested in comment #14 on kernel 2.6.24-rc2 CPU_IDLE=n and 2.6.23 Exactly! C3 seems not used on kernel 2.6.24-rc2 configured with CPU_IDLE=n, and booted without parameters. kernel 2.6.23 seems to use C3 just fine. I'm attaching a file containing the output of commands in comment #14, both on kernel 2.6.23 and on kernel 2.6.24-rc2. I have also tried booting 2.6.24-rc2 (both CPU_IDLE=n and CPU_IDLE=y) with nohz=off parameter, and I never seen C3 being used.
acpidump output shows that this box does not support _CST: FACP.dsl: [05Fh 095 1] _CST Support : 00 [060h 096 2] C2 Latency : 03E7 [062h 098 2] C3 Latency : 000A it shows that C2 latency is 999 (which is > 100, and thus not supported) it shows that C3 latency is 10 (which is < 1000, and thus legal). So we have the strange situation of no C2, but yes to C3, and C3 reports a mysteriously low 10usec latency. I wonder what C-states windows uses on this box...
Dino, More patches to test. Hopefully this will be the last time. I worked with Alexey offline to root cause this one and below are the three patches for cpuidle needed ontop of 2.6.24-rc. Please test it and let me know. Two of the three patches are cleaned up versions of the patch I had provided earlier. So, just apply these three patches in order over 2.6.24-rc.
Created attachment 13634 [details] forwardport_no_bmcontrol_fix_02.patch
Created attachment 13635 [details] forwardport_ingo_timer_changes_02.patch
Created attachment 13636 [details] bug9355_fix_02.patch
pacthes in comment #24 #25 and #26 applied to cpuidle branch
(In reply to comment #22) > acpidump output shows that this box does not support _CST: > > FACP.dsl: > [05Fh 095 1] _CST Support : 00 > [060h 096 2] C2 Latency : 03E7 > [062h 098 2] C3 Latency : 000A > > it shows that C2 latency is 999 (which is > 100, and thus not supported) > it shows that C3 latency is 10 (which is < 1000, and thus legal). > > So we have the strange situation of no C2, but yes to C3, > and C3 reports a mysteriously low 10usec latency. > > I wonder what C-states windows uses on this box... > Well, until kernel 2.6.23 linux used to hit C3 (and obviously never C2). If you are interested, I have windows on this laptop, so if there is a way to test the C-states, let me know. Anyways: applying the patches to 2.6.24-rc2, compiling with CPU_IDLE=y, NO_HZ=y, HIGH_RES_TIMERS=y and booting without parameters works, so it fixes my problem, Thanks!
shipped in linux-2.6.24-rc4 closed