Bug 9355

Summary: CPU_IDLE=y NO_HZ=y hangs at boot - HP nx6125
Product: ACPI Reporter: Dino Leonardo Sangoi (dino)
Component: Power-ProcessorAssignee: Venkatesh Pallipadi (venki)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, akpm, john.stultz, rjw, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9243    
Attachments: 'acpidump -b' output (binary form)
'acpidump' output (hex dump format)
.config used to build kernel 2.6.24-rc2
Full dmesg output for kernel 2.6.23.
Forward porting Ingo's timer correctionpatch to cpuidle
dmesg after booting unpatched kernel 2.6.24-rc2 using "nohz=off" parameter
commands requested on comment #14 after booting unpatched kernel 2.6.24-rc2 with "nohz=off" parameter
output of commands requested in comment #14 on kernel 2.6.24-rc2 CPU_IDLE=n and 2.6.23
forwardport_no_bmcontrol_fix_02.patch
forwardport_ingo_timer_changes_02.patch
bug9355_fix_02.patch

Description Dino Leonardo Sangoi 2007-11-12 02:37:35 UTC
Most recent kernel where this bug did not occur: 2.6.23
Distribution: Fedora Rawhide (almost F8)
Hardware Environment: Laptop HP nx6125
Problem Description:
see https://bugzilla.redhat.com/show_bug.cgi?id=312201 and http://lkml.org/lkml/2007/11/9/24 for hardware info, boot logs etcetera.
I'm creating this bug to add some data requested, without spamming lkml. I will also close the Fedora bug as UPSTREAM, trying to keep all info together.

Another interesting info is that sometimes while booting 2.6.24-rc2, I get:

Marking TSC unstable due to: TSC halts in idle.
Time: acpi_pm clocksource has been installed.

Then the kernel halts for a while (seconds or minutes), and then continues. But if I press a key on the keyboard the system continues the boot immediately (as if it has been waiting in idle, and is restarted by an hardware interrupt).

Unfortunately the patch sent by Venki Pallipadi (http://lkml.org/lkml/2007/11/9/263) doesn't work.

I will attach to this bug the acpi dump (btw, using pmtools-20070714 I get: "ACPI tables were not found. If you know location of RSD PTR table (from dmesg, etc), supply it with either --addr or -a option", so the dump is generated by pmtools-20070511) and the .config used.
Comment 1 Dino Leonardo Sangoi 2007-11-12 02:38:38 UTC
Created attachment 13506 [details]
'acpidump -b' output (binary form)
Comment 2 Dino Leonardo Sangoi 2007-11-12 02:40:43 UTC
Created attachment 13507 [details]
'acpidump' output (hex dump format)
Comment 3 Dino Leonardo Sangoi 2007-11-12 02:41:16 UTC
Created attachment 13508 [details]
.config used to build kernel 2.6.24-rc2
Comment 4 Andrew Morton 2007-11-12 02:45:45 UTC
So we have two separate bugs here.

Is the time/clocksource instability/stall also a regression from 2.6.23?
Comment 5 Dino Leonardo Sangoi 2007-11-12 02:53:46 UTC
With Fedora kernel 2.6.23-0.197.rc7.git5.fc8 I get:

Marking TSC unstable due to: TSC halts in C3.

I'm now building again a vanilla 2.6.23 and will report what it says.
Comment 6 Anonymous Emailer 2007-11-12 03:04:55 UTC
Reply-To: akpm@linux-foundation.org

On Mon, 12 Nov 2007 02:53:46 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> With Fedora kernel 2.6.23-0.197.rc7.git5.fc8 I get:
> 
> Marking TSC unstable due to: TSC halts in C3.

OK.  Does 2.6.23 also have the long delay after this?
Comment 7 Dino Leonardo Sangoi 2007-11-12 04:28:46 UTC
Created attachment 13509 [details]
Full dmesg output for kernel 2.6.23.

Ok, vanilla 2.6.23 says:

Marking TSC unstable due to: TSC halts in C3.
Time: acpi_pm clocksource has been installed.

with no delay.

I'm attaching the full boot log (easy to get, as this kernel boots fine :-)) just in case somebody needs it.
Comment 8 Venkatesh Pallipadi 2007-11-12 10:06:15 UTC
I am confused a bit, as I don't see the mention of CPU_IDLE here.
Just to double check:
The above hang in udev happens only when CPU_IDLE is configured and things work fine when it is not configured?
Comment 9 Dino Leonardo Sangoi 2007-11-12 10:23:31 UTC
Yes:

- On 2.6.24-rc2 (see attachment named ".config used to build kernel 2.6.24-rc2"), compiling with CONFIG_CPU_IDLE=y, kernel hangs. Simply switching CONFIG_CPU_IDLE to 'n' and rebuilding gives my a working kernel. I said this in the original report to lkml (http://lkml.org/lkml/2007/11/9/24), but probably I was not clear enough (my english is not so good).
- on 2.6.23 there's no CONFIG_CPU_IDLE option :-)
Comment 10 Venkatesh Pallipadi 2007-11-12 11:07:18 UTC
OK. We can keep the timer guys off the hook until we figure out the problem with cpuidle or that this is not a cpuidle problem.

Can you try to boot rc2 with CPU_IDLE and adding
processor.max_cstate=1
boot option in grub. Atleast that will tell us the problem is only with C3 state and C1 is working fine.
Comment 11 Venkatesh Pallipadi 2007-11-12 11:47:07 UTC
Created attachment 13518 [details]
Forward porting Ingo's timer correctionpatch to cpuidle


Can you try the attched patch and see whether this makes any difference. This is a patch that fixes timer part on C3 exit and it was missing in CPU_IDLE path. This patch can be tried over the previous patch I had or without it. I would suggest trying over the previous one I sent on lkml.
Comment 12 Dino Leonardo Sangoi 2007-11-13 02:04:00 UTC
unpatched kernel 2.6.24-rc2, CPU_IDLE=y, with 'processor.max_cstate=1' boots fine.

I have tried your patch "Forward porting Ingo's timer correctionpatch to cpuidle" applying to a plain 2.6.24-rc2 and over your previous patch, and either way the kernel hangs.

I don't know if this can be useful (probably not) but I have tried also sticking a printk() of pr->flags.bm_check and pr->flags.bm_control and both seems to be always equal to 1 (at least until it hangs). 
Comment 13 Venkatesh Pallipadi 2007-11-13 11:37:54 UTC
Hmmm.. I am thoroughly confused now. With the above two patches, code executed on C3 is very identical with or without CPU_IDLE.

Can you try few more things that can give some lead on this.
boot CPU_IDLE kernel with boot options
1) "highres=off"
2) "nohz=off"
3) "highres=off" "nohz=off"

4) One other thing if possible, Change the rating from 20 to 5 in file /drivers/cpuidle/governors/menu.c and try (without any above off options.

Thanks.
Comment 14 Venkatesh Pallipadi 2007-11-13 15:02:05 UTC
And one more bit of information that can help here.
With CPU_IDLE disabled and once system boots provide the output of the following

cat /proc/acpi/processor/*/power
cat /proc/interrupts
sleep 10
cat /proc/acpi/processor/*/power
cat /proc/interrupts
Comment 15 Dino Leonardo Sangoi 2007-11-14 02:18:46 UTC
Ok: 
- 2.6.24-rc2 unpatched:
  - highres=off -> stops after "Marking TSC as unstable...", hangs while running udev (it behaves as with no parameters).
  - nohz=off -> does NOT hangs after "Marking TSC as unstable...", and boots fine (dmesg attached as 2.6.24-rc2.nohz-off.dmesg, power dump as requested in comment #14 attached as 2.6.24-rc2.nohz-off.dumppower).
  - highres=off nohz=off -> as with nohz=off only.

setting .rating = 5 in menu.c and booting without parameters hangs. using nohz=off somtimes hangs after 'Marking TSC as unstable..." but this time pressing a key does not makes the boot restart... I'm doing a clean build right now to rule out problems on my side, I will report in a while.
Comment 16 Dino Leonardo Sangoi 2007-11-14 02:19:57 UTC
Created attachment 13538 [details]
dmesg after booting unpatched kernel 2.6.24-rc2 using "nohz=off" parameter
Comment 17 Dino Leonardo Sangoi 2007-11-14 02:21:04 UTC
Created attachment 13539 [details]
commands requested on comment #14 after booting unpatched kernel 2.6.24-rc2 with "nohz=off" parameter
Comment 18 Dino Leonardo Sangoi 2007-11-14 05:16:15 UTC
Ok, after a rebuild from scratch, I can say that:
- 2.6.24-rc2 with .rating = 5, with no boot parameters: nearly half times it goes on after "Marking TSC unstable..." without user intervention, but halts while running udev. The other half it prints: 

Marking TSC unstable due to: RSC halts in idle.
Time: acpi_pm clocksource has been installed.
Creating initial device nodes
Settin up hotplug.
Synaptics Touchpad, model: 1, fw: 6.2, id: 0x1a0b1, caps: 0xa04713/0x200000
input: SynPS/2 Synaptics TouchPad as /class/input/input2

Then halts, and even if I press some keys, the boot keeps stalled. But the keys I press can be seen on the screen.

- 2.6.24-rc2 with .rating = 5, with "nohz=off" parameter behaves as without it: sometimes halts soon after "marking TSC unstable...", every time halts in udev.

- 2.6.24-rc2 with the default .rating = 20, without parameters, after booting 10 times, everytime it halts after "Marking...", but pressing a key everytime restarts the boot. Then halts at udev. 
- 2.6.24-rc2 with the default .rating = 20, with "nohz=off" always boots fine.

Should I try a kernel with .rating=5 and both your patches applied?
Comment 19 Venkatesh Pallipadi 2007-11-14 05:31:11 UTC
OK. Thanks for all the data. Will get back if any more experiments are required. Let me look at all this data and try to figure out what is going on here..
Comment 20 Venkatesh Pallipadi 2007-11-14 12:08:28 UTC
nohz=off is making cpuidle disable C3 state and the result is same as processor.max_cstate=1. In both cases system boots fine.

The problem here seems to be C3 state in this platform. For some reason it is not waking up from this C3 state on timer interrupt.

Can you provide the data from comment #14. That is with rc2 and CPU_IDLE disabled in config and no nohz or highres option. That will tell us if C3 is really getting used in that case and how the system is working there.

And also same data with rc2 and CPU_IDLE enabled and again no nohz and highres option. This time system hangs. use your key pressing workaround to boot the system up and then get the same data as in comment #14.

Thanks for providing all these data. Hopefully we should be able to figure out something after this round.
Comment 21 Dino Leonardo Sangoi 2007-11-15 01:40:53 UTC
Created attachment 13559 [details]
output of commands requested in comment #14 on kernel 2.6.24-rc2 CPU_IDLE=n and 2.6.23

Exactly! C3 seems not used on kernel 2.6.24-rc2 configured with CPU_IDLE=n, and booted without parameters. kernel 2.6.23 seems to use C3 just fine.

I'm attaching a file containing the output of commands in comment #14, both on kernel 2.6.23 and on kernel 2.6.24-rc2.

I have also tried booting 2.6.24-rc2 (both CPU_IDLE=n and CPU_IDLE=y) with nohz=off parameter, and I never seen C3 being used.
Comment 22 Len Brown 2007-11-16 11:51:14 UTC
acpidump output shows that this box does not support _CST:

FACP.dsl:
[05Fh 095  1]                 _CST Support : 00
[060h 096  2]                   C2 Latency : 03E7
[062h 098  2]                   C3 Latency : 000A

it shows that C2 latency is 999 (which is > 100, and thus not supported)
it shows that C3 latency is 10 (which is < 1000, and thus legal).

So we have the strange situation of no C2, but yes to C3,
and C3 reports a mysteriously low 10usec latency.

I wonder what C-states windows uses on this box...
Comment 23 Venkatesh Pallipadi 2007-11-19 16:34:30 UTC
Dino,

More patches to test. Hopefully this will be the last time. I worked with
Alexey offline to root cause this one and below are the three patches for
cpuidle needed ontop of 2.6.24-rc. Please test it and let me know. Two of the
three patches are cleaned up versions of the patch I had provided earlier. So,
just apply these three patches in order over 2.6.24-rc.
Comment 24 Venkatesh Pallipadi 2007-11-19 16:36:34 UTC
Created attachment 13634 [details]
forwardport_no_bmcontrol_fix_02.patch
Comment 25 Venkatesh Pallipadi 2007-11-19 16:37:27 UTC
Created attachment 13635 [details]
forwardport_ingo_timer_changes_02.patch
Comment 26 Venkatesh Pallipadi 2007-11-19 16:37:59 UTC
Created attachment 13636 [details]
bug9355_fix_02.patch
Comment 27 Len Brown 2007-11-19 18:35:15 UTC
pacthes in comment #24 #25 and #26 applied to cpuidle branch
Comment 28 Dino Leonardo Sangoi 2007-11-20 08:26:56 UTC
(In reply to comment #22)
> acpidump output shows that this box does not support _CST:
> 
> FACP.dsl:
> [05Fh 095  1]                 _CST Support : 00
> [060h 096  2]                   C2 Latency : 03E7
> [062h 098  2]                   C3 Latency : 000A
> 
> it shows that C2 latency is 999 (which is > 100, and thus not supported)
> it shows that C3 latency is 10 (which is < 1000, and thus legal).
> 
> So we have the strange situation of no C2, but yes to C3,
> and C3 reports a mysteriously low 10usec latency.
> 
> I wonder what C-states windows uses on this box...
> 

Well, until kernel 2.6.23 linux used to hit C3 (and obviously never C2).

If you are interested, I have windows on this laptop, so if there is a way to test the C-states, let me know.

Anyways: applying the patches to 2.6.24-rc2, compiling with CPU_IDLE=y, NO_HZ=y, HIGH_RES_TIMERS=y and booting without parameters works, so it fixes my problem, Thanks!
Comment 29 Len Brown 2007-12-03 22:51:24 UTC
shipped in linux-2.6.24-rc4
closed