Bug 20722 - intel_idle boot hang if C-states disabled in BIOS, NO_HZ=n CONFIG_HIGH_RES_TIMERS=n - Xeon X5550 - IBM System x3650 M2
Summary: intel_idle boot hang if C-states disabled in BIOS, NO_HZ=n CONFIG_HIGH_RES_TI...
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_idle (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: power-management_intel_idle@kernel-bugs.osdl.org
URL:
Keywords:
: 20002 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-10-18 20:07 UTC by Marc Aurele La France
Modified: 2012-06-05 04:08 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.35
Subsystem:
Regression: No
Bisected commit-id:


Attachments
acpidump output (identical in all cases) (76.21 KB, text/plain)
2010-10-18 20:23 UTC, Marc Aurele La France
Details
/proc/cpuinfo (identical in all cases modulo speed & bogomips) (6.40 KB, text/plain)
2010-10-18 20:26 UTC, Marc Aurele La France
Details
lspci output (identical in all cases) (3.88 KB, text/plain)
2010-10-18 20:28 UTC, Marc Aurele La France
Details
grep output (2.6.36-rc8 INTEL_IDLE=n) (74 bytes, text/plain)
2010-10-18 20:31 UTC, Marc Aurele La France
Details
grep output (2.6.36-rc8 INTEL_IDLE=y max_cstate=1) (5.38 KB, text/plain)
2010-10-18 20:33 UTC, Marc Aurele La France
Details
dmesg (2.6.36-rc8 INTEL_IDLE=y max_cstat=1) (59.40 KB, application/octet-stream)
2010-10-18 20:34 UTC, Marc Aurele La France
Details
dmesg (2.6.36-rc8 INTEL_IDLE=y max_cstate=1) (59.40 KB, text/plain)
2010-10-18 20:46 UTC, Marc Aurele La France
Details
.config for 2.6.36-rc8 INTEL_IDLE=n (51 bytes, text/plain)
2010-10-19 20:14 UTC, Marc Aurele La France
Details
requested dmesg (processor.max_cstate=7) (59.08 KB, text/plain)
2010-10-20 14:05 UTC, Marc Aurele La France
Details
.config for comment #15 (51 bytes, text/plain)
2010-10-20 19:37 UTC, Marc Aurele La France
Details
C-state disabled MSRs (64.10 KB, text/plain)
2010-11-29 16:57 UTC, Marc Aurele La France
Details
C-state enabled MSRs (63.95 KB, text/plain)
2010-11-29 16:57 UTC, Marc Aurele La France
Details
3.6.37 .config diff (3.81 KB, text/plain)
2011-02-17 17:01 UTC, Marc Aurele La France
Details

Description Marc Aurele La France 2010-10-18 20:07:27 UTC
+++ This bug was initially created as a clone of Bug #20002 +++
(... as it is not a regression against 2.6.34)
On Thu, 7 Oct 2010, Marc Aurele La France wrote:

> I administer a cluster composed of a mixture of various Opteron models and
> Intel Xeon X5550's.  The 2.6.34.*, and prior, kernels run fine on all of
> them.  The 2.6.35 series also runs fine on the Opterons, but not on the
> Xeon's.  All of these are CONFIG_GENERIC_CPU kernels.

> On the Xeon's, 2.6.35 hangs early on, upon the first test of trace events
> (in kernel/trace/trace_events.c:event_trace_self_tests()).  When disabling
> all tracing, debugging, etc., it still hangs but slightly later.  The
> megaraid_sas module is loaded, detects the adapter, but never gets around
> to registering it with the SCSI layer.

> Core2-specific kernels also hang the same way, as do UP kernels.  I've
> tried backing out certain commits that seemed likely candidates, but have
> yet to stumble upon the one (or more) that is causing this.

> Does anyone have any ideas?

This is due to "CONFIG_INTEL_IDLE=y".  "m" or "n", the hang doesn't occur.

Of the kernels I've tested, INTEL_IDLE first appears in 2.6.34-git15.  So,
technically, this is not a regression against 2.6.34.
Comment 1 Marc Aurele La France 2010-10-18 20:10:29 UTC
On Sat, 16 Oct 2010, Len Brown wrote:

> > > On the Xeon's, 2.6.35 hangs early on, upon the first test of trace events
> > > (in kernel/trace/trace_events.c:event_trace_self_tests()).  When
> > > disabling all tracing, debugging, etc., it still hangs but slightly
> > > later.  The megaraid_sas module is loaded, detects the adapter, but never
> > > gets around to registering it with the SCSI layer.

> > This is due to "CONFIG_INTEL_IDLE=y".

> Please file a bug report at bugzilla.kernel.org and assign it to me.

> Please reproduce using an upstream 2.6.36-rc8 kernel.

> Boot a CONFIG_INTEL_IDLE=n kernel and to the bug report...

> attach the output from acpidump
> 'cat /proc/cpuinfo'
> 'grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*'
> 'lspci'

> Then boot a CONFIG_INTEL_IDLE=y kernel and see what is the highest N that
> boots when you boot with "intel_idle.max_cstate=N"  (0 will disable
> the driver completely) and if any of them boot, for the highest N,
> attach to the bug report the complete dmesg and the output from
> 'grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*'
Comment 2 Marc Aurele La France 2010-10-18 20:23:13 UTC
Created attachment 33992 [details]
acpidump output (identical in all cases)
Comment 3 Marc Aurele La France 2010-10-18 20:26:51 UTC
Created attachment 34002 [details]
/proc/cpuinfo (identical in all cases modulo speed & bogomips)
Comment 4 Marc Aurele La France 2010-10-18 20:28:41 UTC
Created attachment 34012 [details]
lspci  output (identical in all cases)
Comment 5 Marc Aurele La France 2010-10-18 20:31:18 UTC
Created attachment 34022 [details]
grep output (2.6.36-rc8 INTEL_IDLE=n)
Comment 6 Marc Aurele La France 2010-10-18 20:33:36 UTC
Created attachment 34032 [details]
grep output (2.6.36-rc8 INTEL_IDLE=y max_cstate=1)
Comment 7 Marc Aurele La France 2010-10-18 20:34:42 UTC
Created attachment 34042 [details]
dmesg (2.6.36-rc8 INTEL_IDLE=y max_cstat=1)
Comment 8 Marc Aurele La France 2010-10-18 20:35:36 UTC
2.6.36-rc8 INTEL_IDLE=y max_cstate=2 hangs as described above
Comment 9 Marc Aurele La France 2010-10-18 20:46:24 UTC
Created attachment 34052 [details]
dmesg (2.6.36-rc8 INTEL_IDLE=y max_cstate=1)
Comment 10 Rafael J. Wysocki 2010-10-18 21:34:02 UTC
*** Bug 20002 has been marked as a duplicate of this bug. ***
Comment 11 Len Brown 2010-10-18 22:11:52 UTC
re: comment #5

the ACPI baseline case with CONFIG_INTEL_IDLE=n ...

> grep: /sys/devices/system/cpu/cpu*/cpuidle/*/*: No such file or directory

please grep CONFIG_ACPI_PROCESSOR .config

if it is =m, try 'modprobe processor' or try =y.

what do you see with:

cat /proc/acpi/processor/*/power
Comment 12 Len Brown 2010-10-18 22:17:37 UTC
Re: comment #9 -- dmesg
> cpuidle: using governor ladder

Hmmm, haven't used that since we went tickless a few years ago.

please attach the .config
in particular, what is CONFIG_NO_HZ?
If it is =n, please try =y
and be sure that 
CPU_IDLE_GOV_MENU is enabled
Comment 13 Marc Aurele La France 2010-10-19 20:14:54 UTC
Created attachment 34172 [details]
.config for 2.6.36-rc8 INTEL_IDLE=n

This config originally hails from Red Hat's 2.6.9 modified kernel.  It has been incrementally `make oldconfig`'ed over the years.  So I'm not surprised it isn't tickless.

Anyway, a 2.6.36-rc8 kernel with INTEL_IDLE, ACPI_PROCESSOR, NO_HZ and IDLE_GOV_MENU all set to "y" runs fine regardless of max_cstates [0-7].  So, it seems INTEL_IDLE simply needs another Kconfig dependency.

As for /proc/acpi/processor/*/power, a commit you signed off on removes them.
Comment 14 Len Brown 2010-10-20 08:38:49 UTC
intel_idle does not depend on ACPI_PROCESSOR, NO_HZ, or IDLE_GOV_MENU.
When I delete them from my working system, it still boots.

Unfortunately, I've not been able to reproduce your boot hang
using your .config, as I've failed to convince it to mount root
on my Fedora 13 test box.

Please try ACPI mode, like so:
CONFIG_ACPI_PROCESSOR=y
CONFIG_NO_HZ=n
CONFIG_INTEL_IDLE=n

Attach the full dmesg and output from
'grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*'

If acpi_idle fails to boot, try "processor.max_cstate=1"
and increase until it fails.
Comment 15 Marc Aurele La France 2010-10-20 14:05:53 UTC
Created attachment 34222 [details]
requested dmesg (processor.max_cstate=7)

This config (re: comment #14) runs fine regardless of processor.max_cstate [0-7].

There are no /sys/devices/system/cpu/cpu*/cpuidle/*/*.
Comment 16 Len Brown 2010-10-20 16:45:51 UTC
please attach the .config run in comment #15
and show the output from 'turbostat -v sleep 20'

turbostat is available here:
http://www.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/pmtools-latest/turbostat/turbostat.c
Comment 17 Marc Aurele La France 2010-10-20 19:37:10 UTC
Created attachment 34282 [details]
.config for comment #15

After `modprobe msr`, turbostat gives...

CPUID GenuineIntel 11 levels family:model:stepping 0x6:1a:5 (6:26:5)
12 * 133 = 1600 MHz max efficiency
20 * 133 = 2667 MHz TSC frequency
22 * 133 = 2933 MHz max turbo 4 active cores
22 * 133 = 2933 MHz max turbo 3 active cores
23 * 133 = 3067 MHz max turbo 2 active cores
23 * 133 = 3067 MHz max turbo 1 active cores
pkg core CPU   %c0   GHz  TSC   %c1    %c3    %c6   %pc3   %pc6
               0.23 2.05 2.67  99.77   0.00   0.00   0.00   0.00
   0   0   0   0.37 2.25 2.67  99.63   0.00   0.00   0.00   0.00
   0   1   1   0.21 2.00 2.67  99.79   0.00   0.00   0.00   0.00
   0   2   2   0.13 1.63 2.67  99.87   0.00   0.00   0.00   0.00
   0   3   3   0.41 2.20 2.67  99.59   0.00   0.00   0.00   0.00
   1   0   4   0.27 2.19 2.67  99.73   0.00   0.00   0.00   0.00
   1   1   5   0.21 1.94 2.67  99.79   0.00   0.00   0.00   0.00
   1   2   6   0.13 1.69 2.67  99.87   0.00   0.00   0.00   0.00
   1   3   7   0.10 1.62 2.67  99.90   0.00   0.00   0.00   0.00
20.010713 sec
Comment 18 Len Brown 2010-10-22 06:34:56 UTC
Hmmm, the 2.6.36 kernel has CONFIG_ACPI_PROCESSOR=y,
yet turbostat shows that it fails to enter any C-states deeper than C1.

How about if you boot it with "processor.nocst=1"?

please confirm that acpi_idle probed by showing
'grep .  /sys/devices/system/cpu/cpuidle/*'

Assuming it did, then for some reason it didn't actually
register any C-states with cpuidle, which would explain
the missing states in /sys/devices/system/cpu/cpu*/cpuidle/*/*

What if you boot a 2.6.35 CONFIG_ACPI_PROCESSOR=y kernel --
what do you see in /proc/acpi/processor/*/power?
(if processor.nocst=1 worked above, try it with and without
 here also)

Please show the turbostat output for the
INTEL_IDLE=y, NO_HZ=y, IDLE_GOV_MENU=y kernel
to see if it is getting into deep C-states.

That kernel will fail to boot if you boot with "nohz=off", right?
Does it boot when "nolapic_timer" is added to the cmdline?

Finally, please identify the motherboard.
Are the BIOS SETUP options at their defaults WRT
power management options?
Comment 19 Marc Aurele La France 2010-10-25 19:25:47 UTC
(In reply to comment #18)
> Hmmm, the 2.6.36 kernel has CONFIG_ACPI_PROCESSOR=y,
> yet turbostat shows that it fails to enter any C-states deeper than C1.

> How about if you boot it with "processor.nocst=1"?

Just about the same as before 

CPUID GenuineIntel 11 levels family:model:stepping 0x6:1a:5 (6:26:5)
12 * 133 = 1600 MHz max efficiency
20 * 133 = 2667 MHz TSC frequency
22 * 133 = 2933 MHz max turbo 4 active cores
22 * 133 = 2933 MHz max turbo 3 active cores
23 * 133 = 3067 MHz max turbo 2 active cores
23 * 133 = 3067 MHz max turbo 1 active cores
pkg core CPU   %c0   GHz  TSC   %c1    %c3    %c6   %pc3   %pc6
               0.25 2.07 2.67  99.75   0.00   0.00   0.00   0.00
   0   0   0   0.23 1.94 2.67  99.77   0.00   0.00   0.00   0.00
   0   1   1   0.19 1.91 2.67  99.81   0.00   0.00   0.00   0.00
   0   2   2   0.13 1.69 2.67  99.87   0.00   0.00   0.00   0.00
   0   3   3   0.11 1.62 2.67  99.89   0.00   0.00   0.00   0.00
   1   0   4   0.28 2.19 2.67  99.72   0.00   0.00   0.00   0.00
   1   1   5   0.20 2.00 2.67  99.80   0.00   0.00   0.00   0.00
   1   2   6   0.13 1.69 2.67  99.87   0.00   0.00   0.00   0.00
   1   3   7   0.69 2.35 2.67  99.31   0.00   0.00   0.00   0.00
20.007365 sec

> please confirm that acpi_idle probed by showing
> 'grep .  /sys/devices/system/cpu/cpuidle/*'

grep -r . `find /sys/devices/system/cpu -name cpuidle` shows

/sys/devices/system/cpu/cpuidle/current_driver:acpi_idle
/sys/devices/system/cpu/cpuidle/current_governor_ro:ladder
 
> Assuming it did, then for some reason it didn't actually
> register any C-states with cpuidle, which would explain
> the missing states in /sys/devices/system/cpu/cpu*/cpuidle/*/*

> What if you boot a 2.6.35 CONFIG_ACPI_PROCESSOR=y kernel --
> what do you see in /proc/acpi/processor/*/power?
> (if processor.nocst=1 worked above, try it with and without
>  here also)

This hangs, regardless of nocst.

> Please show the turbostat output for the
> INTEL_IDLE=y, NO_HZ=y, IDLE_GOV_MENU=y kernel
> to see if it is getting into deep C-states.

It does ...

CPUID GenuineIntel 11 levels family:model:stepping 0x6:1a:5 (6:26:5)
12 * 133 = 1600 MHz max efficiency
20 * 133 = 2667 MHz TSC frequency
22 * 133 = 2933 MHz max turbo 4 active cores
22 * 133 = 2933 MHz max turbo 3 active cores
23 * 133 = 3067 MHz max turbo 2 active cores
23 * 133 = 3067 MHz max turbo 1 active cores
pkg core CPU   %c0   GHz  TSC   %c1    %c3    %c6   %pc3   %pc6
               0.59 1.66 2.67   0.60   0.14  98.67   0.76  79.46
   0   0   0   0.82 1.90 2.67   0.59   0.10  98.50   0.76  79.45
   0   1   1   0.57 1.60 2.67   0.60   0.00  98.83   0.76  79.45
   0   2   2   0.56 1.61 2.67   0.60   0.52  98.31   0.76  79.45
   0   3   3   0.61 1.60 2.67   0.62   0.18  98.59   0.76  79.45
   1   0   4   0.55 1.62 2.67   0.59   0.03  98.83   0.76  79.47
   1   1   5   0.53 1.61 2.67   0.60   0.06  98.81   0.76  79.47
   1   2   6   0.54 1.61 2.67   0.61   0.15  98.70   0.76  79.47
   1   3   7   0.52 1.60 2.67   0.61   0.11  98.76   0.76  79.47
20.005420 sec
 
> That kernel will fail to boot if you boot with "nohz=off", right?
> Does it boot when "nolapic_timer" is added to the cmdline?

No.  It boots fine with all four combinations of these options.

> Finally, please identify the motherboard.

The system is an IBM System x3650 M2.  God knows what IBM calls its board.  I can privately send you the docs, if you want.  Not all that detailed, really.

> Are the BIOS SETUP options at their defaults WRT
> power management options?

No, because the defaults are not optimal for my purposes.  Relevent settings appear to be:

1) Package ACPI C-State Limit:  Set to ACPI C2;  help says ...
   Package ACPI C-State Limit selects the processor's lowest idle power state.
   Choosing a higher C-State allows lower processor idle power.  ACPI C2 equals
   Intel C3.  ACPI C3 equals Intel C6.

2) CPU C-States:  Set to Disabled;  help says ...
   Enable/Disable ACPI Processor Power states C2 & C3.
Comment 20 Len Brown 2010-10-26 00:53:23 UTC
> CPU C-States:  Set to Disabled

Yay, this explains the unsolved mystery
of the 1st 19 comments in this bug report.
Linux acpi_idle sees no C-states besides C1
because they have been manually disabled.

BTW, this is a very unusual way to run the machine.
Note that if C-state latency avoidance is your goal,
you can do that at run-time via linux pm_qos, or at
boot time via the acpi_idle and intel_idle max_cstate
boot parameters.  Or if even C1 latency is too high,
you can disable all C-states by booting with "idle=poll"...

Note also that disabling deep C-states impacts the
ability of the machine to reach high-frequency turbo modes.
You can observe this in the GHz column in turbostat
when the system is under load.

Please try the BIOS SETUP options at the defaults
to enable all the ACPI C-states and run acpi_idle,
does turbostat show that you are then getting into the deep C-states?

intel_idle ignores the ACPI BIOS settings, of course, so what
we are trying to do here is get an apples/apples comparison
of acpi_idle and intel_idle using the same states.  In this
case, it would be interesting if acpi_idle is able to boot
and access the deep C-states when intel_idle can not.

>> What if you boot a 2.6.35 CONFIG_ACPI_PROCESSOR=y kernel --
>> what do you see in /proc/acpi/processor/*/power?
>> (if processor.nocst=1 worked above, try it with and without
>>  here also)
>
>This hangs, regardless of nocst.

The idea there was to run in ACPI mode, rather than in intel_idle
mode.  So you would have to build with CONFIG_INTEL_IDLE=n
or disable it at boot time with intel_idle.max_cstate=0.

But as you've identified that the ACPI C-states are disabled
in the BIOS, that test is no longer necessary to explain
why ACPI saw no C-states.

>> Please show the turbostat output for the
>> INTEL_IDLE=y, NO_HZ=y, IDLE_GOV_MENU=y kernel
>> to see if it is getting into deep C-states.
>
> It does ...

>> That kernel will fail to boot if you boot with "nohz=off", right?
>> Does it boot when "nolapic_timer" is added to the cmdline?

> No.  It boots fine with all four combinations of these options.

Hmm, this kernel works with "nohz=off", yet if you change
it to CONFIG_NO_HZ=n it fails?
What governor was it using?
(grep . /sys/devices/system/cpu/cpuidle/*"
Comment 21 Marc Aurele La France 2010-10-26 21:19:10 UTC
(In reply to comment #20)
> > CPU C-States:  Set to Disabled

> Yay, this explains the unsolved mystery of the 1st 19 comments in this bug
> report.  Linux acpi_idle sees no C-states besides C1 because they have been
> manually disabled.

> BTW, this is a very unusual way to run the machine.  Note that if C-state
> latency avoidance is your goal, you can do that at run-time via linux pm_qos,
> or at boot time via the acpi_idle and intel_idle max_cstate boot parameters.
> Or if even C1 latency is too high, you can disable all C-states by booting
> with "idle=poll"...

... or, to be more in line with the KISS principle, not configure anything at all that depends on CPU_IDLE.  That's probably what I'll end up doing shortly, when I move the entire cluster to 2.6.36-release.

> Note also that disabling deep C-states impacts the ability of the machine to
> reach high-frequency turbo modes.  You can observe this in the GHz column in
> turbostat when the system is under load.

This makes no sense.  You get more performance if you allow the CPU(s) to use less power when idle?  Performance-per-watt maybe, but I don't care to much about that.

> Please try the BIOS SETUP options at the defaults to enable all the ACPI
> C-states and run acpi_idle, does turbostat show that you are then getting
> into the deep C-states?

It turns out the settings I mention in comment #19 were the only ones I needed to change.  Anyway, turbostat says ...

CPUID GenuineIntel 11 levels family:model:stepping 0x6:1a:5 (6:26:5)
12 * 133 = 1600 MHz max efficiency
20 * 133 = 2667 MHz TSC frequency
22 * 133 = 2933 MHz max turbo 4 active cores
22 * 133 = 2933 MHz max turbo 3 active cores
23 * 133 = 3067 MHz max turbo 2 active cores
23 * 133 = 3067 MHz max turbo 1 active cores
pkg core CPU   %c0   GHz  TSC   %c1    %c3    %c6   %pc3   %pc6
               2.25 1.61 2.67   2.19   0.64  94.92   1.45  76.55
   0   0   0   2.36 1.60 2.67   2.03   0.27  95.34   1.45  76.50
   0   1   1   2.25 1.60 2.67   2.16   0.31  95.29   1.45  76.50
   0   2   2   2.20 1.60 2.67   2.21   0.31  95.28   1.45  76.50
   0   3   3   2.22 1.60 2.67   2.25   0.25  95.29   1.45  76.50
   1   0   4   2.35 1.63 2.67   2.13   1.08  94.43   1.45  76.60
   1   1   5   2.24 1.61 2.67   2.21   1.06  94.50   1.45  76.60
   1   2   6   2.23 1.60 2.67   2.29   0.92  94.57   1.45  76.60
   1   3   7   2.20 1.61 2.67   2.26   0.90  94.64   1.45  76.60
20.006242 sec

> intel_idle ignores the ACPI BIOS settings, of course, so what
> we are trying to do here is get an apples/apples comparison
> of acpi_idle and intel_idle using the same states.  In this
> case, it would be interesting if acpi_idle is able to boot
> and access the deep C-states when intel_idle can not.

It would also be good if it didn't hang.

> >> What if you boot a 2.6.35 CONFIG_ACPI_PROCESSOR=y kernel --
> >> what do you see in /proc/acpi/processor/*/power?
> >> (if processor.nocst=1 worked above, try it with and without
> >>  here also)

> >This hangs, regardless of nocst.

> The idea there was to run in ACPI mode, rather than in intel_idle mode.  So
> you would have to build with CONFIG_INTEL_IDLE=n or disable it at boot time
> with intel_idle.max_cstate=0.

> But as you've identified that the ACPI C-states are disabled in the BIOS,
> that test is no longer necessary to explain why ACPI saw no C-states.

I did so anyway.

grep . /proc/acpi/processor/*/power gives 
/proc/acpi/processor/CPU0/power:active state:            C0
/proc/acpi/processor/CPU0/power:max_cstate:              C8
/proc/acpi/processor/CPU0/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU0/power:states:
/proc/acpi/processor/CPU1/power:active state:            C0
/proc/acpi/processor/CPU1/power:max_cstate:              C8
/proc/acpi/processor/CPU1/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU1/power:states:
/proc/acpi/processor/CPU2/power:active state:            C0
/proc/acpi/processor/CPU2/power:max_cstate:              C8
/proc/acpi/processor/CPU2/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU2/power:states:
/proc/acpi/processor/CPU3/power:active state:            C0
/proc/acpi/processor/CPU3/power:max_cstate:              C8
/proc/acpi/processor/CPU3/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU3/power:states:
/proc/acpi/processor/CPU4/power:active state:            C0
/proc/acpi/processor/CPU4/power:max_cstate:              C8
/proc/acpi/processor/CPU4/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU4/power:states:
/proc/acpi/processor/CPU5/power:active state:            C0
/proc/acpi/processor/CPU5/power:max_cstate:              C8
/proc/acpi/processor/CPU5/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU5/power:states:
/proc/acpi/processor/CPU6/power:active state:            C0
/proc/acpi/processor/CPU6/power:max_cstate:              C8
/proc/acpi/processor/CPU6/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU6/power:states:
/proc/acpi/processor/CPU7/power:active state:            C0
/proc/acpi/processor/CPU7/power:max_cstate:              C8
/proc/acpi/processor/CPU7/power:maximum allowed latency: 2000000000 usec
/proc/acpi/processor/CPU7/power:states:
 
> >> Please show the turbostat output for the INTEL_IDLE=y, NO_HZ=y,
> >> IDLE_GOV_MENU=y kernel to see if it is getting into deep C-states.

> >> That kernel will fail to boot if you boot with "nohz=off", right?
> >> Does it boot when "nolapic_timer" is added to the cmdline?

> > No.  It boots fine with all four combinations of these options.

> Hmm, this kernel works with "nohz=off", yet if you change it to
> CONFIG_NO_HZ=n it fails?

That is so, yes.

> What governor was it using?
> (grep . /sys/devices/system/cpu/cpuidle/*"

/sys/devices/system/cpu/cpuidle/current_driver:intel_idle
/sys/devices/system/cpu/cpuidle/current_governor_ro:menu
Comment 22 Marc Aurele La France 2010-11-04 14:46:50 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > >> Please show the turbostat output for the INTEL_IDLE=y, NO_HZ=y,
> > >> IDLE_GOV_MENU=y kernel to see if it is getting into deep C-states.

> > >> That kernel will fail to boot if you boot with "nohz=off", right?
> > >> Does it boot when "nolapic_timer" is added to the cmdline?

> > > No.  It boots fine with all four combinations of these options.

> > Hmm, this kernel works with "nohz=off", yet if you change it to
> > CONFIG_NO_HZ=n it fails?

> That is so, yes.

> > What governor was it using?
> > (grep . /sys/devices/system/cpu/cpuidle/*"

> /sys/devices/system/cpu/cpuidle/current_driver:intel_idle
> /sys/devices/system/cpu/cpuidle/current_governor_ro:menu

Is there anything more you want me to try on this?

Thanks.
Comment 23 Len Brown 2010-11-19 04:16:55 UTC
>> Note also that disabling deep C-states impacts the ability of the machine to
>> reach high-frequency turbo modes.  You can observe this in the GHz column in
>> turbostat when the system is under load.

> This makes no sense.  You get more performance if you allow the CPU(s) to use
> less power when idle?  Performance-per-watt maybe, but I don't care to much
> about that.

Each processor package has a fixed power and thermal budget.
When some cores are idle, the busy cores have available power
and thermal budget that they can use for "opportunistic
frequency upside", AKA "turbo mode".

Turbostat spells this out:

12 * 133 = 1600 MHz max efficiency
20 * 133 = 2667 MHz TSC frequency
22 * 133 = 2933 MHz max turbo 4 active cores
22 * 133 = 2933 MHz max turbo 3 active cores
23 * 133 = 3067 MHz max turbo 2 active cores
23 * 133 = 3067 MHz max turbo 1 active cores

So under nominal electrical and cooling conditions, this part
will run all cores continuously at up to 2667.  Turbo allows
all 4 cores to run at up to 2933 until it the part gets hot.
If 2 or 3 cores are idle, then the maximum frequency is 3067,
and that will be sustained as long as the part stays within
its thermal limits.

So the answer is yes, the part can deliver more performance
when the cores are permitted to use less power when idle.
Comment 24 Len Brown 2010-11-19 05:04:15 UTC
With C-states renabled in BIOS SETUP,
booting 2.6.36 intel_idle.max_cstate=0 to enable acpi_idle,
you were able to get into deep C-states, as shown by
the turbostat output in comment #21.

For that scenario, please show the output from
grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*

The same kernel booted with:
intel_idle.max_cstate=1 works fine.
But
intel_idle.max_cstate=2 hangs someplace during boot.

You are using a CONFIG_NO_HZ=n kernel.
If you change that to CONFIG_NO_HZ=y then everything works fine.

Unexpectedly, a CONFIG_NO_HZ=y booted with "nohz=off"
also works fine.

Is that summary accurate?
Comment 25 Marc Aurele La France 2010-11-26 02:52:23 UTC
(In reply to comment #23)
>>> Note also that disabling deep C-states impacts the ability of the machine
>>> to
>>> reach high-frequency turbo modes.  You can observe this in the GHz column
>>> in
>>> turbostat when the system is under load.

>> This makes no sense.  You get more performance if you allow the CPU(s) to
>> use
>> less power when idle?  Performance-per-watt maybe, but I don't care to much
>> about that.

> Each processor package has a fixed power and thermal budget.
> When some cores are idle, the busy cores have available power
> and thermal budget that they can use for "opportunistic
> frequency upside", AKA "turbo mode".

> Turbostat spells this out:

> 12 * 133 = 1600 MHz max efficiency
> 20 * 133 = 2667 MHz TSC frequency
> 22 * 133 = 2933 MHz max turbo 4 active cores
> 22 * 133 = 2933 MHz max turbo 3 active cores
> 23 * 133 = 3067 MHz max turbo 2 active cores
> 23 * 133 = 3067 MHz max turbo 1 active cores

> So under nominal electrical and cooling conditions, this part
> will run all cores continuously at up to 2667.  Turbo allows
> all 4 cores to run at up to 2933 until it the part gets hot.
> If 2 or 3 cores are idle, then the maximum frequency is 3067,
> and that will be sustained as long as the part stays within
> its thermal limits.

> So the answer is yes, the part can deliver more performance
> when the cores are permitted to use less power when idle.

OK.  You've got me convinced of the error in my ways.  Silly question to ask of you perhaps, but I'm wondering if the AMDs have a similar scheme.  Probably not.

Thanks for the correction.
Comment 26 Marc Aurele La France 2010-11-26 03:04:57 UTC
(In reply to comment #24)
> With C-states renabled in BIOS SETUP,
> booting 2.6.36 intel_idle.max_cstate=0 to enable acpi_idle,
> you were able to get into deep C-states, as shown by
> the turbostat output in comment #21.

Yes.

> For that scenario, please show the output from
> grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*

/sys/devices/system/cpu/cpu0/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu0/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu0/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu0/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu0/cpuidle/state1/usage:2979
/sys/devices/system/cpu/cpu0/cpuidle/state1/time:3466983
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu0/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu0/cpuidle/state2/usage:6104
/sys/devices/system/cpu/cpu0/cpuidle/state2/time:8843703
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu0/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu0/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu0/cpuidle/state3/usage:56814
/sys/devices/system/cpu/cpu0/cpuidle/state3/time:151472271
/sys/devices/system/cpu/cpu1/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu1/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu1/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu1/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu1/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu1/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu1/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu1/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu1/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu1/cpuidle/state1/usage:4585
/sys/devices/system/cpu/cpu1/cpuidle/state1/time:3363710
/sys/devices/system/cpu/cpu1/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu1/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu1/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu1/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu1/cpuidle/state2/usage:6896
/sys/devices/system/cpu/cpu1/cpuidle/state2/time:9265035
/sys/devices/system/cpu/cpu1/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu1/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu1/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu1/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu1/cpuidle/state3/usage:57609
/sys/devices/system/cpu/cpu1/cpuidle/state3/time:1423589867
/sys/devices/system/cpu/cpu2/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu2/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu2/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu2/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu2/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu2/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu2/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu2/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu2/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu2/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu2/cpuidle/state1/usage:2403
/sys/devices/system/cpu/cpu2/cpuidle/state1/time:3142363
/sys/devices/system/cpu/cpu2/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu2/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu2/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu2/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu2/cpuidle/state2/usage:5152
/sys/devices/system/cpu/cpu2/cpuidle/state2/time:7944935
/sys/devices/system/cpu/cpu2/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu2/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu2/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu2/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu2/cpuidle/state3/usage:59271
/sys/devices/system/cpu/cpu2/cpuidle/state3/time:1427461159
/sys/devices/system/cpu/cpu3/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu3/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu3/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu3/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu3/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu3/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu3/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu3/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu3/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu3/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu3/cpuidle/state1/usage:2167
/sys/devices/system/cpu/cpu3/cpuidle/state1/time:3059448
/sys/devices/system/cpu/cpu3/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu3/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu3/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu3/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu3/cpuidle/state2/usage:4950
/sys/devices/system/cpu/cpu3/cpuidle/state2/time:7897376
/sys/devices/system/cpu/cpu3/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu3/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu3/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu3/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu3/cpuidle/state3/usage:60101
/sys/devices/system/cpu/cpu3/cpuidle/state3/time:1425792279
/sys/devices/system/cpu/cpu4/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu4/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu4/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu4/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu4/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu4/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu4/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu4/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu4/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu4/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu4/cpuidle/state1/usage:2798
/sys/devices/system/cpu/cpu4/cpuidle/state1/time:2781128
/sys/devices/system/cpu/cpu4/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu4/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu4/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu4/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu4/cpuidle/state2/usage:6026
/sys/devices/system/cpu/cpu4/cpuidle/state2/time:10147072
/sys/devices/system/cpu/cpu4/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu4/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu4/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu4/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu4/cpuidle/state3/usage:58763
/sys/devices/system/cpu/cpu4/cpuidle/state3/time:1422955488
/sys/devices/system/cpu/cpu5/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu5/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu5/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu5/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu5/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu5/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu5/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu5/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu5/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu5/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu5/cpuidle/state1/usage:4747
/sys/devices/system/cpu/cpu5/cpuidle/state1/time:4148144
/sys/devices/system/cpu/cpu5/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu5/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu5/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu5/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu5/cpuidle/state2/usage:7056
/sys/devices/system/cpu/cpu5/cpuidle/state2/time:9609591
/sys/devices/system/cpu/cpu5/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu5/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu5/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu5/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu5/cpuidle/state3/usage:60452
/sys/devices/system/cpu/cpu5/cpuidle/state3/time:1422184588
/sys/devices/system/cpu/cpu6/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu6/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu6/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu6/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu6/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu6/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu6/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu6/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu6/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu6/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu6/cpuidle/state1/usage:2454
/sys/devices/system/cpu/cpu6/cpuidle/state1/time:3193182
/sys/devices/system/cpu/cpu6/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu6/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu6/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu6/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu6/cpuidle/state2/usage:5287
/sys/devices/system/cpu/cpu6/cpuidle/state2/time:8147649
/sys/devices/system/cpu/cpu6/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu6/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu6/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu6/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu6/cpuidle/state3/usage:59626
/sys/devices/system/cpu/cpu6/cpuidle/state3/time:1427344192
/sys/devices/system/cpu/cpu7/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu7/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu7/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu7/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu7/cpuidle/state0/usage:0
/sys/devices/system/cpu/cpu7/cpuidle/state0/time:0
/sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu7/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0
/sys/devices/system/cpu/cpu7/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu7/cpuidle/state1/power:4294967294
/sys/devices/system/cpu/cpu7/cpuidle/state1/usage:2209
/sys/devices/system/cpu/cpu7/cpuidle/state1/time:3352888
/sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2
/sys/devices/system/cpu/cpu7/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10
/sys/devices/system/cpu/cpu7/cpuidle/state2/latency:205
/sys/devices/system/cpu/cpu7/cpuidle/state2/power:4294967293
/sys/devices/system/cpu/cpu7/cpuidle/state2/usage:5085
/sys/devices/system/cpu/cpu7/cpuidle/state2/time:8344587
/sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu7/cpuidle/state3/desc:ACPI FFH INTEL MWAIT 0x20
/sys/devices/system/cpu/cpu7/cpuidle/state3/latency:245
/sys/devices/system/cpu/cpu7/cpuidle/state3/power:4294967292
/sys/devices/system/cpu/cpu7/cpuidle/state3/usage:58917
/sys/devices/system/cpu/cpu7/cpuidle/state3/time:1427177694

> The same kernel booted with:
> intel_idle.max_cstate=1 works fine.
> But
> intel_idle.max_cstate=2 hangs someplace during boot.

If C-states are disabled in the firmware, yes.

> You are using a CONFIG_NO_HZ=n kernel.
> If you change that to CONFIG_NO_HZ=y then everything works fine.

> Unexpectedly, a CONFIG_NO_HZ=y booted with "nohz=off"
> also works fine.

Both true, with C-states disabled, but they both use the menu governor.

I've just resolved some trepidation I had in trying to duplicate this problem with 2.6.36 release.  So far, for the hang to occur, all of the following must hold:

    C-states disabled in firmware, as above;
    INTEL_IDLE=y;
    NO_HZ=n (implies ladder governor);
    HIGH_RES_TIMERS=n;
    intel_idle.max_cstate>1.

Negate any one, or more, of these conditions and the hang doesn't occur.  HIGH_RES_TIMERS controls SCHED_HRTICK, so that might be involved as well.

Hope this helps.
Comment 27 Len Brown 2010-11-26 19:00:54 UTC
>    C-states disabled in firmware, as above;
>    INTEL_IDLE=y;
>    NO_HZ=n (implies ladder governor);
>    HIGH_RES_TIMERS=n;
>    intel_idle.max_cstate>1.
>
>Negate any one, or more, of these conditions and the hang doesn't occur. 
>HIGH_RES_TIMERS controls SCHED_HRTICK, so that might be involved as well.

If C-state are enabled in BIOS SETUP (the default)
then intel_idle works properly with no special cmdline params?
Comment 28 Marc Aurele La France 2010-11-26 22:43:49 UTC
(In reply to comment #27)
> >    C-states disabled in firmware, as above;
> >    INTEL_IDLE=y;
> >    NO_HZ=n (implies ladder governor);
> >    HIGH_RES_TIMERS=n;
> >    intel_idle.max_cstate>1.

> >Negate any one, or more, of these conditions and the hang doesn't occur. 
> >HIGH_RES_TIMERS controls SCHED_HRTICK, so that might be involved as well.

> If C-state are enabled in BIOS SETUP (the default)
> then intel_idle works properly with no special cmdline params?

Yes.  But that requirement is not documented.
Comment 29 Len Brown 2010-11-27 20:46:29 UTC
>> If C-state are enabled in BIOS SETUP (the default)
>> then intel_idle works properly with no special cmdline params?
>
>Yes.  But that requirement is not documented.

I'd be surprised to see that there is any documentation
for that that BIOS SETUP option really does.
But we can endeavor to find out, say by dumping the MSRs
with default BIOS SETUP and with C-state disabled BIOS setup.

My guess at this point is that this is a BIOS bug,
or at least a BIOS quirk.

please fetch the msr-tools from here:

git clone git://git.kernel.org/pub/scm/utils/cpu/msr-tools/msr-tools.git

and build rdmsr
we can use it to dump out the MSRs with the BIOS defaults
and compare the the same MSRs for the BIOS c-state disabled setting.
Comment 30 Len Brown 2010-11-27 20:51:43 UTC
with ./rdmsr present, please run this script and save
msr.out for the default BIOS setting, and also for the
C-state disabled BIOS setting, and attach them to this
bug report.


#!/bin/bash
OUTPUT_FILE=msr.out
echo output to $OUTPUT_FILE

typeset -i msr
msr=0
while [ $msr -lt 1600 ] ; do
	./rdmsr -a $msr
	if [ $? == 0 ] ; then
		printf "MSR 0x%x\n" $msr
	fi
	msr=$msr+1
done > $OUTPUT_FILE 2> /dev/null
Comment 31 Len Brown 2010-11-28 08:03:11 UTC
> CONFIG_HIGH_RES_TIMERS=n

So if you set CONFIG_HIGH_RES_TIMERS=y then everything works fine?

If you build with CONFIG_HIGH_RES_TIMERS=y and then boot
with "highres=off" then we see the failure?
Comment 32 Marc Aurele La France 2010-11-29 16:57:05 UTC
Created attachment 38532 [details]
C-state disabled MSRs
Comment 33 Marc Aurele La France 2010-11-29 16:57:53 UTC
Created attachment 38542 [details]
C-state enabled MSRs
Comment 34 Marc Aurele La France 2010-11-29 16:59:53 UTC
(In reply to comment #31)
> > CONFIG_HIGH_RES_TIMERS=n

> So if you set CONFIG_HIGH_RES_TIMERS=y then everything works fine?

Yes, with the default highres setting.

> If you build with CONFIG_HIGH_RES_TIMERS=y and then boot
> with "highres=off" then we see the failure?

Yes.
Comment 35 Len Brown 2010-11-29 23:18:09 UTC
Does the failing config still fail if CONFIG_HZ_100=y is used?
Comment 36 Len Brown 2010-11-30 06:44:27 UTC
Please clarify if "nolapic_timer" has an effect on the failing configuration.

Also, it would be interesting to know if CONFIG_HPET=y has any effect.
Comment 37 Marc Aurele La France 2010-11-30 18:34:33 UTC
Neither HZ_100 nor HPET have any effect.  Setting "nolapic_timer" does, however, prevent the hang.
Comment 38 Len Brown 2010-11-30 21:56:54 UTC
about the failure mode itself...

Is it a hard hang, or if you hold down a key to give the system a stream of interrupts does the system make forward progres?
Comment 39 Len Brown 2010-11-30 21:57:40 UTC
oh, you can also test for hard hang by pressing the CAPS-LOCK key and see if that lights up, or ping on the network.
Comment 40 Marc Aurele La France 2010-11-30 22:16:20 UTC
There is no keyboard nor network at the point of the hang.  Neither USB nor NIC drivers are compiled into the kernel, and an IP address that I could ping has not yet been assigned at that point.
Comment 41 Marc Aurele La France 2010-12-07 21:33:35 UTC
(In reply to comment #40)
> There is no keyboard nor network at the point of the hang.  Neither USB nor
> NIC
> drivers are compiled into the kernel, and an IP address that I could ping has
> not yet been assigned at that point.

I just now had a chance to confirm this.  The keyboard is completely unresponsive during the hang.  The various LOCK keys don't turn on LEDs, magic SysRq sequences do nothing, carriage returns don't work, etc.

But, as I allude to in comment #40, that doesn't necessarily mean the CPUs are uninterruptable, nor that we have a hardware lockup here.

On the other hand, I do have an Infiniband driver and the entire IPoIB infrastructure compiled into this particular kernel, and there definitely is traffic on the internal network the adapter is connected to.  But I have no idea whether the adapter would be generating interrupts before being assigned an IP address and ifup'ed.
Comment 42 Marc Aurele La France 2010-12-19 15:54:57 UTC
Is there anything more on this?

I've gone through the MSR differences and it appears that 0xe2 & 0xe4 are the relevant ones, although they are documented for the Sandy Bridge parts, not the Nehalem's.
Comment 43 Marc Aurele La France 2010-12-23 14:11:53 UTC
(In reply to comment #25)
> (In reply to comment #23)
> >>> Note also that disabling deep C-states impacts the ability of the machine
> >>> to reach high-frequency turbo modes.  You can observe this in the GHz
> >>> column in turbostat when the system is under load.

> >> This makes no sense.  You get more performance if you allow the CPU(s) to
> >> useless power when idle?  Performance-per-watt maybe, but I don't care to
> >> much about that.

> > Each processor package has a fixed power and thermal budget.
> > When some cores are idle, the busy cores have available power
> > and thermal budget that they can use for "opportunistic
> > frequency upside", AKA "turbo mode".

> > Turbostat spells this out:

> > 12 * 133 = 1600 MHz max efficiency
> > 20 * 133 = 2667 MHz TSC frequency
> > 22 * 133 = 2933 MHz max turbo 4 active cores
> > 22 * 133 = 2933 MHz max turbo 3 active cores
> > 23 * 133 = 3067 MHz max turbo 2 active cores
> > 23 * 133 = 3067 MHz max turbo 1 active cores

> > So under nominal electrical and cooling conditions, this part
> > will run all cores continuously at up to 2667.  Turbo allows
> > all 4 cores to run at up to 2933 until it the part gets hot.
> > If 2 or 3 cores are idle, then the maximum frequency is 3067,
> > and that will be sustained as long as the part stays within
> > its thermal limits.

> > So the answer is yes, the part can deliver more performance
> > when the cores are permitted to use less power when idle.

> OK.  You've got me convinced of the error in my ways.  Silly question to ask
> of you perhaps, but I'm wondering if the AMDs have a similar scheme. 
> Probably
> not.

> Thanks for the correction.

BTW, would this explain why threads are at times reported as using more than 100% of an HT thread?

Thanks.
Comment 44 ykzhao 2011-02-15 01:47:19 UTC
Hi, Marc
    Does this issue still exist if the latest kernel is used? 

Thanks.
Comment 45 Marc Aurele La France 2011-02-17 16:52:12 UTC
(In reply to comment #44)
> Does this issue still exist if the latest kernel is used? 

Yes, it still occurs with 2.6.37.
Comment 46 Marc Aurele La France 2011-02-17 17:01:12 UTC
Created attachment 48162 [details]
3.6.37 .config diff

... between a working kernel (2.6.37-smp) and a non-working one (called 2.6.37-1-smp)

This includes the settings indicated in comment #26, and the USB stuff I need for module-less support of keyboard and mouse.  This looks like a hard hang (rather than a loop) as there is no keyboard response (no LED action, etc.) despite the keyboard & mouse having been earlier recognised.

Thanks.

Thanks.
Comment 47 Marc Aurele La France 2011-04-12 14:55:36 UTC
I've had an opportunity to test both 2.6.37.6 and 2.6.38.2.  2.6.37.6 hangs as before.  But at the point where previous kernels hang, 2.6.38.2 simply slows down to a crawl, returning to normal speed when userland is started (in the initrd in this case).  This results in grub-to-login-prompt boot times of nearly half an hour.  Also, vesafb's cursor blink rate on the console is much slower than normal.
The system is otherwise as responsive as usual.
Comment 48 Marc Aurele La France 2011-06-28 14:17:12 UTC
I inadvertently ended up testing 2.6.39.2 configured as above, and it behaves the same as 2.6.38.2 in comment #47, but this time despite having C-states enabled in the firmware.
Comment 49 Len Brown 2011-08-01 16:45:24 UTC
the "slow down to a crawl" is certainly due to lack of clock interrupts.
If you pelt the system with interrupts from a device, like the network
or the keyboard, then it will crawl faster;-)
Comment 50 Zhang Rui 2012-01-18 02:21:35 UTC
It's great that kernel bugzilla is back.

can you please verify if the problem still exists in the latest upstream
kernel?
Comment 51 Marc Aurele La France 2012-01-18 11:28:11 UTC
(In reply to comment #50)
> can you please verify if the problem still exists in the latest upstream
> kernel?

I have a major outage scheduled for Jan 30th.  I will see if 3.2.1 still exhibits the problem then.

Thanks.
Comment 52 Marc Aurele La France 2012-01-30 20:07:59 UTC
(In reply to comment #50)
> can you please verify if the problem still exists in the latest upstream
> kernel?

I've had an opportunity to test this with 3.2.2, and there is no hang nor slowdown anymore.  So I consider this issue resolved.  Thanks to all for your time.

Note You need to log in before you can comment on or make changes to this bug.