Bug 20172 - 2.6.36-rc7 boot stalls unless nolapic_timer - Lenovo S12
Summary: 2.6.36-rc7 boot stalls unless nolapic_timer - Lenovo S12
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: intel_idle (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: power-management_intel_idle@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-12 17:11 UTC by Maciej Rutecki
Modified: 2012-01-29 17:11 UTC (History)
8 users (show)

See Also:
Kernel Version: 2.6.36-rc7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
patch vs 2.6.36 to avoid using the LAPIC timer in ATM-C2 (1.01 KB, patch)
2010-10-24 19:20 UTC, Len Brown
Details | Diff
lspci -v output from my Lenovo S12 (6.32 KB, text/plain)
2010-10-24 19:39 UTC, Ville-Pekka Vainio
Details
/proc/cpuinfo from my Lenovo S12 (1.44 KB, text/plain)
2010-10-25 08:37 UTC, Ville-Pekka Vainio
Details
/proc/cpuinfo for my S12 (1.44 KB, text/plain)
2010-10-25 09:57 UTC, Chris Vine
Details
'lspci -v' for my S12 (8.71 KB, text/plain)
2010-10-25 09:58 UTC, Chris Vine
Details
intel_idle.c (10.96 KB, text/x-csrc)
2010-10-25 10:12 UTC, Chris Vine
Details
my .config based on Fedora's i686-PAE .config (112.32 KB, text/plain)
2010-10-25 19:59 UTC, Ville-Pekka Vainio
Details
lapic_timer_reliable_states (148 bytes, text/plain)
2010-10-26 00:13 UTC, Chris Vine
Details
kernel config (73.61 KB, text/plain)
2010-10-26 00:14 UTC, Chris Vine
Details
dmidecode output (8.37 KB, text/plain)
2010-12-01 11:01 UTC, Ville-Pekka Vainio
Details
dmidecode output (8.37 KB, text/plain)
2010-12-01 16:58 UTC, Chris Vine
Details

Description Maciej Rutecki 2010-10-12 17:11:20 UTC
Subject    : Lenovo S12 2.6.36-rc7 lockup
Submitter  : Chris Vine <chris@cvine.freeserve.co.uk>
Date       : 2010-10-10 12:59
Message-ID : 20101010135912.76eb61b7@boulder.homenet
References : http://marc.info/?l=linux-kernel&m=128671620721712&w=2

This entry is being used for tracking a regression from 2.6.35.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2010-10-18 22:42:39 UTC
On Tuesday, October 19, 2010, Chris Vine wrote:
> On Sun, 17 Oct 2010 22:21:44 +0200 (CEST)
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > This message has been generated automatically as a part of a summary
> > report of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.35.  Please verify if it still should be listed and let the
> > tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=20172
> > Subject             : Lenovo S12 2.6.36-rc7 lockup
> > Submitter   : Chris Vine <chris@cvine.freeserve.co.uk>
> > Date                : 2010-10-10 12:59 (8 days old)
> > Message-ID  : <20101010135912.76eb61b7@boulder.homenet>
> > References  :
> > http://marc.info/?l=linux-kernel&m=128671620721712&w=2
> 
> This bug is still present in 2.6.36-rc8.
Comment 2 Len Brown 2010-10-19 01:51:34 UTC
<quote>
Interesting, yes, the 'nolapic_timer' option fixes it on this netbook,
but that seems a bit sub-optimal as it will presumably force use of i/o
apic. Whether that makes much difference in practice I don't know: the
netbook has a single Atom processor which does hyperthreading to appear
as two.

kernel 2.6.36-rc is the first kernel to trigger this on this particular
netbook.

end quote>

Chris, can you bisect which patch in 2.6.36 rc broke this machine?
Comment 3 Chris Vine 2010-10-19 18:46:46 UTC
<quote>
Chris, can you bisect which patch in 2.6.36 rc broke this machine?
<end quote>

I can give it a go: I suppose it is an opportunity to learn to use git (I am involved in 3 projects all of which use cvs/svn), although I do not have vast amounts of spare time at the moment.

I know that the regression is not present in 2.6.35.  I know it is present in 2.6.36-rc6.  Do they sound like reasonable end points for a git bisect, and would these be tagged as v2.6.35 and v2.6.36-rc6 in the mainline tree?
Comment 4 Chris Vine 2010-10-20 21:12:55 UTC
I found that the bug is present in 2.6.36-rc1, and began a bisect with 'git bisect start v2.6.36-rc1 v2.6.35'.  It offered me a reasonable test commit point between those bounds which still exhibited the bug.  I then entered 'git bisect bad' for the next commit point and it offered me a commit point between 2.6.35-rc1 and 2.6.35-rc2, which is obviously completely bogus.

So evidently v2.6.35-rc1 and v2.6.35 are not tagged in a way which allows them to be used as end points for a bisect.

Can you offer me suitable end points?
Comment 5 Chris Vine 2010-10-21 21:29:31 UTC
I have not got anywhere with a bisect for the reasons mentioned, but I have independently found the cause of the problem, which may not in fact be a regression.

The problem is caused by the INTEL_IDLE configuration option.  If that is set to Yes with kernel 2.6.36, then my Lenovo S12 with Atom CPU hangs on boot up unless the nolapic_timer configuration option is chosen.  If the INTEL_IDLE configuration option is set to No then boot up completes normally.

The factor which triggered this with the 2.6.36 kernel is that with that kernel, INTEL_IDLE is no longer available as a module - it is compiled directly into the kernel.  With the 2.6.35 kernel, the intel_idle module had been compiled but was not loaded, so no hang exhibited itself.  (I do not know whether the bug would have exhibited itself if it had been compiled directly into the kernel in the 2.6.35 kernel: I can try that if that would be helpful.)
Comment 6 Ville-Pekka Vainio 2010-10-22 12:55:00 UTC
I'm seeing this problem as well with my S12. Adding 'intel_idle.max_cstate=0' also fixes it.

(In reply to comment #5)
>  (I do not know
> whether the bug would have exhibited itself if it had been compiled directly
> into the kernel in the 2.6.35 kernel: I can try that if that would be
> helpful.)

Yes it does, Fedora builds intel_idle into the kernel with 2.6.35 and I'm seeing this with the Fedora kernels.
Comment 7 Len Brown 2010-10-24 19:18:50 UTC
> 'intel_idle.max_cstate=0' also fixes it.

Okay, then it is specific to intel_idle, and not seen with acpi_idle, yes?

please share the output from lspci for the lenovo S12.
Comment 8 Len Brown 2010-10-24 19:20:48 UTC
Created attachment 34832 [details]
patch vs 2.6.36 to avoid using the LAPIC timer in ATM-C2

please test this patch from bug 21032
Comment 9 Ville-Pekka Vainio 2010-10-24 19:39:03 UTC
(In reply to comment #7)
> Okay, then it is specific to intel_idle, and not seen with acpi_idle, yes?

Yes. I'm currently using acpi_idle with no such problems.

> please share the output from lspci for the lenovo S12.

I'll attach that here. The S12 I have has an Intel graphics controller and Broadcom networking, I know Lenovo sells an ION based version with the same name as well.
Comment 10 Ville-Pekka Vainio 2010-10-24 19:39:56 UTC
Created attachment 34842 [details]
lspci -v output from my Lenovo S12
Comment 11 Chris Vine 2010-10-24 22:58:30 UTC
Comment #8: The patch does not have any effect on this bug.

My S12 also uses the i915/945GME graphics controller rather than the ION one, but I doubt that makes any difference.
Comment 12 Len Brown 2010-10-25 02:53:01 UTC
Chris, please share your /proc/cpuinfo and lspci output
Does "nolapic_timer" help your system or no?
The patch in comment #8 should effectively be
the same as using "nolapic_timer", at least
for states deeper than C1.

Ville, please share also your /proc/cpuinfo output
Comment 13 Ville-Pekka Vainio 2010-10-25 08:37:34 UTC
Created attachment 34932 [details]
/proc/cpuinfo from my Lenovo S12
Comment 14 Ville-Pekka Vainio 2010-10-25 08:41:05 UTC
I've been running a kernel with the patch from comment #8 only for a few hours now, but it seems like the patch fixes the problem for me.
Comment 15 Chris Vine 2010-10-25 09:53:28 UTC
Comments #12 and #14:  Just to make sure I am not going completely mad, I have reapplied the patch, selected INTEL_IDLE in kernel configuration, recompiled and reinstalled the kernel, and it definitely hangs at the usual place, and will only proceed if I either keep a keyboard key pressed or I reboot with the nolapic_timer option (as I am doing right now).  I can absolutely confirm from 'uname -a' that I am running this morning's recompiled kernel with the patch applied.

Ville-Pekka, can you recheck that you are not booting the patched kernel with the nolapic_timer or intel_idle.max_cstate=0 kernel options, and that you recompiled with INTEL_IDLE selected rather than just CPU_IDLE?  Otherwise we have the uncomfortable outcome of seemingly identical hardware offering different results.  My 'lspci -v' seems to give identical output to yours.
Comment 16 Chris Vine 2010-10-25 09:57:54 UTC
Created attachment 34942 [details]
/proc/cpuinfo for my S12
Comment 17 Chris Vine 2010-10-25 09:58:48 UTC
Created attachment 34952 [details]
'lspci -v' for my S12
Comment 18 Chris Vine 2010-10-25 10:12:27 UTC
Created attachment 34962 [details]
intel_idle.c

And just to prove none of us is mad, here is the version of intel_idle.c used to produce my hanging kernel this morning, which you can see at line 279 does have the patch applied.
Comment 19 Ville-Pekka Vainio 2010-10-25 10:23:39 UTC
I'm using the Fedora source RPM, which I rebuilt with the patch from this bug report. The source RPM is from http://koji.fedoraproject.org/koji/buildinfo?buildID=201305

(In reply to comment #15)
> Ville-Pekka, can you recheck that you are not booting the patched kernel with
> the nolapic_timer or intel_idle.max_cstate=0 kernel options

I'm not, based on /proc/cmdline

> , and that you
> recompiled with INTEL_IDLE selected rather than just CPU_IDLE?

Yes. I have CONFIG_INTEL_IDLE=y and CONFIG_CPU_IDLE=y. /sys/devices/system/cpu/cpuidle/current_driver is intel_idle and dmesg|grep intel_idle outputs:

[    1.039425] intel_idle: MWAIT substates: 0x20220
[    1.039430] intel_idle: v0.4 model 0x1C
[    1.039434] intel_idle: lapic_timer_reliable_states 0x2
[    1.045870] ACPI: acpi_idle yielding to intel_idle
Comment 20 Chris Vine 2010-10-25 10:42:58 UTC
Comment #19: Mine is the same:

  intel_idle: MWAIT substates: 0x20220
  intel_idle: v0.4 model 0x1C
  intel_idle: lapic_timer_reliable_states 0x2
  ACPI: acpi_idle yielding to intel_idle

Assuming the hardware is identical, Fedora's 2.6.36 appears to behave differently from mainline 2.6.36 (which I am using).  So it looks as if there is also another different bug at work, the second of which is resolved in Fedora's custom kernel.
Comment 21 Chris Vine 2010-10-25 10:49:18 UTC
Or I suppose the difference could just be down to different kernel configuration options.  My kernel is compiled for the Atom, whereas presumably the Fedora one is more generic.
Comment 22 Chris Vine 2010-10-25 10:59:28 UTC
Comment #12: Len on your "does 'nolapic_timer' help your system or no?": Yes, my S12 correctly boots a kernel compiled with intel idle if the nolapic_timer boot-up option is chosen.  Your patch makes no difference to that on my machine.
Comment 23 Len Brown 2010-10-25 18:53:43 UTC
Chris, Ville,
Please attach your .config files -- maybe there is something there
that explains why your systems are behaving differently?

Ville,
> [    1.039434] intel_idle: lapic_timer_reliable_states 0x2

When the patch from comment #8 is applied, this should say:

intel_idle: lapic_timer_reliable_states 0x1

Chris, can you see the driver output upon the hang --
does it show the lapic_timer_reliable_states line?
(we can also put an additional printk in there for when
you actually enter C2 for the first time, and that
should pop out upon the hang and confirm that
you are running the driver that you built.)
Comment 24 Ville-Pekka Vainio 2010-10-25 19:57:35 UTC
I'm currently running vanilla 2.6.36 with the patch from comment #8. I copied Fedora's i686-PAE config, ran make oldconfig and set CONFIG_MATOM=y to see if this had something to do with it. I'm still not seeing this issue, but right now I've been running this kernel only for about 15 minutes. I'll attach the .config file.
Comment 25 Ville-Pekka Vainio 2010-10-25 19:59:44 UTC
Created attachment 35032 [details]
my .config based on Fedora's i686-PAE .config
Comment 26 Len Brown 2010-10-25 20:46:32 UTC
>> [    1.039434] intel_idle: lapic_timer_reliable_states 0x2
>
> When the patch from comment #8 is applied, this should say:
>
> intel_idle: lapic_timer_reliable_states 0x1

Oops, my thinko there.  Before the patch, lapic_timer_reliable_states
should be 0x6, and after the patch it should show as 0x2.
Comment 27 Chris Vine 2010-10-26 00:13:40 UTC
Created attachment 35042 [details]
lapic_timer_reliable_states

Here is the output of 'dmesg | grep intel_idle' if I reverse your patch, and boot up by holding down a keyboard key instead of booting with the nolapic_timer option.  As you say, without the patch lapic_timer_reliable_states is set to 0x6.
Comment 28 Chris Vine 2010-10-26 00:14:31 UTC
Created attachment 35052 [details]
kernel config
Comment 29 Rafael J. Wysocki 2010-11-19 20:08:55 UTC
On Friday, November 19, 2010, Chris Vine wrote:
> On Fri, 19 Nov 2010 00:53:35 +0100 (CET)
> "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.35 and 2.6.36.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.35 and 2.6.36.  Please verify if it still
> > should be listed and let the tracking team know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=20172
> > Subject             : 2.6.36-rc7 boot stalls unless nolapic_timer
> > - Lenovo S12 Submitter      : Chris Vine
> > <chris@cvine.freeserve.co.uk> Date          : 2010-10-10 12:59
> > (40 days old) Message-ID    :
> > <<20101010135912.76eb61b7@boulder.homenet>> References      :
> > http://marc.info/?l=linux-kernel&m=128671620721712&w=2
> 
> This bug is still present in 2.6.37-rc2, but (as the original reporter)
> I now doubt it should be classified as a regression.  It looked as if
> it was a regression because in the 2.6.35 kernel (the first kernel to
> include intel idle), intel idle was available as a kernel module and
> that module never in fact loaded on my hardware and so never caused a
> hang. In 2.6.36 onwards, intel idle is only available directly compiled
> into the kernel and that is when the hang became apparent: subsequent
> testing shows that the hang also occurs with 2.6.35 if intel idle is
> compiled in rather than as a module.
> 
> So it is a bug, which is still present, but not necessarily a
> regression.
Comment 30 Chris Vine 2010-11-23 23:46:38 UTC
This is to confirm that the patch now in kernel 2.6.36.1, which was presumably the same as at comment #8, does not fix this issue for me.

I still get a hang on boot-up with kernel 2.6.36.1 if intel idle is chosen, unless the nolapic_timer boot option is used.
Comment 31 Len Brown 2010-11-26 19:16:53 UTC
Yes, the patch from comment #8 is present in 2.6.36.1

Ville - can you try Chris' .config?
Chris - can you try Ville's .config?

Perhaps there is a BIOS difference, can you
attach the output from dmidecode?
Also, please verify that
BIOS SETUP defaults are being used.
Comment 32 Ville-Pekka Vainio 2010-12-01 11:01:19 UTC
Created attachment 38742 [details]
dmidecode output

Here's the dmidecode output from my S12 running with the default BIOS settings (I've removed some serial numbers). I'm currently using Fedora's 2.6.36.1-10.fc15 without the intel_idle.max_cstate=0 option and I've not seen this problem any more. I probably won't have the time to build and test a vanilla kernel with Chris's .config in the next few days, I might do so later.
Comment 33 Chris Vine 2010-12-01 16:58:37 UTC
Created attachment 38762 [details]
dmidecode output

Attached is the dmidecode output on my S12.  It looks the same as Vainio's, except that the BIOS is a few revisions earlier.

If I compile stock 2.6.36.1 using Vainio's .config, apart from taking a very long time to compile, nearly 5 hours (as it packages everything in the house including the car in the garage), it demonstrates similar though not identical behaviour to my more trimmed down version.  With intel idle compiled in and without the nolapic_timer boot option, it boots and runs very slowly.  It takes 2 mins 23 secs to boot to a console prompt, as compared with 46 seconds with my kernel with acpi idle and no intel idle; and once booted up it is generally sluggish and unpleasant.  If you hold down a keyboard key to generate interrupts, it does boot up more quickly though.

There is a very substantial speed-up if I boot Vainio's config with the nolapic_timer option - it then boots to a console prompt in 55 seconds.  With this boot option, most of the difference in boot times between my and his config seems to be mainly due to a call to depmod on boot-up, which takes a considerable time to execute because of all the modules produced by his config.  (The boot-up scripts could be much improved in my distribution, which is slackware 13.1, by commenting out the calls to fc-cache, ldconfig and depmod, but I have never quite got round to doing that.)
Comment 34 Jon Jahren 2010-12-02 13:31:39 UTC
This bug is fixed for me in rc4, my S12 boots a lot faster and never hangs. Sleep and function keys works and everything is ok. Is there anything I can upload or test to help?
Comment 35 Chris Vine 2010-12-02 14:35:18 UTC
Comment #34: "This bug is fixed for me in rc4"

It certainly isn't fixed for me in 2.6.37-rc4.  In addition there is a regression introduced between 2.6.37-rc1 and rc2 (still in rc4) which breaks i915 graphics on a warm boot, which occurs at kernel mode setting (some cold boots work).  If you are using an S12 with the nvidia chip however, you will not see that.  If you are using Intel graphics, you can work around it by suspending and then resuming, which will bring graphics up.
Comment 36 Ville-Pekka Vainio 2010-12-22 20:52:56 UTC
An update: I was mistaken. My system is actually still suffering from this issue, for some reason it's just more subtle than in Chris' case, which is why I thought it was fixed. I haven't played any music on the S12 for months and now that I did, I noticed the music often jams and starts repeating the same clip over and over for a while. Rtkit also complains about the canary thread being starved. I'm thinking these are symptoms of this bug.

intel_idle.max_cstate=0 does not fix it, but nolapic_timer does. I'm running Fedora's 2.6.36.2 now, I could try a 2.6.37-rc* if needed.
Comment 37 Len Brown 2011-08-01 16:41:26 UTC
is this still a problem when running a recent kernel.
Comment 38 Ville-Pekka Vainio 2011-08-02 18:07:47 UTC
I'm currently running Fedora 15 with 2.6.40-4.fc15, which apparently is a renamed 3.0. I am not currently experiencing the problems with music stuttering as described in comment #36. This may have been fixed, but I'm interested in hearing others' experiences with newer kernels.
Comment 39 Ville-Pekka Vainio 2011-08-06 09:34:25 UTC
I've tested this a bit more and this is still a problem. I let the system idle for about 20 minutes and the clock started running late and ssh connections dropped.
Comment 40 Chris Vine 2011-08-17 22:57:42 UTC
On my netbook, intel idle now appears to work OK with kernel-3.0.1.  The boot stalls have gone and I have not noticed any other odd effects after a couple of days of use.  If anything comes up I will post again.
Comment 41 Zhang Rui 2012-01-18 02:21:24 UTC
It's great that kernel bugzilla is back.

can you please verify if the problem still exists in the latest upstream
kernel?
Comment 42 Chris Vine 2012-01-18 09:45:56 UTC
The problem does not exist for me with the latest kernel (3.2.1).
Comment 43 Zhang Rui 2012-01-19 02:04:04 UTC
Good to know.
Bug closed.
Comment 44 Ville-Pekka Vainio 2012-01-29 17:11:50 UTC
Unfortunately I am still seeing this problem, or something like this, on my Lenovo S12 (the Intel-based version). I am running Fedora's kernel-PAE-3.2.1-3.fc16.i686. Like I mentioned in comment #39, if I let the system idle for a while, ssh connections drop and the clock starts running late, which is probably a symptom of the system sleeping "too deep". I wonder why this is not happening on Chris' system. Maybe our .configs are just so different or maybe Fedora is carrying some patches which change things.

Note You need to log in before you can comment on or make changes to this bug.