Subject : Lenovo S12 2.6.36-rc7 lockup Submitter : Chris Vine <chris@cvine.freeserve.co.uk> Date : 2010-10-10 12:59 Message-ID : 20101010135912.76eb61b7@boulder.homenet References : http://marc.info/?l=linux-kernel&m=128671620721712&w=2 This entry is being used for tracking a regression from 2.6.35. Please don't close it until the problem is fixed in the mainline.
On Tuesday, October 19, 2010, Chris Vine wrote: > On Sun, 17 Oct 2010 22:21:44 +0200 (CEST) > "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > This message has been generated automatically as a part of a summary > > report of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.35. Please verify if it still should be listed and let the > > tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=20172 > > Subject : Lenovo S12 2.6.36-rc7 lockup > > Submitter : Chris Vine <chris@cvine.freeserve.co.uk> > > Date : 2010-10-10 12:59 (8 days old) > > Message-ID : <20101010135912.76eb61b7@boulder.homenet> > > References : > > http://marc.info/?l=linux-kernel&m=128671620721712&w=2 > > This bug is still present in 2.6.36-rc8.
<quote> Interesting, yes, the 'nolapic_timer' option fixes it on this netbook, but that seems a bit sub-optimal as it will presumably force use of i/o apic. Whether that makes much difference in practice I don't know: the netbook has a single Atom processor which does hyperthreading to appear as two. kernel 2.6.36-rc is the first kernel to trigger this on this particular netbook. end quote> Chris, can you bisect which patch in 2.6.36 rc broke this machine?
<quote> Chris, can you bisect which patch in 2.6.36 rc broke this machine? <end quote> I can give it a go: I suppose it is an opportunity to learn to use git (I am involved in 3 projects all of which use cvs/svn), although I do not have vast amounts of spare time at the moment. I know that the regression is not present in 2.6.35. I know it is present in 2.6.36-rc6. Do they sound like reasonable end points for a git bisect, and would these be tagged as v2.6.35 and v2.6.36-rc6 in the mainline tree?
I found that the bug is present in 2.6.36-rc1, and began a bisect with 'git bisect start v2.6.36-rc1 v2.6.35'. It offered me a reasonable test commit point between those bounds which still exhibited the bug. I then entered 'git bisect bad' for the next commit point and it offered me a commit point between 2.6.35-rc1 and 2.6.35-rc2, which is obviously completely bogus. So evidently v2.6.35-rc1 and v2.6.35 are not tagged in a way which allows them to be used as end points for a bisect. Can you offer me suitable end points?
I have not got anywhere with a bisect for the reasons mentioned, but I have independently found the cause of the problem, which may not in fact be a regression. The problem is caused by the INTEL_IDLE configuration option. If that is set to Yes with kernel 2.6.36, then my Lenovo S12 with Atom CPU hangs on boot up unless the nolapic_timer configuration option is chosen. If the INTEL_IDLE configuration option is set to No then boot up completes normally. The factor which triggered this with the 2.6.36 kernel is that with that kernel, INTEL_IDLE is no longer available as a module - it is compiled directly into the kernel. With the 2.6.35 kernel, the intel_idle module had been compiled but was not loaded, so no hang exhibited itself. (I do not know whether the bug would have exhibited itself if it had been compiled directly into the kernel in the 2.6.35 kernel: I can try that if that would be helpful.)
I'm seeing this problem as well with my S12. Adding 'intel_idle.max_cstate=0' also fixes it. (In reply to comment #5) > (I do not know > whether the bug would have exhibited itself if it had been compiled directly > into the kernel in the 2.6.35 kernel: I can try that if that would be > helpful.) Yes it does, Fedora builds intel_idle into the kernel with 2.6.35 and I'm seeing this with the Fedora kernels.
> 'intel_idle.max_cstate=0' also fixes it. Okay, then it is specific to intel_idle, and not seen with acpi_idle, yes? please share the output from lspci for the lenovo S12.
Created attachment 34832 [details] patch vs 2.6.36 to avoid using the LAPIC timer in ATM-C2 please test this patch from bug 21032
(In reply to comment #7) > Okay, then it is specific to intel_idle, and not seen with acpi_idle, yes? Yes. I'm currently using acpi_idle with no such problems. > please share the output from lspci for the lenovo S12. I'll attach that here. The S12 I have has an Intel graphics controller and Broadcom networking, I know Lenovo sells an ION based version with the same name as well.
Created attachment 34842 [details] lspci -v output from my Lenovo S12
Comment #8: The patch does not have any effect on this bug. My S12 also uses the i915/945GME graphics controller rather than the ION one, but I doubt that makes any difference.
Chris, please share your /proc/cpuinfo and lspci output Does "nolapic_timer" help your system or no? The patch in comment #8 should effectively be the same as using "nolapic_timer", at least for states deeper than C1. Ville, please share also your /proc/cpuinfo output
Created attachment 34932 [details] /proc/cpuinfo from my Lenovo S12
I've been running a kernel with the patch from comment #8 only for a few hours now, but it seems like the patch fixes the problem for me.
Comments #12 and #14: Just to make sure I am not going completely mad, I have reapplied the patch, selected INTEL_IDLE in kernel configuration, recompiled and reinstalled the kernel, and it definitely hangs at the usual place, and will only proceed if I either keep a keyboard key pressed or I reboot with the nolapic_timer option (as I am doing right now). I can absolutely confirm from 'uname -a' that I am running this morning's recompiled kernel with the patch applied. Ville-Pekka, can you recheck that you are not booting the patched kernel with the nolapic_timer or intel_idle.max_cstate=0 kernel options, and that you recompiled with INTEL_IDLE selected rather than just CPU_IDLE? Otherwise we have the uncomfortable outcome of seemingly identical hardware offering different results. My 'lspci -v' seems to give identical output to yours.
Created attachment 34942 [details] /proc/cpuinfo for my S12
Created attachment 34952 [details] 'lspci -v' for my S12
Created attachment 34962 [details] intel_idle.c And just to prove none of us is mad, here is the version of intel_idle.c used to produce my hanging kernel this morning, which you can see at line 279 does have the patch applied.
I'm using the Fedora source RPM, which I rebuilt with the patch from this bug report. The source RPM is from http://koji.fedoraproject.org/koji/buildinfo?buildID=201305 (In reply to comment #15) > Ville-Pekka, can you recheck that you are not booting the patched kernel with > the nolapic_timer or intel_idle.max_cstate=0 kernel options I'm not, based on /proc/cmdline > , and that you > recompiled with INTEL_IDLE selected rather than just CPU_IDLE? Yes. I have CONFIG_INTEL_IDLE=y and CONFIG_CPU_IDLE=y. /sys/devices/system/cpu/cpuidle/current_driver is intel_idle and dmesg|grep intel_idle outputs: [ 1.039425] intel_idle: MWAIT substates: 0x20220 [ 1.039430] intel_idle: v0.4 model 0x1C [ 1.039434] intel_idle: lapic_timer_reliable_states 0x2 [ 1.045870] ACPI: acpi_idle yielding to intel_idle
Comment #19: Mine is the same: intel_idle: MWAIT substates: 0x20220 intel_idle: v0.4 model 0x1C intel_idle: lapic_timer_reliable_states 0x2 ACPI: acpi_idle yielding to intel_idle Assuming the hardware is identical, Fedora's 2.6.36 appears to behave differently from mainline 2.6.36 (which I am using). So it looks as if there is also another different bug at work, the second of which is resolved in Fedora's custom kernel.
Or I suppose the difference could just be down to different kernel configuration options. My kernel is compiled for the Atom, whereas presumably the Fedora one is more generic.
Comment #12: Len on your "does 'nolapic_timer' help your system or no?": Yes, my S12 correctly boots a kernel compiled with intel idle if the nolapic_timer boot-up option is chosen. Your patch makes no difference to that on my machine.
Chris, Ville, Please attach your .config files -- maybe there is something there that explains why your systems are behaving differently? Ville, > [ 1.039434] intel_idle: lapic_timer_reliable_states 0x2 When the patch from comment #8 is applied, this should say: intel_idle: lapic_timer_reliable_states 0x1 Chris, can you see the driver output upon the hang -- does it show the lapic_timer_reliable_states line? (we can also put an additional printk in there for when you actually enter C2 for the first time, and that should pop out upon the hang and confirm that you are running the driver that you built.)
I'm currently running vanilla 2.6.36 with the patch from comment #8. I copied Fedora's i686-PAE config, ran make oldconfig and set CONFIG_MATOM=y to see if this had something to do with it. I'm still not seeing this issue, but right now I've been running this kernel only for about 15 minutes. I'll attach the .config file.
Created attachment 35032 [details] my .config based on Fedora's i686-PAE .config
>> [ 1.039434] intel_idle: lapic_timer_reliable_states 0x2 > > When the patch from comment #8 is applied, this should say: > > intel_idle: lapic_timer_reliable_states 0x1 Oops, my thinko there. Before the patch, lapic_timer_reliable_states should be 0x6, and after the patch it should show as 0x2.
Created attachment 35042 [details] lapic_timer_reliable_states Here is the output of 'dmesg | grep intel_idle' if I reverse your patch, and boot up by holding down a keyboard key instead of booting with the nolapic_timer option. As you say, without the patch lapic_timer_reliable_states is set to 0x6.
Created attachment 35052 [details] kernel config
On Friday, November 19, 2010, Chris Vine wrote: > On Fri, 19 Nov 2010 00:53:35 +0100 (CET) > "Rafael J. Wysocki" <rjw@sisk.pl> wrote: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.35 and 2.6.36. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.35 and 2.6.36. Please verify if it still > > should be listed and let the tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=20172 > > Subject : 2.6.36-rc7 boot stalls unless nolapic_timer > > - Lenovo S12 Submitter : Chris Vine > > <chris@cvine.freeserve.co.uk> Date : 2010-10-10 12:59 > > (40 days old) Message-ID : > > <<20101010135912.76eb61b7@boulder.homenet>> References : > > http://marc.info/?l=linux-kernel&m=128671620721712&w=2 > > This bug is still present in 2.6.37-rc2, but (as the original reporter) > I now doubt it should be classified as a regression. It looked as if > it was a regression because in the 2.6.35 kernel (the first kernel to > include intel idle), intel idle was available as a kernel module and > that module never in fact loaded on my hardware and so never caused a > hang. In 2.6.36 onwards, intel idle is only available directly compiled > into the kernel and that is when the hang became apparent: subsequent > testing shows that the hang also occurs with 2.6.35 if intel idle is > compiled in rather than as a module. > > So it is a bug, which is still present, but not necessarily a > regression.
This is to confirm that the patch now in kernel 2.6.36.1, which was presumably the same as at comment #8, does not fix this issue for me. I still get a hang on boot-up with kernel 2.6.36.1 if intel idle is chosen, unless the nolapic_timer boot option is used.
Yes, the patch from comment #8 is present in 2.6.36.1 Ville - can you try Chris' .config? Chris - can you try Ville's .config? Perhaps there is a BIOS difference, can you attach the output from dmidecode? Also, please verify that BIOS SETUP defaults are being used.
Created attachment 38742 [details] dmidecode output Here's the dmidecode output from my S12 running with the default BIOS settings (I've removed some serial numbers). I'm currently using Fedora's 2.6.36.1-10.fc15 without the intel_idle.max_cstate=0 option and I've not seen this problem any more. I probably won't have the time to build and test a vanilla kernel with Chris's .config in the next few days, I might do so later.
Created attachment 38762 [details] dmidecode output Attached is the dmidecode output on my S12. It looks the same as Vainio's, except that the BIOS is a few revisions earlier. If I compile stock 2.6.36.1 using Vainio's .config, apart from taking a very long time to compile, nearly 5 hours (as it packages everything in the house including the car in the garage), it demonstrates similar though not identical behaviour to my more trimmed down version. With intel idle compiled in and without the nolapic_timer boot option, it boots and runs very slowly. It takes 2 mins 23 secs to boot to a console prompt, as compared with 46 seconds with my kernel with acpi idle and no intel idle; and once booted up it is generally sluggish and unpleasant. If you hold down a keyboard key to generate interrupts, it does boot up more quickly though. There is a very substantial speed-up if I boot Vainio's config with the nolapic_timer option - it then boots to a console prompt in 55 seconds. With this boot option, most of the difference in boot times between my and his config seems to be mainly due to a call to depmod on boot-up, which takes a considerable time to execute because of all the modules produced by his config. (The boot-up scripts could be much improved in my distribution, which is slackware 13.1, by commenting out the calls to fc-cache, ldconfig and depmod, but I have never quite got round to doing that.)
This bug is fixed for me in rc4, my S12 boots a lot faster and never hangs. Sleep and function keys works and everything is ok. Is there anything I can upload or test to help?
Comment #34: "This bug is fixed for me in rc4" It certainly isn't fixed for me in 2.6.37-rc4. In addition there is a regression introduced between 2.6.37-rc1 and rc2 (still in rc4) which breaks i915 graphics on a warm boot, which occurs at kernel mode setting (some cold boots work). If you are using an S12 with the nvidia chip however, you will not see that. If you are using Intel graphics, you can work around it by suspending and then resuming, which will bring graphics up.
An update: I was mistaken. My system is actually still suffering from this issue, for some reason it's just more subtle than in Chris' case, which is why I thought it was fixed. I haven't played any music on the S12 for months and now that I did, I noticed the music often jams and starts repeating the same clip over and over for a while. Rtkit also complains about the canary thread being starved. I'm thinking these are symptoms of this bug. intel_idle.max_cstate=0 does not fix it, but nolapic_timer does. I'm running Fedora's 2.6.36.2 now, I could try a 2.6.37-rc* if needed.
is this still a problem when running a recent kernel.
I'm currently running Fedora 15 with 2.6.40-4.fc15, which apparently is a renamed 3.0. I am not currently experiencing the problems with music stuttering as described in comment #36. This may have been fixed, but I'm interested in hearing others' experiences with newer kernels.
I've tested this a bit more and this is still a problem. I let the system idle for about 20 minutes and the clock started running late and ssh connections dropped.
On my netbook, intel idle now appears to work OK with kernel-3.0.1. The boot stalls have gone and I have not noticed any other odd effects after a couple of days of use. If anything comes up I will post again.
It's great that kernel bugzilla is back. can you please verify if the problem still exists in the latest upstream kernel?
The problem does not exist for me with the latest kernel (3.2.1).
Good to know. Bug closed.
Unfortunately I am still seeing this problem, or something like this, on my Lenovo S12 (the Intel-based version). I am running Fedora's kernel-PAE-3.2.1-3.fc16.i686. Like I mentioned in comment #39, if I let the system idle for a while, ssh connections drop and the clock starts running late, which is probably a symptom of the system sleeping "too deep". I wonder why this is not happening on Chris' system. Maybe our .configs are just so different or maybe Fedora is carrying some patches which change things.