Bug 21952
Summary: | resume hangs unless intel_idle.max_cstate=3 or maxcpus=1 - Samsung N145, N148, N150, N210, Lenovo S10-3 | ||
---|---|---|---|
Product: | ACPI | Reporter: | Benjamin Nave (bnave) |
Component: | Power-Sleep-Wake | Assignee: | Len Brown (lenb) |
Status: | CLOSED OBSOLETE | ||
Severity: | normal | CC: | abhijeet.1989, acpi-bugzilla, alan, aleksandr.tishin, anachesa, BenjiM, bnave, david, develop, feng.tang, frank.fqc, lenb, marogge, mlord, oliparcol, r.schtz, rjw, rk, rui.zhang, seth, vitekd88 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216, 16055 | ||
Attachments: |
lspci -v (Samsung N150+)
default boot parameters default boot parameters, hyperthreading disabled in BIOS maxcpus=1 intel_idle.max_cstate=0 intel_idle.max_cstate=3 nolapic_timer patch vs 2.6.37 patch vs 2.6.36 patch vs 2.6.35.9 lspci -v output on Samsung N145 Plus dmesg|grep idle and cpuidle sysfs with default configuration dmesg|grep idle with intel_idle.max_cstate=0 cpuidle sysfs with nolapic_timer Debug output with intel_idle.max_cstate=0 |
Description
Benjamin Nave
2010-11-04 00:36:01 UTC
Also affects Samsung N148 Plus (same hardware as in N150 Plus but without preinstalled Windows) Ditto for Samsung N210 netbook. Very difficult to debug, too. Suspend (to RAM) appears to work, but the wakeup simply hangs with a dark screen. No serial ports, no way to attach one either (USB is no good for this). -ml Same problem for Samsung N150 Plus. With the 2.6.35.4 (Kubuntu Maverik) suspend not work, but with the 2.6.34.7 works. Is there some way to have useful info about the problem? I tried the procedure described here: https://wiki.ubuntu.com/DebuggingKernelSuspend but every time report a different hash. If you need someone to make a test on this platform, i offer my support. Problem exists with 2.6.34-2.6.36 kernels. On 2.6.33 no hang-ups detected for now. 2.6.27 not tested (since is not released). I don't know if can help to solve the problem, but on the ubuntu bug track system i found a bug that covered the same problem in june. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/594885. The focus was on the setting CONFIG_PM_DISABLE_CONSOLE. In this email there is the ubuntu patch that create the problem: https://lists.ubuntu.com/archives/kernel-team/2010-June/011203.html In this thread ubuntu kernel team drop this patch from the kernel build, but now we suffer the same problem... Davide, that bug is definitely not the same, since in bug 594885 problem is only not show picture on display while all other is working, and current bug is about totally hanged-up system and nothing works except power button. It is much bigger problem and by another (for now still unknown) reason. Well, on my N210 at least, that's what happens -- no backlight, and total system lockup on resume. ALT-SYSRQ-X buttons don't work either. -ml Right Юрий Чудновский! Sorry for the mistake... Re-tested 2.6.34 kernel - seems to work good. Maybe i've tested by mistake not 34 brunch kernel earlier. Confirm regression begins withing 2.6.35 branch. Looks very similar to the symptoms from this more generic bug: http://bugzilla.kernel.org/show_bug.cgi?id=21652 Also affects Samsung N210 Plus (Ubuntu 10.10 network edition, kernel 2.6.35-23-generic; dual-boot with Windows 7 Starter) I had this issue on my netbook samsung n210 with 2.6.32.6 kernel. Appending intel_idle.max_cstate=0 to the kernel boot parameters solves the issue. I wonder what this means and how it solves the issue! Can anyone throw some light? Appending intel_idle.max_cstate=0 to the kernel boot parameters solved the issue for my samsung n210 as well (using 2.6.35.24 kernel now). Setting max_cstate=0 essentially disables aggressive power-saving on the CPU. So battery life will likely be reduced. Why does this help? Dunno, but it does indicate some kind of race condition in the kernel. With normal settings, operations can have higher latency as the CPU transitions from lower (slower) cstates up to cstate-0 (the fastest). Disabling cstates means the timing is always fast, and predictable. Cheers So.. possibly a better fix is to have the suspend script read/save the current max_cstate value, set it to zero afterward before suspending. Then on resume, the resume script could restore the old value. This way, everything ought to work well enough. Ideally, I expect this should really be done in-kernel, since that's where the bug is. Cheers I am able to set intel_idle.max_cstate=3 on my Samsung N150+ without triggering this bug, and it's not until intel_idle.max_cstate=4 that I run into problems. That indicates to me that the problem isn't so much that the intel_idle driver needs to be disabled, but rather that cstate-4 is glitchy. Power management is a mystery to me so I don't know whether there's much to be gained by running with max_cstate=3 rather than max_cstate=0, but I hope this can at least help someone more knowledgeable narrow down the problem. Linux tinytronix 2.6.35-24-generic #43~ppa1~loms~maverick-Ubuntu SMP Fri Dec 24 18:15:40 UTC 2010 i686 GNU/Linux BOOT_IMAGE=/boot/vmlinuz-2.6.35-24-generic root=UUID=66817f18-a5c7-4d9e-8a29-3220974cb618 ro intel_idle.max_cstate=3 quiet splash - dpk Is the problem still present in 2.6.37? I am experiencing the problem with my Samsung N145 Plus and kernel 2.6.37, too. I noticed that disabling Hyper Threading in BIOS or at runtime (echo 0 > /sys/devices/system/cpu/cpu1/online) helps. Can someone confirm this? Furthermore all the devices mentioned in topic seem to have Hyper Threading capable processors. Confirmed, disabling hyperthreading in BIOS resolves the issue without requiring intel_idle.max_cstate to be specified, on 2.6.35 as well. - dpk please show the output from dmesg |grep idle grep . /sys/devices/system/cpu/cpu*/cpuidle/*/* in the default configuration, and also when booting with intel_idle.max_cstate=0 (to fall back to ACPI). does booting with "maxcpus=1" also work around the problem? does booting with "lapic_timer" help? If yes, is the kernel being tested include the patch from bug #21032? please show the lspci output for each of the failing systems. Created attachment 44232 [details]
lspci -v (Samsung N150+)
Created attachment 44242 [details]
default boot parameters
Created attachment 44252 [details]
default boot parameters, hyperthreading disabled in BIOS
Created attachment 44262 [details]
maxcpus=1
Created attachment 44272 [details]
intel_idle.max_cstate=0
Created attachment 44282 [details]
intel_idle.max_cstate=3
Created attachment 44292 [details]
nolapic_timer
'maxcpus=1' and 'nolapic_timer' both work around the issue on 2.6.35. I don't know whether the patch has been applied. - dpk > 00:00.0 Host bridge: Intel Corporation N10 Family DMI Bridge Intel NM10. Good to know. > intel_idle: lapic_timer_reliable_states 0x6 This should be 0x2 This means that your kernel (vmlinuz-2.6.35-24-generic) is older than upstream 2.6.35.9, which is when the patch from bug #21032 shipped. However, I don't think that patch will actually help here. That is because comment #27 shows that intel_idle.max_cstate=3 fixes your system, and that disables ATM-C4, yet leaves ATM-C2 active. So the problem here is with ATM-C4. > intel_idle.max_cstate=0 (the acpi_idle case) No output from "# grep . /sys/devices/system/cpu/cpu*/cpuidle/*/*" ? If that is the case, then ACPI C-states deeper than C1 are somehow disabled on this system. Is it running with default BIOS SETUP settings? Do you see the same thing when on AC vs on Battery? > nolapic_timer output in comment #28 The output shows that with nolapic_timer you are not entering *any* c-states, not even c1; instead you are polling. It looks like something is broken related to the nolapic_timer option -- need to look into that; because it would otherwise implicate the lapic timer; but here it nukes all the c-states, which doesn't tell us anything about the lapic timer. Created attachment 44302 [details]
patch vs 2.6.37
please Shaohua's broadcast clock event patch from 2.6.38-rc1
This version should apply cleanly to 2.6.37
Created attachment 44312 [details]
patch vs 2.6.36
Here is the same patch, back-ported to apply cleanly to 2.6.36.3
Created attachment 44322 [details]
patch vs 2.6.35.9
Here is the same patch, back-ported to apply cleanly to 2.6.35.9
Created attachment 44382 [details]
lspci -v output on Samsung N145 Plus
Created attachment 44392 [details]
dmesg|grep idle and cpuidle sysfs with default configuration
Created attachment 44402 [details]
dmesg|grep idle with intel_idle.max_cstate=0
there's no data for cpuidle in sysfs
Created attachment 44412 [details] cpuidle sysfs with nolapic_timer patch from bug #21032 should be applied as it's 2.6.37 maxcpus=1 as well as nolapic_timer (obviously because it disables the c-states) help. The patch from #31 does not fix the problem in my case. The patch from #33 didn't fix it for me, with 2.6.35.10. I am not sure this is the same issue or not (pls ignore if not) but I just wanted to make you aware of a similar 2.6.35 regression that went away in 2.6.36: https://bugzilla.kernel.org/show_bug.cgi?id=16532 2.6.37 is still good on the hardware in question. I've been debugging this issue a little with an N150. It's still present as of v2.6.38-rc6. In addition to what's been reported here, I've observed that acpi_skip_timer_override and nohpet seem to make this issue go away. I've traced it down to something going wrong when bringing the secondary logical CPU online. When the hpet code receives the CPU_ONLINE notification the primary CPU schedules some work on the secondary CPU and waits for it to complete, but the work is never getting executed. The secondary CPU is coming online and executing instructions, and I haven't isolated exactly where it hangs. I've also noticed that this problem seems to be timing sensitive, so it's entirely possible that some of the command-line options that "fix" the issue just alter the timing enough to mask it. Let me know if there's anything you want me to try, and I'll post any further findings here as well. Created attachment 49672 [details]
Debug output with intel_idle.max_cstate=0
Attached requested data with intel_idle.max_cstate=0. I got some data from the cpuidle sysfs nodes, it just seemed to take a while after boot before they appeared for some reason. I still see the same nolapic_timer behavior with 2.6.38.
I also note that acpi_idle only seems to utilize C3 and higher, so if this is a problem with C4 it makes sense that disabling intel_idle eliminates the issue.
Did someone gets rarely (let's say, once per day) suddenly hangups on 2.6.35+ kernels (even with intel_idle.max_cstate=3 or etc)? If so, its may be same regression, because I don't remember any hangups on 2.6.32. I got some time to look into this a little more. I have some more information, but still no clear answer. The secondary CPU starts executing and hits idle at least once. It hangs after coming out of idle and re-enabling irqs -- I can see that it makes it as far as local_irq_enable() in intel_idle(), but no farther in that function. Seeing where it goes from there is more of a challenge, given the limited debug capabilities in this state. However, I don't see it hitting smp_reschedule_interrupt() which is expected from the schedule_delayed_work_on() call from hpet_cpuhp_notify(). I have the same problem with a samsung n220 with kernel 2.6.38.2. I tried to change intel_idle.max_cstate to 0 (I also tried 1,2 and 3) and I still can't resume from suspend. re: comment #43 /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH INTEL MWAIT 0x0 /sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH INTEL MWAIT 0x10 it seems that when running acpi_idle via "intel_idle.max_cstate=0" that only C1 and C2 are exposed. Is this on AC? Please try it on DC to see if additional C-states show up under ACPI. Sorry, I'm not in possession of that machine any more so I'll be unable to do any more testing. FYI: With Linux 3.0.1 (didn't try 3.0) on my Samsung N145P netbook suspend is working fine now. What could be the patch, that fixed it? same thing for me, resume now works without any workaround with linux 3.0.1 (Samsung N220) Good to know. Bug closed. Please reopen. I was a bit overhasty: Sometimes suspend is working fine on my Samsung N145P, but sometimes it still fails in the same way like before. I'm using and learning Ubuntu since Maverick, and since that time when I'm initiating suspend system can't wake up normally. Just disk activity indicator is flashing for short time, screen remains black. DistroRelease: Ubuntu 12.04 Package: linux-image-3.2.0-20-generic 3.2.0-20.33 ProcVersionSignature: Ubuntu 3.2.0-20.33-generic 3.2.12 Uname: Linux 3.2.0-20-generic i686 MachineType: LENOVO S10-3 Proc: Intel Atom N450 1.66 GHz Motherboard: Intel NM10 Video: Intel Graphics Media Accelerator (GMA) 3150 Network: Realtek PCIe GBE Family Controller (10/100/1000MBit), Atheros AR9285 Wireless Network Adapter (bgn), 2.1+EDR Bluetooth Just updated system with update manager and installed Linux kernel 3.3.1 to test - problem remains. My Lenovo can't get up from suspend. As Viktor, I got the same issu with a Lenovo S10. After suspend, screen remains black. I got no idea how to debug since I even can't access the netbook thriugh SSH. (In reply to comment #54) > Just updated system with update manager and installed Linux kernel 3.3.1 to > test - problem remains. My Lenovo can't get up from suspend. I saw asimilar problem on my Lenovo s10-3t, but on my machine it can resume back from suspend, sometimes after 120 seconds, sometimes 150 seconds or 300 seconds. So could you try to wait 6 minutes to see whether it could come back. (In reply to comment #56) > > So could you try to wait 6 minutes to see whether it could come back. no, without intel_idle.max_cstate=3 it doesn't wake even after 10 min. cstate=3 makes computer wake up fast (but it badly affects browser, Firefox starts to load the memory and processor). Regarding the Lenovo S10-3... originally its resume problem was fixed by this patch in 2.6.36: commit 4731fdcf6f7bdab3e369a3f844d4ea4d4017284d Author: Len Brown <len.brown@intel.com> Date: Fri Sep 24 21:02:27 2010 -0400 intel_idle: PCI quirk to prevent Lenovo Ideapad s10-3 boot hang You can tell if that quirk is running b/c it spews a dmesg line: [ 0.624375] pci 0000:00:1f.0: [Firmware Bug]: TigerPoint LPC.BM_STS cleared The way that original issue was debugged was finding the difference between the working acpi_idle and the failing intel_idle. But today the failure is different. I have access to a Lenovo S10-3 I just dropped FC17 on it, which is 3.5.2, and resume hangs with a black screen. The quirk above is in place. intel_idle.max_cstate=3 allows resume to work. But some cmdline params that fail are surprising: intel_idle.max_cstate=1 maxcpus=1 intel_idle.max_cstate=0 intel_idle.max_cstate=0 processor.max_cstate=1 intel_idle.max_cstate=0 processor.max_cstate=2 (gives MWAIT 0x10) cpuidle.off=1 crashes on boot nohpet idle=poll here is a clue, after leaving the system "failed" for about 5 minutes it actually resumed, and dmesg says this: [ 118.624575] PM: Syncing filesystems ... done. [ 118.627338] PM: Preparing system for mem sleep [ 118.779349] Freezing user space processes ... (elapsed 0.01 seconds) done. [ 118.791287] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. [ 118.802295] PM: Entering mem sleep [ 118.802338] Suspending console(s) (use no_console_suspend to debug) [ 118.803318] sd 1:0:0:0: [sda] Synchronizing SCSI cache [ 118.803577] sd 1:0:0:0: [sda] Stopping disk [ 119.169169] ACPI handle has no context! [ 119.365118] PM: suspend of devices complete after 562.255 msecs [ 119.365590] PM: late suspend of devices complete after 0.459 msecs [ 119.377247] pcieport 0000:00:1c.0: wake-up capability enabled by ACPI [ 119.388261] ehci_hcd 0000:00:1d.7: wake-up capability enabled by ACPI [ 119.399244] uhci_hcd 0000:00:1d.3: wake-up capability enabled by ACPI [ 120.757907] uhci_hcd 0000:00:1d.1: wake-up capability enabled by ACPI [ 120.758006] uhci_hcd 0000:00:1d.0: wake-up capability enabled by ACPI [ 120.758137] PM: noirq suspend of devices complete after 1392.536 msecs [ 120.758174] ACPI: Preparing to enter system sleep state S3 [ 120.784273] PM: Saving platform NVS memory [ 120.784330] Disabling non-boot CPUs ... [ 120.786548] CPU 1 is now offline [ 120.787428] Extended CMOS year: 2000 [ 120.787428] ACPI: Low-level resume complete [ 120.787428] PM: Restoring platform NVS memory [ 120.787428] CPU0: Thermal monitoring handled by SMI [ 120.787428] Extended CMOS year: 2000 [ 120.787428] microcode: CPU0 updated to revision 0x107, date = 2009-08-25 [ 120.792528] Enabling non-boot CPUs ... [ 120.792749] Booting Node 0 Processor 1 APIC 0x1 [ 120.807014] microcode: CPU1 updated to revision 0x107, date = 2009-08-25 [ 120.812158] CPU1 is up [ 120.812539] ACPI: Waking up from system sleep state S3 [ 424.700077] ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20120320/evregion-501) [ 424.700185] ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPCB.EC0_.DSSV] (Node ffff88003d1d0488), AE_TIME (20120320/psparse-536) [ 424.700228] ACPI Error: Method parse/execution failed [\_WAK] (Node ffff88003d1cbaf0), AE_TIME (20120320/psparse-536) [ 424.700402] ACPI Exception: AE_TIME, While executing method \_WAK (20120320/hwesleep-82) [ 424.721458] uhci_hcd 0000:00:1d.0: wake-up capability disabled by ACPI [ 424.728327] uhci_hcd 0000:00:1d.1: wake-up capability disabled by ACPI [ 424.732428] uhci_hcd 0000:00:1d.3: wake-up capability disabled by ACPI [ 424.732540] ehci_hcd 0000:00:1d.7: wake-up capability disabled by ACPI [ 424.733523] PM: noirq resume of devices complete after 21.159 msecs [ 424.733927] PM: early resume of devices complete after 0.323 msecs So it appears that we had some kind of time-out in the EC while evaluating _WAK This is not intel_idle specific, and it looks like ACPI, so moving bug categories. acpi_sleep=nonvs hmm, we get the _WAK EC timeout also in the working intel_idle.max_cstate=3 case: [ 62.018118] PM: Syncing filesystems ... done. [ 63.634734] PM: Preparing system for mem sleep [ 63.768209] Freezing user space processes ... (elapsed 0.01 seconds) done. [ 63.779276] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. [ 63.790285] PM: Entering mem sleep [ 63.790330] Suspending console(s) (use no_console_suspend to debug) [ 63.791278] sd 1:0:0:0: [sda] Synchronizing SCSI cache [ 63.791622] sd 1:0:0:0: [sda] Stopping disk [ 64.157146] ACPI handle has no context! [ 64.392113] PM: suspend of devices complete after 601.259 msecs [ 64.392513] PM: late suspend of devices complete after 0.389 msecs [ 64.403259] pcieport 0000:00:1c.0: wake-up capability enabled by ACPI [ 64.414241] ehci_hcd 0000:00:1d.7: wake-up capability enabled by ACPI [ 64.425214] uhci_hcd 0000:00:1d.3: wake-up capability enabled by ACPI [ 65.783698] uhci_hcd 0000:00:1d.1: wake-up capability enabled by ACPI [ 65.783791] uhci_hcd 0000:00:1d.0: wake-up capability enabled by ACPI [ 65.783900] PM: noirq suspend of devices complete after 1391.379 msecs [ 65.783936] ACPI: Preparing to enter system sleep state S3 [ 65.811197] PM: Saving platform NVS memory [ 65.811251] Disabling non-boot CPUs ... [ 65.813252] CPU 1 is now offline [ 65.814435] Extended CMOS year: 2000 [ 65.814435] ACPI: Low-level resume complete [ 65.814435] PM: Restoring platform NVS memory [ 65.814435] CPU0: Thermal monitoring handled by SMI [ 65.814435] Extended CMOS year: 2000 [ 65.814435] microcode: CPU0 updated to revision 0x107, date = 2009-08-25 [ 65.819131] Enabling non-boot CPUs ... [ 65.819449] Booting Node 0 Processor 1 APIC 0x1 [ 65.853992] microcode: CPU1 updated to revision 0x107, date = 2009-08-25 [ 65.858078] CPU1 is up [ 65.858479] ACPI: Waking up from system sleep state S3 [ 69.862782] ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20120320/evregion-501) [ 69.862803] ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPCB.EC0_.DSSV] (Node ffff88003d1d0488), AE_TIME (20120320/psparse-536) [ 69.862829] ACPI Error: Method parse/execution failed [\_WAK] (Node ffff88003d1cbaf0), AE_TIME (20120320/psparse-536) [ 69.862861] ACPI Exception: AE_TIME, While executing method \_WAK (20120320/hwesleep-82) [ 69.864039] Clocksource tsc unstable (delta = 1099511324104 ns) [ 69.864227] Switching to clocksource hpet Len, The resume problems for s10-3 has 2 types: 1. hang on resume for ever 2. the resume will hang for 2-5 minutes, and then come back to life. For the 2nd one I have a debug patch to fix it, pls check in bugzilla 41932 https://bugzilla.kernel.org/show_bug.cgi?id=41932 acpi_sleep=nonvs no joy acpi.ec_delay=5000 no joy Feng, Apparently I'm mostly seeing failure #1, because the patch in bug 41932 doesn't seem to help. (applied w/ typo fixed to 3.5.2) that said... suspend seems to always work on my lenovo s10-3 when I use intel_idle.max_cstate=3 on Linux-3.5.2, and I've not the foggiest idea why. The same c-state accessed by acpi-idle doesn't work, and shallower c-states don't work. bizarre. (In reply to comment #62) > Feng, > Apparently I'm mostly seeing failure #1, because the patch > in bug 41932 doesn't seem to help. (applied w/ typo fixed to 3.5.2) I see, my machine is a s10-3t, which is different from your s10-3. the bios version is Rev 0.25, released on 05/26/2010 Len, any idea/progress on this? Resume fails on my Lenovo s10-3 running Ubuntu 13.04 (Linux 3.8.0-27-generic), but the recent upstream kernels I try all work fine: 3.11.0-rc5-gf1d6e17 3.10.7 3.8.13 3.8.0 For the newest one, I tested AC, DC, intel_idle.max_cstate=0 -- all OK. Ubuntu 10.04's kernel still fails always -- no matter if running intel_idle, acpi_idle, or even idle=halt I grabbed the latest -- 3.8.0-29-generic from raring-proposed, but no joy. So I installed Ubuntu 13.10's daily build -- 3.11.0-2-generic (Aug 12th) and suspend/resume on the Lenovo Ideapad S10-3 works fine. I don't know what 13.04's problem was, but since upstream is working and 13.10 is working, we seem to be done here. |