Bug 6166
Summary: | Suspend to RAM regression (failure to resume) | ||
---|---|---|---|
Product: | ACPI | Reporter: | Jean-Marc Valin (jean-marc.valin) |
Component: | Power-Sleep-Wake | Assignee: | Shaohua (shaohua.li) |
Status: | REJECTED UNREPRODUCIBLE | ||
Severity: | normal | CC: | acpi-bugzilla, john.stultz |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.12 and up | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: | Loaded modules |
Description
Jean-Marc Valin
2006-03-04 19:10:48 UTC
There isn't significent change between 12-rc5 and 12-rc6. Can you do more investigations which kernel is final one working well? Thanks! Both 2.6.10 final and 2.6.11 final work well for me, while 2.6.12, 2.6.13 and 2.6.14 are broken. Also, not sure if the info can be useful, but the Ubuntu 5.04 custom version of 2.6.11 was broken in the same way as stock 2.6.12 and up are. I've heard it was broken for a lot of other people, so it may not mean much. Regarding 2.6.12-rc6, I'm seeing a lot of ACPI stuff in the changelog (http://lwn.net/Articles/138755/), so maybe one of them is the cause. That staff isn't suspend/resume related. Can you double check if 12-rc5 work well? I've been running rc5 for several weeks now and it works fine so far. I really have the impression that it's not in the suspend/resume code itself, otherwise it would always behave the same. Because the probability of my machine not resuming is proportional to the time it's been up, it makes me think the problem might be a corruption somewhere else that might leave the system in an inconsistent state. Just another detail that could be useful. I've used three different ways to suspend: - The ACPI stripts that come with Ubuntu - The gnome-power-management scripts - echo mem > /sys/power/state (works even from X) I'm getting the same results (success or failure depending on the kernel) with all three, so I don't think the suspend scripts and the video restore are involved. Of course, I could be wrong. I was just trying with 2.6.15 and I'm suspecting that the problem *may* be related to 6266. OK, here's something more precise. 2.6.12-rc5-git8 is affected, while 2.6.12-rc5-git5 looks OK (so far so good). Before I continue testing all versions, I'd like to know if someone is actually interested in investigating/fixing the problem or if I'm wasting my time. I can't promise I could fix this, but I really hope this can be fixed. I think you already did good job. BTW, with a unstable kernel, did you tried to unload as many drivers as possible? Maybe it's a driver issue. I tried unloading the usb stuff (which I think the scripts do anyway) when I was testing 2.6.12-final and it didn't make a difference. Any particular modules I should try unloading (or any that had a significant change in the current investigation window)? To me it still really looks like a sort of "internal corruption" (that may or may not be caused by a driver) that happens randomly and cause strange things, while not crashing the machine. I've had problems with at least USB and cpufreq, but never before 2.6.12-rc5, so I suspect they could have the same cause as the suspend problem. Interesting is that when one of those occurs, my system usually does not wake if I suspend it. Created attachment 7851 [details]
Loaded modules
This is the result of:
% lsmod > lsmod.txt
Any module suspected more than others?
No paticular one, I only could say please try more, sorry. This got fixed in recent kernels. Can someone set this one as a duplicate to bug #6331, for whatever reasons I cannot. Oops, wrong. This bug was stated in an other one as possibly related. It's the same machine, but another bug..., sorry. OK, problem tracked down to the transition between 2.6.12-rc5-git5 and 2.6.12-rc5-git6. Just looking at the patch, there seems to be a lot of cpufreq-related stuff, which happens to be one of the things I suspect. Is there a complete changelog for -git6? Is this precise enough for you to be able to find the bug? >between 2.6.12-rc5-git5 and 2.6.12-rc5-git6.
That's great. I'll do a diff and see the changes.
Then if you don't use cpufreq driver, what's happen?
I just did a diff between git5 and git6. There are a lot of cpufreq changes, but it doesn't matter to me. Please make sure if the git5 to git6 change break your system. Anyway, please try to unload ide-cd driver and cpufreq driver, let's see what happen. I can unload ide-cd with no problem but I don't see how I can unload the cpufreq driver because it's always in use. You need to unload the speedstep-centrino, speedstep-ich or acpi-cpufreq driver first. Did some more testing that keeps confirming git6 as the source of the problem. Now I'm trying to unload modules, but I can't remove speedstep-centrino because it's referenced once (owner not specified by lsmod). I was able to remove ondemand and ide-cd, so we'll see if that helps, especially leaving the CPU at max frequency all the time. Just a note, I have a collegue who also has a D600 and has the same problem with suspend to RAM. While I can't confirm that all D600 are affected, at least I know it's not just my laptop that's affected. I tried removing ide-cd and cpufreq-ondemand (couldn't remove speedstep-centrino) and set the policy to performance. Turns out that in this setup, my machine always resumes. This narrows the problem down quite a bit I guess. That's great. So just remove one driver, lets see which one is the root cause? I also let you try to set cpu speed to a low speed in suspend with cpufreq driver loaded, and see if resume work. I suspect when the cpu isn't in full speed the resume will fail. Well, it's not the ide-cd driver, so it's something to do with cpufreq. I'll try with cpu at low speed and give you the results. One thing I already noticed is that -git6 tends to wrongly report the cpu speed in /proc/cpufreq after running a while with the ondemand scheduler. Could point to the calibration code that changed from -git5 to -git6. If I just set the CPU speed to the lowest speed (600 MHz using the powersave governor) right after booting my machine, it resumes fine. Then I tried changing the state from powersave (600 MHz) to performance (1.6 GHz) every 2 seconds for several hours. When I tried resuming, it failed, but with the HD led flashing every few seconds. In all previous cases I've tried, there was nothing going on at all when resume would fail. I'm now testing with -git5 just to make sure it's not a different bug. Seems like the problem above was due to the fact I had the webcam plugged in (it usually isn't). So far, it seems like my machine actually resumes fine with -git6 if I just switch between powersave and performance. That would leave the ondemand governor as the cause. Is it possible to separate the changes to ondemand from the ones made to the rest of the cpufreq stuff? I could then test -git5 with only the ondemand changes and -git6 without the ondemand changes. The ondemand driver actually just changes cpu speed according to your workload. It should haven't any difference with switching between powersave and performance to me. Can you set the governor to 'user' mode, and try different cpu speed and see if it works? Still haven't tried the "user" (you mean userspace governor, right?) mode, but I think I remember having problems with the default Ubuntu setup, which is to have the userspace governor with powernowd. One interesting thing I tried was to switch to the ondemand governor *after* my machine is hit by (unrelated) cpufreq bug #6331. This means that the ondemand scheduler is unable to really change the frequency. Despite that, after a day, my machine failed to resume. Right now, I'm trying to see what happens if I run with ondemand, but switch to performance just before suspending. Tried running with ondemand, but switching to performance just before suspending. Went fine for two days, then I got a complete lockup (keyboard leds flashing) when simply trying to switch from performance to ondemand after a successful resume. Not sure how to interpret that. I'm now trying the userspace governow with the powernowd daemon. OK, so I can't reproduce the problem with userspace governor, but I've heard someone with a D600 he had that problem with userspace as well. Now what? Any way to fix this? I don't see much more I can do in terms of experimenting... OK, anyone working on fixing this? Or should the bug be moved to cpufreq? No idea. Does the system use speed-step cpufreq driver? I wonder if the speed- step driver changes anything. maybe the io based (acpi-cpufreq) has BIOS bug. I'm using the speedstep-centrino driver. Also, not sure what you mean about acpi-cpufreq, but I think if it were a BIOS bug it would have always caused problem (not just at 2.6.12-rc5-git6). The io based method needs BIOS to handle cpu freq, maybe it's buggy in some situation (just my guess). But you are using speedstep-centrino, so just ingore it. I've been thinking about the options here. I've already spent months of work trying to pinpoint this bug and no significant progress has made towards solving it. Given that I'm planning on replacing this laptop in a few months, I think the cost-benefit of continuing to chase this bug (e.g learning git just so I can test further) is just not worth it for me. I seem to be the only one who cares about it anyway (despite the fact that it probably affects all Dell D600 laptops), so I guess the only option here is to mark it as WILL_NOT_FIX and close it. I am also experiencing this bug on my Thinkpad X60s. I am reopening it as I have experienced it on 2.6.15 through 2.6.19 kernels. I have not tested 2.6.20 yet. What is the current status of this bug? Gregory, the main thing to try if you want to test whether this is the same thing I reported is to compare 2.6.12-rc5 and 2.6.12-rc6. In my case, things stopped working correctly in 2.6.12-rc6 (or 2.6.12-rc5-git6 to be more precise). However, I ended up stopping the testing since there seemed to be very little interest in fixing the bug. I have changed my laptop since then anyway. 2.6.21 was reported to work on Thinkpad X60, etc. Please re-open the bug if you still have a bug. |