Kernel Bug Tracker – Bug 6166
Suspend to RAM regression (failure to resume)
Last modified: 2007-06-05 01:52:18 UTC
Most recent kernel where this bug did not occur:
The bug was introduced in kernel 2.6.12 and occurs on later versions too (tested
2.6.13 and 2.6.14). After doing further testing, it seems like the regression
happened between 2.6.12-rc5 and 2.6.12-rc6
Ubuntu 5.10 (Breezy), although it also happened on Ubuntu 5.04 and Debian stable.
Dell Latitude D600 (Bios rev. A14)
Pentium-M 1.6 GHz / 1 GB RAM
ATI Technologies, Inc. Radeon Mobility 9000 M9 (R250 Lf)
Default Ubuntu 5.10 software installed, except for stock kernel.
With kernel 2.6.12-rc6 and later, my machine *sometimes* doesn't resume when I
suspend it. This happens especially when it has been running for a while. It
almost always works when I just rebooted, or if I just successfully resumed. So
it behaves like "something" gets randomly corrupted, at which point the machine
still works, but will not resume if I suspend it. Also, I've observed the same
behaviour with and without preemption enabled.
Steps to reproduce:
It does not reproduce reliably. In general, the longer the machine has been up,
the most likely the bug is to happen. After 7-10 days of active uptime, resume
almost never works. After a few hours, it almost always works.
There isn't significent change between 12-rc5 and 12-rc6. Can you do more
investigations which kernel is final one working well? Thanks!
Both 2.6.10 final and 2.6.11 final work well for me, while 2.6.12, 2.6.13 and
2.6.14 are broken. Also, not sure if the info can be useful, but the Ubuntu 5.04
custom version of 2.6.11 was broken in the same way as stock 2.6.12 and up are.
I've heard it was broken for a lot of other people, so it may not mean much.
Regarding 2.6.12-rc6, I'm seeing a lot of ACPI stuff in the changelog
(http://lwn.net/Articles/138755/), so maybe one of them is the cause.
That staff isn't suspend/resume related. Can you double check if 12-rc5 work
I've been running rc5 for several weeks now and it works fine so far. I really
have the impression that it's not in the suspend/resume code itself, otherwise
it would always behave the same. Because the probability of my machine not
resuming is proportional to the time it's been up, it makes me think the problem
might be a corruption somewhere else that might leave the system in an
Just another detail that could be useful. I've used three different ways to suspend:
- The ACPI stripts that come with Ubuntu
- The gnome-power-management scripts
- echo mem > /sys/power/state (works even from X)
I'm getting the same results (success or failure depending on the kernel) with
all three, so I don't think the suspend scripts and the video restore are
involved. Of course, I could be wrong.
I was just trying with 2.6.15 and I'm suspecting that the problem *may* be
related to 6266.
OK, here's something more precise. 2.6.12-rc5-git8 is affected, while
2.6.12-rc5-git5 looks OK (so far so good). Before I continue testing all
versions, I'd like to know if someone is actually interested in
investigating/fixing the problem or if I'm wasting my time.
I can't promise I could fix this, but I really hope this can be fixed. I think
you already did good job. BTW, with a unstable kernel, did you tried to unload
as many drivers as possible? Maybe it's a driver issue.
I tried unloading the usb stuff (which I think the scripts do anyway) when I was
testing 2.6.12-final and it didn't make a difference. Any particular modules I
should try unloading (or any that had a significant change in the current
investigation window)? To me it still really looks like a sort of "internal
corruption" (that may or may not be caused by a driver) that happens randomly
and cause strange things, while not crashing the machine. I've had problems with
at least USB and cpufreq, but never before 2.6.12-rc5, so I suspect they could
have the same cause as the suspend problem. Interesting is that when one of
those occurs, my system usually does not wake if I suspend it.
Created attachment 7851 [details]
This is the result of:
% lsmod > lsmod.txt
Any module suspected more than others?
No paticular one, I only could say please try more, sorry.
This got fixed in recent kernels. Can someone set this one as a duplicate to
bug #6331, for whatever reasons I cannot.
Oops, wrong. This bug was stated in an other one as possibly related. It's the
same machine, but another bug..., sorry.
OK, problem tracked down to the transition between 2.6.12-rc5-git5 and
2.6.12-rc5-git6. Just looking at the patch, there seems to be a lot of
cpufreq-related stuff, which happens to be one of the things I suspect. Is there
a complete changelog for -git6? Is this precise enough for you to be able to
find the bug?
>between 2.6.12-rc5-git5 and 2.6.12-rc5-git6.
That's great. I'll do a diff and see the changes.
Then if you don't use cpufreq driver, what's happen?
I just did a diff between git5 and git6. There are a lot of cpufreq changes,
but it doesn't matter to me. Please make sure if the git5 to git6 change break
your system. Anyway, please try to unload ide-cd driver and cpufreq driver,
let's see what happen.
I can unload ide-cd with no problem but I don't see how I can unload the cpufreq
driver because it's always in use.
You need to unload the speedstep-centrino, speedstep-ich or acpi-cpufreq
Did some more testing that keeps confirming git6 as the source of the problem.
Now I'm trying to unload modules, but I can't remove speedstep-centrino because
it's referenced once (owner not specified by lsmod). I was able to remove
ondemand and ide-cd, so we'll see if that helps, especially leaving the CPU at
max frequency all the time.
Just a note, I have a collegue who also has a D600 and has the same problem with
suspend to RAM. While I can't confirm that all D600 are affected, at least I
know it's not just my laptop that's affected.
I tried removing ide-cd and cpufreq-ondemand (couldn't remove
speedstep-centrino) and set the policy to performance. Turns out that in this
setup, my machine always resumes. This narrows the problem down quite a bit I guess.
That's great. So just remove one driver, lets see which one is the root cause?
I also let you try to set cpu speed to a low speed in suspend with cpufreq
driver loaded, and see if resume work. I suspect when the cpu isn't in full
speed the resume will fail.
Well, it's not the ide-cd driver, so it's something to do with cpufreq. I'll try
with cpu at low speed and give you the results. One thing I already noticed is
that -git6 tends to wrongly report the cpu speed in /proc/cpufreq after running
a while with the ondemand scheduler. Could point to the calibration code that
changed from -git5 to -git6.
If I just set the CPU speed to the lowest speed (600 MHz using the powersave
governor) right after booting my machine, it resumes fine. Then I tried changing
the state from powersave (600 MHz) to performance (1.6 GHz) every 2 seconds for
several hours. When I tried resuming, it failed, but with the HD led flashing
every few seconds. In all previous cases I've tried, there was nothing going on
at all when resume would fail. I'm now testing with -git5 just to make sure it's
not a different bug.
Seems like the problem above was due to the fact I had the webcam plugged in (it
usually isn't). So far, it seems like my machine actually resumes fine with
-git6 if I just switch between powersave and performance. That would leave the
ondemand governor as the cause. Is it possible to separate the changes to
ondemand from the ones made to the rest of the cpufreq stuff? I could then test
-git5 with only the ondemand changes and -git6 without the ondemand changes.
The ondemand driver actually just changes cpu speed according to your
workload. It should haven't any difference with switching between powersave
and performance to me. Can you set the governor to 'user' mode, and try
different cpu speed and see if it works?
Still haven't tried the "user" (you mean userspace governor, right?) mode, but I
think I remember having problems with the default Ubuntu setup, which is to have
the userspace governor with powernowd. One interesting thing I tried was to
switch to the ondemand governor *after* my machine is hit by (unrelated) cpufreq
bug #6331. This means that the ondemand scheduler is unable to really change the
frequency. Despite that, after a day, my machine failed to resume. Right now,
I'm trying to see what happens if I run with ondemand, but switch to performance
just before suspending.
Tried running with ondemand, but switching to performance just before
suspending. Went fine for two days, then I got a complete lockup (keyboard leds
flashing) when simply trying to switch from performance to ondemand after a
successful resume. Not sure how to interpret that. I'm now trying the userspace
governow with the powernowd daemon.
OK, so I can't reproduce the problem with userspace governor, but I've heard
someone with a D600 he had that problem with userspace as well. Now what? Any
way to fix this? I don't see much more I can do in terms of experimenting...
OK, anyone working on fixing this? Or should the bug be moved to cpufreq?
No idea. Does the system use speed-step cpufreq driver? I wonder if the speed-
step driver changes anything. maybe the io based (acpi-cpufreq) has BIOS bug.
I'm using the speedstep-centrino driver. Also, not sure what you mean about
acpi-cpufreq, but I think if it were a BIOS bug it would have always caused
problem (not just at 2.6.12-rc5-git6).
The io based method needs BIOS to handle cpu freq, maybe it's buggy in some
situation (just my guess). But you are using speedstep-centrino, so just
I've been thinking about the options here. I've already spent months of work
trying to pinpoint this bug and no significant progress has made towards solving
it. Given that I'm planning on replacing this laptop in a few months, I think
the cost-benefit of continuing to chase this bug (e.g learning git just so I can
test further) is just not worth it for me. I seem to be the only one who cares
about it anyway (despite the fact that it probably affects all Dell D600
laptops), so I guess the only option here is to mark it as WILL_NOT_FIX and
I am also experiencing this bug on my Thinkpad X60s. I am reopening it as I
have experienced it on 2.6.15 through 2.6.19 kernels. I have not tested 2.6.20 yet.
What is the current status of this bug?
Gregory, the main thing to try if you want to test whether this is the same
thing I reported is to compare 2.6.12-rc5 and 2.6.12-rc6. In my case, things
stopped working correctly in 2.6.12-rc6 (or 2.6.12-rc5-git6 to be more precise).
However, I ended up stopping the testing since there seemed to be very little
interest in fixing the bug. I have changed my laptop since then anyway.
2.6.21 was reported to work on Thinkpad X60, etc.
Please re-open the bug if you still have a bug.