Bug 6166 - Suspend to RAM regression (failure to resume)
Suspend to RAM regression (failure to resume)
Status: REJECTED UNREPRODUCIBLE
Product: ACPI
Classification: Unclassified
Component: Power-Sleep-Wake
i386 Linux
: P2 normal
Assigned To: Shaohua
:
Depends on:
Blocks: 7216
  Show dependency treegraph
 
Reported: 2006-03-04 19:10 UTC by Jean-Marc Valin
Modified: 2007-06-05 01:52 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.12 and up
Tree: Mainline
Regression: ---


Attachments
Loaded modules (2.96 KB, text/plain)
2006-04-12 18:57 UTC, Jean-Marc Valin
Details

Description Jean-Marc Valin 2006-03-04 19:10:48 UTC
Most recent kernel where this bug did not occur:
The bug was introduced in kernel 2.6.12 and occurs on later versions too (tested
2.6.13 and 2.6.14). After doing further testing, it seems like the regression
happened between 2.6.12-rc5 and 2.6.12-rc6

Distribution:
Ubuntu 5.10 (Breezy), although it also happened on Ubuntu 5.04 and Debian stable.

Hardware Environment:
Dell Latitude D600 (Bios rev. A14)
Pentium-M 1.6 GHz / 1 GB RAM
ATI Technologies, Inc. Radeon Mobility 9000 M9 (R250 Lf)

Software Environment:
Default Ubuntu 5.10 software installed, except for stock kernel.

Problem Description:
With kernel 2.6.12-rc6 and later, my machine *sometimes* doesn't resume when I
suspend it. This happens especially when it has been running for a while. It
almost always works when I just rebooted, or if I just successfully resumed. So
it behaves like "something" gets randomly corrupted, at which point the machine
still works, but will not resume if I suspend it. Also, I've observed the same
behaviour with and without preemption enabled.

Steps to reproduce:
It does not reproduce reliably. In general, the longer the machine has been up,
the most likely the bug is to happen. After 7-10 days of active uptime, resume
almost never works. After a few hours, it almost always works.
Comment 1 Shaohua 2006-03-05 22:39:48 UTC
There isn't significent change between 12-rc5 and 12-rc6. Can you do more 
investigations which kernel is final one working well? Thanks!
Comment 2 Jean-Marc Valin 2006-03-05 22:52:25 UTC
Both 2.6.10 final and 2.6.11 final work well for me, while 2.6.12, 2.6.13 and
2.6.14 are broken. Also, not sure if the info can be useful, but the Ubuntu 5.04
custom version of 2.6.11 was broken in the same way as stock 2.6.12 and up are.
I've heard it was broken for a lot of other people, so it may not mean much.
Regarding 2.6.12-rc6, I'm seeing a lot of ACPI stuff in the changelog
(http://lwn.net/Articles/138755/), so maybe one of them is the cause.
Comment 3 Shaohua 2006-03-05 22:57:54 UTC
That staff isn't suspend/resume related. Can you double check if 12-rc5 work 
well?
Comment 4 Jean-Marc Valin 2006-03-06 00:25:52 UTC
I've been running rc5 for several weeks now and it works fine so far. I really
have the impression that it's not in the suspend/resume code itself, otherwise
it would always behave the same. Because the probability of my machine not
resuming is proportional to the time it's been up, it makes me think the problem
might be a corruption somewhere else that might leave the system in an
inconsistent state.
Comment 5 Jean-Marc Valin 2006-03-06 01:12:36 UTC
Just another detail that could be useful. I've used three different ways to suspend:
- The ACPI stripts that come with Ubuntu
- The gnome-power-management scripts
- echo mem > /sys/power/state (works even from X)

I'm getting the same results (success or failure depending on the kernel) with
all three, so I don't think the suspend scripts and the video restore are
involved. Of course, I could be wrong.
Comment 6 Jean-Marc Valin 2006-03-21 15:23:21 UTC
I was just trying with 2.6.15 and I'm suspecting that the problem *may* be
related to 6266. 
Comment 7 Jean-Marc Valin 2006-04-12 09:03:18 UTC
OK, here's something more precise. 2.6.12-rc5-git8 is affected, while
2.6.12-rc5-git5 looks OK (so far so good). Before I continue testing all
versions, I'd like to know if someone is actually interested in
investigating/fixing the problem or if I'm wasting my time.
Comment 8 Shaohua 2006-04-12 18:38:54 UTC
I can't promise I could fix this, but I really hope this can be fixed. I think 
you already did good job. BTW, with a unstable kernel, did you tried to unload 
as many drivers as possible? Maybe it's a driver issue.
Comment 9 Jean-Marc Valin 2006-04-12 18:55:57 UTC
I tried unloading the usb stuff (which I think the scripts do anyway) when I was
testing 2.6.12-final and it didn't make a difference. Any particular modules I
should try unloading (or any that had a significant change in the current
investigation window)? To me it still really looks like a sort of "internal
corruption" (that may or may not be caused by a driver) that happens randomly
and cause strange things, while not crashing the machine. I've had problems with
at least USB and cpufreq, but never before 2.6.12-rc5, so I suspect they could
have the same cause as the suspend problem. Interesting is that when one of
those occurs, my system usually does not wake if I suspend it.
Comment 10 Jean-Marc Valin 2006-04-12 18:57:18 UTC
Created attachment 7851 [details]
Loaded modules

This is the result of:
% lsmod > lsmod.txt
Any module suspected more than others?
Comment 11 Shaohua 2006-04-12 19:08:23 UTC
No paticular one, I only could say please try more, sorry.
Comment 12 Thomas Renninger 2006-04-13 03:47:38 UTC
This got fixed in recent kernels. Can someone set this one as a duplicate to 
bug #6331, for whatever reasons I cannot. 
Comment 13 Thomas Renninger 2006-04-13 03:53:07 UTC
Oops, wrong. This bug was stated in an other one as possibly related. It's the 
same machine, but another bug..., sorry. 
Comment 14 Jean-Marc Valin 2006-04-17 17:08:06 UTC
OK, problem tracked down to the transition between 2.6.12-rc5-git5 and
2.6.12-rc5-git6. Just looking at the patch, there seems to be a lot of
cpufreq-related stuff, which happens to be one of the things I suspect. Is there
a complete changelog for -git6? Is this precise enough for you to be able to
find the bug?
Comment 15 Shaohua 2006-04-17 17:52:02 UTC
>between 2.6.12-rc5-git5 and 2.6.12-rc5-git6.
That's great. I'll do a diff and see the changes.

Then if you don't use cpufreq driver, what's happen?
Comment 16 Shaohua 2006-04-17 18:54:59 UTC
I just did a diff between git5 and git6. There are a lot of cpufreq changes, 
but it doesn't matter to me. Please make sure if the git5 to git6 change break 
your system. Anyway, please try to unload ide-cd driver and cpufreq driver, 
let's see what happen.
Comment 17 Jean-Marc Valin 2006-04-17 21:52:00 UTC
I can unload ide-cd with no problem but I don't see how I can unload the cpufreq
driver because it's always in use.
Comment 18 Shaohua 2006-04-17 22:01:41 UTC
You need to unload the speedstep-centrino, speedstep-ich or acpi-cpufreq 
driver first.
Comment 19 Jean-Marc Valin 2006-04-18 18:35:03 UTC
Did some more testing that keeps confirming git6 as the source of the problem.
Now I'm trying to unload modules, but I can't remove speedstep-centrino because
it's referenced once (owner not specified by lsmod). I was able to remove
ondemand and ide-cd, so we'll see if that helps, especially leaving the CPU at
max frequency all the time.
Comment 20 Jean-Marc Valin 2006-04-18 18:43:12 UTC
Just a note, I have a collegue who also has a D600 and has the same problem with
suspend to RAM. While I can't confirm that all D600 are affected, at least I
know it's not just my laptop that's affected.
Comment 21 Jean-Marc Valin 2006-04-21 08:11:41 UTC
I tried removing ide-cd and cpufreq-ondemand (couldn't remove
speedstep-centrino) and set the policy to performance. Turns out that in this
setup, my machine always resumes. This narrows the problem down quite a bit I guess.
Comment 22 Shaohua 2006-04-23 19:03:57 UTC
That's great. So just remove one driver, lets see which one is the root cause?
I also let you try to set cpu speed to a low speed in suspend with cpufreq 
driver loaded, and see if resume work. I suspect when the cpu isn't in full 
speed the resume will fail.
Comment 23 Jean-Marc Valin 2006-04-25 03:34:35 UTC
Well, it's not the ide-cd driver, so it's something to do with cpufreq. I'll try
with cpu at low speed and give you the results. One thing I already noticed is
that -git6 tends to wrongly report the cpu speed in /proc/cpufreq after running
a while with the ondemand scheduler. Could point to the calibration code that
changed from -git5 to -git6.
Comment 24 Jean-Marc Valin 2006-04-26 00:56:16 UTC
If I just set the CPU speed to the lowest speed (600 MHz using the powersave
governor) right after booting my machine, it resumes fine. Then I tried changing
the state from powersave (600 MHz) to performance (1.6 GHz) every 2 seconds for
several hours. When I tried resuming, it failed, but with the HD led flashing
every few seconds. In all previous cases I've tried, there was nothing going on
at all when resume would fail. I'm now testing with -git5 just to make sure it's
not a different bug.
Comment 25 Jean-Marc Valin 2006-04-28 03:48:06 UTC
Seems like the problem above was due to the fact I had the webcam plugged in (it
usually isn't). So far, it seems like my machine actually resumes fine with
-git6 if I just switch between powersave and performance. That would leave the
ondemand governor as the cause. Is it possible to separate the changes to
ondemand from the ones made to the rest of the cpufreq stuff? I could then test
-git5 with only the ondemand changes and -git6 without the ondemand changes.
Comment 26 Shaohua 2006-04-28 18:16:34 UTC
The ondemand driver actually just changes cpu speed according to your 
workload. It should haven't any difference with switching between powersave 
and performance to me. Can you set the governor to 'user' mode, and try 
different cpu speed and see if it works?
Comment 27 Jean-Marc Valin 2006-05-04 00:02:41 UTC
Still haven't tried the "user" (you mean userspace governor, right?) mode, but I
think I remember having problems with the default Ubuntu setup, which is to have
the userspace governor with powernowd. One interesting thing I tried was to
switch to the ondemand governor *after* my machine is hit by (unrelated) cpufreq
bug #6331. This means that the ondemand scheduler is unable to really change the
frequency. Despite that, after a day, my machine failed to resume. Right now,
I'm trying to see what happens if I run with ondemand, but switch to performance
just before suspending.
Comment 28 Jean-Marc Valin 2006-05-05 08:16:21 UTC
Tried running with ondemand, but switching to performance just before
suspending. Went fine for two days, then I got a complete lockup (keyboard leds
flashing) when simply trying to switch from performance to ondemand after a
successful resume. Not sure how to interpret that. I'm now trying the userspace
governow with the powernowd daemon. 
Comment 29 Jean-Marc Valin 2006-06-03 07:51:46 UTC
OK, so I can't reproduce the problem with userspace governor, but I've heard
someone with a D600 he had that problem with userspace as well. Now what? Any
way to fix this? I don't see much more I can do in terms of experimenting...
Comment 30 Jean-Marc Valin 2006-06-08 18:09:27 UTC
OK, anyone working on fixing this? Or should the bug be moved to cpufreq?
Comment 31 Shaohua 2006-07-18 18:43:23 UTC
No idea. Does the system use speed-step cpufreq driver? I wonder if the speed-
step driver changes anything. maybe the io based (acpi-cpufreq) has BIOS bug.
Comment 32 Jean-Marc Valin 2006-07-18 18:57:05 UTC
I'm using the speedstep-centrino driver. Also, not sure what you mean about
acpi-cpufreq, but I think if it were a BIOS bug it would have always caused
problem (not just at 2.6.12-rc5-git6).
Comment 33 Shaohua 2006-07-18 19:07:56 UTC
The io based method needs BIOS to handle cpu freq, maybe it's buggy in some 
situation (just my guess). But you are using speedstep-centrino, so just 
ingore it.
Comment 34 Jean-Marc Valin 2006-07-26 18:31:10 UTC
I've been thinking about the options here. I've already spent months of work
trying to pinpoint this bug and no significant progress has made towards solving
it. Given that I'm planning on replacing this laptop in a few months, I think
the cost-benefit of continuing to chase this bug (e.g learning git just so I can
test further) is just not worth it for me. I seem to be the only one who cares
about it anyway (despite the fact that it probably affects all Dell D600
laptops), so I guess the only option here is to mark it as WILL_NOT_FIX and
close it. 
Comment 35 Gregory Oschwald 2007-02-10 15:45:22 UTC
I am also experiencing this bug on my Thinkpad X60s.  I am reopening it as I
have experienced it on 2.6.15 through 2.6.19 kernels.  I have not tested 2.6.20 yet.
Comment 36 Rafael J. Wysocki 2007-06-04 10:47:10 UTC
What is the current status of this bug?
Comment 37 Jean-Marc Valin 2007-06-04 17:11:25 UTC
Gregory, the main thing to try if you want to test whether this is the same
thing I reported is to compare 2.6.12-rc5 and 2.6.12-rc6. In my case, things
stopped working correctly in 2.6.12-rc6 (or 2.6.12-rc5-git6 to be more precise).
However, I ended up stopping the testing since there seemed to be very little
interest in fixing the bug. I have changed my laptop since then anyway.
Comment 38 Alexey Starikovskiy 2007-06-05 01:52:18 UTC
2.6.21 was reported to work on Thinkpad X60, etc.
Please re-open the bug if you still have a bug.

Note You need to log in before you can comment on or make changes to this bug.