Bug 12149 - suspend/hibernate issues with ThinkPad T41
suspend/hibernate issues with ThinkPad T41
Status: CLOSED UNREPRODUCIBLE
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64
All Linux
: P1 normal
Assigned To: Thomas Gleixner
:
Depends on:
Blocks: 7216
  Show dependency treegraph
 
Reported: 2008-12-03 09:59 UTC by Toralf Förster
Modified: 2011-01-17 22:16 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.27.7
Tree: Mainline
Regression: Yes


Attachments
/var/log/message excerpt (82.45 KB, text/plain)
2008-12-03 11:02 UTC, Toralf Förster
Details
.config to reproduce the (new) issue (21.83 KB, text/plain)
2008-12-04 12:40 UTC, Toralf Förster
Details
dmesg of failure of 2.6.27-gentoo-r7 (523.64 KB, text/plain)
2009-01-19 09:36 UTC, Toralf Förster
Details
/var/log/messages part (213.49 KB, text/plain)
2009-05-20 14:00 UTC, Toralf Förster
Details

Description Toralf Förster 2008-12-03 09:59:48 UTC
Latest working kernel version: 2.6.26.8
Earliest failing kernel version: 2.6.27.7
Distribution: Gentoo
Hardware Environment: T41
Software Environment: Gentoo Linux
Problem Description: 

Steps to reproduce:https://bugs.gentoo.org/show_bug.cgi?id=249498
Comment 1 Daniel Drake 2008-12-03 10:19:55 UTC
Toralf hits the bug through the following process:

1. suspend-to-ram
2. resume
3. suspend-to-disk

At this point, the suspend-to-disk fails with the following error message:
Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze):

Oddly, it is a bash process refusing to freeze.

Toralf, please attach the /var/log/messages file here so that it can be accessed directly.

The suspend-to-RAM prior to suspend-to-disk is necessary for the bug to appear. suspend-to-disk works fine if he does not suspend-to-ram beforehand.

This bug is a 2.6.27 regression. However, it does not appear in 2.6.28-rc7. We would like to identify and backport the fix into 2.6.27.x, can you help us? Thanks!
Comment 2 Toralf Förster 2008-12-03 11:02:55 UTC
Created attachment 19126 [details]
/var/log/message excerpt

syslog output, please see  https://bugs.gentoo.org/show_bug.cgi?id=249498#add_comment for the actions made
Comment 3 Rafael J. Wysocki 2008-12-03 14:59:18 UTC
References : http://marc.info/?l=linux-kernel&m=122805295416976&w=4
Comment 4 Rafael J. Wysocki 2008-12-03 16:13:48 UTC
Date : 2008-12-03 09:59
Comment 5 Toralf Förster 2008-12-04 12:40:12 UTC
Created attachment 19148 [details]
.config to reproduce the (new) issue

I could narrow down the issue to the the attached .config - BTW it tooks nearly as long as bisecting it ... :-(

Please note that the behaviour is now reverted, which means 

1. echo disk > /sys/power/state <--- works
2. wakeup system
3. echo mem > /sys/power/state  <--- failed

If I de-select CONFIG_HPET_TIMER then the issue went away and step 3 (suspend to mem) works fine.
Comment 6 Daniel Drake 2008-12-05 08:21:17 UTC
Toralf, any chance of a bisection? would give us something concrete to work from
Comment 7 Toralf Förster 2008-12-06 07:03:19 UTC
(In reply to comment #6)
> Toralf, any chance of a bisection? 
Yes - b/c of hours of live winter sport in German TV I could bisected this commit from the stable 2.6.27.y branch (b/c both v2.6.17 and current git were fine) as the first bad commit:

5e55aa8db085dad1aabb4574c73c23c7ae571e7b


Comment 8 Daniel Drake 2008-12-06 07:28:56 UTC
Oh, another regression in 2.6.27-stable :(

The commit in question is 5b7dba4ff834259a5623e03a565748704a8fe449 in Linus' tree

author	Dave Kleikamp <shaggy@linux.vnet.ibm.com>

sched_clock: prevent scd->clock from moving backwards

When sched_clock_cpu() couples the clocks between two cpus, it may
increment scd->clock beyond the GTOD tick window that __update_sched_clock()
uses to clamp the clock.  A later call to __update_sched_clock() may move
the clock back to scd->tick_gtod + TICK_NSEC, violating the clock's
monotonic property.

This patch ensures that scd->clock will not be set backward.


---

Dave, the above commit was backported into 2.6.27-stable for the 2.6.27.5 release, but it produced this strange suspend regression detailed in this bug. However, the bug does not appear in in 2.6.28. Can you help us fix the 2.6.27 tree (was there a dependent patch too?), or should we propose that it gets removed from linux-stable, or something else? thanks
Comment 9 Rafael J. Wysocki 2008-12-06 08:58:40 UTC
Since the bug is apparently not present in the mainline, I'm dropping it from the list of mainline regressions from 2.6.26.  Still, Dave please take care of it if possible.
Comment 10 Dave Kleikamp 2008-12-07 20:08:44 UTC
I'm at a loss to explain why this patch should cause the suspend regression.  In fact, I would expect my patch to be a no-op on a single cpu system, since I don't expect that scd->clock would ever exceed scd->tick_gtod + TICK_NSEC.

I'm not an authority on either the scheduler or suspend/resume.  There have been a lot of scheduler changes in the mainline kernel since 2.6.27.  Maybe Ingo or Peter may know of something that could explain this.

My patch didn't fix an observed problem.  Rather I was testing some debug code for powerpc on an x86 machine when I discovered the problem.  If we can't figure out what's causing the problem, removing the patch from the stable branch is probably safe.
Comment 11 Daniel Drake 2008-12-14 10:10:31 UTC
Thanks for your input, I've reverted it from the Gentoo 2.6.27 kernel
Comment 12 Toralf Förster 2008-12-14 11:58:31 UTC
BTW, reverting commit 5e55aa8 at 2.6.27.9 solved the issue for the current stable kernel at my system.
Comment 13 Axel Dyks 2008-12-14 12:17:34 UTC
(In reply to comment #12)
> BTW, reverting commit 5e55aa8 at 2.6.27.9 solved the issue for the current
> stable kernel at my system.
Seems to be the same commit
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commitdiff;h=5e55aa8db085dad1aabb4574c73c23c7ae571e7b
That's exactly what has been reverted from 2.6.27(.9) on gentoo.
http://sources.gentoo.org/viewcvs.py/linux-patches/genpatches-2.6/tags/2.6.27-8/1800_revert-clock-backwards-fix.patch?rev=1437&view=markup
Comment 14 Toralf Förster 2008-12-14 13:05:59 UTC
FWIW, with the following diff on top of 2.6.27.9 (revert the bad commit and add a printk I hope it wasn't done in a too dumb way b/c I do not have experiences in kernel programming nor did I really code in C within the last 10 years) 

diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 8178724..2a67340 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -124,7 +124,9 @@ static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)

        clock = scd->tick_gtod + delta;
        min_clock = wrap_max(scd->tick_gtod, scd->clock);
-       max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC);
+       max_clock = scd->tick_gtod + TICK_NSEC;
+       if (scd->clock > max_clock)
+               printk(KERN_INFO "%d %d\n", scd->clock, max_clock);

        clock = wrap_max(clock, min_clock);
        clock = wrap_min(clock, max_clock);

I get in my syslog after suspend-to-disk :

...
Dec 14 21:55:24 n22 ACPI: Preparing to enter system sleep state S4
Dec 14 21:55:24 n22 Extended CMOS year: 2000
Dec 14 21:55:24 n22 PM: Creating hibernation image:
Dec 14 21:55:24 n22 PM: Need to copy 8127 pages
Dec 14 21:55:24 n22 Intel machine check architecture supported.
Dec 14 21:55:24 n22 Intel machine check reporting enabled on CPU#0.
Dec 14 21:55:24 n22 Extended CMOS year: 2000
Dec 14 21:55:24 n22 Force enabled HPET at resume
Dec 14 21:55:24 n22 -1556628607 70
Dec 14 21:55:24 n22 ACPI: Waking up from system sleep state S4
...

and for suspend-to-ram :

...
Dec 14 21:55:55 n22 ACPI: Preparing to enter system sleep state S3
Dec 14 21:55:55 n22 Extended CMOS year: 2000
Dec 14 21:55:55 n22 Intel machine check architecture supported.
Dec 14 21:55:55 n22 Intel machine check reporting enabled on CPU#0.
Dec 14 21:55:55 n22 Back to C!
Dec 14 21:55:55 n22 Extended CMOS year: 2000
Dec 14 21:55:55 n22 Force enabled HPET at resume
Dec 14 21:55:55 n22 212611283 77
Dec 14 21:55:55 n22 ACPI: Waking up from system sleep state S3
...
Comment 15 Axel Dyks 2008-12-14 15:03:34 UTC
(In reply to comment #14)
> FWIW, with the following diff on top of 2.6.27.9 (revert the bad commit and add
> a printk I hope it wasn't done in a too dumb way b/c I do not have experiences
> in kernel programming nor did I really code in C within the last 10 years) 
Is this something that's additionally required to solve your problem,
or is it meant as an immprovement / better diagnostic output? 
Comment 16 Toralf Förster 2008-12-15 01:29:53 UTC
(In reply to comment #15)

> or is it meant as an immprovement / better diagnostic output? 
It is meant in this way maybe it is helpful for Dave Kleikamp (comment #10)

Comment 18 Thomas Gleixner 2008-12-22 15:28:32 UTC
Not fixed, just papered over the real problem.
See http://lkml.org/lkml/2008/12/22/192 ...
Comment 19 Toralf Förster 2009-01-19 09:36:18 UTC
Created attachment 19887 [details]
dmesg of failure of 2.6.27-gentoo-r7

Today I observed q/ 2.6.27-gentoo-r7 th eproblem, that the system doesn't go into suspend if I closed the lid. I tried it several times, finally I removed the battery to bring the system down.
Comment 20 Toralf Förster 2009-05-20 13:59:13 UTC
Today I played with NFSv4 at my T41 (2.6.29.4) using a user mode linux image (v2.6.30-rc6-72-g279e677) to verify 7ee2cb7f32b299c (ix NFS v4 client handling of MAY_EXEC in nfs_permission) - could not verify this, however got some zombie processes both, killed some processes accessing the (in the meanwile unounted NFSv4 dirs), shutdown the nfs daemon and tried to hibernate my T41 - doesn't work, I ahd to shutdown the system instead.

The next attachment contains the relavant part of /var/log/messages. Not sure whether this is a hibernateion issue or related to NFSv4
Comment 21 Toralf Förster 2009-05-20 14:00:19 UTC
Created attachment 21448 [details]
/var/log/messages part
Comment 22 Rafael J. Wysocki 2011-01-16 22:51:44 UTC
Is this problem still present in 2.6.37?
Comment 23 Toralf Förster 2011-01-17 08:36:38 UTC
(In reply to comment #22)
> Is this problem still present in 2.6.37?

Well, in the mean while the T41 died (really) due to a hair-line crack in the main board and was replaced by a T400 (bug #15100 BTW), therefore I'll close this bug report.

Note You need to log in before you can comment on or make changes to this bug.