Bug 9501

Summary: suspend/hibernation issue with kernel 2.6.23-gentoo-r3 at T41
Product: Power Management Reporter: Toralf Förster (toralf.foerster)
Component: Hibernation/SuspendAssignee: Rafael J. Wysocki (rjwysocki)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: akpm, jdike
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.23.9 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 9056    

Description Toralf Förster 2007-12-04 04:33:52 UTC
Most recent kernel where this bug did not occur:2.6.24-rc3-g92d499d9
Distribution:Gentoo
Hardware Environment:ThinkPad T41
Software Environment:gentoo Linux + user mode linux
Problem Description:


It seems to be a regression of kernel 2.6.23 with suspend/hibernate if a user mode linux image is runnning.

With previous kernel versions (tested gentoo-sources-2.6.22-r9 and gentoo-sources-2.6.21-r4) I've no problem to either suspend or to hibernate the ThinkPad T41.

However with gentoo-sources-2.6.23-r3 the system didn't neither suspend nor hibernate. The good news is that the system doesn't hang :-) It tries to calm down but after some secs it comes back and I can continue the work.


Steps to reproduce:

Start the user mode linux image :

1)
/usr/local/bin/linux-2.6.23 ubda=/opt/uml/root_fs ubdb=/opt/uml/swap_fs eth0=tuntap,,2A:93:E5:15:4E:B9,192.168.0.253 mem=128M umid=root

2) press <Fn>+<F4> or <Fn>+<F12> respectively.

I tested for the UML kernel both the Gentoo linux-2.6.18-usermode-r2 UML kernel and a straight linux-2.6.23 UML kernel.
Comment 1 Andrew Morton 2007-12-04 09:11:17 UTC
I don't understand.  Are you saying that the kernel fails to suspend when 
an UML instance is running?  Or are you saying that the UML kernel will no
longer suspend?

(If the latter: that's news to me - I didn't know you could suspend and resume
UML...)
Comment 2 Toralf Förster 2007-12-04 12:40:13 UTC
The host system cannot be suspended / hibernated if an UML instance is running. (BTW I never tried to suspend/hibernat the UML instance).
Comment 3 Rafael J. Wysocki 2007-12-04 15:49:08 UTC
Do I understand correctly that 2.6.24-rc3-g92d499d9 is the kernel in which the problem appears?  If I do, what's the last known good kernel?  Is that vanilla 2.6.23 or 2.6.24-rc[12]*?

Also, what exactly is the configuration?  Do you have the same kernel on both the host system and UML or are these kernels different, in which case what kernel is used on the host system?
Comment 4 Toralf Förster 2007-12-05 00:49:05 UTC
The last fine host kernel was linux-2.6.22-gentoo-r9-pppoe.
The current used host kernel linux-2.6.23-gentoo-r3 has the issue.
With current git kernel 2.6.24-rc3-g92d499d9 the problem cannot be reproduced.

For the UML linux I tried to different kernel versions (linux-2.6.18-usermode-r2 and linux-2.6.23), both show the same behaviour.
Comment 5 Rafael J. Wysocki 2007-12-05 10:43:44 UTC
(In reply to comment #4)
> The last fine host kernel was linux-2.6.22-gentoo-r9-pppoe.
> The current used host kernel linux-2.6.23-gentoo-r3 has the issue.
> With current git kernel 2.6.24-rc3-g92d499d9 the problem cannot be
> reproduced.

Does this mean that the problem has disappeared between 2.6.23-gentoo-r3 and 2.6.24-rc3-g92d499d9 or that 2.6.24-rc3-g92d499d9 is completely unusable and the problem cannot be reproduced for this reason?

> For the UML linux I tried to different kernel versions
> (linux-2.6.18-usermode-r2 and linux-2.6.23), both show the same behaviour.

OK, so the problem is with the host kernel rather than with the UML one.
Comment 6 Toralf Förster 2007-12-06 00:55:40 UTC
>Does this mean that the problem has disappeared between 2.6.23-gentoo-r3 and
2.6.24-rc3-g92d499d9
Right, the current kernel is ok.
Comment 7 Rafael J. Wysocki 2007-12-06 08:12:09 UTC
So, the problem is that 2.6.23 doesn't work correctly, right?
Comment 8 Toralf Förster 2007-12-06 08:50:40 UTC
2.6.23.9 to be exactly
Comment 9 Rafael J. Wysocki 2007-12-12 16:11:15 UTC
Well, this looks like w freezer problem to me, but I don't think any of my patches that went in after 2.6.23 could fix it.  That very well may be an UML patch.

To find a patch that fixes the problem, you'd have to carry out a bisection between 2.6.23 and the current mainline.
Comment 10 Toralf Förster 2007-12-13 13:11:10 UTC
(In reply to comment #9)
> 
> To find a patch that fixes the problem, you'd have to carry out a bisection
> between 2.6.23 and the current mainline.
> 
I bisected it, first bad commit is : 

commit 0c1eecfb345401629aa57c9d3b077273e56c45a7
Author: Rafael J. Wysocki <rjw@sisk.pl>
Date:   Thu Jul 19 01:47:33 2007 -0700

    Freezer: avoid freezing kernel threads prematurely

    Kernel threads should not have TIF_FREEZE set when user space processes are
    being frozen, since otherwise some of them might be frozen prematurely.
    To prevent this from happening we can (1) make exit_mm() unset TIF_FREEZE
    unconditionally just after clearing tsk->mm and (2) make try_to_freeze_tasks()
    check if p->mm is different from zero and PF_BORROWED_MM is unset in p->flags
    when user space processes are to be frozen.

    Namely, when user space processes are being frozen, we only should set
    TIF_FREEZE for tasks that have p->mm different from NULL and don't have
    PF_BORROWED_MM set in p->flags.  For this reason task_lock() must be used to
    prevent try_to_freeze_tasks() from racing with use_mm()/unuse_mm(), in which
    p->mm and p->flags.PF_BORROWED_MM are changed under task_lock(p).  Also, we
    need to prevent the following scenario from happening:

    * daemonize() is called by a task spawned from a user space code path
    * freezer checks if the task has p->mm set and the result is positive
    * task enters exit_mm() and clears its TIF_FREEZE
    * freezer sets TIF_FREEZE for the task
    * task calls try_to_freeze() and goes to the refrigerator, which is wrong at
      that point

    This requires us to acquire task_lock(p) before p->flags.PF_BORROWED_MM and
    p->mm are examined and release it after TIF_FREEZE is set for p (or it turns
    out that TIF_FREEZE should not be set).

    Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
    Cc: Gautham R Shenoy <ego@in.ibm.com>
    Cc: Pavel Machek <pavel@ucw.cz>
    Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment 11 Rafael J. Wysocki 2007-12-13 13:35:01 UTC
(In reply to comment #10)
> (In reply to comment #9)
> > 
> > To find a patch that fixes the problem, you'd have to carry out a bisection
> > between 2.6.23 and the current mainline.
> 
> I bisected it, first bad commit is : 

Well, thanks, but I didn't mean that.  You have found a patch that broke 2.6.23 for you (which is useful information BTW), but what we need is the patch that _fixed_ this breakage in 2.6.24-rc.

IOW, mark 2.6.23 as "good" (in fact it's bad, but so what?), mark the current mainline as "bad" (we know that in fact it's good) and run the bisection marking all good kernels as "bad" and vice versa.  Then, the first "bad" one returned by the bisection should be the one that fixes the bug.
Comment 12 Toralf Förster 2007-12-14 04:47:16 UTC
(In reply to comment #11)
>2.6.23 for you (which is useful information BTW)
Hopefully, b/c it needed 2-3 hours

>but what we need is the patch that _fixed_ this breakage in 2.6.24-rc.
Ok at weekend I'll try to find some time for this.

> IOW, mark 2.6.23 as "good" (in fact it's bad, but so what?), mark the current
> mainline as "bad" (we know that in fact it's good) and run the bisection
> marking all good kernels as "bad" and vice versa.  Then, the first "bad" one
> returned by the bisection should be the one that fixes the bug.
I thought that git is able handle the case of 2.6.23 as "bad" and 2.6.24 as "good"  isn't it?
Comment 13 Toralf Förster 2007-12-16 08:59:58 UTC
According to comment #11 I found this ccommit which solves the issue :

commit d5d8c5976d6adeddb8208c240460411e2198b393
Author: Rafael J. Wysocki <rjw@sisk.pl>
Date:   Thu Oct 18 03:04:46 2007 -0700

    freezer: do not send signals to kernel threads

    The freezer should not send signals to kernel threads, since that may lead to
    subtle problems.  In particular, commit
    b74d0deb968e1f85942f17080eace015ce3c332c has changed recalc_sigpending_tsk()
    so that it doesn't clear TIF_SIGPENDING.  For this reason, if the freezer
    continues to send fake signals to kernel threads and the freezing of kernel
    threads fails, some of them may be running with TIF_SIGPENDING set forever.

    Accordingly, recalc_sigpending_tsk() shouldn't set the task's TIF_SIGPENDING
    flag if TIF_FREEZE is set.

    Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
    Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
    Cc: Pavel Machek <pavel@ucw.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment 14 Rafael J. Wysocki 2007-12-16 10:19:33 UTC
Thanks a lot for your work!

Well, this patch is definitely too invasive to go into 2.6.23.y.  I'll try to figure out if we can get a simpler fix for the "UML vs freezer" breakage in 2.6.23 (I think that's a "stopped tasks vs freezer" breakage, BTW).
Comment 15 Toralf Förster 2007-12-16 13:02:31 UTC
BTW, I gave this a try :

n22 /usr/src/linux # (cd /home/tfoerste/devel/linux-2.6/; git diff e42837b..d5d8c59) | patch -p1

(against gentoo-sources-2.6.23-r3 which is mainly 2.6.23.9) but didn't had have success.
Comment 16 Rafael J. Wysocki 2007-12-16 15:04:08 UTC
Hmm.  Looks like something else is still missing.  No idea what's that, for now, but I'm going to find out.
Comment 17 Rafael J. Wysocki 2008-01-02 09:29:21 UTC
Unfortunately, I'm unable to reproduce the problem without UML, so there has to be a subtle bug affecting UML only.  Moreover, it's difficult to single out the specific change that fixed the problem in the recent kernels.  It probably is
a combination of commits:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=232b14328050a4639130b0dec185f43968e72035
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2e1318956ce6bf149af5c5e98499b5cd99f99c89
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e42837bcd35b75bb59ae5d3e62f87be1aeeb05c3
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d5d8c5976d6adeddb8208c240460411e2198b393
(in this order), but I can't test that locally.
Comment 19 Rafael J. Wysocki 2008-01-03 15:14:42 UTC
Thanks for verifying.

This also will require this three fixes on top:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cc5f916e90a811dd8f809b4d17409f98e74b237c
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8baabde66c60a84781c718c28fe283ed411a7bd0
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e136e769d471e7f3d24a8f6bf9c91dcb372bd0ab
to a minimum.

All in all, I'm afraid the combined patch is too complex for 2.6.23.y, and since we now know which patches need to be applied, I'd like to resolve this entry with PATCH_ALREADY_AVAILABLE.