Bug 13277

Summary: 2.6.30 regression - hang on 2nd resume - bisected - Thinkpad X40
Product: ACPI Reporter: Daniel Vetter (daniel)
Component: Power-Sleep-WakeAssignee: Len Brown (lenb)
Status: CLOSED CODE_FIX    
Severity: normal CC: alex.shi, lenb, rjw, yakui.zhao
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13070    
Attachments: dmesg of 2.6.30-rc2 with ff69.. reverted
dmesg of 2.6.30-rc6-lockdep-00043-g22ef37e
dmesg of 2.6.30-rc7-lockdep-00082-g07f4f3e freezing after resume

Description Daniel Vetter 2009-05-11 10:08:08 UTC
Created attachment 21303 [details]
dmesg of 2.6.30-rc2 with ff69.. reverted

Hardware: IBM Thinkpad X40, 1.2Ghz Pentium M 1.5GB ram
Software: Debian unstable 32bit

ff69f2bba67bd45514923aaedbf40fe351787c59 caused a regression when booting-up on my setup. See

http://bugzilla.kernel.org/show_bug.cgi?id=13087

for reference. This issue has been fixed with f461ddea0af8b98e2b7940eba9c693b0ee44d64a. Unfortunately ff69.. also caused a resume hang on my setup, which is _not_ yet fixed.

Problem: 2.6.30-rc2 hangs after in the second suspend/resume cycle when resuming. The fan and harddisk spin up and the lcd light switches on (and shows the blinking cursor) but then the machine hangs. Only holding down the power button for 4 secs helps. The suspend indicator light does not switch to the blinking mode like it does when resuming normally (before it switches off completely). Reverting ff69.. on top of 2.6.30-rc2 fixes the issue.

Latest kernel I tried is v2.6.30-rc5-96-ga4d7749. But then reverting f461.. and ff69.. didn't fully fixed the problem: I could only resume two times (instead of only once) before the machine hung when resuming.

Kernels before ff69 work flawless (at least the ones I've tested, and I'm updating -linus from git fairly often).

I'll add the dmesg one suspend-resume cylce of 2.6.30-rc2 with ff69 reverted.
Comment 1 Len Brown 2009-05-12 01:51:02 UTC
Please test this patch:
http://patchwork.kernel.org/patch/22499/
(ACPI: suspend: restore BM_RLD on resume)
Comment 2 ykzhao 2009-05-12 03:59:48 UTC
Hi, Daniel
    Will you please try the patch in comment #1 and see whether the issue still exists?
   If it still exists, Will you please enable "CONFIG_PM_DEBUG" in kernel configuration and do the following test?
   a. echo core > /sys/power/pm_test
   b. echo mem > /sys/power/state 
   
   It will be great if you can do several suspend/resume cycles .
   Thanks.
Comment 3 Daniel Vetter 2009-05-12 08:04:20 UTC
On Tue, May 12, 2009 at 01:51:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #1 from Len Brown <len.brown@intel.com>  2009-05-12 01:51:02 ---
> Please test this patch:
> http://patchwork.kernel.org/patch/22499/
> (ACPI: suspend: restore BM_RLD on resume)
I've done about a dozen suspend/resume cycles with this patch applied.
Didn't hang one single time on works rock solid.

Thanks for the speedy fix, Daniel

[I'll close the bug report as soon as your patch hits mainline and I've
retested]
Comment 4 Rafael J. Wysocki 2009-05-13 09:25:52 UTC
Handled-By : Len Brown <len.brown@intel.com>
Patch : http://patchwork.kernel.org/patch/22499/
Comment 5 Len Brown 2009-05-16 03:09:05 UTC

*** This bug has been marked as a duplicate of bug 13032 ***
Comment 6 Daniel Vetter 2009-05-19 20:41:26 UTC
Looks like my problem is not really fixed yet and the patch only papered over the real issue. I've tested several kernels since v2.6.30-rc4-289-g815ab0f (this is the patch I've tested first as merged in mainline) and all where broken.

I'll be testing now the suggestion in comment #2.
Comment 7 Daniel Vetter 2009-05-19 20:52:32 UTC
Created attachment 21438 [details]
dmesg of 2.6.30-rc6-lockdep-00043-g22ef37e

The attached dmesg contains 5 runs of

# echo core > /sys/power/pm_state
# echo mem > /sys/power/state

The machine resumed always perfectly.
Comment 8 Daniel Vetter 2009-05-19 20:58:33 UTC
I've also tested the other options for /sys/power/pm_test (processors platform devices freezer). They all seem to work for multiple consequtiv runs on 2.6.30-rc6-lockdep-00043-g22ef37e.
Comment 9 Rafael J. Wysocki 2009-05-26 19:14:05 UTC
On Tuesday 26 May 2009, Daniel Vetter wrote:
> On Sun, May 24, 2009 at 09:11:52PM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.29.  Please verify if it still should be listed and let me know
> > (either way).
> I've just tested 2.6.30-rc7 an the problem still exists.
> 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13277
> > Subject             : Thinkpad X40 no longer resumes reliable since
> ff69f2bba67bd45514923aaedbf40fe351787c59
> > Submitter   : Daniel Vetter <daniel@ffwll.ch>
> > Date                : 2009-05-11 10:08 (14 days old)
> > Handled-By  : Len Brown <len.brown@intel.com>
> > Patch               : http://patchwork.kernel.org/patch/22499/
> This is not correct. This patch was merged, but does not fix the problem.
> It worked while I tested it seperately, but obviously it only papered over
> the real issue in this very specific kernel.
Comment 10 Rafael J. Wysocki 2009-05-26 19:14:48 UTC
Ignore-Patch : http://patchwork.kernel.org/patch/22499/
Comment 11 Len Brown 2009-05-27 02:06:35 UTC
Thanks for the additional testing, Daniel.

What does this command show?:

grep .  /sys/devices/system/clocksource/clocksource0/*

What happens if you boot the system with
"clocksource=acpi_pm" (try all of the alternatives shown
in available_clocksource)
Comment 12 Daniel Vetter 2009-05-27 13:04:08 UTC
On Wed, May 27, 2009 at 02:06:36AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #11 from Len Brown <len.brown@intel.com>  2009-05-27 02:06:35 ---
> Thanks for the additional testing, Daniel.
> 
> What does this command show?:
> 
> grep .  /sys/devices/system/clocksource/clocksource0/*
/sys/devices/system/clocksource/clocksource0/available_clocksource:hpet acpi_pm jiffies tsc
/sys/devices/system/clocksource/clocksource0/current_clocksource:hpet

> What happens if you boot the system with
> "clocksource=acpi_pm" (try all of the alternatives shown
> in available_clocksource)
hpet - default, crashes in second resum
acpi_pm - crashes like hpet
jiffies - works (I've done about ten suspend-resume cycles)
tsc - doesn't work, kernel switches back to hpet because the tsc is
unstable

I've also tried

echo jiffies > current_clocksource

before suspending (normal setup, i.e. clocksource=hept) as a work-around.
But that does not work, it hangs when I try to suspend the first time.

Furthermore suspend-to-disk has the same issue: In the second resume the
kernel hangs right before it switches the suspend-led to blinking mode
(indicating an ongoing suspend/resume operation). Like with suspend-to-ram
I only see the console cursor frozen in the upper-left corner. Sysrq also
doesn't work.

-Daniel
Comment 13 Daniel Vetter 2009-05-29 08:58:39 UTC
Created attachment 21622 [details]
dmesg of 2.6.30-rc7-lockdep-00082-g07f4f3e freezing after resume

Just now my system froze a few seconds after resuming. This was with clocksource=jiffies. Luckily SysRq still worked so I could captured the dmesg (relevant parts attached). Might this be related to the problem?
Comment 14 Rafael J. Wysocki 2009-06-08 11:03:24 UTC
On Monday 08 June 2009, Daniel Vetter wrote:
> On Sun, Jun 07, 2009 at 11:52:49AM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.29.  Please verify if it still should be listed and let me know
> > (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13277
> > Subject             : 2.6.30 regression - unreliable resume - bisected -
> Thinkpad X40
> > Submitter   : Daniel Vetter <daniel@ffwll.ch>
> > Date                : 2009-05-11 10:08 (28 days old)
> > Handled-By  : Len Brown <len.brown@intel.com>
> 
> I've just tested v2.6.30-rc8-34-g81ee1ba and this version has the exact
> same problem. I've also tested a few versions in the rc7-rc8 timeframe.
> Something must have slightly changed because with these kernels I've
> tested, the resume-hang was way less likely. I've tried tracking down the
> changeset that introduced this new behaviour, but it was to unreliable to
> classify for a bisect run. But with v2.6.30-rc8-34-g81ee1ba the kernel
> hangs again reliably in the second resume.
Comment 15 Len Brown 2009-06-09 02:03:49 UTC
Can you reproduce the failure when you boot with "highres=off"?
How about with "idle=poll"?
Comment 16 Len Brown 2009-06-09 02:04:59 UTC
WARNING: at kernel/hrtimer.c:625 hres_timers_resume+0x2c/0x42()
...
hres_timers_resume() called with IRQs enabled!

Do you see this every time it hangs,
and does it hang every time you see this?
Comment 17 Daniel Vetter 2009-06-09 08:53:27 UTC
> --- Comment #16 from Len Brown <len.brown@intel.com>  2009-06-09 02:04:59 ---
> WARNING: at kernel/hrtimer.c:625 hres_timers_resume+0x2c/0x42()
> ...
> hres_timers_resume() called with IRQs enabled!
> 
> Do you see this every time it hangs,
> and does it hang every time you see this?
If it hangs, I don't see anything at all (it hangs before the console
comes up). But when I apply the clocksource=jiffies workaround I sometimes
see this. Just this morning the system behaved strangely after a resume
(clocksource=jiffies workaround applied) - X hung up. And there was the
same backtrace in the logs. So my gut feeling tells me this backtrace
might be related to the problem I'm seeing. But I don't have further
evidence.

-Daniel
Comment 18 Daniel Vetter 2009-06-09 08:56:29 UTC
[Moving some e-mail discussion back on the bug report -Daniel]

From: "Rafael J. Wysocki" <rjw@sisk.pl>

> > > I've just tested v2.6.30-rc8-34-g81ee1ba and this version has the exact
> > > same problem. I've also tested a few versions in the rc7-rc8 timeframe.
> > > Something must have slightly changed because with these kernels I've
> > > tested, the resume-hang was way less likely. I've tried tracking down the
> > > changeset that introduced this new behaviour, but it was to unreliable to
> > > classify for a bisect run. But with v2.6.30-rc8-34-g81ee1ba the kernel
> > > hangs again reliably in the second resume.
> >
> > Well, thanks for the update.
> >
> > I wonder what we've changed recently that it makes the problem more
> > reproducible for you.  Puzzled.
>
> Actually, it was for a few kernel revisions _less_ reproducible (only hung
> after about a dozen suspend cycles). But that's nothing special: Since the
> regression was introduced there were already a few other kernels that
> almost never crashed. One of them was the reason I've preliminarily
> declared the bug fixed. But I was never able to pinpoint an exact cause
> for the change in behaviour. Most likely we have a very tight race window
> and when instructions get moved around a little bit due to totally
> unrelated changes, chances are massively lower that the kernel hangs (e.g.
> because of delay due to a cache-miss). At least that's the only consistent
> explanation I could come up with. And I always look at the patches/try to
> bisect when something changes.
>
> -Daniel
>
> PS: It's also possible that this is not really a regression, but the bug
> was just uncovered. I vaguely remember similar resume problems with this
> exact machine from a few years ago. But I can't remember any details nor
> which kernels might have been affected.

Hmm.  Can you please try to comment out suspend_device_irqs()
and resume_device_irqs() in drivers/base/power/main.c ?

Rafael
Comment 19 Daniel Vetter 2009-06-09 09:45:33 UTC
> --- Comment #15 from Len Brown <len.brown@intel.com>  2009-06-09 02:03:49 ---
> Can you reproduce the failure when you boot with "highres=off"?
> How about with "idle=poll"?

Base kernel was v2.6.30-rc8-34-g81ee1ba plus an unrelated revert. Results:

base: hung on 3rd resume
base + "highres=off": survived 10 resume-to-mem cycles
base + "idle=poll": hung on 2nd resume

I'm now using "highres=off" as an workaround to test some more. One recent
kernel (without any workaround) also survived 10 resume cycles but then
crashed after a few days of day-to-day use. I'll report back how this one
fares after a few days of use.

-Daniel
Comment 20 Daniel Vetter 2009-06-09 10:55:03 UTC
> Hmm.  Can you please try to comment out suspend_device_irqs()
> and resume_device_irqs() in drivers/base/power/main.c ?
> 
> Rafael

I've tested this against the same base kernel as the previous tests. It
hung on the 4th resume.

-Daniel
Comment 21 Daniel Vetter 2009-06-15 07:41:26 UTC
> --- Comment #19 from Daniel Vetter <daniel@ffwll.ch>  2009-06-09 09:45:33 ---
> I'm now using "highres=off" as an workaround to test some more. One recent
> kernel (without any workaround) also survived 10 resume cycles but then
> crashed after a few days of day-to-day use. I'll report back how this one
> fares after a few days of use.
I've now been using this workaround for a few days with suspend-resume
cycles under various conditions. The system never hung on resume, so this
really prevents the bug.

-Daniel
Comment 22 Daniel Vetter 2009-06-22 08:37:48 UTC
> --- Comment #21 from Daniel Vetter <daniel@ffwll.ch>  2009-06-15 07:41:26 ---
> > --- Comment #19 from Daniel Vetter <daniel@ffwll.ch>  2009-06-09 09:45:33
> ---
> > I'm now using "highres=off" as an workaround to test some more. One recent
> > kernel (without any workaround) also survived 10 resume cycles but then
> > crashed after a few days of day-to-day use. I'll report back how this one
> > fares after a few days of use.
> I've now been using this workaround for a few days with suspend-resume
> cycles under various conditions. The system never hung on resume, so this
> really prevents the bug.

I stand corrected: On 2.6.30-03984-g45e3e19, the highres=off workaround
does not work anymore. Futher my laptop now hangs on the _first_ resume
and no longer only on the second or a later resume cycle.

-Daniel
Comment 23 Rafael J. Wysocki 2009-07-07 10:44:15 UTC
On Tuesday 07 July 2009, Daniel Vetter wrote:
> On Tue, Jul 07, 2009 at 02:00:35AM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.29 and 2.6.30.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.29 and 2.6.30.  Please verify if it still should
> > be listed and let me know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13277
> > Subject             : 2.6.30 regression - hang on 2nd resume - bisected -
> Thinkpad X40
> > Submitter   : Daniel Vetter <daniel@ffwll.ch>
> > Date                : 2009-05-11 10:08 (57 days old)
> > Handled-By  : Len Brown <len.brown@intel.com>
> 
> I've now put two different recent kernel versions (2.6.31-rc1-00268 and
> 2.6.31-rc2) through a few days of real-world testing. The machine _never_
> hung on resume. So by whatever means I don't know but I'd say the problem's
> fixed and we can close this report.