Bug 13277
Summary: | 2.6.30 regression - hang on 2nd resume - bisected - Thinkpad X40 | ||
---|---|---|---|
Product: | ACPI | Reporter: | Daniel Vetter (daniel) |
Component: | Power-Sleep-Wake | Assignee: | Len Brown (lenb) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | alex.shi, lenb, rjw, yakui.zhao |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.30-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 13070 | ||
Attachments: |
dmesg of 2.6.30-rc2 with ff69.. reverted
dmesg of 2.6.30-rc6-lockdep-00043-g22ef37e dmesg of 2.6.30-rc7-lockdep-00082-g07f4f3e freezing after resume |
Description
Daniel Vetter
2009-05-11 10:08:08 UTC
Please test this patch: http://patchwork.kernel.org/patch/22499/ (ACPI: suspend: restore BM_RLD on resume) Hi, Daniel Will you please try the patch in comment #1 and see whether the issue still exists? If it still exists, Will you please enable "CONFIG_PM_DEBUG" in kernel configuration and do the following test? a. echo core > /sys/power/pm_test b. echo mem > /sys/power/state It will be great if you can do several suspend/resume cycles . Thanks. On Tue, May 12, 2009 at 01:51:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #1 from Len Brown <len.brown@intel.com> 2009-05-12 01:51:02 --- > Please test this patch: > http://patchwork.kernel.org/patch/22499/ > (ACPI: suspend: restore BM_RLD on resume) I've done about a dozen suspend/resume cycles with this patch applied. Didn't hang one single time on works rock solid. Thanks for the speedy fix, Daniel [I'll close the bug report as soon as your patch hits mainline and I've retested] Handled-By : Len Brown <len.brown@intel.com> Patch : http://patchwork.kernel.org/patch/22499/ *** This bug has been marked as a duplicate of bug 13032 *** Looks like my problem is not really fixed yet and the patch only papered over the real issue. I've tested several kernels since v2.6.30-rc4-289-g815ab0f (this is the patch I've tested first as merged in mainline) and all where broken. I'll be testing now the suggestion in comment #2. Created attachment 21438 [details]
dmesg of 2.6.30-rc6-lockdep-00043-g22ef37e
The attached dmesg contains 5 runs of
# echo core > /sys/power/pm_state
# echo mem > /sys/power/state
The machine resumed always perfectly.
I've also tested the other options for /sys/power/pm_test (processors platform devices freezer). They all seem to work for multiple consequtiv runs on 2.6.30-rc6-lockdep-00043-g22ef37e. On Tuesday 26 May 2009, Daniel Vetter wrote:
> On Sun, May 24, 2009 at 09:11:52PM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.29. Please verify if it still should be listed and let me know
> > (either way).
> I've just tested 2.6.30-rc7 an the problem still exists.
>
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13277
> > Subject : Thinkpad X40 no longer resumes reliable since
> ff69f2bba67bd45514923aaedbf40fe351787c59
> > Submitter : Daniel Vetter <daniel@ffwll.ch>
> > Date : 2009-05-11 10:08 (14 days old)
> > Handled-By : Len Brown <len.brown@intel.com>
> > Patch : http://patchwork.kernel.org/patch/22499/
> This is not correct. This patch was merged, but does not fix the problem.
> It worked while I tested it seperately, but obviously it only papered over
> the real issue in this very specific kernel.
Ignore-Patch : http://patchwork.kernel.org/patch/22499/ Thanks for the additional testing, Daniel. What does this command show?: grep . /sys/devices/system/clocksource/clocksource0/* What happens if you boot the system with "clocksource=acpi_pm" (try all of the alternatives shown in available_clocksource) On Wed, May 27, 2009 at 02:06:36AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #11 from Len Brown <len.brown@intel.com> 2009-05-27 02:06:35 --- > Thanks for the additional testing, Daniel. > > What does this command show?: > > grep . /sys/devices/system/clocksource/clocksource0/* /sys/devices/system/clocksource/clocksource0/available_clocksource:hpet acpi_pm jiffies tsc /sys/devices/system/clocksource/clocksource0/current_clocksource:hpet > What happens if you boot the system with > "clocksource=acpi_pm" (try all of the alternatives shown > in available_clocksource) hpet - default, crashes in second resum acpi_pm - crashes like hpet jiffies - works (I've done about ten suspend-resume cycles) tsc - doesn't work, kernel switches back to hpet because the tsc is unstable I've also tried echo jiffies > current_clocksource before suspending (normal setup, i.e. clocksource=hept) as a work-around. But that does not work, it hangs when I try to suspend the first time. Furthermore suspend-to-disk has the same issue: In the second resume the kernel hangs right before it switches the suspend-led to blinking mode (indicating an ongoing suspend/resume operation). Like with suspend-to-ram I only see the console cursor frozen in the upper-left corner. Sysrq also doesn't work. -Daniel Created attachment 21622 [details]
dmesg of 2.6.30-rc7-lockdep-00082-g07f4f3e freezing after resume
Just now my system froze a few seconds after resuming. This was with clocksource=jiffies. Luckily SysRq still worked so I could captured the dmesg (relevant parts attached). Might this be related to the problem?
On Monday 08 June 2009, Daniel Vetter wrote:
> On Sun, Jun 07, 2009 at 11:52:49AM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.29. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13277
> > Subject : 2.6.30 regression - unreliable resume - bisected -
> Thinkpad X40
> > Submitter : Daniel Vetter <daniel@ffwll.ch>
> > Date : 2009-05-11 10:08 (28 days old)
> > Handled-By : Len Brown <len.brown@intel.com>
>
> I've just tested v2.6.30-rc8-34-g81ee1ba and this version has the exact
> same problem. I've also tested a few versions in the rc7-rc8 timeframe.
> Something must have slightly changed because with these kernels I've
> tested, the resume-hang was way less likely. I've tried tracking down the
> changeset that introduced this new behaviour, but it was to unreliable to
> classify for a bisect run. But with v2.6.30-rc8-34-g81ee1ba the kernel
> hangs again reliably in the second resume.
Can you reproduce the failure when you boot with "highres=off"? How about with "idle=poll"? WARNING: at kernel/hrtimer.c:625 hres_timers_resume+0x2c/0x42() ... hres_timers_resume() called with IRQs enabled! Do you see this every time it hangs, and does it hang every time you see this? > --- Comment #16 from Len Brown <len.brown@intel.com> 2009-06-09 02:04:59 ---
> WARNING: at kernel/hrtimer.c:625 hres_timers_resume+0x2c/0x42()
> ...
> hres_timers_resume() called with IRQs enabled!
>
> Do you see this every time it hangs,
> and does it hang every time you see this?
If it hangs, I don't see anything at all (it hangs before the console
comes up). But when I apply the clocksource=jiffies workaround I sometimes
see this. Just this morning the system behaved strangely after a resume
(clocksource=jiffies workaround applied) - X hung up. And there was the
same backtrace in the logs. So my gut feeling tells me this backtrace
might be related to the problem I'm seeing. But I don't have further
evidence.
-Daniel
[Moving some e-mail discussion back on the bug report -Daniel] From: "Rafael J. Wysocki" <rjw@sisk.pl> > > > I've just tested v2.6.30-rc8-34-g81ee1ba and this version has the exact > > > same problem. I've also tested a few versions in the rc7-rc8 timeframe. > > > Something must have slightly changed because with these kernels I've > > > tested, the resume-hang was way less likely. I've tried tracking down the > > > changeset that introduced this new behaviour, but it was to unreliable to > > > classify for a bisect run. But with v2.6.30-rc8-34-g81ee1ba the kernel > > > hangs again reliably in the second resume. > > > > Well, thanks for the update. > > > > I wonder what we've changed recently that it makes the problem more > > reproducible for you. Puzzled. > > Actually, it was for a few kernel revisions _less_ reproducible (only hung > after about a dozen suspend cycles). But that's nothing special: Since the > regression was introduced there were already a few other kernels that > almost never crashed. One of them was the reason I've preliminarily > declared the bug fixed. But I was never able to pinpoint an exact cause > for the change in behaviour. Most likely we have a very tight race window > and when instructions get moved around a little bit due to totally > unrelated changes, chances are massively lower that the kernel hangs (e.g. > because of delay due to a cache-miss). At least that's the only consistent > explanation I could come up with. And I always look at the patches/try to > bisect when something changes. > > -Daniel > > PS: It's also possible that this is not really a regression, but the bug > was just uncovered. I vaguely remember similar resume problems with this > exact machine from a few years ago. But I can't remember any details nor > which kernels might have been affected. Hmm. Can you please try to comment out suspend_device_irqs() and resume_device_irqs() in drivers/base/power/main.c ? Rafael > --- Comment #15 from Len Brown <len.brown@intel.com> 2009-06-09 02:03:49 ---
> Can you reproduce the failure when you boot with "highres=off"?
> How about with "idle=poll"?
Base kernel was v2.6.30-rc8-34-g81ee1ba plus an unrelated revert. Results:
base: hung on 3rd resume
base + "highres=off": survived 10 resume-to-mem cycles
base + "idle=poll": hung on 2nd resume
I'm now using "highres=off" as an workaround to test some more. One recent
kernel (without any workaround) also survived 10 resume cycles but then
crashed after a few days of day-to-day use. I'll report back how this one
fares after a few days of use.
-Daniel
> Hmm. Can you please try to comment out suspend_device_irqs()
> and resume_device_irqs() in drivers/base/power/main.c ?
>
> Rafael
I've tested this against the same base kernel as the previous tests. It
hung on the 4th resume.
-Daniel
> --- Comment #19 from Daniel Vetter <daniel@ffwll.ch> 2009-06-09 09:45:33 ---
> I'm now using "highres=off" as an workaround to test some more. One recent
> kernel (without any workaround) also survived 10 resume cycles but then
> crashed after a few days of day-to-day use. I'll report back how this one
> fares after a few days of use.
I've now been using this workaround for a few days with suspend-resume
cycles under various conditions. The system never hung on resume, so this
really prevents the bug.
-Daniel
> --- Comment #21 from Daniel Vetter <daniel@ffwll.ch> 2009-06-15 07:41:26 ---
> > --- Comment #19 from Daniel Vetter <daniel@ffwll.ch> 2009-06-09 09:45:33
> ---
> > I'm now using "highres=off" as an workaround to test some more. One recent
> > kernel (without any workaround) also survived 10 resume cycles but then
> > crashed after a few days of day-to-day use. I'll report back how this one
> > fares after a few days of use.
> I've now been using this workaround for a few days with suspend-resume
> cycles under various conditions. The system never hung on resume, so this
> really prevents the bug.
I stand corrected: On 2.6.30-03984-g45e3e19, the highres=off workaround
does not work anymore. Futher my laptop now hangs on the _first_ resume
and no longer only on the second or a later resume cycle.
-Daniel
On Tuesday 07 July 2009, Daniel Vetter wrote:
> On Tue, Jul 07, 2009 at 02:00:35AM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.29 and 2.6.30.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.29 and 2.6.30. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13277
> > Subject : 2.6.30 regression - hang on 2nd resume - bisected -
> Thinkpad X40
> > Submitter : Daniel Vetter <daniel@ffwll.ch>
> > Date : 2009-05-11 10:08 (57 days old)
> > Handled-By : Len Brown <len.brown@intel.com>
>
> I've now put two different recent kernel versions (2.6.31-rc1-00268 and
> 2.6.31-rc2) through a few days of real-world testing. The machine _never_
> hung on resume. So by whatever means I don't know but I'd say the problem's
> fixed and we can close this report.
|