Bug 14222 - Hibernation oopses for the 2nd time with 2.6.31 (won't fit the screen)
Summary: Hibernation oopses for the 2nd time with 2.6.31 (won't fit the screen)
Status: CLOSED CODE_FIX
Alias: None
Product: Timers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: john stultz
URL:
Keywords:
: 14254 (view as bug list)
Depends on:
Blocks: 7216 13615
  Show dependency tree
 
Reported: 2009-09-24 14:07 UTC by Ondrej Zary
Modified: 2009-10-08 21:10 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.31
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg (25.22 KB, text/plain)
2009-10-02 17:28 UTC, Ondrej Zary
Details
dmesg after hibernation (3.36 KB, text/plain)
2009-10-02 17:34 UTC, Ondrej Zary
Details
debug patch (1.09 KB, patch)
2009-10-02 18:58 UTC, john stultz
Details | Diff
dmesg with clocksource change messages (27.84 KB, text/plain)
2009-10-02 19:59 UTC, Ondrej Zary
Details
second debug patch (2.48 KB, patch)
2009-10-02 21:46 UTC, john stultz
Details | Diff
dmesg with modified debug 2 patch (28.21 KB, text/plain)
2009-10-03 10:22 UTC, Ondrej Zary
Details
use clocksource.list instead of mult to check if the clock is reigstered (408 bytes, patch)
2009-10-05 19:39 UTC, john stultz
Details | Diff
don't use clocksource.mult to cehck if a clock is registered (953 bytes, patch)
2009-10-05 19:58 UTC, john stultz
Details | Diff
Avoid bad pit mult check and try to re-register pit on resume (1.56 KB, patch)
2009-10-07 23:22 UTC, john stultz
Details | Diff

Description Ondrej Zary 2009-09-24 14:07:08 UTC
Hibernation worked great with 2.6.30 - I had over a month of uptime (it never 
worked so well before - always crashed after some time). But in 2.6.31, it's 
broken again - and completely. The first hibernation works but it always 
crashes on the second one.

The oops won't fit the screen, even 1280x1024 is not enough.

I bisected it and found this:

c7121843685de2bf7f3afd3ae1d6a146010bf1fc is first bad commit
commit c7121843685de2bf7f3afd3ae1d6a146010bf1fc
Author: Magnus Damm <magnus.damm@gmail.com>
Date:   Tue Jul 28 14:09:55 2009 -0700

    clocksource: Save mult_orig in clocksource_disable()

    To fix the common case where ->enable() does not set up
    mult, make sure mult_orig is saved in mult on disable.

    Also add comments to explain why we do this.

    Signed-off-by: Magnus Damm <damm@igel.co.jp>
    Cc: johnstul@us.ibm.com
    Cc: lethal@linux-sh.org
    Cc: akpm@linux-foundation.org
    LKML-Reference: <20090618152432.10136.9932.sendpatchset@rx1.opensource.se>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

:040000 040000 c965062bc79af46cdb3522fc7ab1cc81d2f84de3 26616c88a8128b25fd15c3da39ac6251cd9723b6 M      include

Reverting this commit fixes the problem.
Comment 1 Andrew Morton 2009-09-24 22:27:04 UTC
On Thu, 24 Sep 2009 14:07:09 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14222
> 
>            Summary: Hibernation oopses for the 2nd time with 2.6.31 (won't
>                     fit the screen)
>            Product: Timers
>            Version: 2.5
>     Kernel Version: 2.6.31
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>         AssignedTo: johnstul@us.ibm.com
>         ReportedBy: linux@rainbow-software.org
>         Regression: Yes

It's a 2.6.21 regression.

> 
> Hibernation worked great with 2.6.30 - I had over a month of uptime (it never 
> worked so well before - always crashed after some time). But in 2.6.31, it's 
> broken again - and completely. The first hibernation works but it always 
> crashes on the second one.
> 
> The oops won't fit the screen, even 1280x1024 is not enough.
> 
> I bisected it and found this:

Thanks for bisecting it.

> c7121843685de2bf7f3afd3ae1d6a146010bf1fc is first bad commit
> commit c7121843685de2bf7f3afd3ae1d6a146010bf1fc
> Author: Magnus Damm <magnus.damm@gmail.com>
> Date: __ Tue Jul 28 14:09:55 2009 -0700
> 
> __ __ clocksource: Save mult_orig in clocksource_disable()
> 
> __ __ To fix the common case where ->enable() does not set up
> __ __ mult, make sure mult_orig is saved in mult on disable.
> 
> __ __ Also add comments to explain why we do this.
> 
> __ __ Signed-off-by: Magnus Damm <damm@igel.co.jp>
> __ __ Cc: johnstul@us.ibm.com
> __ __ Cc: lethal@linux-sh.org
> __ __ Cc: akpm@linux-foundation.org
> __ __ LKML-Reference:
> <20090618152432.10136.9932.sendpatchset@rx1.opensource.se>
> __ __ Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> :040000 040000 c965062bc79af46cdb3522fc7ab1cc81d2f84de3
> 26616c88a8128b25fd15c3da39ac6251cd9723b6 M __ __ __include
> 
> Reverting this commit fixes the problem.

OK, I cc'ed everyone.

Guys, is reverting this patch the best approach?

It might be for 2.6.31.x but not for 2.6.32, perhaps?
Comment 2 Rafael J. Wysocki 2009-09-25 12:40:12 UTC
First-Bad-Commit : c7121843685de2bf7f3afd3ae1d6a146010bf1fc
Comment 3 john stultz 2009-09-28 18:45:38 UTC
(In reply to comment #0)
> Hibernation worked great with 2.6.30 - I had over a month of uptime (it never 
> worked so well before - always crashed after some time). But in 2.6.31, it's 
> broken again - and completely. The first hibernation works but it always 
> crashes on the second one.
> 
> The oops won't fit the screen, even 1280x1024 is not enough.

Thanks for bisecting this down! Can you provide a little bit of information about your box, so I can try to reproduce the issue?

dmesg
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Thanks
-john
Comment 4 john stultz 2009-09-28 18:50:53 UTC
Andrew: Reverting it might be the best approach for 2.6.31.x, although the patch was needed to fix the mult breakage on resume that Magnus earlier patch caused. Granted, that breakage isn't as bad as borking suspend, so reverting it would be the lesser evil.

The code has been totally reworked by Martin Schwidefsky in 2.6.32, so I don't believe there should be an issue there.
Comment 5 john stultz 2009-10-01 23:21:59 UTC
Ondrej: Just wanted to follow up here again. Can you provide the following output?

dmesg
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

thanks
-john
Comment 6 john stultz 2009-10-01 23:26:05 UTC
Also, are you running ntp during the suspend/resume cycle? If so, does disabling ntp change the behavior?
Comment 7 john stultz 2009-10-02 01:45:46 UTC
Looking more at the change i'm still baffled at how the simple one-liner in a code path that we only call when we're throwing out a clocksource (only reason we call clocksource_disable()) could be causing problems.

The change was made so that if you switch away from a clocksource you could switch back to it and NTP wouldn't be confused. But I don't see anything in the suspend/resume path that would cause you to go back to using a disabled clocksource.
Comment 8 Ondrej Zary 2009-10-02 11:15:44 UTC
Maybe it's related to another bug - TSC is marked unstable on hibernation on this machine: http://lkml.org/lkml/fancy/2009/2/20/428
Comment 9 Ondrej Zary 2009-10-02 17:28:39 UTC
Created attachment 23231 [details]
dmesg

dmesg attached

This is before hibernation:
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
pit jiffies tsc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
pit

NTP is running.
Comment 10 Ondrej Zary 2009-10-02 17:34:57 UTC
Created attachment 23232 [details]
dmesg after hibernation

And it gets interesting after resume from hibernation:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
jiffies tsc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
jiffies
Comment 11 Rafael J. Wysocki 2009-10-02 17:50:10 UTC
*** Bug 14254 has been marked as a duplicate of this bug. ***
Comment 12 john stultz 2009-10-02 18:58:03 UTC
Created attachment 23235 [details]
debug patch

Thanks for the extra info. I've not seen a box that was still using the pit clocksource in awhile, esp ones with ACPI Cstates but no ACPI PM. Odd, but not the issue here.

I'm still a little baffled, as I don't see why we would switch away from the PIT clocksource to jiffies on resume. Further, its odd the pit clocksource drops off the available list as well.

Can you apply this patch and then again provide the boot/resume dmesges and available/current_clocksource data?
Comment 13 Ondrej Zary 2009-10-02 19:59:34 UTC
Created attachment 23236 [details]
dmesg with clocksource change messages

Got warnings about %lu and u32 so changed them to %u. Attached dmesg shows two clocksource changes during boot (jiffies->TSC->PIT) and one during resume (if I understand it correctly): PIT->jiffies. Available clocksources are still the same.
Comment 14 john stultz 2009-10-02 21:46:34 UTC
Created attachment 23238 [details]
second debug patch

Hrmm.. Here's another debug patch to try to show more about whats happening.
Comment 15 Ondrej Zary 2009-10-03 10:22:06 UTC
Created attachment 23249 [details]
dmesg with modified debug 2 patch

With this patch, it oopses after "echo disk >/sys/power/state". Splitted each of the new printk lines into two with NULL checks and got this dmesg. I see only one NULL message on resume, but there was probably at least one more during hibernation which got lost.
Comment 16 john stultz 2009-10-05 19:34:05 UTC
Yuck. Ok. So we are disqualifying the PIT on suspend, which I wasn't expecting.

ACPI: Preparing to enter system sleep state S4
PM: Saving platform NVS memory
clocksource_unregister(pit)
clocksource_unregister: next: jiffies
PM: Creating hibernation image: 
PM: Need to copy 19506 pages
PM: Restoring platform NVS memory
Changing clocksource from pit to jiffies

Now, trying to connect the patch that broke things and this issue, I'm guessing the check against pit_cs.mult in pit_disable_clocksource() got messed up since we probably unregister the clocksource, then disable it. But I'm honestly a little suspicious of the code, since we're unregistering and manipulating a clocksource that may be actively in use.

I've got a simple change to try that I suspect resolves the issue, but we may have a bit more work to do to really fix this code up.
Comment 17 john stultz 2009-10-05 19:39:37 UTC
Created attachment 23270 [details]
use clocksource.list instead of mult to check if the clock is reigstered

This patch simply changes the registered check in pit_disable_clocksource() to use the .list entry instead of .mult. 

I'm suspecting the issue in this bug is occurring because we unregister the clocksource and set mult to 0 (this is esp bad btw, since the clocksource may be in active use, so manipulating mult directly can cause all sorts of issues before we actually switch clocksources), however when we suspend, we set mult to mult_orig, which overwrites the null. Then, we likely hit pit_disable_clocksource() again, and try to unregister and already unregistered clocksource, and one of the list manipulations probably goes bust.

So using the list entry instead is safer, because none of the timekeeping code other then register and unregister should manipulate it, esp after its unregistered.
Comment 18 john stultz 2009-10-05 19:41:21 UTC
Ah, crud, that doesn't build. Let me take another shot at it.
Comment 19 john stultz 2009-10-05 19:58:09 UTC
Created attachment 23271 [details]
don't use clocksource.mult to cehck if a clock is registered

This patch simply changes the registered check in pit_disable_clocksource() to
a dedicated flag value instead of .mult. 

I'm suspecting the issue in this bug is occurring because we unregister the
clocksource and set mult to 0 (this is esp bad btw, since the clocksource may
be in active use, so manipulating mult directly can cause all sorts of issues
before we actually switch clocksources), however when we suspend, we set mult
to mult_orig, which overwrites the null. Then, we likely hit
pit_disable_clocksource() again, and try to unregister and already unregistered
clocksource, and one of the list manipulations probably goes bust.

My previous attempt was flawed as its a actual list head, not pointer to a list head, so it can't be simply set to null for the test.

So this patch builds. :)
Comment 20 Ondrej Zary 2009-10-07 17:13:06 UTC
This patch not only builds, it also fixes the problem. It does not crash anymore.

It looks like there's another bug here - that disappearing PIT after first hibernation.
Comment 21 john stultz 2009-10-07 17:42:51 UTC
(In reply to comment #20)
> This patch not only builds, it also fixes the problem. It does not crash
> anymore.

Good! At we know whats going on now!

> It looks like there's another bug here - that disappearing PIT after first
> hibernation.

Yes, there is no code to re-register the PIT after we remove it. So you'll likely be stuck with jiffies after that. I'll try to take a swing at that later today if you're still interested in testing things.
Comment 22 Ondrej Zary 2009-10-07 18:15:15 UTC
Yes, I'm interested in testing.

I wonder why the PIT is unregistered at all - if it's a bug or a feature. If it's a bug, there's no need for the re-registration code.
Comment 23 john stultz 2009-10-07 23:22:04 UTC
Created attachment 23307 [details]
Avoid bad pit mult check and try to re-register pit on resume

Ok. So this patch should avoid the hang and try to re-register the PIT so its available on resume. Let me know if it works or you run into any other issues with it.
Comment 24 Ondrej Zary 2009-10-08 18:18:39 UTC
Thanks, it works fine with this patch. No more crashes and PIT does not disappear anymore.
Comment 25 john stultz 2009-10-08 18:27:32 UTC
Good to hear! Thanks for for the great bug report and all the testing! 
I'll send this off to lkml today.
Comment 26 john stultz 2009-10-08 18:39:19 UTC
Huh. So it looks like this might have already been fixed upstream. 

See commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8cab02dc3c58a12235c6d463ce684dded9696848

Could I ask one more favor, and have you try the suspend/resume issue against 2.6.32-rc3? If that's all good, then I might push my patch to 2.6.31-stable but otherwise I think this is fixed.
Comment 27 Ondrej Zary 2009-10-08 20:09:03 UTC
You're right, it's already fixed. 2.6.23-rc3 works fine - no crashes, PIT does not disappear.
Comment 28 john stultz 2009-10-08 20:32:30 UTC
Ok, patch sent to lkml for 31-stable. Marking this resolved.

Note You need to log in before you can comment on or make changes to this bug.