Hibernation worked great with 2.6.30 - I had over a month of uptime (it never worked so well before - always crashed after some time). But in 2.6.31, it's broken again - and completely. The first hibernation works but it always crashes on the second one. The oops won't fit the screen, even 1280x1024 is not enough. I bisected it and found this: c7121843685de2bf7f3afd3ae1d6a146010bf1fc is first bad commit commit c7121843685de2bf7f3afd3ae1d6a146010bf1fc Author: Magnus Damm <magnus.damm@gmail.com> Date: Tue Jul 28 14:09:55 2009 -0700 clocksource: Save mult_orig in clocksource_disable() To fix the common case where ->enable() does not set up mult, make sure mult_orig is saved in mult on disable. Also add comments to explain why we do this. Signed-off-by: Magnus Damm <damm@igel.co.jp> Cc: johnstul@us.ibm.com Cc: lethal@linux-sh.org Cc: akpm@linux-foundation.org LKML-Reference: <20090618152432.10136.9932.sendpatchset@rx1.opensource.se> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> :040000 040000 c965062bc79af46cdb3522fc7ab1cc81d2f84de3 26616c88a8128b25fd15c3da39ac6251cd9723b6 M include Reverting this commit fixes the problem.
On Thu, 24 Sep 2009 14:07:09 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14222 > > Summary: Hibernation oopses for the 2nd time with 2.6.31 (won't > fit the screen) > Product: Timers > Version: 2.5 > Kernel Version: 2.6.31 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > AssignedTo: johnstul@us.ibm.com > ReportedBy: linux@rainbow-software.org > Regression: Yes It's a 2.6.21 regression. > > Hibernation worked great with 2.6.30 - I had over a month of uptime (it never > worked so well before - always crashed after some time). But in 2.6.31, it's > broken again - and completely. The first hibernation works but it always > crashes on the second one. > > The oops won't fit the screen, even 1280x1024 is not enough. > > I bisected it and found this: Thanks for bisecting it. > c7121843685de2bf7f3afd3ae1d6a146010bf1fc is first bad commit > commit c7121843685de2bf7f3afd3ae1d6a146010bf1fc > Author: Magnus Damm <magnus.damm@gmail.com> > Date: __ Tue Jul 28 14:09:55 2009 -0700 > > __ __ clocksource: Save mult_orig in clocksource_disable() > > __ __ To fix the common case where ->enable() does not set up > __ __ mult, make sure mult_orig is saved in mult on disable. > > __ __ Also add comments to explain why we do this. > > __ __ Signed-off-by: Magnus Damm <damm@igel.co.jp> > __ __ Cc: johnstul@us.ibm.com > __ __ Cc: lethal@linux-sh.org > __ __ Cc: akpm@linux-foundation.org > __ __ LKML-Reference: > <20090618152432.10136.9932.sendpatchset@rx1.opensource.se> > __ __ Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > > :040000 040000 c965062bc79af46cdb3522fc7ab1cc81d2f84de3 > 26616c88a8128b25fd15c3da39ac6251cd9723b6 M __ __ __include > > Reverting this commit fixes the problem. OK, I cc'ed everyone. Guys, is reverting this patch the best approach? It might be for 2.6.31.x but not for 2.6.32, perhaps?
First-Bad-Commit : c7121843685de2bf7f3afd3ae1d6a146010bf1fc
(In reply to comment #0) > Hibernation worked great with 2.6.30 - I had over a month of uptime (it never > worked so well before - always crashed after some time). But in 2.6.31, it's > broken again - and completely. The first hibernation works but it always > crashes on the second one. > > The oops won't fit the screen, even 1280x1024 is not enough. Thanks for bisecting this down! Can you provide a little bit of information about your box, so I can try to reproduce the issue? dmesg cat /sys/devices/system/clocksource/clocksource0/available_clocksource cat /sys/devices/system/clocksource/clocksource0/current_clocksource Thanks -john
Andrew: Reverting it might be the best approach for 2.6.31.x, although the patch was needed to fix the mult breakage on resume that Magnus earlier patch caused. Granted, that breakage isn't as bad as borking suspend, so reverting it would be the lesser evil. The code has been totally reworked by Martin Schwidefsky in 2.6.32, so I don't believe there should be an issue there.
Ondrej: Just wanted to follow up here again. Can you provide the following output? dmesg cat /sys/devices/system/clocksource/clocksource0/available_clocksource cat /sys/devices/system/clocksource/clocksource0/current_clocksource thanks -john
Also, are you running ntp during the suspend/resume cycle? If so, does disabling ntp change the behavior?
Looking more at the change i'm still baffled at how the simple one-liner in a code path that we only call when we're throwing out a clocksource (only reason we call clocksource_disable()) could be causing problems. The change was made so that if you switch away from a clocksource you could switch back to it and NTP wouldn't be confused. But I don't see anything in the suspend/resume path that would cause you to go back to using a disabled clocksource.
Maybe it's related to another bug - TSC is marked unstable on hibernation on this machine: http://lkml.org/lkml/fancy/2009/2/20/428
Created attachment 23231 [details] dmesg dmesg attached This is before hibernation: $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource pit jiffies tsc $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource pit NTP is running.
Created attachment 23232 [details] dmesg after hibernation And it gets interesting after resume from hibernation: $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource jiffies tsc $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource jiffies
*** Bug 14254 has been marked as a duplicate of this bug. ***
Created attachment 23235 [details] debug patch Thanks for the extra info. I've not seen a box that was still using the pit clocksource in awhile, esp ones with ACPI Cstates but no ACPI PM. Odd, but not the issue here. I'm still a little baffled, as I don't see why we would switch away from the PIT clocksource to jiffies on resume. Further, its odd the pit clocksource drops off the available list as well. Can you apply this patch and then again provide the boot/resume dmesges and available/current_clocksource data?
Created attachment 23236 [details] dmesg with clocksource change messages Got warnings about %lu and u32 so changed them to %u. Attached dmesg shows two clocksource changes during boot (jiffies->TSC->PIT) and one during resume (if I understand it correctly): PIT->jiffies. Available clocksources are still the same.
Created attachment 23238 [details] second debug patch Hrmm.. Here's another debug patch to try to show more about whats happening.
Created attachment 23249 [details] dmesg with modified debug 2 patch With this patch, it oopses after "echo disk >/sys/power/state". Splitted each of the new printk lines into two with NULL checks and got this dmesg. I see only one NULL message on resume, but there was probably at least one more during hibernation which got lost.
Yuck. Ok. So we are disqualifying the PIT on suspend, which I wasn't expecting. ACPI: Preparing to enter system sleep state S4 PM: Saving platform NVS memory clocksource_unregister(pit) clocksource_unregister: next: jiffies PM: Creating hibernation image: PM: Need to copy 19506 pages PM: Restoring platform NVS memory Changing clocksource from pit to jiffies Now, trying to connect the patch that broke things and this issue, I'm guessing the check against pit_cs.mult in pit_disable_clocksource() got messed up since we probably unregister the clocksource, then disable it. But I'm honestly a little suspicious of the code, since we're unregistering and manipulating a clocksource that may be actively in use. I've got a simple change to try that I suspect resolves the issue, but we may have a bit more work to do to really fix this code up.
Created attachment 23270 [details] use clocksource.list instead of mult to check if the clock is reigstered This patch simply changes the registered check in pit_disable_clocksource() to use the .list entry instead of .mult. I'm suspecting the issue in this bug is occurring because we unregister the clocksource and set mult to 0 (this is esp bad btw, since the clocksource may be in active use, so manipulating mult directly can cause all sorts of issues before we actually switch clocksources), however when we suspend, we set mult to mult_orig, which overwrites the null. Then, we likely hit pit_disable_clocksource() again, and try to unregister and already unregistered clocksource, and one of the list manipulations probably goes bust. So using the list entry instead is safer, because none of the timekeeping code other then register and unregister should manipulate it, esp after its unregistered.
Ah, crud, that doesn't build. Let me take another shot at it.
Created attachment 23271 [details] don't use clocksource.mult to cehck if a clock is registered This patch simply changes the registered check in pit_disable_clocksource() to a dedicated flag value instead of .mult. I'm suspecting the issue in this bug is occurring because we unregister the clocksource and set mult to 0 (this is esp bad btw, since the clocksource may be in active use, so manipulating mult directly can cause all sorts of issues before we actually switch clocksources), however when we suspend, we set mult to mult_orig, which overwrites the null. Then, we likely hit pit_disable_clocksource() again, and try to unregister and already unregistered clocksource, and one of the list manipulations probably goes bust. My previous attempt was flawed as its a actual list head, not pointer to a list head, so it can't be simply set to null for the test. So this patch builds. :)
This patch not only builds, it also fixes the problem. It does not crash anymore. It looks like there's another bug here - that disappearing PIT after first hibernation.
(In reply to comment #20) > This patch not only builds, it also fixes the problem. It does not crash > anymore. Good! At we know whats going on now! > It looks like there's another bug here - that disappearing PIT after first > hibernation. Yes, there is no code to re-register the PIT after we remove it. So you'll likely be stuck with jiffies after that. I'll try to take a swing at that later today if you're still interested in testing things.
Yes, I'm interested in testing. I wonder why the PIT is unregistered at all - if it's a bug or a feature. If it's a bug, there's no need for the re-registration code.
Created attachment 23307 [details] Avoid bad pit mult check and try to re-register pit on resume Ok. So this patch should avoid the hang and try to re-register the PIT so its available on resume. Let me know if it works or you run into any other issues with it.
Thanks, it works fine with this patch. No more crashes and PIT does not disappear anymore.
Good to hear! Thanks for for the great bug report and all the testing! I'll send this off to lkml today.
Huh. So it looks like this might have already been fixed upstream. See commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8cab02dc3c58a12235c6d463ce684dded9696848 Could I ask one more favor, and have you try the suspend/resume issue against 2.6.32-rc3? If that's all good, then I might push my patch to 2.6.31-stable but otherwise I think this is fixed.
You're right, it's already fixed. 2.6.23-rc3 works fine - no crashes, PIT does not disappear.
Ok, patch sent to lkml for 31-stable. Marking this resolved.