Kernel Bug Tracker – Bug 11568
spontaneous reboot on resume with 2.6.27
Last modified: 2008-10-22 22:22:16 UTC
Latest working kernel version: 184.108.40.206
Earliest failing kernel version: 2.6.27-rc2 (haven't tested earlier versions)
Distribution: Debian unstable
Hardware Environment: MSI AMD Socket 754 with VIA K8T800 and AMI BIOS
Problem Description: Resume from S3 causes machine to reset itself.
Steps to reproduce:
Enter S3 with "echo mem > /sys/power/state"
Try to resume.
Created attachment 17781 [details]
dmesg from 2.6.27-rc2
Noticed this bug while investigating bug #11237. 32 bit kernel doesn't reboot on resume.
I'll find some time to compile and test 2.6.27-rc1 to see if it is the same.
Please also try:
# echo core > /sys/power/pm_test
(requires CONFIG_PM_DEBUG to be set; without it /sys/power/pm_test is not present)
# echo mem > /sys/power/state
and see if that also causes a reboot to happen.
Created attachment 17795 [details]
dmesg of pm_test 2.6.27-rc6
The machine does not reset with that set. Here is the dmesg.
We'll need to check at which points it reboots. I'll send you some debug patches for that in the next couple of days.
On Tuesday, 23 of September 2008, Andy Wettstein wrote:
> On Sun, Sep 21, 2008 at 08:54:21PM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > The following bug entry is on the current list of known regressions
> > from 2.6.26. Please verify if it still should be listed and let me know
> > (either way).
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11568
> > Subject : spontaneous reboot on resume with 2.6.27
> > Submitter : Andy Wettstein <email@example.com>
> > Date : 2008-09-14 20:00 (8 days old)
> Just verified it is still a problem with 2.6.27-rc7.
Started a bisect. But I get stuck with a kernel panic. Here is my bisect log:
# good: [bce7f793daec3e65ec5c5705d2457b81fe7b5725] Linux 2.6.26
git-bisect good bce7f793daec3e65ec5c5705d2457b81fe7b5725
# bad: [6e86841d05f371b5b9b86ce76c02aaee83352298] Linux 2.6.27-rc1
git-bisect bad 6e86841d05f371b5b9b86ce76c02aaee83352298
# bad: [d20b27478d6ccf7c4c8de4f09db2bdbaec82a6c0] V4L/DVB (8415): gspca: Infinite loop in i2c_w() of etoms.
git-bisect bad d20b27478d6ccf7c4c8de4f09db2bdbaec82a6c0
# bad: [666484f0250db2e016948d63b3ef33e202e3b8d0] Merge branch 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
git-bisect bad 666484f0250db2e016948d63b3ef33e202e3b8d0
# bad: [d59fdcf2ac501de99c3dfb452af5e254d4342886] Merge commit 'v2.6.26' into x86/core
git-bisect bad d59fdcf2ac501de99c3dfb452af5e254d4342886
# good: [3de352bbd86f890dd0c5e1c09a6a1b0b29e0f8ce] Merge branch 'x86/mpparse' into x86/devel
git-bisect good 3de352bbd86f890dd0c5e1c09a6a1b0b29e0f8ce
Any ideas on proceeding?
A blind shot: can you try to revert commit 5603509137940f4cbc577281cee62110d4097b1b "Make sure to re-enable SCI after an ACPI suspend" and retest, please?
Hmm. That doesn't revert cleanly on 2.6.27-rc9. I did comment 'acpi_set_register(ACPI_BITREG_SCI_ENABLE, 1);' in drivers/acpi/pci_link.c. Will that have the same effect?
It still rebooted after doing that.
Just to be sure this is not a problem with ACPI being disabled on wake-up, please try the patch from http://marc.info/?l=linux-kernel&m=122307130419753&w=4 .
Created attachment 18244 [details]
biostar motherboard with same problem
I didn't get to test that patch, but I did test out a 2.6.27 kernel on another machine. This time it is a Biostar socket 754 with Award BIOS and nvidia c51 chipset running 64 bit kernel again. This machine also reboots on resume. 2.6.26 is of course working fine. I haven't tried the 32 bit kernel.
Another thing I noticed while trying to make it easier to test things out is on the MSI motherboard I can't do a WOL on 2.6.27. It works fine on 2.6.26. That makes me wonder if the problem is on suspend and not resume. I wonder if the machine has shutdown instead of suspended? It will act the same for me because I have these machines set to power on by keyboard from both S3 and S5.
The WOL problem is actually a different issue IMO. What network adapter is there in the box?
The rebooting during resume on two different boxes is worrisome. I'll prepare some debug patches for you. [I may get distracted by some more urgent things, in which case please ping me.]
Network adapter on the MSI motherboard is an onboard realtek 8169
(In reply to comment #13)
> Network adapter on the MSI motherboard is an onboard realtek 8169
The patches [1/2] and [2/2] from http://marc.info/?w=4&r=1&s=r8169%3A+WoL+fixes&q=t should fix your WOL issue.
The other one requires some serious debugging, though.
Comment #2 indicates that this is a software bug in our 64-bit resume code.
I don't see anything obviously suspicious in there, so I'll need you to try a few debug patches.
Created attachment 18265 [details]
Debug patch #1
With that patch applied, the box(es) should hand during resume instead of rebooting. Please check if that happens.
Ah, please attach kernel configs from the affected systems.
Created attachment 18266 [details]
Debug patch #2
If the patch from Comment #16 hangs the box(es) during resume instead of rebooting, please check if this one does the same.
Only tested on the MSI box. The patch from comment #16 hung the machine. The patch from comment #19, still rebooted. I did revert the first patch before applying the second. Is that what you wanted?
Created attachment 18271 [details]
I am using a minimal kernel config on these machines, now (it is a lot faster to compile).
(In reply to comment #20)
> Only tested on the MSI box. The patch from comment #16 hung the machine. The
> patch from comment #19, still rebooted. I did revert the first patch before
> applying the second. Is that what you wanted?
This means that something breaks in the trampoline code or in head_64.S, or the trampoline address we try to use during wake-up is busted.
(In reply to comment #21)
> Created an attachment (id=18271) [details]
> kernel config
> I am using a minimal kernel config on these machines, now (it is a lot faster
> to compile).
I have compared you .config with mine working one and nothing really stands out except for virtualization. Can you please test with CONFIG_VIRTUALIZATION unset?
Created attachment 18272 [details]
Patch checking the trampoline address
With this patch applied, suspend to RAM should fail if the trampoline address is different from the expected value.
Please see what happens with this patch applied (and two previous debug patches reverted). If the suspend fails, please run dmesg and attach its output.
Also, please check if the failure still happens if you compile the kernel without SMP support.
Without SMP and virtualization resume works! I disabled them both at the same, but probably it was SMP, right? I am recompiling with SMP and patch from comment #24, now.
SMP enabled with patch from #24 does not abort the suspend, and of course reboots on resume.
On Sunday, 12 of October 2008, firstname.lastname@example.org wrote:
> ------- Comment #27 from email@example.com 2008-10-12 12:25 -------
> SMP enabled with patch from #24 does not abort the suspend, and of course
> reboots on resume.
Andy, thanks for testing.
Ingo, Peter, Pavel,
The regression here is that x86_64 non-SMP systems using SMP kernel fail to
resume from suspend to RAM, most probably due to the trampoline code being
busted. I haven't found anything obviously wrong in the code yet, but I
suspect that something simply writes over the (original) trampoline code.
This is in 2.6.27 as well as in the current mainline and it's quite urgent,
because we'll get hit by it in distro kernels based on 2.6.27 if we don't fix
Please have a look at bug #11568.
Created attachment 18273 [details]
Debug patch #3
Andy, please test the SMP kernel with this patch applied and without the previous patches. This should hang the box during resume instead of rebooting, so please check if it does that.
Confirmed. Patch from #29 hangs the machine.
#28: Hmm, generic problem would be bad. Do I need true uniprocessor machine to duplicate this? (I don't have many uniprocessor x86-64s around; I have some old Solo boxes, but they were collecting dust for quite long...)
(In reply to comment #31)
> #28: Hmm, generic problem would be bad.
I almost certainly is generic.
> Do I need true uniprocessor machine to duplicate this?
Probably. I haven't tried with 'nosmp' or 'maxcpus=1' yet, though.
> (I don't have many uniprocessor x86-64s around; I have some old
> Solo boxes, but they were collecting dust for quite long...)
I have one, but it's never resumed from STR.
(In reply to comment #30)
> Confirmed. Patch from #29 hangs the machine.
Thanks for testing.
Well, that means the trampoline code is not busted after all, but the problem happens somewhere in the trampoline code itself or in head_64.S . I'll post a few additional debug patches later today.
Created attachment 18287 [details]
Debug patch #4
Let's see if the switch to Protected Mode works.
Andy, please test this patch without any of the previous patches. It should hang the box on resume instead of rebooting.
Created attachment 18288 [details]
Debug patch #5
If the previous patch works as expected, please test this one. It also should hang the box during resume.
Created attachment 18289 [details]
Debug patch #6
If both previous patches work as expected, please test this one. It also should hang the box during resume.
Created attachment 18290 [details]
Debug patch #7
If the three previous patches work as expected, please test this one. It should hang the box during resume (of course each time please revert all of the previously tested patches before testing the next one).
Actually, you can start testing from the patch #7. If it works (ie. hangs the box during resume), there's no need to test the previous patches.
Patches 4 and 5 hang on resume.
Patches 6 and 7 hang on boot.
(In reply to comment #39)
> Patches 4 and 5 hang on resume.
Thanks for testing.
> Patches 6 and 7 hang on boot.
Ah, sorry for that. I sometimes to forget that head_64.S is used during boot too ...
So, in fact we can't narrow it any more this way.
I'm now looking back at your bisect results in Comment #7. So, you hit the panic when trying to bisect between 3de352bbd86f890dd0c5e1c09a6a1b0b29e0f8ce (good) and d59fdcf2ac501de99c3dfb452af5e254d4342886 (bad). If that's the case, you can try to do 'git reset --hard 26e9e57b106445bbd8c965985e4e8af5293ae005' and continue bisection.
In the meantime, I'll try to find the culprit looking at the code, but that's going to be hard ...
OK, I think I have found something.
Andy, can you please check if commit 9cf4f298e29abba25c16679fe7be70898223167e ("x86: use stack_start in x86_64") resumes correctly with CONFIG_SMP set?
(In reply to comment #41)
> OK, I think I have found something.
> Andy, can you please check if commit 9cf4f298e29abba25c16679fe7be70898223167e
> ("x86: use stack_start in x86_64") resumes correctly with CONFIG_SMP set?
Resume is working fine with that commit.
OK, thanks for the confirmation.
Can you also check if commit a939098afcfa5f81d3474782ec15c6d114e57763 ("x86: move x86_64 gdt closer to i386") works correctly, please?
Never mind, it doesn't compile.
Created attachment 18332 [details]
Patch to test
OK, here's something that _seems_ to work on my broken Asus L5D with an SMP kernel and, most importantly, it doesn't seem to break my SMP test box.
Andy, can you give it a try, please?
Created attachment 18333 [details]
Corrected patch to test
The previous version wouldn't compile. Please try this one instead.
(In reply to comment #46)
> Created an attachment (id=18333) [details]
> Corrected patch to test
> The previous version wouldn't compile. Please try this one instead.
Resume is working fine with that patch applied to 2.6.27. Only tried the MSI motherboard.
OK, thanks for testing. Now, I only have to understand why it is actually needed and whether we can do anything better to fix this issue. Stay tuned. :-)
Created attachment 18343 [details]
Patch that should fix the problem
Finally, I think I know what's going on. If I'm not mistaken, this patch also should fix the problem and IMO it's nicer. Again, I tested it on the broken Asus L5D and it _seemed_ to work, but I'm not 100% sure since the box didn't resume anyway (for another unknown reason).
Andy, can you please test it too?
OK, sorry. After some more investigation I've decided to submit the previous patch after all, as it's what we should do IMO.
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=3038edabf48f01421c621cb77a712b446d3a5d67
Could you get this into a 2.6.27 point release, since it's going to be a pretty nasty one if it shows up in distribution kernels (for example Debian, where all kernels are SMP)?
Yes, it's on its way to -stable too.