Bug 11568

Summary: spontaneous reboot on resume with 2.6.27
Product: Platform Specific/Hardware Reporter: Andy Wettstein (ajw1980)
Component: x86-64Assignee: Rafael J. Wysocki (rjw)
Status: CLOSED CODE_FIX    
Severity: normal CC: ajw1980, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 11167    
Attachments: dmesg from 2.6.27-rc2
dmesg of pm_test 2.6.27-rc6
biostar motherboard with same problem
Debug patch #1
Debug patch #2
kernel config
Patch checking the trampoline address
Debug patch #3
Debug patch #4
Debug patch #5
Debug patch #6
Debug patch #7
Patch to test
Corrected patch to test
Patch that should fix the problem

Description Andy Wettstein 2008-09-14 20:00:24 UTC
Latest working kernel version: 2.6.26.5
Earliest failing kernel version: 2.6.27-rc2 (haven't tested earlier versions)
Distribution: Debian unstable
Hardware Environment: MSI AMD Socket 754 with VIA K8T800 and AMI BIOS
Software Environment:
Problem Description: Resume from S3 causes machine to reset itself.  

Steps to reproduce:
Enter S3 with "echo mem > /sys/power/state"
Try to resume.
Comment 1 Andy Wettstein 2008-09-14 20:04:13 UTC
Created attachment 17781 [details]
dmesg from 2.6.27-rc2
Comment 2 Andy Wettstein 2008-09-14 20:07:21 UTC
Noticed this bug while investigating bug #11237.  32 bit kernel doesn't reboot on resume.

I'll find some time to compile and test 2.6.27-rc1 to see if it is the same.
Comment 3 Rafael J. Wysocki 2008-09-14 21:23:52 UTC
Please also try:

# echo core > /sys/power/pm_test
(requires CONFIG_PM_DEBUG to be set; without it /sys/power/pm_test is not present)
# echo mem > /sys/power/state

and see if that also causes a reboot to happen.
Comment 4 Andy Wettstein 2008-09-15 15:49:43 UTC
Created attachment 17795 [details]
dmesg of pm_test 2.6.27-rc6

The machine does not reset with that set.  Here is the dmesg.
Comment 5 Rafael J. Wysocki 2008-09-16 11:36:15 UTC
We'll need to check at which points it reboots.  I'll send you some debug patches for that in the next couple of days.
Comment 6 Rafael J. Wysocki 2008-09-26 15:59:32 UTC
On Tuesday, 23 of September 2008, Andy Wettstein wrote:
> On Sun, Sep 21, 2008 at 08:54:21PM +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.26.  Please verify if it still should be listed and let me know
> > (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=11568
> > Subject             : spontaneous reboot on resume with 2.6.27
> > Submitter   : Andy Wettstein <ajw1980@gmail.com>
> > Date                : 2008-09-14 20:00 (8 days old)
> 
> Just verified it is still a problem with 2.6.27-rc7.
Comment 7 Andy Wettstein 2008-09-29 12:06:25 UTC
Started a bisect.  But I get stuck with a kernel panic.  Here is my bisect log:

git-bisect start
# good: [bce7f793daec3e65ec5c5705d2457b81fe7b5725] Linux 2.6.26
git-bisect good bce7f793daec3e65ec5c5705d2457b81fe7b5725
# bad: [6e86841d05f371b5b9b86ce76c02aaee83352298] Linux 2.6.27-rc1
git-bisect bad 6e86841d05f371b5b9b86ce76c02aaee83352298
# bad: [d20b27478d6ccf7c4c8de4f09db2bdbaec82a6c0] V4L/DVB (8415): gspca: Infinite loop in i2c_w() of etoms.
git-bisect bad d20b27478d6ccf7c4c8de4f09db2bdbaec82a6c0
# bad: [666484f0250db2e016948d63b3ef33e202e3b8d0] Merge branch 'core/softirq' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
git-bisect bad 666484f0250db2e016948d63b3ef33e202e3b8d0
# bad: [d59fdcf2ac501de99c3dfb452af5e254d4342886] Merge commit 'v2.6.26' into x86/core
git-bisect bad d59fdcf2ac501de99c3dfb452af5e254d4342886
# good: [3de352bbd86f890dd0c5e1c09a6a1b0b29e0f8ce] Merge branch 'x86/mpparse' into x86/devel
git-bisect good 3de352bbd86f890dd0c5e1c09a6a1b0b29e0f8ce

Any ideas on proceeding?
Comment 8 Rafael J. Wysocki 2008-10-03 12:40:49 UTC
A blind shot: can you try to revert commit 5603509137940f4cbc577281cee62110d4097b1b "Make sure to re-enable SCI after an ACPI suspend" and retest, please?
Comment 9 Andy Wettstein 2008-10-09 07:21:53 UTC
Hmm.  That doesn't revert cleanly on 2.6.27-rc9.  I did comment 'acpi_set_register(ACPI_BITREG_SCI_ENABLE, 1);' in drivers/acpi/pci_link.c.  Will that have the same effect?

It still rebooted after doing that.
Comment 10 Rafael J. Wysocki 2008-10-09 08:50:56 UTC
OK, thanks.

Just to be sure this is not a problem with ACPI being disabled on wake-up, please try the patch from http://marc.info/?l=linux-kernel&m=122307130419753&w=4 .
Comment 11 Andy Wettstein 2008-10-10 07:55:59 UTC
Created attachment 18244 [details]
biostar motherboard with same problem

I didn't get to test that patch, but I did test out a 2.6.27 kernel on another machine.  This time it is a Biostar socket 754 with Award BIOS and nvidia c51 chipset running 64 bit kernel again.  This machine also reboots on resume.  2.6.26 is of course working fine.  I haven't tried the 32 bit kernel.

Another thing I noticed while trying to make it easier to test things out is on the MSI motherboard I can't do a WOL on 2.6.27.  It works fine on 2.6.26.  That makes me wonder if the problem is on suspend and not resume.  I wonder if the machine has shutdown instead of suspended?  It will act the same for me because I have these machines set to power on by keyboard from both S3 and S5.
Comment 12 Rafael J. Wysocki 2008-10-10 12:16:54 UTC
The WOL problem is actually a different issue IMO.  What network adapter is there in the box?

The rebooting during resume on two different boxes is worrisome.  I'll prepare some debug patches for you.  [I may get distracted by some more urgent things, in which case please ping me.]
Comment 13 Andy Wettstein 2008-10-10 12:53:13 UTC
Network adapter on the MSI motherboard is an onboard realtek 8169
Comment 14 Rafael J. Wysocki 2008-10-10 16:06:47 UTC
(In reply to comment #13)
> Network adapter on the MSI motherboard is an onboard realtek 8169

The patches [1/2] and [2/2] from http://marc.info/?w=4&r=1&s=r8169%3A+WoL+fixes&q=t should fix your WOL issue.

The other one requires some serious debugging, though.
Comment 15 Rafael J. Wysocki 2008-10-11 14:45:02 UTC
Comment #2 indicates that this is a software bug in our 64-bit resume code.

I don't see anything obviously suspicious in there, so I'll need you to try a few debug patches.
Comment 16 Rafael J. Wysocki 2008-10-11 14:52:41 UTC
Created attachment 18265 [details]
Debug patch #1

With that patch applied, the box(es) should hand during resume instead of rebooting.  Please check if that happens.
Comment 17 Rafael J. Wysocki 2008-10-11 14:59:31 UTC
s/hand/hang/ (sorry)
Comment 18 Rafael J. Wysocki 2008-10-11 15:15:04 UTC
Ah, please attach kernel configs from the affected systems.
Comment 19 Rafael J. Wysocki 2008-10-11 15:23:32 UTC
Created attachment 18266 [details]
Debug patch #2

If the patch from Comment #16 hangs the box(es) during resume instead of rebooting, please check if this one does the same.
Comment 20 Andy Wettstein 2008-10-12 06:00:51 UTC
Only tested on the MSI box.  The patch from comment #16 hung the machine.  The patch from comment #19, still rebooted.  I did revert the first patch before applying the second.  Is that what you wanted?
Comment 21 Andy Wettstein 2008-10-12 06:02:52 UTC
Created attachment 18271 [details]
kernel config 

I am using a minimal kernel config on these machines, now (it is a lot faster to compile).
Comment 22 Rafael J. Wysocki 2008-10-12 09:26:36 UTC
(In reply to comment #20)
> Only tested on the MSI box.  The patch from comment #16 hung the machine. 
> The
> patch from comment #19, still rebooted.  I did revert the first patch before
> applying the second.  Is that what you wanted?

Yes, exactly.

This means that something breaks in the trampoline code or in head_64.S, or the trampoline address we try to use during wake-up is busted.
Comment 23 Rafael J. Wysocki 2008-10-12 09:28:28 UTC
(In reply to comment #21)
> Created an attachment (id=18271) [details]
> kernel config 
> 
> I am using a minimal kernel config on these machines, now (it is a lot faster
> to compile).    

Thanks.

I have compared you .config with mine working one and nothing really stands out except for virtualization.  Can you please test with CONFIG_VIRTUALIZATION unset?
Comment 24 Rafael J. Wysocki 2008-10-12 10:33:37 UTC
Created attachment 18272 [details]
Patch checking the trampoline address

With this patch applied, suspend to RAM should fail if the trampoline address is different from the expected value.

Please see what happens with this patch applied (and two previous debug patches reverted).  If the suspend fails, please run dmesg and attach its output.
Comment 25 Rafael J. Wysocki 2008-10-12 10:49:37 UTC
Also, please check if the failure still happens if you compile the kernel without SMP support.
Comment 26 Andy Wettstein 2008-10-12 12:14:13 UTC
Without SMP and virtualization resume works!  I disabled them both at the same, but probably it was SMP, right?  I am recompiling with SMP and patch from comment #24, now.
Comment 27 Andy Wettstein 2008-10-12 12:25:13 UTC
SMP enabled with patch from #24 does not abort the suspend, and of course reboots on resume.
Comment 28 Rafael J. Wysocki 2008-10-12 12:39:49 UTC
On Sunday, 12 of October 2008, bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11568
> 
> 
> 
> 
> 
> ------- Comment #27 from ajw1980@gmail.com  2008-10-12 12:25 -------
> SMP enabled with patch from #24 does not abort the suspend, and of course
> reboots on resume.

Andy, thanks for testing.

Ingo, Peter, Pavel,

The regression here is that x86_64 non-SMP systems using SMP kernel fail to
resume from suspend to RAM, most probably due to the trampoline code being
busted.  I haven't found anything obviously wrong in the code yet, but I
suspect that something simply writes over the (original) trampoline code.

This is in 2.6.27 as well as in the current mainline and it's quite urgent,
because we'll get hit by it in distro kernels based on 2.6.27 if we don't fix
it timely. 

Please have a look at bug #11568.

Thanks,
Rafael
Comment 29 Rafael J. Wysocki 2008-10-12 12:47:46 UTC
Created attachment 18273 [details]
Debug patch #3

Andy, please test the SMP kernel with this patch applied and without the previous patches.  This should hang the box during resume instead of rebooting, so please check if it does that.
Comment 30 Andy Wettstein 2008-10-12 19:14:29 UTC
Confirmed.  Patch from #29 hangs the machine.
Comment 31 Pavel Machek 2008-10-13 04:02:35 UTC
#28: Hmm, generic problem would be bad. Do I need true uniprocessor machine to duplicate this? (I don't have many uniprocessor x86-64s around; I have some old Solo boxes, but they were collecting dust for quite long...)
Comment 32 Rafael J. Wysocki 2008-10-13 05:31:07 UTC
(In reply to comment #31)
> #28: Hmm, generic problem would be bad.

I almost certainly is generic.

> Do I need true uniprocessor machine to duplicate this?

Probably.  I haven't tried with 'nosmp' or 'maxcpus=1' yet, though.

> (I don't have many uniprocessor x86-64s around; I have some old
> Solo boxes, but they were collecting dust for quite long...)

I have one, but it's never resumed from STR.
Comment 33 Rafael J. Wysocki 2008-10-13 05:34:51 UTC
(In reply to comment #30)
> Confirmed.  Patch from #29 hangs the machine.

Thanks for testing.

Well, that means the trampoline code is not busted after all, but the problem happens somewhere in the trampoline code itself or in head_64.S .  I'll post a few additional debug patches later today.
Comment 34 Rafael J. Wysocki 2008-10-13 13:38:33 UTC
Created attachment 18287 [details]
Debug patch #4

Let's see if the switch to Protected Mode works.

Andy, please test this patch without any of the previous patches.  It should hang the box on resume instead of rebooting.
Comment 35 Rafael J. Wysocki 2008-10-13 13:41:33 UTC
Created attachment 18288 [details]
Debug patch #5

If the previous patch works as expected, please test this one.  It also should hang the box during resume.
Comment 36 Rafael J. Wysocki 2008-10-13 13:46:57 UTC
Created attachment 18289 [details]
Debug patch #6

If both previous patches work as expected, please test this one.  It also should hang the box during resume.
Comment 37 Rafael J. Wysocki 2008-10-13 13:50:52 UTC
Created attachment 18290 [details]
Debug patch #7

If the three previous patches work as expected, please test this one.  It should hang the box during resume (of course each time please revert all of the previously tested patches before testing the next one).
Comment 38 Rafael J. Wysocki 2008-10-13 13:52:21 UTC
Actually, you can start testing from the patch #7.  If it works (ie. hangs the box during resume), there's no need to test the previous patches.
Comment 39 Andy Wettstein 2008-10-13 20:58:26 UTC
Patches 4 and 5 hang on resume.

Patches 6 and 7 hang on boot.
Comment 40 Rafael J. Wysocki 2008-10-14 07:42:51 UTC
(In reply to comment #39)
> Patches 4 and 5 hang on resume.

Thanks for testing.

> Patches 6 and 7 hang on boot.

Ah, sorry for that.  I sometimes to forget that head_64.S is used during boot too ...

So, in fact we can't narrow it any more this way.

I'm now looking back at your bisect results in Comment #7.  So, you hit the panic when trying to bisect between 3de352bbd86f890dd0c5e1c09a6a1b0b29e0f8ce (good) and d59fdcf2ac501de99c3dfb452af5e254d4342886 (bad).  If that's the case, you can try to do 'git reset --hard 26e9e57b106445bbd8c965985e4e8af5293ae005' and continue bisection.

In the meantime, I'll try to find the culprit looking at the code, but that's going to be hard ...
Comment 41 Rafael J. Wysocki 2008-10-14 15:52:02 UTC
OK, I think I have found something.

Andy, can you please check if commit 9cf4f298e29abba25c16679fe7be70898223167e ("x86: use stack_start in x86_64") resumes correctly with CONFIG_SMP set?
Comment 42 Andy Wettstein 2008-10-14 20:13:18 UTC
(In reply to comment #41)
> OK, I think I have found something.
> 
> Andy, can you please check if commit 9cf4f298e29abba25c16679fe7be70898223167e
> ("x86: use stack_start in x86_64") resumes correctly with CONFIG_SMP set?
> 

Resume is working fine with that commit.
Comment 43 Rafael J. Wysocki 2008-10-15 12:15:06 UTC
OK, thanks for the confirmation.

Can you also check if commit a939098afcfa5f81d3474782ec15c6d114e57763 ("x86: move x86_64 gdt closer to i386") works correctly, please?
Comment 44 Rafael J. Wysocki 2008-10-15 12:25:46 UTC
Never mind, it doesn't compile.
Comment 45 Rafael J. Wysocki 2008-10-15 14:23:30 UTC
Created attachment 18332 [details]
Patch to test

OK, here's something that _seems_ to work on my broken Asus L5D with an SMP kernel and, most importantly, it doesn't seem to break my SMP test box.

Andy, can you give it a try, please?
Comment 46 Rafael J. Wysocki 2008-10-15 14:26:27 UTC
Created attachment 18333 [details]
Corrected patch to test

The previous version wouldn't compile.  Please try this one instead.
Comment 47 Andy Wettstein 2008-10-15 20:18:56 UTC
(In reply to comment #46)
> Created an attachment (id=18333) [details]
> Corrected patch to test
> 
> The previous version wouldn't compile.  Please try this one instead.
> 

Resume is working fine with that patch applied to 2.6.27.  Only tried the MSI motherboard.
Comment 48 Rafael J. Wysocki 2008-10-16 06:51:42 UTC
OK, thanks for testing.  Now, I only have to understand why it is actually needed and whether we can do anything better to fix this issue.  Stay tuned. :-)
Comment 49 Rafael J. Wysocki 2008-10-16 14:46:36 UTC
Created attachment 18343 [details]
Patch that should fix the problem

Finally, I think I know what's going on.  If I'm not mistaken, this patch also should fix the problem and IMO it's nicer.  Again, I tested it on the broken Asus L5D and it _seemed_ to work, but I'm not 100% sure since the box didn't resume anyway (for another unknown reason).

Andy, can you please test it too?
Comment 50 Rafael J. Wysocki 2008-10-16 15:29:12 UTC
OK, sorry.  After some more investigation I've decided to submit the previous patch after all, as it's what we should do IMO.
Comment 52 Ken Bloom 2008-10-22 18:32:30 UTC
Could you get this into a 2.6.27 point release, since it's going to be a pretty nasty one if it shows up in distribution kernels (for example Debian, where all kernels are SMP)?
Comment 53 Rafael J. Wysocki 2008-10-22 22:22:16 UTC
Yes, it's on its way to -stable too.