Bug 54911 - Suspend to ram fails for 3.7- and 3.8-series, while working up to 3.6.11 - Samsung Q35
Summary: Suspend to ram fails for 3.7- and 3.8-series, while working up to 3.6.11 - S...
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: H. Peter Anvin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-06 12:13 UTC by Ralph Boehm
Modified: 2013-07-27 23:56 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.7.*, 3.8.*
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Ralph Boehm 2013-03-06 12:13:27 UTC
Hello,

subject says it: On my old Samsung Q35 (Intel, dual core) laptop, resume after suspend to RAM fails, beginning with the 3.7 kernel series, while working up to 3.6.11. The symptoms are: After trying to resume, the backlight switches on and there seems to be HD activity (as indicated by LED, about 30 sec.), but then I'm stuck with a blinking cursor (seemingly in 80x24 VGA mode) in the upper left corner of the screen. The keyboard seems dead, including Alt-SysRq, thus forcing a hard reboot.

Bisecting (1st time, so bear with me) starting from 'linux-stable.git' finally gave me:

73201dbec64aebf6b0dca855b523f437972dc7bb is the first bad commit
commit 73201dbec64aebf6b0dca855b523f437972dc7bb
Author: H. Peter Anvin <hpa@linux.intel.com>
Date:   Wed Sep 26 15:02:34 2012 -0700

    x86, suspend: On wakeup always initialize cr4 and EFER
 ....

which at least sounds relevant to my untrained eye. Don't know how to proceed or what other info you may need, so feel free to advise / ask!

Thanks, Ralph
Comment 1 Aaron Lu 2013-03-11 07:15:44 UTC
Hi Ralph,
Thanks for the report!

Hi HPA,
Can you please take a look? Thanks.
Comment 2 Alejandro 2013-03-20 09:57:07 UTC
Hello,

exactly same issue here. I was really hoping 3.8.x will solve the issue, but not. 

I'm willing to help solving this bug. I can provide further debug information if required.

Regards
Comment 3 Alejandro 2013-03-20 10:00:11 UTC
I forgot to mention, my laptop is a Toshiba satellite u200, not a Samsung Q35. Also a Intel dual core (not core 2 duo), with i915 graphic card (maybe relevant).

Regards
Comment 4 Aaron Lu 2013-03-21 02:51:30 UTC
Hi Alejandro,

Can you please confirm if that commit also breaks your system? Thanks.
Comment 5 Alejandro 2013-03-21 07:15:40 UTC
Hi Aaron,

It started to fail right with the 3.7.x series, and still does with 3.8.x. I would love to help solving this. How do I proceed to check that specific commit? I'm not used to git, hence I will need some indications.

Thanks!
Comment 6 Aaron Lu 2013-03-21 07:28:32 UTC
(In reply to comment #5)
> Hi Aaron,
> 
> It started to fail right with the 3.7.x series, and still does with 3.8.x. I
> would love to help solving this. How do I proceed to check that specific
> commit? I'm not used to git, hence I will need some indications.

First clone linus' tree:
$ git clone http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Then checkout the commit:
$ git reset --hard 73201dbec64aebf6b0dca855b523f437972dc7bb
Now build the kernel and test if resume failed.
If it failed, checkout the previous commit:
$ git reset --hard 5a5a51db78ef24aa61a4cb2ae36f07f6fa37356d
Now build the kernel and test again. If resume OK, then it means 73201dbec64aebf6b0dca855b523f437972dc7bb is the bad commit that breaks your system. Thanks.
Comment 7 Alejandro 2013-03-21 17:21:45 UTC
Hi Aaron,

thanks for the info. Following the instructions it was really easy to do. After testing both of them I can confirm that the version corresponding to commit 73201dbec64aebf6b0dca855b523f437972dc7bb did not work, while the one corresponding to commit 5a5a51db78ef24aa61a4cb2ae36f07f6fa37356d worked without problems. 

Therefore, I can blame that commit for breaking my suspend/resume support :)

I hope this can be solved quickly. Anything else you may need (e.g. additional HW info), just ask me.

Thanks!
Comment 8 Rafael J. Wysocki 2013-03-21 20:39:49 UTC
On Thursday, March 21, 2013 05:21:46 PM bugzilla-daemon@bugzilla.kernel.org wrote:
>
> --- Comment #7 from Alejandro <alejandro.perez.mendez@gmail.com>  2013-03-21
> 17:21:45 ---
> Hi Aaron,
> 
> thanks for the info. Following the instructions it was really easy to do.
> After
> testing both of them I can confirm that the version corresponding to commit
> 73201dbec64aebf6b0dca855b523f437972dc7bb did not work, while the one
> corresponding to commit 5a5a51db78ef24aa61a4cb2ae36f07f6fa37356d worked
> without
> problems. 
> 
> Therefore, I can blame that commit for breaking my suspend/resume support :)
> 
> I hope this can be solved quickly. Anything else you may need (e.g.
> additional
> HW info), just ask me.

Peter, it looks like we need your help.
Comment 9 H. Peter Anvin 2013-03-21 21:07:06 UTC
73201dbec64aebf6b0dca855b523f437972dc7bb is known buggy, but the bug was believed to be fixed in 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e.

So the first thing to find out is if 73201dbec64aebf6b0dca855b523f437972dc7bb with 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e cherry-picked on top of it works or not.
Comment 10 Aaron Lu 2013-03-22 00:49:48 UTC
Hi Alejandro,

Please follow these steps to test if commit 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e fixed this problem(in the cloned git tree):

$ git reset --hard 73201dbec64aebf6b0dca855b523f437972dc7bb
$ git cherry-pick 1396adc3c2bdc556d4cdd1cf107aa0b6d59fbb1e
Now build the kernel and test if resume is OK, thanks for your help.
Comment 11 Ralph Boehm 2013-03-22 12:59:04 UTC
Hi all,

no luck for me with Aarons cherry-pick recipe. Still no resume, same symptoms as before.

Thanks, Ralph
Comment 12 Aaron Lu 2013-03-22 13:14:37 UTC
(In reply to comment #11)
> Hi all,
> 
> no luck for me with Aarons cherry-pick recipe. Still no resume, same symptoms
> as before.
 
Thank you Ralph for the test.

Hi Peter,
Looks like that commit doesn't fix it?
Comment 13 Alejandro 2013-03-22 15:11:21 UTC
It didn't work for me either :(. 
Same behaviour, blinking cursor, no response after trying to resume.
Comment 14 Alejandro 2013-03-23 10:48:54 UTC
Umm, something interesting happened today. My daily distribution is ARCH Linux, which AFAIK provides unmodified Kernel packages. Hence, their kernel package fails the same way the one from git, showing similar behaviour.

However, I just tested Ubuntu 13.04 live (pre-release) which is shipped with a (probably modified) 3.8.0 kernel version. Using that kernel suspend/resume is working. Maybe they just disabled commit 73201dbec64aebf6b0dca855b523f437972dc7bb, I don't really know. But I may be interesting to look around that.

Regards,
Alejandro
Comment 15 Aaron Lu 2013-04-02 05:47:55 UTC
(In reply to comment #14)
> Umm, something interesting happened today. My daily distribution is ARCH
> Linux,
> which AFAIK provides unmodified Kernel packages. Hence, their kernel package
> fails the same way the one from git, showing similar behaviour.
> 
> However, I just tested Ubuntu 13.04 live (pre-release) which is shipped with
> a
> (probably modified) 3.8.0 kernel version. Using that kernel suspend/resume is
> working. Maybe they just disabled commit
> 73201dbec64aebf6b0dca855b523f437972dc7bb, I don't really know. But I may be
> interesting to look around that.

Hi Alejandro,

Is there any update about your finding?
Comment 16 Alejandro 2013-04-02 22:33:30 UTC
Sorry, I've been busy, so I couldn't try further. I discovered an interesting thing, using the .config from ubuntu, with vanilla 3.8.0, the computer is able to suspend. however, using ArchLinux's configuration, it is not able to. Therefore, it seems to be related with a new kernel configuration parameter.

I'm trying to discover what is it. The problem is that both files (the one from ubuntu and the one from ArchLinux) are significantly different, and the Kernel compilation process takes more than 1h in my computer. I'm trying to copy entire sections from ubuntu's to arch's (starting from ACPI section). This way I aim to surround the conflictive parameter. I will keep you posted once I discover what speciific option is. If you (as an expert) think that uploading both files could be useful, just tell.

Regards
Comment 17 Alejandro 2013-04-04 10:16:37 UTC
hello again,

analysing this is taking longer than I though, as every single change on the .config file requires me to rebuild the whole kernel (2h 100%CPU). 

As commit 73201dbec64aebf6b0dca855b523f437972dc7bb is supposed to make the difference: is there any particular config option involved on that changeset? I need to narrow my search, or I'm afraid I will never find the problem :).

Thanks,
Alelajdnro
Comment 18 Alejandro 2013-04-06 11:30:12 UTC
Hello there,

I've found the problematic option. It's the PAE related stuff. With pre-3.7.x kernels, everything worked fine without PAE activated. However, it seems that starting from 3.7.0, I need to activate the CONFIG_HIGHMEM64G (and related) options to make it work. 

I only have 2G of RAM, so it does not make sense to me. Does commit 73201dbec64aebf6b0dca855b523f437972dc7bb introduce any change taking into account whether PAE is activated or not? Does it assumes 64bit addresses? 

Ralph, can you confirm if having PAE activated solves the problem for you too?

Best regards,
Alejandro
Comment 19 Ralph Boehm 2013-04-08 14:15:45 UTC
Hi Alejandro,

you seem to have found something: Same behaviour here, enabling CONFIG_HIGHMEM64G (which seems unnecessary with my 1G RAM as well) makes resume work again! Do you or someone else know about any undesired side effects of that option, dissallowing its usage on low RAM machines?

Thanks a lot for your research, and best regards,
   Ralph
Comment 20 Alejandro 2013-04-08 14:30:41 UTC
You are welcome. I do not think enabling PAE has great disadvantages (ubuntu ships its kernel images with PAE enabled by default), but nevertheless the clearly unrelated cause->consequence relation surely indicates a bug somewhere in the kernel. I have other computers, and this one is the only one exposing such a behaviour. It could be a buggy BIOS, but then IMO it should be failing before 3.7.x (but it worked fine).

It would be great if the kernel developers could locate and fix this bug.

Best regards,
Alejandro
Comment 21 Rafael J. Wysocki 2013-04-08 19:38:14 UTC
On Monday, April 08, 2013 02:15:45 PM bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=54911

Hi Peter,

> --- Comment #19 from Ralph Boehm <xurpher@web.de>  2013-04-08 14:15:45 ---
> Hi Alejandro,
> 
> you seem to have found something: Same behaviour here, enabling
> CONFIG_HIGHMEM64G (which seems unnecessary with my 1G RAM as well) makes
> resume
> work again! Do you or someone else know about any undesired side effects of
> that option, dissallowing its usage on low RAM machines?

It looks like some change in 3.7 broke suspend/resume with highmem other than
CONFIG_HIGHMEM64G which worked before and the breakage is still present
(any of you guys can confirm that suspend/resume doesn't work for you with 3.9-rc6?).

Do you have any idea what change might have cause that to happen?
Comment 22 H. Peter Anvin 2013-04-08 19:51:08 UTC
Is this reproducible on arbitrary hardware or is it a specific set of machines?

"Rafael J. Wysocki" <rjw@sisk.pl> wrote:

>On Monday, April 08, 2013 02:15:45 PM
>bugzilla-daemon@bugzilla.kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=54911
>
>Hi Peter,
>
>> --- Comment #19 from Ralph Boehm <xurpher@web.de>  2013-04-08
>14:15:45 ---
>> Hi Alejandro,
>> 
>> you seem to have found something: Same behaviour here, enabling
>> CONFIG_HIGHMEM64G (which seems unnecessary with my 1G RAM as well)
>makes resume
>> work again! Do you or someone else know about any undesired side
>effects of
>> that option, dissallowing its usage on low RAM machines?
>
>It looks like some change in 3.7 broke suspend/resume with highmem
>other than
>CONFIG_HIGHMEM64G which worked before and the breakage is still present
>(any of you guys can confirm that suspend/resume doesn't work for you
>with 3.9-rc6?).
>
>Do you have any idea what change might have cause that to happen?
Comment 23 Rafael J. Wysocki 2013-04-08 19:55:00 UTC
On Monday, April 08, 2013 07:51:08 PM bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=54911
> 
> --- Comment #22 from H. Peter Anvin <hpa@zytor.com>  2013-04-08 19:51:08 ---
> Is this reproducible on arbitrary hardware or is it a specific set of
> machines?

Well, I haven't tried to reproduce it.

Aaron, any chance to try to reproduce this issue in a lab?
Comment 24 Alejandro 2013-04-08 19:55:38 UTC
I have several computers, both i686 and X64. It only fails on one computer (I can provide further information upon request). It's a Toshiba satellite u200, Vendor ID:             GenuineIntel
CPU family:            6
Model:                 14
Stepping:              8

2G of RAM.

I has been suspending/resuming without problems with bouth, 2.x and 3.x series.
Comment 25 Alejandro 2013-04-08 21:36:16 UTC
I just confirmed that 3.9-rc6 does not solve the problem.

Regards
Comment 26 H. Peter Anvin 2013-04-08 22:23:20 UTC
First of all: since PAE is required for NX, it is generally the
preferred configuration these days, regardless of amount of memory.

However, it should work, obviously, but why on Earth PAE should have any
impact here *in that direction* is bizarre.

	-hpa
Comment 27 H. Peter Anvin 2013-04-08 22:41:26 UTC
Rafael, can you think of any way we could get the wakeup_header dumped
out at suspend time?

The other thing I can think of is if we can get a message out giving an
idea where it is hanging during startup...

	-hpa
Comment 28 Rafael J. Wysocki 2013-04-09 00:35:55 UTC
On Monday, April 08, 2013 10:41:26 PM bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=54911
> 
> --- Comment #27 from H. Peter Anvin <hpa@zytor.com>  2013-04-08 22:41:26 ---
> Rafael, can you think of any way we could get the wakeup_header dumped
> out at suspend time?

That's populated very late, after we've switched every useful output device off.

The only thing we could do would be to use the CMOS RTC memory to store stuff
and read it from there on the next boot.

> The other thing I can think of is if we can get a message out giving an
> idea where it is hanging during startup...

Or add something that will cause the box to reboot to the wakeup code and move
it from one place to another to see when it hangs instead of rebooting.
Comment 29 H. Peter Anvin 2013-04-09 00:42:44 UTC
It sounds like we're in text mode so maybe we can just poke into video memory...

bugzilla-daemon@bugzilla.kernel.org wrote:

>https://bugzilla.kernel.org/show_bug.cgi?id=54911
>
>
>
>
>
>--- Comment #28 from Rafael J. Wysocki <rjw@sisk.pl>  2013-04-09
>00:35:55 ---
>On Monday, April 08, 2013 10:41:26 PM
>bugzilla-daemon@bugzilla.kernel.org
>wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=54911
>> 
>> --- Comment #27 from H. Peter Anvin <hpa@zytor.com>  2013-04-08
>22:41:26 ---
>> Rafael, can you think of any way we could get the wakeup_header
>dumped
>> out at suspend time?
>
>That's populated very late, after we've switched every useful output
>device
>off.
>
>The only thing we could do would be to use the CMOS RTC memory to store
>stuff
>and read it from there on the next boot.
>
>> The other thing I can think of is if we can get a message out giving
>an
>> idea where it is hanging during startup...
>
>Or add something that will cause the box to reboot to the wakeup code
>and move
>it from one place to another to see when it hangs instead of rebooting.
Comment 30 Rafael J. Wysocki 2013-04-09 01:34:05 UTC
No, we aren't in the text mode.  There simply is no graphics at this point.
Comment 31 Rafael J. Wysocki 2013-04-09 01:36:20 UTC
In principle, we can use acpi_sleep=s3_bios to revive the graphics early, but that usually doesn't work.
Comment 32 Aaron Lu 2013-04-09 05:28:39 UTC
(In reply to comment #23)
> > --- Comment #22 from H. Peter Anvin <hpa@zytor.com>  2013-04-08 19:51:08
> ---
> > Is this reproducible on arbitrary hardware or is it a specific set of
> machines?
> 
> Well, I haven't tried to reproduce it.
> 
> Aaron, any chance to try to reproduce this issue in a lab?

No I'm afraid, sorry. I have an old HP Compaq 6531s but it doesn't have this problem.
Comment 33 Alejandro 2013-04-09 06:26:17 UTC
Umm, you said PAE is required for NX. Is it somehow possible that I have NX enabled in the BIOS, and that is making the resume fail? Maybe I'm taking shots in the dark. I cannot try it now anyway, since I'm at work and the failure is happening in my laptop, but I will try when I arrive home.
Comment 34 Alejandro 2013-04-09 13:52:36 UTC
Well, I can confirm this issue. In the BIOS, Executable bit (NX) was disabled, and that's is making a non-PAE kernel to fail resuming on 3.7, 3.8 and 3.9 series. Enabling NX in the BIOS solves the problem, and the resume functionality works again, at least with 3.8.5 and 3.9-rc6.

But, using any of the 3.x (x < 7) series, it was working without problem. 

Regards
Comment 35 H. Peter Anvin 2013-04-09 21:41:01 UTC
NX should not matter at all for non-PAE, but clearly that is not actually happening.

Could you install the msr-tools package on your computer and do a:

rdmsr -xc 0xc0000080

... as root, please?
Comment 36 H. Peter Anvin 2013-04-09 21:41:21 UTC
(Specifially while running on the non-PAE kernel with PAE enabled in the BIOS.)
Comment 37 H. Peter Anvin 2013-04-09 21:44:41 UTC
Even better, please do (again, as root):

cd /dev/cpu ; for c in [0-9]*; do rdmsr -p $c -xc 0xc0000080; done
Comment 38 Alejandro 2013-04-09 22:32:15 UTC
Sure, executed on a non-PAE kernel, with Executable Bit option activated in the BIOS (the configuration that actually works), the results of the above command gives the following:

$cd /dev/cpu ; for c in [0-9]*; do rdmsr -p $c -xc 0xc0000080; done
0x0
0x0
Comment 39 Zhang Rui 2013-05-13 02:01:40 UTC
Hi, peter,

any update for this?
Comment 40 Zhang Rui 2013-05-20 03:15:39 UTC
ping...
Comment 41 Rafael J. Wysocki 2013-06-04 01:10:51 UTC
Alejandro, any chance to try 3.10-rc4?
Comment 42 Alejandro 2013-06-04 06:23:44 UTC
Sure, I will try when I have a moment. I suppuse I have to disable NX to replicate the error conditions.

I will try tonigh (here it is 8:23am). 

Regards
Comment 43 Alejandro 2013-06-05 19:40:57 UTC
Hello,

sadly it does not work either. Did you change anything in special or just trying to know if it was fixed by other changeset?

Regards
Comment 44 Rafael J. Wysocki 2013-06-05 20:11:45 UTC
The 3.10-rc kernels include a fix that might be related to it.  Thanks for testing!
Comment 45 Aaron Lu 2013-07-08 06:29:40 UTC
Hi Alejandro,

Can you please test if v3.10 fixed your problem? Thanks.
Comment 46 Rafael J. Wysocki 2013-07-27 23:56:46 UTC
Fixed by 5ff560f x86, suspend: Handle CPUs which fail to #GP on RDMSR .

Note You need to log in before you can comment on or make changes to this bug.