Bug 13931

Summary: second resume after s2ram fails
Product: Power Management Reporter: Can Koy (cankoy)
Component: Hibernation/SuspendAssignee: power-management_other
Status: CLOSED CODE_FIX    
Severity: normal CC: dave.cabot, enrico.gueli, lenb, maximlevitsky, nalimilan, rjw, rui.zhang, ruslanledesmagarza, yakui.zhao
Priority: P1    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.31 Subsystem:
Regression: No Bisected commit-id:
Attachments: acpidump output for acer aspire 5715z w/ Bios v1.45
dmesg output after first S3 resume
dmesg_after1 per comment #17
The fix (workaround)
acpidump
Test app
dump of NVS before first s2ram attempt
Dump of NVS after first s2ram
Store NVS over S3
Fixed patch
make_s2ram_work
writereg.c

Description Can Koy 2009-08-07 11:43:58 UTC
Acer Aspire 5715Z notebook, BIOS v1.45 (latest), kernel 2.6.30.4 w/o any closed source drivers; suspends and resumes once but does not resume after the second suspend. After a brief HD activity, everything seems be dead; no reaction to CapsLock/Numlock, screen off, etc.
Comment 1 Zhang Rui 2009-08-10 01:25:26 UTC
please set CONFIG_PM_DEBUG,
rebuild and run
1. echo core > /sys/power/pm_test
2. echo mem > /sys/power/state
does the system come back in about 10 seconds?
if yes, does the system still alive when you run the above test for several times?

please attach the dmesg out after the first S3 resume.
please attach the acpidump output.
Comment 2 Can Koy 2009-08-10 06:47:21 UTC
(In reply to comment #1)
> please set CONFIG_PM_DEBUG,
> rebuild and run
> 1. echo core > /sys/power/pm_test
> 2. echo mem > /sys/power/state
> does the system come back in about 10 seconds?

yes, it does.

> if yes, does the system still alive when you run the above test for several
> times?

yes, I repeated those steps 4-5 times in a row and it came back in about 10 sec.s everytime, seemed to function normally.

> 
> please attach the dmesg out after the first S3 resume.
> please attach the acpidump output.
Comment 3 Can Koy 2009-08-10 06:48:35 UTC
Created attachment 22657 [details]
acpidump output for acer aspire 5715z w/ Bios v1.45
Comment 4 Can Koy 2009-08-10 06:50:52 UTC
Created attachment 22658 [details]
dmesg output after first S3 resume
Comment 5 Zhang Rui 2009-08-12 07:05:22 UTC
is this a regression? i.e. did the s2ram work in any earlier kernel?
Comment 6 Can Koy 2009-08-12 08:34:30 UTC
(In reply to comment #5)
> is this a regression? i.e. did the s2ram work in any earlier kernel?

No, it's not regression. I've been testing Linux on this notebook for ~15 months (tried every BIOS from 1.29 on) and have never been able to resume from suspend successfully.
Comment 7 Maxim Levitsky 2009-08-21 05:54:48 UTC
This happens on my notebook too, and I pretty much tried everything.

The core cause of this problem is that bios doesn't pass control to real mode handler of linux.

It chokes on something
Comment 8 Maxim Levitsky 2009-08-21 05:55:57 UTC
I wish I had contact with their bios team. For them it is piece of cake to see why it hangs...
Comment 9 Can Koy 2009-08-21 10:13:19 UTC
(In reply to comment #8)
> I wish I had contact with their bios team. For them it is piece of cake to
> see
> why it hangs...

BIOS changelog mentions Linux several times, so I think they're not totally unmindful to it. Maybe we just need to take their attention to this issue.
Comment 10 Zhang Rui 2009-09-03 06:05:42 UTC
what if you boot with boot option "nosmp" or "maxcpus=1"?
Comment 11 Maxim Levitsky 2009-09-03 22:36:58 UTC
Just in case, tested, this suggestion, nothing had changed
Comment 12 Zhang Rui 2009-09-04 01:52:49 UTC
Maxim, what's the model name of your laptop?

Can Koy, can you try boot option "nosmp"?
Comment 13 Can Koy 2009-09-04 01:58:29 UTC
(In reply to comment #10)
> what if you boot with boot option "nosmp" or "maxcpus=1"?

I just tried with nosmp (observed switching to UP code in dmesg output).
It made no difference, unfortunately.
Comment 14 mrrusof 2009-09-04 21:57:03 UTC
Hi. I have the same problem as Maxim and practically the same machine (his is 5720g). My machine info, as reported by s2ram, is:
    sys_vendor   = "Acer      "
    sys_product  = "Aspire 5720     "
    sys_version  = "V1.14"
    bios_version = "V1.14"
Comment 15 ykzhao 2009-11-18 09:01:52 UTC
Will you please try the following test?
   a. kill the process using /proc/acpi/event(use the command of "lsof /proc/acpi/event")
   b. echo mem > /sys/power/state; dmesg >dmesg_after1; sync; sleep 10; echo mem > /sys/power/state; dmesg >dmesg_after2; sync;
   c. press the power button to do the first resume
   d. wait for some time and press the power button to see whether the box can be resumed
   e. Please reboot the system and see whether the file of dmesg_after1/2 is created.

thanks.
Comment 16 Maxim Levitsky 2009-11-18 11:39:25 UTC
Its almost a thought experiment, but anyway.

I don't have /proc/acpi/event, I disabled this interface in kernel config. I don't have acpid running ether.

And the outcome of the experiment is as usual, first resume works perfectly, second one fails. dmesg_after1 is created, and contains nothing interesting.
Comment 17 Can Koy 2009-12-03 18:43:53 UTC
(In reply to comment #15)
> Will you please try the following test?
[..]

The second resume did not succeed, and dmesg_after2 was not created.
However, I observed the following lines on console after first resume:
[  360.124067] ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x9 t4
[  360.124070] ata1: irq_stat 0x00400040, connection status changed

dmesg_after1 is attached fyi.
Comment 18 Can Koy 2009-12-03 18:47:51 UTC
Created attachment 24005 [details]
dmesg_after1 per comment #17
Comment 19 enrico.gueli 2010-01-10 09:51:58 UTC
I can confirm the presence of this bug. Still found no solution yet.
My machine is a 5720 (no Z) with 1.42 bios.

https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/160763
Comment 20 Maxim Levitsky 2010-03-30 21:47:37 UTC
Can we write together a letter to acer about that issue?
Comment 21 Dave Cabot 2010-05-16 19:01:33 UTC
This problem affects me too.  Acer Aspire 5720Z.  Firmware 1.42.  Kernel 2.6.32 (Ubuntu 10.04 2.6.32-22-generic).
Comment 22 Maxim Levitsky 2010-05-19 21:21:38 UTC
You know, I tell everyone that there is no problem I can't fix.
There is just no upper bound on how much time it will take me to do that.

Well it took me almost 2 years to fix that problem.

Enjoy!
Comment 23 Maxim Levitsky 2010-05-19 21:24:28 UTC
Created attachment 26442 [details]
The fix (workaround)

This makes Linux store and restore BIOS NVS (non-volatile) area during suspend to ram.

Since memory is preserved during suspend to ram that shouldn't be necessary, but maybe the Other OS restores that area?

And then maybe BIOS craps to that area on resume (or suspend...) so...
Comment 24 Maxim Levitsky 2010-05-19 21:53:21 UTC
Created attachment 26443 [details]
acpidump

acpidump of my system (acer aspire 5720G)
BIOS 1.42
Comment 25 Milan Bouchet-Valat 2010-05-20 09:06:13 UTC
Maxim: Thanks for working on that! Let's hope a maintainer will have a look at your patch... Else, you may want to send a mail to LKML.
Comment 26 Matthew Garrett 2010-05-20 14:59:15 UTC
Created attachment 26459 [details]
Test app

Maxim:

Can you run some tests using this code? Revert your patch first, then build this code and (as root) do (appname) read output . That'll create a 99 byte file called output. Suspend and resume, then do (appname) read output2. Now do (appname) write output and see if you can suspend and resume again. If so, attach output and output2.
Comment 27 Maxim Levitsky 2010-05-20 15:53:56 UTC
Created attachment 26462 [details]
dump of NVS before first s2ram attempt
Comment 28 Maxim Levitsky 2010-05-20 15:59:27 UTC
Created attachment 26463 [details]
Dump of NVS after first s2ram
Comment 29 Maxim Levitsky 2010-05-20 18:31:47 UTC
The irony is that although NVS area contains several changes, the only one that makes s2ram work is just one byte. Just one byte makes that BIOS hang.

it is at 0x7fe99010.

The change is always 00 - good, 0xFF bad
Comment 30 Maxim Levitsky 2010-05-20 19:11:45 UTC
I tested that this memory change happens just after resume, because commenting the code that finally puts system in suspend mode in acpi_enter_sleep_state doesn't trigger 00->FF change. I also know that right after resume the system is in *bad* state (because of many checks I did before).

I also suspect that 00 is only valid value, because writing 0x10 to that offset produced hang.
Comment 31 Maxim Levitsky 2010-05-20 19:21:32 UTC
And in addition to that, setting this byte to 0xFF is enough to trigger a hang on *first* resume, thus I am confident that this byte alone crashes BIOS on resume

I guess that is all I could reverse engineer for now.
Comment 32 Maxim Levitsky 2010-05-20 19:22:41 UTC
Someday if I install windows on that system I see if this byte stays clear there.
Comment 33 Can Koy 2010-05-20 22:15:22 UTC
(In reply to comment #26)
> Created an attachment (id=26459) [details]
> Test app
> 

this code has buffer overflow problem. 
it reads 0x63 bytes into a 63 byte buffer.
Comment 34 Maxim Levitsky 2010-05-21 21:37:31 UTC
Ah, sorry for confusion, this patch applies on top of 
https://bugzilla.kernel.org/show_bug.cgi?id=14668

I don't know why its is not upstream yet..
Comment 35 Matthew Garrett 2010-05-24 18:29:25 UTC
Created attachment 26531 [details]
Store NVS over S3

Can you confirm that this patch works as well as your version? I haven't made any progress in figuring out what the BIOS is actually doing there, but I have verified that Windows dumps and restores the NVS region.
Comment 36 Maxim Levitsky 2010-05-26 11:17:15 UTC
I test this this Thursday.

I am very short on time though, won't be able to much more testing.

I am quite sure you forgot to 'git add' the new version of 'hibernate_nvs.c' because it is only removed but nothing added instead.

Also note that somehow this helps only on my computer.
The answers I receive from users are till now negative.

This Thursday I try to figure out if some other unintended change also made the suspend finally work.

(Note that suspend does work 100% of time on my notebook now)
Comment 37 Matthew Garrett 2010-05-27 13:59:33 UTC
Created attachment 26559 [details]
Fixed patch

Original patch missed the nvs.c file. Try this one.
Comment 38 Maxim Levitsky 2010-05-28 02:36:50 UTC
As expected this patch does work.

Note that now
https://bugzilla.kernel.org/show_bug.cgi?id=14668
needs several changes.

I would think that now we need to post both patches to ACPI mailing list?

Note that I did many test and came to conclusion that restoring NVS works all they way back to 2.6.28, and this doesn't depend on anything else (like .config)
Does even work with ubuntu kernel if I restore that byte from userspace.
Comment 39 Maxim Levitsky 2010-05-28 02:37:55 UTC
*many tests* *they* -> *the*
Comment 40 Maxim Levitsky 2010-05-31 18:50:46 UTC
If you for some reason can't compile kernel and/or my patch appears not to work, try the following userspace workaround:

compile writereg.c (gcc -o writereg writereg.c), put it in somewhere in $PATH (/usr/local/bin for example), and then use attached script (make_s2ram_work) after first resume.

If you see something like that (note FF):

writing 0 to <address> (current value = FF)

Then its likely my workaround works.
Try to suspend to ram again, if works, execute this script on each resume to make next resume work. (create a script in /etc/pm/sleep.d to do that automaticly)
Comment 41 Maxim Levitsky 2010-05-31 18:52:11 UTC
Created attachment 26593 [details]
make_s2ram_work
Comment 42 Maxim Levitsky 2010-05-31 18:53:57 UTC
Created attachment 26594 [details]
writereg.c
Comment 44 Len Brown 2010-07-07 02:50:45 UTC
shipped in linux-2.6.35-rc4


commit 2a6b69765ad794389f2fc3e14a0afa1a995221c2
Author: Matthew Garrett <mjg@redhat.com>
Date:   Fri May 28 16:32:15 2010 -0400

    ACPI: Store NVS state even when entering suspend to RAM


commit dd4c4f17d722ffeb2515bf781400675a30fcead7
Author: Matthew Garrett <mjg@redhat.com>
Date:   Fri May 28 16:32:14 2010 -0400

    suspend: Move NVS save/restore code to generic suspend functionality

closed.