Bug 5555

Summary: suspend/resume unstable between 2.6.11 and 2.6.12/13/14
Product: ACPI Reporter: steve (stevenm)
Component: Power-Sleep-WakeAssignee: Shaohua (shaohua.li)
Status: REJECTED INSUFFICIENT_DATA    
Severity: high CC: acpi-bugzilla, kernel, sziwan
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.11 upgraded to 2.6.12 2.6.13 2.6.14 2.6.15 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: debug
lspci output, 2.6.11, after bootup
lspci output, 2.6.11, after first suspend/resume cycle
lspci output, 2.6.15, after bootup
lspci output, 2.6.15, after first suspend/resume cycle
lspci output, 2.6.15, right before a bad suspend 'cycle'

Description steve 2005-11-05 09:09:49 UTC
I have a Dell 600m x86 laptop here and I am using suspend-to-ram ("suspend").
With kernel 2.6.11, everything works fine for months at a time- the machine
suspends, resumes, there is no problem.

However, kernels 2.6.12, 2.6.13, and 2.6.14 have all exhibited the following
problem: every few days the machine will not resume from mem sleep. I open the
lid, it powers up, hard drive light flashes once or twice, and nothing else
happens. The screen backlight will not even come back on. It's not like the
screen doesn't wake up - the whole thing is completely frozen. It does not
respond to ACPI events like hitting the power button, or ssh. The only thing I
can do at this point is to completely power it off.

This is not an issue with other software on the system. After 2.6.12 refused to
resume, I went back to 2.6.11 and everything worked fine again. When 2.6.13 came
out, I built it and again, sometimes it will not resume. I went back to 2.6.11
again and everything worked fine again up until I switched to 2.6.14. It worked
fine for two days, and now it too locked up during resume.

I don't want to be stuck on kernel 2.6.11 til the end of time. There must have
been some significant changes between 2.6.11 and 2.6.12 that introduced this...
Does anyone have any ideas? My 2.6.14 config file is basically the same as the 
working 2.6.11 config file, just mainly defaults selected for the new options.
here is a link to my .config for 2.6.14 if anyone wants to look at it:
http://wam.umd.edu/~stevenm/config2614

I do not have any binary modules loaded. I have tried this with and without
unloading various modules before suspending, but still no luck.
Comment 1 Shaohua 2005-11-06 19:13:13 UTC
Since I haven't the laptop you have, I can't reproduce it here. Could you 
please narrow the problem? Between 2.6.11 to 2.6.12, there are 6 -rc releases
(2.6.12-rc1 to 2.6.12-rc6).You can get the patch from 
(http://www.kernel.org/pub/linux/kernel/v2.6/testing/). Could you please 
figure out which one is the first release breaking your system? Thanks!
Comment 2 steve 2005-11-06 21:44:21 UTC
All right. Given the intermittent nature of this bug, I'll test all of those. It
usually happens once every few days, so it will probably take a while. I just
downloaded RC1 and will try building it. Hopefully we'll find out soon enough...
Comment 3 Karol Kozimor 2006-01-03 14:40:43 UTC
I think I'm hitting this too, the same symptoms and completely erratic  
behaviour. I had 2.6.12-rc6 going through literally hundreds of suspends  
(stress testing for this specific issue, with or without X, USB, C3, etc.) and 
then failing. I tested every -bk snapshot between 2.6.12-rc4 (which, at that 
time, seemed stable) to 2.6.12 final, only to come to a conclusion that no 
kernel version (including those older than 2.6.11) has ever been 100% stable 
on my machine. Some versions are better (last hundreds of suspends), some 
worse (15% resumes fail), sometimes a new build of the same version behaves 
differently than the previous one. I finally gave up and moved to 2.6.15-rc, 
which still seems to exhibit this problem. 
 
I'm getting the impression that the bug is obscurely related to the compiler 
version, code alignment, or something similar. For the record, I'm using gcc 
3.3.6, but I remember that 3.4 didn't really help much. I also need 
acpi_sleep=s3_bios for my backlight to work. 
 
David: any idea on how to debug this? Serial console? Is this likely a 
hardware problem? 
Comment 4 Shaohua 2006-01-04 17:41:57 UTC
Created attachment 6935 [details]
debug

So both your issues are s3 stress test failed. Attached patch will emulate a S3
process. Let's try if it can pass your stress test.
You might also check if the lspci -vv output is significently different before
suspend and after resume in a real S3 circle.
Comment 5 steve 2006-03-06 11:46:46 UTC
Hello again.
Switched to kernel 2.6.15, this is still happening. First resume freeze happened
2 days after making the switch. Again, 2.6.11 works just fine, months without
errors. I will try to get a serial console going to try to see if it produces
any output, but I do not even know what to look for.

Is anyone still working on this? Any ideas why this happens?
Comment 6 steve 2006-03-06 11:55:43 UTC
Created attachment 7514 [details]
lspci output, 2.6.11, after bootup
Comment 7 steve 2006-03-06 11:56:22 UTC
Created attachment 7515 [details]
lspci output, 2.6.11, after first suspend/resume cycle
Comment 8 steve 2006-03-06 11:56:45 UTC
Created attachment 7516 [details]
lspci output, 2.6.15, after bootup
Comment 9 steve 2006-03-06 11:57:14 UTC
Created attachment 7517 [details]
lspci output, 2.6.15, after first suspend/resume cycle
Comment 10 Shaohua 2006-03-06 18:06:35 UTC
I know this is hard, but can you give me the lspci -vv output just before the 
failed suspend/resume cycle? We got several failure reports (two IIRC) caused 
by 2.6.11 - 2.6.12 changes, but I looked at the changesets. There aren't 
significent suspend/resume changes.
Comment 11 steve 2006-03-06 18:09:40 UTC
Will do. This will take time, and hopefully I will never have to post it!
I will add a command to echo the output of lspci -vv to a file and sync right
before suspending. Next time it fails, I will post the output.
Comment 12 steve 2006-03-13 09:33:50 UTC
It's been about a week and sure enough, when I opened my laptop this morning, I
was greeted by a locked up system. Open it up, the HD spins up, and nothing else
happens. LCD backlight does not turn on, no HD activity during resume (usually
the light blinks a few times during resume). I am posting an output of lspci -vv
right before the bad suspend cycle.

I should probably mention that my video card (Radeon Mobility M9000) does not
get POSTed by the BIOS during resume. This is instead done by the Radeon driver
in Xorg. I highly doubt this is responsible for the lockups, as this all works
fine with kernel 2.6.11. Still... anyways, hope the output helps.
Comment 13 steve 2006-03-13 09:35:18 UTC
Created attachment 7571 [details]
lspci output, 2.6.15, right before a bad suspend 'cycle'
Comment 14 Daniel Drake 2006-03-13 12:11:06 UTC
Downstream bug: http://bugs.gentoo.org/126051
(for my reference only, no useful info to add)
Comment 15 steve 2006-03-26 01:02:43 UTC
Switched from letting X wake up the graphics card to using vbetool. Same kind of
behavior still. I open the lid, power light goes solid, but the hard drive
doesn't even click a few times (I guess this happens as tasks begin to resume).

I upgraded from 2.6.15 to 2.6.16 by thinking "well maybe they fixed it."

In 2.6.15, I've been using a serial console to look at the output of various
things before and after suspend. Of course, if the suspend freeze happened
before the console was resumed, then that would be a problem.

Well, problem is, in 2.6.16 the serial console is not properly restored after
suspend until the machine is fully resumed and a userspace command is issued.
This is Bug 6259. How would I capture/store any debug messages that occur during
a failed suspend/resume cycle if getting the console working is contingent on a
successful resume in 2.6.16?
Comment 16 Daniel Drake 2007-04-29 07:20:07 UTC
Is this still an issue on the latest kernel, currently 2.6.21?
Comment 17 Daniel Drake 2007-06-30 15:41:26 UTC
please reopen when responding to comment #16