Bug 9659

Summary: [2.6.22-rc1 regression] S3 ACPI wakeup from suspend broken because of commit 9666f4009c22f6520ac3fb8a19c9e32ab973e828
Product: IO/Storage Reporter: devsk (funtoos)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Severity: normal CC: bunk, dex, hancockrwd, htejun, pavel, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22-rc1 Tree: Mainline
Regression: Yes
Bug Depends on:    
Bug Blocks: 7216    
Attachments: uname -a, dmesg, lspci from good kernel
uname -a, dmesg, lspci from 2.6.24-rc6 kernel
dmesg immd. after good resume
dmesg immd. after good resume in 2.6.24-rc6 with adma=0
patch to resume ports from PCI device resume

Description devsk 2007-12-28 23:05:35 UTC
Most recent kernel where this bug did not occur: 2.6.21
Distribution: Gentoo
Hardware Environment: nForce4, SATA disks, AMD
Software Environment: KDE, gcc 4.2
Problem Description:

Can not resume from suspend-to-ram if commit 9666f4009c22f6520ac3fb8a19c9e32ab973e828 is in the kernel. On resume the machine hangs with only hard reset working. Upon reset from such resume failure, one SATA drive (maxtor 300GB) is not detected, while other 2 SATA drives are detected by the BIOS. Machine has to be reset again to see the drive. The spin up time on 2 WD drives is < 5ms, while maxtor has a spin up time of >25ms. My suspicion is that BIOS doesn't detect my maxtor drive on resume if the mentioned commit (containing reimplemented suspend-resume) is in the kernel and hence it hangs. Wrong command during suspend, maybe? I don't know.

This problem doesn't happen in 2.6.21 kernels.

Steps to reproduce:

1. Build kernel 2.6.22-rc1 and tested suspend-to-ram with hibernate-ram utility. It hangs on resume and requires reset.

2. Build the same kernel without the commit 9666f4009c22f6520ac3fb8a19c9e32ab973e828 and suspend-to-resume works as before (as in 2.6.21).

3. The problem happens in 2.6.22, 2.6.23 and 2.6.24RC series. I couldn't take out the patch from >= 2.6.23 kernels. It failed many hunks.
Comment 1 Rafael J. Wysocki 2007-12-29 06:07:34 UTC
Can you test the 2.6.24-rc6 kernel, please?

There are some important fixes related to libata suspend in it.
Comment 2 devsk 2007-12-29 07:11:04 UTC
I had mentioned I tested 24RC but I had tested upto RC5. I went ahead and tested 2.6.24-rc6 as well and it failed to resume in the same fashion: hang requires RESET (although enabled, ctrl-alt-sysrq R/S/U/B doesn't work in this situation while it works in many other situations) after resume and one of the SATA drive is undetected by BIOS upon RESET, until RESET again.
Comment 3 devsk 2007-12-31 14:50:29 UTC
Created attachment 14241 [details]
uname -a, dmesg, lspci from good kernel
Comment 4 devsk 2007-12-31 14:51:18 UTC
Created attachment 14242 [details]
uname -a, dmesg, lspci from 2.6.24-rc6 kernel
Comment 5 devsk 2007-12-31 14:53:01 UTC
Don't see much difference between the two dmesg outputs but I will let you peek at it. Please let me know if you need any more info.
Comment 6 Tejun Heo 2008-01-01 19:56:53 UTC
That's weird.  Other than timing changes, there isn't much difference from drive's POV.  Can you please try the followings?

* During resume, please wait for more than 10 minutes.  Does anything happen?

* Does specifying sata_nv.adma_enabled=0 as kernel parameter make any difference?
Comment 7 devsk 2008-01-01 20:21:07 UTC
I will try both and let you know soon. Do you have some corrective action or fix in mind if one of these succeeds?
Comment 8 Tejun Heo 2008-01-01 20:27:00 UTC
For the second one, I'll have to ask Robert Hancock and Nvidia people.  If the first one works, there will be a lot more information we can look at.

Also, please post the result of dmesg (preferably with printk time stamp turned on) after resuming from the good kernel.  Thanks.
Comment 9 devsk 2008-01-02 08:30:45 UTC
Created attachment 14264 [details]
dmesg immd. after good resume
Comment 10 devsk 2008-01-02 08:31:16 UTC
1. suspended to ram in 2.6.24-rc6 and waited for 20 minutes on resume. It didn't come back.

2. Added sata_nv.adma_enabled=0 to 2.6.24-rc6 kernel and verified after boot up. The symptoms remained the same on resume.

3. dmesg from good resume in 2.6.21 is attached.

So, what now?
Comment 11 Tejun Heo 2008-01-02 17:21:52 UTC
Hmmmm... Robert, any ideas?
Comment 12 Robert Hancock 2008-01-02 17:35:45 UTC
For #2 it's actually adma=0 (you can tell if it worked because you won't see any "Using ADMA mode" lines in dmesg).
Comment 13 devsk 2008-01-02 18:44:18 UTC
Ok. booted 2.6.24-rc6 with sata_nv.adma=0, suspend and resumed fine. SCORE! I will attach the dmesg immd. after resume. It definitely looks different and we (or rather you...;-)) should have clues in there.
Comment 14 devsk 2008-01-02 18:44:56 UTC
Created attachment 14269 [details]
dmesg immd. after good resume in 2.6.24-rc6 with adma=0
Comment 15 devsk 2008-01-02 19:17:08 UTC
Are we looking at the same problem as discussed in this thread: http://lkml.org/lkml/2008/1/1/304
Comment 16 Robert Hancock 2008-01-02 20:42:43 UTC
It's not likely related to that problem.. likely some effect that the patch has on  the ordering of suspend/resume processing. There's some ADMA-specific suspend/resume handling that will get skipped with ADMA disabled.

Tejun, would that patch change the order in which the PCI, host and port resumes happen from what they were before?
Comment 17 Tejun Heo 2008-01-02 23:03:18 UTC
The relative order among host, port and device is preserved.  What changes is parallelism.  Before the patch, after host is resumed, all ports are resumed simultaneously.  After the patch, things are sequential.  Other than that, there shouldn't be much difference from the POV of hardware.

I also have an interesting possibly related report where a user is reporting machine freeze after hotplug.  I asked the reporter to test sata_nv.adma=0.  I'll let you know when I get the answer.  Thanks.
Comment 18 devsk 2008-01-03 02:49:45 UTC
How much of a change is to put it back to simultaneous resume? It is possible that it is a timing issue and in 2.6.21, sdd was scheduled for resume much early and succeeded. Isn't the spin up time important on resume? At least the offending patch's comment seems to suggest that. The fact is that sdd has a spin up time which is 5 times larger than the others.
Comment 19 Tejun Heo 2008-01-03 02:58:35 UTC
A lot.  We basically need new infrastructure to do that.  Timing difference resulting in system lock up should be fixed anyway.  Robert, any ideas on why the behavior is different depending on ADMA mode enable/disable?
Comment 20 devsk 2008-01-03 03:11:57 UTC
Is it possible to configure adma=0 for a particular disk? I see a read slow down with adma=0 both in boot up time (2.6.24 with adma=0 boots 5 seconds slower than 2.6.21) and in bonnie.
Comment 21 Robert Hancock 2008-01-03 06:38:02 UTC
ADMA does have some wierd inter-port dependencies (like with the notifiers, it seems you can't really handle one port independently of the other). It's entirely possible that something breaks if you try to use a port before the other one on the controller was initialized.

As a test, we could effectively nuke the content of nv_adma_port_resume and move it into the nv_pci_device_resume function, and make it execute that code for both ports there, and see if that works..
Comment 22 Tejun Heo 2008-01-03 06:44:50 UTC
Robert, can you please prep a patch?  If the controller needs cross-port ordering, the code definitely has to be called from host resume.

Comment 23 Robert Hancock 2008-01-03 18:07:46 UTC
Created attachment 14273 [details]
patch to resume ports from PCI device resume

Can you test this patch against 2.6.24-rc6, leaving ADMA enabled? It should resume the ports from the PCI device resume to ensure that they're both initialized before either one is used.
Comment 24 devsk 2008-01-04 06:08:38 UTC

Applied the patch on 2.6.24-rc6 and booted new kernel without adma=0. Resume failed like before, with same symptoms.
Comment 25 Robert Hancock 2008-01-04 06:37:54 UTC
Hmm. There should be a way we can get output by that point in the resume cycle to know where it's failing, isn't there?

Tejun, is there any other difference in ordering that that patch causes that would be visible to the driver?
Comment 26 Tejun Heo 2008-01-05 04:43:08 UTC
With consoles offline, getting output is very difficult.  (cc'ing Pavel, Hi!) Pavel, machine locks up hard while resuming ATA controllers.  Are there ways to get debug output?

And for the second question.  AFAIK, no.  The basic sequence is the same.  The only difference is that what used to be done in parallel is now done sequentially (limitation of piggy backing on SCSI).  FWIW, there hasn't been any suspend/resume bug caused directly by this change even though it already went through full release cycle and is being used on major distros.
Comment 27 Robert Hancock 2008-01-07 15:51:35 UTC
devsk, can you try booting with the no_console_suspend kernel command line option and then try suspending and resuming?
Comment 28 devsk 2008-01-08 16:51:52 UTC
@#27, do you want me to reverse the patch u provided before I try with no_console_suspend?
Comment 29 Robert Hancock 2008-01-08 16:55:21 UTC
I would say try it both ways, if however you have it (applied or not) now doesn't work..
Comment 30 Pavel Machek 2008-01-09 05:12:30 UTC
...yep, no_console_suspend is the way, and mdelay(1000) in printk() is additional trick I sometimes had to use.
Comment 31 devsk 2008-01-09 08:35:58 UTC
booted 2.6.24-rc6 with no_console_suspend (both with your patch and without) and still the reset is the only way out of the hang after resume.
Comment 32 Robert Hancock 2008-01-09 16:23:45 UTC
You don't get any output on the console before it hangs?
Comment 33 devsk 2008-01-09 18:47:48 UTC
Monitor doesn't even come to life. No blinking cursor. None of the sysrq or vt switches work.
Comment 34 Robert Hancock 2008-01-09 20:15:13 UTC
Rather curious.. I'd expect the console would be active before the point where libata starts to resume?
Comment 35 devsk 2008-01-09 21:03:15 UTC
I have no clue about the resume sequence. But I would think that if adding adma=0 makes it work, our console is being activated after libata because libata is the only thing that reads that variable.
Comment 36 Rafael J. Wysocki 2008-01-10 08:31:04 UTC
You can use PM_TRACE to find out where it hangs.
Comment 37 devsk 2008-01-12 07:43:59 UTC
isn't the /var/log/messages required for PM_TRACE to be of any use? If console is where its output will go, we still won't see it because console is not there yet after resume.

Or may be I don't understand how I will use PM_TRACE in this scenario. Can you please elaborate?
Comment 38 devsk 2008-01-12 07:55:53 UTC
Ok, scratch comment #37. Realistically, what kind of problems can I expect after enabling PM_TRACE? Is there a possibility of BIOS corruption?
Comment 39 Rafael J. Wysocki 2008-01-12 10:47:24 UTC
No, it only modifies the real time clock data.  You'll have to set the date and time after you're done with the testing.
Comment 40 devsk 2008-01-12 12:08:05 UTC
# dmesg -s1000000 | grep -4 'hash matches'
[   37.526214] swsusp: Resume From Partition /dev/sdc5
[   37.526216] PM: Checking swsusp image.
[   37.526412] PM: Resume from disk failed.
[   37.526499]   Magic number: 8:103:26
[   37.529468]   hash matches device ide0
[   37.532509] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
[   37.535583] Freeing unused kernel memory: 264k freed
[   38.648976] md: Autodetecting RAID arrays.
[   38.679435] ieee1394: Host added: ID:BUS[0-00:1023]  GUID[00508d0000d6c84d]

I enabled RTC in kernel but don't know whether I have the hardware or what hardware I have.

# grep RTC .config
# CONFIG_HPET_RTC_IRQ is not set
# CONFIG_RTC_DEBUG is not set
# RTC interfaces
# CONFIG_RTC_DRV_TEST is not set
# I2C RTC drivers
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# SPI RTC drivers
# CONFIG_RTC_DRV_RS5C348 is not set
# CONFIG_RTC_DRV_MAX6902 is not set
# Platform RTC drivers
# CONFIG_RTC_DRV_CMOS is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_V3020 is not set
# on-CPU RTC drivers
Comment 41 Rafael J. Wysocki 2008-01-12 13:32:24 UTC
(In reply to comment #40)
> # dmesg -s1000000 | grep -4 'hash matches'
> [   37.526214] swsusp: Resume From Partition /dev/sdc5
> [   37.526216] PM: Checking swsusp image.
> [   37.526412] PM: Resume from disk failed.
> [   37.526499]   Magic number: 8:103:26
> [   37.529468]   hash matches device ide0

Here you go.  ide0 has been matched.
Comment 42 devsk 2008-01-12 14:08:31 UTC
But I think that's bogus. I ran it many times and saw it match some random things:

# grep "hash ma" /var/log/messages
Jan 12 12:00:56 localhost [   37.529468]   hash matches device ide0
Jan 12 13:33:33 localhost [   41.534180]   hash matches device ptyc2
Jan 12 13:36:27 localhost [   43.164574]   hash matches device ptycd
Jan 12 13:52:25 localhost [   44.789298]   hash matches device ptyue
Jan 12 13:52:25 localhost [   44.792356]   hash matches device fb0
Jan 12 13:59:18 localhost [   35.410393]   hash matches device ttyd1

ttyd1 is the latest resume failure. And ptyc2 was a regular boot without a failed resume before it.

Doesn't my hardware need to provide RTC support in hardware for it to work?

Jan 12 13:59:18 localhost [   35.348907] rtc_cmos: probe of 00:04 failed with error -16
Jan 12 13:59:18 localhost [   35.413419] drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
Comment 43 devsk 2008-01-15 08:43:46 UTC
Does the fact that BIOS doesn't detect this drive on reset after failed resume give us any clue about what we might have done to the drive on suspend?

And what role does the spin up time has to play in all this? The Maxtor has a spin up time of 25ms, which is like 5 times the drives which get detected fine.

Any more things to try?
Comment 44 Robert Hancock 2008-01-15 15:51:05 UTC
Hmm.. ide0 isn't sata_nv at all, that's the PATA ports driven by pata_amd. It doesn't seem to make much sense that it wouldn't be a consistent point where it hangs, unless we're executing asynchronously at that point.

It seems like console should really be working at that point. Debugging is really a huge pain in the ass if it's not. Any ideas from suspend people?

I don't think that rtc_cmos error is related.

And as far as what the BIOS not detecting the drive on reset means, I'm really not sure other than whatever it is manages to hork the drive and/or controller somehow..
Comment 45 devsk 2008-01-15 17:12:14 UTC
What about "spin up time being high for the failing drive" data point? I see this in the commit which caused the regression:


  Resume now has to wait for disk to spin up before proceeding.  I
  couldn't find easy way out as libata is in EH waiting for the
  disk to be ready and sd is waiting for EH to complete to issue
Comment 46 Daniel Exner 2008-01-24 01:08:16 UTC
Most recent kernel where this bug did not occur: no idea since I had not sata before
Distribution: Frugalware-current
Hardware Environment: Asus A8V (VIA 8237), IDE: Samsung SP1614N, SATA: Samsung HD321KJ
Software Environment: KDE, gcc 4.2, 2.6.24-rc8 &
Problem Description:

I'm getting similar effects since I added the above SATA HDD to my system.
My first suspect was the mdraid mirror I build above this, but even after returning to non raid mode system will not resume. Sometimes after the needed hard reset (no Magic SysRQ) my IDE HDD will not get recognized by BIOS.
no_console_suspend didn't give anything (at least using netconsole, still searching my serial cable ;)
BUT s2disk works perfect.
So I guess the problem is not (only) in sata_nv but above, and only triggerd by s2ram pathway.
Comment 47 devsk 2008-01-24 11:05:12 UTC
(In reply to comment #46)
> BUT s2disk works perfect.
> So I guess the problem is not (only) in sata_nv but above, and only triggerd
> by
> s2ram pathway.

It is not necessary. With s2disk, you are powering on all of your hardware, and hence everything gets reset anyway. With s2ram, it is not so and software has to put everything into a valid state. And in this case, everything else seems to be doing so except for ADMA suspend code path, which when removed with adma=0, makes it resume fine. Did sata_via.adma=0 help you? Or is the ADMA concept only applicable to sata_nv?
Comment 48 Daniel Exner 2008-01-24 12:25:47 UTC
Nope, neither sata nor pata_via got a adma parameter. And source shows no sign of this either.
But I noticed that sata_via does have quite a view functions for my controller.. perhaps some of these doesnt work?
Comment 49 Robert Hancock 2008-01-24 16:37:14 UTC
ADMA setting is sata_nv only. It's not impossible that those controllers have some suspend/resume bug triggered by this change as well, but that would require testing a kernel before the change was made (i.e. 2.6.21). If it still breaks then it's an unrelated problem and a new bug report should be created.
Comment 50 devsk 2008-03-09 18:55:55 UTC
Ok, this bug can probably be closed now. I just rebuilt the 2.6.24 kernel with libata-only config. Involved a major surgery for device paths etc. but the resume works without adma=0.

It seems like some IDE code path doesn't agree with nvidia adma code path and causes resume pain.
Comment 51 Rafael J. Wysocki 2008-03-10 05:11:11 UTC
Well, thanks.  I'm not sure about the resolution, though.

For now, I'll mark it as WILL_FIX_LATER.