Bug 9528

Summary: s2ram hang - 2.6.22 regression - Asus A8N-SLI, Abit KN9 (Nvidia NForce4)
Product: Power Management Reporter: Arthur Erhardt (erhardt)
Component: Hibernation/SuspendAssignee: Zhang Rui (rui.zhang)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, akpm, arbrandes, astarikovskiy, brian, carlos, hbruckynews, lenb, rui.zhang, zarko.pintar
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22+ Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: Debugging facility patch
dmesg output after resume from (debugging) suspend to disk
Trace output from console during suspend
acpidump for Abit KN9
acpidump for Asus A8N SLI Premium, Bios 1303
Simple .config for 2.6.24-rc5-gda8cadb3
/proc/ioports from Abit KN9
dmesg (with debugging) from Abit KN9
lspci -xxxvv from Abit KN9
dmesg with device suspend debugging patch
dmesg output on 2.6.24rc6 w/ PM patches
/proc/acpi/wakeup from Abit KN9
dmidecode output of Asus A8N SLI Premium
DMIDecode output A8N-SLI Premium 1303
patch: dmi workaround for Asus A8N-SLI Premium
dmi workaround for Asus A8N-SLI Premium and Asus A8N-SLI DELUXE

Description Arthur Erhardt 2007-12-08 04:54:35 UTC
Most recent kernel where this bug did not occur: 2.6.24-rc4
Distribution: OpenSuSE 10.2, 10.3
Hardware Environment: Asus A8N SLI-Premium, Opteron 175
Software Environment: OpenSuSE 10.[23], x86_64, console or Xorg 
Problem Description:system freeze instead of suspend to ram

Steps to reproduce: run s2ram -f -a 1 (or any other recommended set of args)
System freezes both from console and Xorg. In 2.6.21 s2ram works fine, even
with nvidia binary driver installed.
Comment 1 Arthur Erhardt 2007-12-08 08:11:54 UTC
Parse error by myself: "Most recent kernel where this bug did not occur: 2.6.24-rc4" ist wrong. I somehow overread the "not" in that line, so yes, the bug is still there in 2.6.24-rc4
Comment 2 Andrew Morton 2007-12-08 09:47:33 UTC
Do you know if the bug was not present in anything later than 2.6.21?
Comment 3 Rafael J. Wysocki 2007-12-08 12:10:30 UTC
I think this is a fallout of the suspend code ordering change between 2.6.22.

Arthur, do you use git for downloading the kernel sources?
Comment 4 Arthur Erhardt 2007-12-08 12:22:35 UTC
(In reply to comment #2)
> Do you know if the bug was not present in anything later than 2.6.21?
> 
The complete history of the problem discussed here can be found at
http://sourceforge.net/mailarchive/forum.php?thread_name=20071205234525.GA3886%40pit.physik.uni-tuebingen.de&forum_name=suspend-devel

I did not try all possible releases, but the following list show the same
behaviour:
2.6.21.any: s2ram works; this includes 2.6.21-119 from OpenSuSE beta.
2.6.22.10, 2.6.22.13, OpenSuSE 10.3 distro kernel: s2ram doesn't work.
2.6.23.1, 2.6.23.9: s2ram doesn't work.
2.6.24rc4: s2ram doesn't work; slightly different behaviour (console cleared,
no text output before freeze)
Comment 5 Arthur Erhardt 2007-12-08 12:23:53 UTC
(In reply to comment #3)
> I think this is a fallout of the suspend code ordering change between 2.6.22.
> 
> Arthur, do you use git for downloading the kernel sources?
No, and have never used git. But I can learn anything if it helps.

 
Comment 6 Rafael J. Wysocki 2007-12-08 12:38:15 UTC
Well, it might help.

The instructions at http://linux.yyz.us/git-howto.html#getting_started should be sufficient to start with.

Please download the kernel tree as described in the above doc, go to the linux-2.6 directory and do:

$ git checkout -f 7b104bcb8e460e45a1aebe3da9b86aacdb4cab12

Then, compile the kernel (using .config from 2.6.22), install it and see if suspend works.
Comment 7 Arthur Erhardt 2007-12-08 14:25:04 UTC
(In reply to comment #6)
[...]
> 
> Please download the kernel tree as described in the above doc, go to the
> linux-2.6 directory and do:
> 
> $ git checkout -f 7b104bcb8e460e45a1aebe3da9b86aacdb4cab12
> 
> Then, compile the kernel (using .config from 2.6.22), install it and see if
> suspend works.
Done. Suspend works. Resume forgets to restore the console display, i.e.
I could log in via network or blind login and type commands on tty[1-6],
since I suspended from runlevel 3. From a first glance, nothing else seems to be broken after resume.
Comment 8 Arthur Erhardt 2007-12-08 14:26:52 UTC
(In reply to comment #6)
[...]
> 
> Please download the kernel tree as described in the above doc, go to the
> linux-2.6 directory and do:
> 
> $ git checkout -f 7b104bcb8e460e45a1aebe3da9b86aacdb4cab12
> 
> Then, compile the kernel (using .config from 2.6.22), install it and see if
> suspend works.
Done. Suspend works. Resume forgets to restore the console display, i.e.
I could log in via network or blind login and type commands on tty[1-6],
since I suspended from runlevel 3. Monitor stays in power save mode, so nothing is visible. From a first glance, nothing else seems to be broken after resume.
Comment 9 Rafael J. Wysocki 2007-12-08 14:39:48 UTC
OK, thanks.

Now, please go to linux-2.6, do:

$ git checkout -f 52ade9b3b97fd3bea42842a056fe0786c28d0555

using the same .config and see if it breaks.
Comment 10 Arthur Erhardt 2007-12-08 15:33:48 UTC
(In reply to comment #9)
> OK, thanks.
> 
> Now, please go to linux-2.6, do:
> 
> $ git checkout -f 52ade9b3b97fd3bea42842a056fe0786c28d0555
> 
> using the same .config and see if it breaks.
Yes it breaks with the same behaviour as described before.
Comment 11 Rafael J. Wysocki 2007-12-08 15:55:36 UTC
Created attachment 13918 [details]
Debugging facility patch

Well, I was afraid it would.

That's unfortunate, because we can't revert this change and the breakage is almost certainly related to your system's BIOS.

Please switch to the latest mainline kernel, by going to linux-2.6 and doing:

$ git checkout -f master

apply the attached patch on top of it, compile the kernel and install it.

Next, reboot to the fully functional system (with the new kernel) and run:

# echo 8 > /proc/sys/kernel/printk
# echo platform > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo disk > /sys/power/state

That will test the suspending of devices and the execution of the platform global control methods (it'll wait 5 sec. in the process).  If it returns to the command prompt, please attach the output of dmesg generated right then.
Comment 12 Arthur Erhardt 2007-12-10 03:39:31 UTC
Unfortunately the attached patch is partially rejected (power.h) by the master tree.
Comment 13 Rafael J. Wysocki 2007-12-10 10:17:59 UTC
Sorry, my fault.

You need to apply patches 01-09 from the series at:
http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.24-rc4-git6/patches/
for this to work.
Comment 14 Arthur Erhardt 2007-12-10 16:10:51 UTC
No problem. I patched the current master with all necessary patches 
and enabled the corresponding debug parts in the kernel. 
After "echo disk > /sys/power state" the system froze with the same 
symptoms as before, and with the following text output after clearing the console:

Freezing user processes... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks... (elapsed 0.00 seconds) done
Shrinking memory... done (0 pages freed)
Freed 0 kbytes in 0.53 seconds
Suspending console(s)

My journaling file system might have cleaned up anything that might have been interesting, but printing syslog output via network didn't help either. 
I suspect that a serial console won't make any difference. I don't know if it helps, but I have dmesg output before right before crashing the system, I can post it if it helps.

Suggestions?
Comment 15 Rafael J. Wysocki 2007-12-10 16:53:22 UTC
Please try:

# echo 8 > /proc/sys/kernel/printk
# echo devices > /sys/power/pm_test
# echo mem > /sys/power/state
Comment 16 Arthur Erhardt 2007-12-11 13:29:03 UTC
Created attachment 13981 [details]
dmesg output after resume from (debugging) suspend to disk
Comment 17 Arthur Erhardt 2007-12-11 13:33:08 UTC
Ok, this worked. dmesg output is in comment #16. 
Comment 18 Rafael J. Wysocki 2007-12-12 16:02:17 UTC
Thanks.

Now, please do:

# echo 8 > /proc/sys/kernel/printk
# echo platform > /sys/power/pm_test
# echo mem > /sys/power/state

and see if that works.
Comment 19 Arthur Erhardt 2007-12-15 06:36:13 UTC
Sorry for the delay, I've been away from that computer for a few days.
The result of your last suggestion looks like this:

Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
PM: Entering mem sleep
Suspending console(s)

*system frozen*
Comment 20 Rafael J. Wysocki 2007-12-15 14:42:51 UTC
That will be hard to fix.

Your system apparently hangs in the ACPI code executing the ACPI _PTS global control method.

Please attach the output of acpidump and I'll try to ask some ACPI people to have a look at it.

Also, please try to boot with no_console_suspend, repeat the steps from Comment #18 and see what's the last message printed before the hang.  Enabling ACPI debugging might be helpful as well.
Comment 21 Rafael J. Wysocki 2007-12-15 14:44:32 UTC
BTW, do you use the NVidia proprietary graphics driver?
Comment 22 Carlos Corbacho 2007-12-15 14:48:33 UTC
This result is easily duplicated without the binary drivers installed (I have the same problem with another nForce 4 board, the Abit KN9).
Comment 23 Rafael J. Wysocki 2007-12-15 15:13:25 UTC
Hm, I have an nForce 4-based board somewhere here.  I'll try to reproduce the problem on it tomorrow.
Comment 24 Carlos Corbacho 2007-12-15 15:27:11 UTC
Created attachment 14059 [details]
Trace output from console during suspend

In the meantime, here's the console output from my machine after suspend-to-RAM (as per your comment #20 instructions) - copied by hand, so apologies for any typos.

Main difference between my machine and Arthur's is that my board is non-SLI.

acpidump to follow.
Comment 25 Carlos Corbacho 2007-12-15 15:31:25 UTC
Created attachment 14060 [details]
acpidump for Abit KN9

acpidump from Abit KN9 (this is the plain, nForce 4 KN9 - not the Ultra or SLI KN9's which are nForce 5). This is with the kn9c12 firmware.
Comment 26 Arthur Erhardt 2007-12-16 03:34:01 UTC
(In reply to comment #20)
> That will be hard to fix.
> 
> Your system apparently hangs in the ACPI code executing the ACPI _PTS global
> control method.
> 
> Please attach the output of acpidump and I'll try to ask some ACPI people to
> have a look at it.
Ok.

> Also, please try to boot with no_console_suspend, repeat the steps from
> Comment
> #18 and see what's the last message printed before the hang.  Enabling ACPI
> debugging might be helpful as well.
The visible text looks like this:
acpi device:10: suspend
acpi device:0f: suspend
acpi device:0e: suspend
acpi device:0d: suspend
acpi device:0c: suspend
acpi device:0b: suspend
acpi device:0a: suspend
acpi device:09: suspend
acpi device:08: suspend
acpi device:07: suspend
acpi device:06: suspend
acpi device:05: suspend
acpi device:04: suspend
acpi device:03: suspend
acpi device:02: suspend
acpi device:01: suspend
acpi PNP0C02:00: suspend
pci_root PNP0A08:00: suspend
button PNP0C0C:00: suspend
acpi device:00: suspend
processor ACPI0007:01: suspend
processor ACPI0007:00: suspend
button LNXPWRBN:00: suspend
acpi LNXSYSTM:00: suspend

Regarding your question in comment #21: I use nvidia binary drivers in 
production setup (SuSE 10.2, Kernel 2.6.21.7) without any problems and 
with working suspend. I never compiled nvidia's driver for any of the 
kernels discussed in this thread, so I'm very sure no nvidia.ko has been
used or loaded in these tests.
Comment 27 Arthur Erhardt 2007-12-16 03:36:07 UTC
Created attachment 14061 [details]
acpidump for Asus A8N SLI Premium, Bios 1303
Comment 28 Carlos Corbacho 2007-12-16 09:57:42 UTC
Rafael,

After a lot of poking about with some custom DSDTs, I've narrowed down the problem (and after comparing my DSDT to Arthur's, they are pretty similar - probably both based on the nVidia reference BIOS. So this is quite likely also true for Arthur's board).

_PTS() is failing on the call to OSTP(). OSTP() is trying to write a value to SMIP, which is defined to be an IO port (0x142E) and this appears to be what is failing.

(So is something in ACPI or kernel land disabling IO port access before we can write out?)
Comment 29 Rafael J. Wysocki 2007-12-16 10:31:00 UTC
Thanks a lot for narrowing this, outstanding work!

Yes, I think something in the kernel is fiddling with port 0x142E behind the ACPI drivers' back.

Do you use any hwmon drivers by chance?
Comment 30 Carlos Corbacho 2007-12-16 10:46:52 UTC
Not AFAICT on the test kernel (which is basically just enough enabled to boot the box and test suspend).

(According to the DSDT though, this port is only supposed to be used for storing a value that specifies the OS).
Comment 31 Rafael J. Wysocki 2007-12-16 11:00:22 UTC
OK, then probably something's wrong in the ACPI land.

Please attach your .config, it may be useful.
Comment 32 Carlos Corbacho 2007-12-16 11:06:16 UTC
Created attachment 14066 [details]
Simple .config for 2.6.24-rc5-gda8cadb3

This is the .config from my test kernel 2.6.24-rc5-gda8cadb3 + Rafael's PM debugging patches.

(NOTE: the custom DSDT reference is a left over from my experiments to narrow down the issue - this issue _is_ fully reproducible with the BIOS supplied DSDT).
Comment 33 Carlos Corbacho 2007-12-21 06:37:37 UTC
I've also played about with moving the offending IO port write around, but it still breaks suspend.

It doesn't appear to be a generic IO ports issue since there are other port writes happening in _PTS() that don't fail.

How exactly do Linux and ACPI handle IO ports during suspend? I've tried poking around the suspend to RAM code, but to be honest, I'm not any the wiser on this (I've considered things like PNPACPI, since port 0x142E is in a supposedly reserved region; but I tested with pnpacpi=off and it doesn't make any difference).

And is there anyway to get more debugging output on what exactly is going on here when ACPI reaches the OSTP() code?
Comment 34 Carlos Corbacho 2007-12-23 04:26:07 UTC
Increasing PCIBIOS_MIN_IO in include/asm-x86/pci.h to 0x142F fixes the problem.

It looks like there were some patches that touched this two years ago:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0b2bfb4e7ff61f286676867c3508569bea6fbf7a

and

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0b2bfb4e7ff61f286676867c3508569bea6fbf7a

Which were then both reverted:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ba84684e8cf6f980e4e95a2300f53a505eb794e

And apparently never heard from since, so this IO ports limit has been stuck at 0x1000 (and all the other IO port writes in _PTS() are to ports lower than 0x1000, hence they don't break).

Perhaps 0x4000 for PCIBIOS_MIN_IO is a bit excessive - but certainly for this nForce board, 0x142F (or, I suppose, we could pick 0x1500 as a nice round number) will also fix the problem.
Comment 35 Carlos Corbacho 2007-12-23 05:13:06 UTC
Sorry for duplicating links, the first patch referred to should have been this one (the second link is correct):

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=71db63acff69618b3d9d3114bd061938150e146b
Comment 36 Anonymous Emailer 2007-12-23 09:56:00 UTC
Reply-To: torvalds@linux-foundation.org



On Sun, 23 Dec 2007, Carlos Corbacho wrote:
>
> Fix suspend-to-RAM on nForce 4 (CK804) boards by increasing
> PCIBIOS_MIN_IO.
> 
> Fixes kernel bugzilla #9528
> 
> Problem:
> 
> Linus' patch (52ade9b3b97fd3bea42842a056fe0786c28d0555) to re-order
> suspend (and fix fall out from Rafael's earlier suspend reordering work)
> broke suspend-to-RAM on nForce 4 (CK804) boards.
> 
> Why:
> 
> After debugging _PTS() in the DSDT, it turns out these nVidia boards are
> trying to write to an IO port > 0x1000 (0x142E) during suspend. Before the
> re-ordering, we got away with this.

Very interesting.

HOWEVER.

I'd much rather figure out what the magic IO resource is that clashes. 

It's almost certainly some hidden and undocumented (or badly documented) 
ACPI IO area that the kernel doesn't know about, because it's not a 
regular PCI BAR resource, but some northbridge (or southbridge) magic 
register range.

Those ranges *should* be reserved by the BIOS in the ACPI tables, but this 
would definitely not be the first time that doesn't happen.

But the right fix would be for us to just figure out what the range is ass 
a PCI quirk, and just know to avoid it on purpose, ratehr than just being 
lucky and happen to avoid it because PCIBIOS_MIN_IO just happens to be 
bigger than the particular address.

So can you:
 - show what your /proc/ioports contains (*with* the bug triggering, ie 
   non-working suspend, so we see what it is that actually ends up using 
   that area)
 - send out 'dmesg' for a boot (same deal)
 - add "lspci -xxxvv" output to the deal too.

and also make them part of the bugzilla history (I'm cc'ing bugzilla here, 
and added the bug number to the subject, so hopefully this thread ends up 
being archived there too).

> There was some previous work in the PCIBIOS_MIN_IO area over two years ago
> (71db63acff69618b3d9d3114bd061938150e146b) which bumped this to 0x4000,
> but this was reverted (2ba84684e8cf6f980e4e95a2300f53a505eb794e) after
> causing new and entirely different problems on another nForce board.

The problem here is classic: these magic ranges tend to be *different* on 
different boards (because they don't tend to be fixed by hardware, they 
are programmed regions set up by firmware), so trying to change 
PCIBIOS_MIN_IO to avoid a problem on one board is almost certain to just 
introduce it on another board instead.

On *your* particular board, 0x142E is used for something, but on somebody 
elses board it might be 0x162E, and now changing PCIBIOS_MIN_IO to 0x1500 
might make that other board hang instead.

So you seem to have debugged this very successfully, and I'm wondering if 
you might be able to find out where that 0x142e comes from, and we could 
fix it for *all* boards using that chipset by just figuring out what the 
*hardware* rules (rather than the random firmware setup that will be 
different on different boards) for that chipset actually are!

For an example of what I mean, see the file "drivers/pci/quirks.c", and 
check out the quirks for various chipsets:

 - quirk_ali7101_acpi()

   Knows about the magic ALI ACPI and SMB OI regions

 - quirk_piix4_acpi(), quirk_ich6_lpc_acpi(), quirk_ich4_lpc_acpi()

   Same thing for the Intel chipsets

 - quirk_vt82c586_acpi(), quirk_vt82c686_acpi()

   VIA chipsets

etc etc.

It would be *wonderful* if somebody could figure out what the equivalent 
quirks for nVidia chipsets are! Because otherwise we'll just end up 
bouncing back and forth between different random IO allocations, and they 
are all almost guaranteed to cause the same problems, just on different 
boards!

It's sometimes possible to even just guess what the registers are, even if 
things are undocumented. In particular, that 142E range is almost 
certainly programmed into the host bridge or possibly a "LPC controller" 
or similar, and it will probably show up as the bytes "20 14" in the 
output from lspci, so we can guess which register it is that sets the 
base. That's not *always* how it works, but it's sometimes possible to 
guess (although you usually need to see a few different cases of the same 
chipset to have any kind of confirmation of the guess).

		Linus
Comment 37 Anonymous Emailer 2007-12-23 10:01:11 UTC
Reply-To: torvalds@linux-foundation.org



On Sun, 23 Dec 2007, Linus Torvalds wrote:
> 
> For an example of what I mean, see the file "drivers/pci/quirks.c", and 
> check out the quirks for various chipsets:

Side note - we already do have some quirks for the CK804 chipset, we 
probably just don't have enough (ie we have it for some HT stuff, but 
there are probably different ranges for ACPI etc other registers). 

I'm adding Brice Goglin to the Cc, since he was the one that created those 
quirks, which implies that he probably has access to documentation on that 
thing.

Brice: see the unfolding sad story on

	http://bugzilla.kernel.org/show_bug.cgi?id=9528

which doesn't include Carlos' previous email, but does include my reply, 
so you can see what the resource allocation issue is..

			Linus
Comment 38 Arthur Erhardt 2007-12-23 10:28:13 UTC
(In reply to comment #34)
> Increasing PCIBIOS_MIN_IO in include/asm-x86/pci.h to 0x142F fixes the
> problem.
> 
[detailed explanation]
Thanks a lot, Carlos. Unfortunately, on my mainboard this patch doesn't solve 
the problem.
I tried PCIBIOS_MIN_IO == 0x1500 , 0x1800, 0x4000 in that order, rebuilt
kernel[1] and modules and tested s2ram -f -a1 (which works on 2.6.21.7) with the same result as before. System freezes, power LED stays on.

[1] 2.6.24rc4, local copy untouched except for include/asm-x86/pci.h since
comment #27.

I have access to a few other CK804 based boards (Abit KN8 SLI), but these are
in use, so I could test things requiring a reboot once or twice, but not more.
Comment 39 Ingo Molnar 2007-12-23 11:19:56 UTC

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Why:
> > 
> > After debugging _PTS() in the DSDT, it turns out these nVidia boards are
> > trying to write to an IO port > 0x1000 (0x142E) during suspend. Before the
> > re-ordering, we got away with this.
> 
> Very interesting.
> 
> HOWEVER.
> 
> I'd much rather figure out what the magic IO resource is that clashes.

Carlos, could you please run the following script as root:

  http://redhat.com/~mingo/misc/probe-ports.sh

and send us the resulting probe-ports.txt file?

This script will probe all unused ports as per /proc/ioports and will 
list "suspect" IO port areas: ones that do not produce the expected 0xff 
default reply from unclaimed IO ports. Magic chipset register areas can 
potentially be mapped this way.

[ CAREFUL: This probes IO ports which might in theory trigger various
  nastiness such as lockups. I this on a few boxes and the script
  worked, but save any work in case you get lockups. ]

	Ingo
Comment 40 Anonymous Emailer 2007-12-23 11:32:48 UTC
Reply-To: torvalds@linux-foundation.org



On Sun, 23 Dec 2007, Ingo Molnar wrote:
> 
> This script will probe all unused ports as per /proc/ioports and will 
> list "suspect" IO port areas: ones that do not produce the expected 0xff 
> default reply from unclaimed IO ports. Magic chipset register areas can 
> potentially be mapped this way.

This probably won't work, if the APCI ports aren't reserved.

The way suspend-to-ram works is that there's a magic port that the CPU 
reads from, which just basically turns the CPU off. 

Same goes for C-states, and while we should recover from that gracefully 
(ie the CPU comes back at wakeup events), if you don't do the right setup, 
that can also just hang the machine..

So this script sounds rather dangerous for this case (while it probably 
tends to work fine for the case of the ACPI ports being properly reserved: 
*most* IO devices tend to try to avoid having too many side effects on 
normal reads - the most common one tends to be to reset any pending 
interrupts from a device, but that one won't matter if we don't have a 
driver listening to that device).

			Linus
Comment 41 Yinghai Lu 2007-12-23 12:43:07 UTC
On Dec 23, 2007 9:53 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Sun, 23 Dec 2007, Carlos Corbacho wrote:
> >
> > Fix suspend-to-RAM on nForce 4 (CK804) boards by increasing
> > PCIBIOS_MIN_IO.
> >
> > Fixes kernel bugzilla #9528
> >
> > Problem:
> >
> > Linus' patch (52ade9b3b97fd3bea42842a056fe0786c28d0555) to re-order
> > suspend (and fix fall out from Rafael's earlier suspend reordering work)
> > broke suspend-to-RAM on nForce 4 (CK804) boards.
> >
> > Why:
> >
> > After debugging _PTS() in the DSDT, it turns out these nVidia boards are
> > trying to write to an IO port > 0x1000 (0x142E) during suspend. Before the
> > re-ordering, we got away with this.
>
> Very interesting.
>
> HOWEVER.
>
> I'd much rather figure out what the magic IO resource is that clashes.
>
> It's almost certainly some hidden and undocumented (or badly documented)
> ACPI IO area that the kernel doesn't know about, because it's not a
> regular PCI BAR resource, but some northbridge (or southbridge) magic
> register range.
>
> Those ranges *should* be reserved by the BIOS in the ACPI tables, but this
> would definitely not be the first time that doesn't happen.
>
> But the right fix would be for us to just figure out what the range is ass
> a PCI quirk, and just know to avoid it on purpose, ratehr than just being
> lucky and happen to avoid it because PCIBIOS_MIN_IO just happens to be
> bigger than the particular address.
>
> So can you:
>  - show what your /proc/ioports contains (*with* the bug triggering, ie
>    non-working suspend, so we see what it is that actually ends up using
>    that area)
>  - send out 'dmesg' for a boot (same deal)
>  - add "lspci -xxxvv" output to the deal too.
>
it looks like BIOS doesn't assign io port in bus 0. ( for PMU? or some 00:01.1)
and kernel try to assign value to it according to PCIBIOS_MIN_IO.

sometime some systems could have several HT chains.
bus: [00,08] on node 0 link 1
bus: 00 index 0 io port: [5000, dfff]
bus: 00 index 1 io port: [e000, efff]
bus: 00 index 2 io port: [0, fff]
bus: 00 index 3 mmio: [de000000, dfffffff]
bus: 00 index 4 mmio: [e0000000, e7ffffff]
bus: 00 index 5 mmio: [a0000, bffff]
bus: 00 index 6 mmio: [f0000000, ffffffff]
bus: [80,86] on node 1 link 2
bus: 80 index 0 io port: [1000, 4fff]
bus: 80 index 1 io port: [f000, ffff]
bus: 80 index 2 mmio: [c0000000, ddffffff]
bus: 80 index 3 mmio: [e8000000, efffffff]

current all the buses will use ioport_resource

@@ -1158,6 +1162,8 @@ struct pci_bus * pci_create_bus(struct d
        b->resource[0] = &ioport_resource;
        b->resource[1] = &iomem_resource;

kernel could try to allocate resource from [0x1000, 0x4fff] for the
device in first HT chain...
..
I met one case: when some cards insert, i can not use mcp55 on die nic.
then i make one patch that could read KB northbridge pci conf space to
make different peer root bus has right io/iomem resource range.
pci_assign_resource could get right value for the devices that is not
assigned io value by BIOS.

YH
Comment 42 Carlos Corbacho 2007-12-23 15:37:53 UTC
Created attachment 14161 [details]
/proc/ioports from Abit KN9

Well, first things first; I'm withdrawing the patch because I didn't check I'd removed the custom DSDT which disabled the offending IO port write first...

So, the patch doesn't work and can be safely ignored (feel free to jump off the CC list if this is now no longer relevant to you).

However, we are still stuck with the problem that Linus' suspend commit in question is the cause here of a wedged suspend (as confirmed by Arthur, the original bug submitter).

So, now that I've wasted everyone's time by CC'ing you all for a patch that doesn't work - what do we need to do to try and hunt down the real cause of why the 0x142E port write wedges suspend?

Attached is the output of /proc/ioports

As already noted here though, my first assumption was that PNPACPI was the culprit here - port 0x142E is in a range reserved by PNPACPI - but booting with pnpacpi=off has no effect on suspend (the range is no longer reserved, but suspend still wedges).
Comment 43 Carlos Corbacho 2007-12-23 15:46:21 UTC
Created attachment 14162 [details]
dmesg (with debugging) from Abit KN9

Output of dmesg after boot from Abit KN9 (warning - lots of debugging options enabled so there's a lot to trawl through).
Comment 44 Carlos Corbacho 2007-12-23 15:47:46 UTC
Created attachment 14163 [details]
lspci -xxxvv from Abit KN9

lspci -xxxvv from Abit KN9.
Comment 45 Carlos Corbacho 2007-12-23 15:51:02 UTC
Ingo -> The resulting probe-ports.txt file from running your script is empty.
Comment 46 Brice Goglin 2007-12-24 01:05:19 UTC
> I'm adding Brice Goglin to the Cc, since he was the one that created those
> quirks, which implies that he probably has access to documentation on that
> thing.

Sorry guys, I can't help much here, I don't have any doc about this (the MSI config is the only details I had about this chipset).
Comment 47 Rafael J. Wysocki 2007-12-24 05:34:01 UTC
Carlos, please apply the series of patches from http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.24-rc6/patches/ on top of
2.6.24-rc6, compile the kernel with CONFIG_PM_VERBOSE set, run:

# echo devices > /sys/power/pm_test
# echo mem > /sys/power/state

and attach the dmesg output taken right after doing that (please make sure you have a big log buffer ;-)).
Comment 48 Carlos Corbacho 2007-12-24 06:48:40 UTC
Created attachment 14169 [details]
dmesg with device suspend debugging patch

Here's the dmesg output after applying your patches and running your given suspend commands.
Comment 49 Robert Hancock 2007-12-24 08:59:41 UTC
Linus Torvalds wrote:
> 
> On Sun, 23 Dec 2007, Carlos Corbacho wrote:
>> Fix suspend-to-RAM on nForce 4 (CK804) boards by increasing
>> PCIBIOS_MIN_IO.
>>
>> Fixes kernel bugzilla #9528
>>
>> Problem:
>>
>> Linus' patch (52ade9b3b97fd3bea42842a056fe0786c28d0555) to re-order
>> suspend (and fix fall out from Rafael's earlier suspend reordering work)
>> broke suspend-to-RAM on nForce 4 (CK804) boards.
>>
>> Why:
>>
>> After debugging _PTS() in the DSDT, it turns out these nVidia boards are
>> trying to write to an IO port > 0x1000 (0x142E) during suspend. Before the
>> re-ordering, we got away with this.
> 
> Very interesting.
> 
> HOWEVER.
> 
> I'd much rather figure out what the magic IO resource is that clashes. 
> 
> It's almost certainly some hidden and undocumented (or badly documented) 
> ACPI IO area that the kernel doesn't know about, because it's not a 
> regular PCI BAR resource, but some northbridge (or southbridge) magic 
> register range.
> 
> Those ranges *should* be reserved by the BIOS in the ACPI tables, but this 
> would definitely not be the first time that doesn't happen.

I'm having trouble sorting out which report is for which BIOS (and some 
of them don't have any dmesg posted), but I believe in these cases that 
memory region is indeed reported as reserved by the BIOS, and no PCI 
resources should end up allocated there. So I'm not sure why fiddling 
with PCIBIOS_MIN_IO would have any effect (other than by accident).

I wonder if this is the culprit (from Arthur Erhardt's dmesg):

pnpacpi: exceeded the max number of mem resources: 12
pnpacpi: exceeded the max number of mem resources: 12

which means we're ignoring some of the memory reservations. I wonder if 
some IO reservations are also being ignored?

Why do we have this silly hard limit of number of resources anyway? If 
we just ignore random reservations provided by the BIOS, we shouldn't be 
surprised if things break randomly. This warning at the very least 
should be much louder (i.e. "Warning: This problem may break your system")..

> 
> But the right fix would be for us to just figure out what the range is ass 
> a PCI quirk, and just know to avoid it on purpose, ratehr than just being 
> lucky and happen to avoid it because PCIBIOS_MIN_IO just happens to be 
> bigger than the particular address.
> 
> So can you:
>  - show what your /proc/ioports contains (*with* the bug triggering, ie 
>    non-working suspend, so we see what it is that actually ends up using 
>    that area)
>  - send out 'dmesg' for a boot (same deal)
>  - add "lspci -xxxvv" output to the deal too.
> 
> and also make them part of the bugzilla history (I'm cc'ing bugzilla here, 
> and added the bug number to the subject, so hopefully this thread ends up 
> being archived there too).
> 
>> There was some previous work in the PCIBIOS_MIN_IO area over two years ago
>> (71db63acff69618b3d9d3114bd061938150e146b) which bumped this to 0x4000,
>> but this was reverted (2ba84684e8cf6f980e4e95a2300f53a505eb794e) after
>> causing new and entirely different problems on another nForce board.
> 
> The problem here is classic: these magic ranges tend to be *different* on 
> different boards (because they don't tend to be fixed by hardware, they 
> are programmed regions set up by firmware), so trying to change 
> PCIBIOS_MIN_IO to avoid a problem on one board is almost certain to just 
> introduce it on another board instead.
> 
> On *your* particular board, 0x142E is used for something, but on somebody 
> elses board it might be 0x162E, and now changing PCIBIOS_MIN_IO to 0x1500 
> might make that other board hang instead.
> 
> So you seem to have debugged this very successfully, and I'm wondering if 
> you might be able to find out where that 0x142e comes from, and we could 
> fix it for *all* boards using that chipset by just figuring out what the 
> *hardware* rules (rather than the random firmware setup that will be 
> different on different boards) for that chipset actually are!

I suspect it's board specific. Looking at the DSDT for my A8N-SLI 
Deluxe, that SMIP region is defined at 0x442E (and is reported as 
reserved). This BIOS doesn't write there in the _PTS method like the 
ones in the report apparently do though.
Comment 50 Carlos Corbacho 2007-12-24 09:33:34 UTC
Robert ->

Sorry for wasting your time. The PCIBIOS_MIN_IO has no effect, as I've reported earlier (I left a custom DSDT in when experimenting with it).

Arthur, the original bug reporter, also has an Abit A8N-SLI, and in his DSDT, there is a call in _PTS() to OSTP(), which in turn does try to write to 0x442E (my bad there as well - you're right, we do have slightly different port writes, but the error is still happening here, even though on the Abit board at least, in spite of the pnpacpi reserved errors, this region is being correctly reserved; and both ports are in PNPACPI reserved regions on their respective boards though); unless Asus decided to remove this write in _PTS/ OSTP in a later BIOS revision[1], I would expect it to be there on your board.

[1] This write is quite likely pointless - it appears to set a value based on the detected OS. However, ACPI already writes the correct value to the port on startup, nothing else appears to change the value, and AFAICT, it isn't re-read by the resume methods. So OSTP() being dropped (at least from _PTS()) wouldn't really suprise me.
Comment 51 Carlos Corbacho 2007-12-24 10:20:14 UTC
After a bit of additional hacking (basically, I've added a call to _PTS() in device_suspend() after every suspend_device() to see what breaks it).

And the culprit here is... ohci_hcd, apparently (unloading it really does let my box suspend - and this is regardless of whether any devices are plugged in or not) - before we start CCing anyone here, this could do with some more confirmation.

Arthur -> Could you test unloading ohci_hcd on your box and see if it will now suspend?
Comment 52 Carlos Corbacho 2007-12-24 16:22:55 UTC
Rafael, Len:

So summarrise my discussions so far with Robert Hancock on LKML, and re-reading all the ACPI spec versions (I always keep 1.0B handy for my 'favourite' Acer laptop anyway), the real problem is this:

Pre ACPI 3.0:

_PTS should be executed _before_ putting devices into a low power state.

Before the PM code rearranging of 2.6.22, this was happening (although by accident, with other side effects, since we weren't properly freezing devices).

ACPI 3.0:

_PTS() should be executed _after_ putting devices into a low power state.

So, 2.6.22 and onwards, we now have a Linux kernel that obeys the ACPI 3.0 spec for suspend to RAM - which at the moment is a bit of a problem, since there are more pre than post ACPI 3.0 systems out there.

So, any pre ACPI 3.0 systems that suspend successfully are doing so by pure luck (and mostly down to their BIOS writers not trying to do things like nVidia, and in turn Abit and Asus, who are calling an SMI trap after Linux has helpfully put all the devices into a low power state...).
Comment 53 Rafael J. Wysocki 2007-12-25 04:55:47 UTC
(In reply to comment #52)
> Rafael, Len:
> 
> So summarrise my discussions so far with Robert Hancock on LKML, and
> re-reading
> all the ACPI spec versions (I always keep 1.0B handy for my 'favourite' Acer
> laptop anyway), the real problem is this:
> 
> Pre ACPI 3.0:
> 
> _PTS should be executed _before_ putting devices into a low power state.

Sorry, but you're wrong.  For example, section 9.1.6 of the 2.0c specification says explicitly that:

3. OSPM places all device drivers into their respective Dx state. If the
   device is enabled for wake, it enters the Dx state associated with the wake
   capability. If the device is not enabled to wake the system, it enters the
   D3 state.
4. OSPM executes the _PTS control method, passing an argument that indicates
   the desired sleeping state (1, 2, 3, or 4 representing S1, S2, S3, and S4).
Comment 54 Carlos Corbacho 2007-12-25 05:07:35 UTC
This is that same section from ACPI 1.0B:

3. The OS executes the Prepare To Sleep (_PTS) control method, passing an argument that indicates the desired sleeping state (1, 2, 3, or 4 representing S1, S2, S3, and S4).

4. The OS places all device drivers into their respective Dx state. If the device is enabled for wakeup, it enters the Dx state associated with the wakeup capability. If the device is not enabled to wakeup the system, it enters the D3 state.

Which very clearly states _PTS() is executed before we suspend the devices.

So, I am correct for the 1.0 case (which is what this BIOS claims to be conformant to).

For the 2.0 case, the spec just contradicts itself - section 7.3.2 in 2.0 states _PTS() should be run before device suspend, but as you say, 9.1.6 says the opposite.

So for 1.0 we are still doing the wrong thing (and both the DSDTs here claim to be conformant to ACPI 1.0), and for 2.0, we are not unclear (to which I presume 3.0 tried to clear it up).
Comment 55 Rafael J. Wysocki 2007-12-25 05:43:51 UTC
(In reply to comment #54)
> This is that same section from ACPI 1.0B:
> 
> 3. The OS executes the Prepare To Sleep (_PTS) control method, passing an
> argument that indicates the desired sleeping state (1, 2, 3, or 4
> representing
> S1, S2, S3, and S4).
> 
> 4. The OS places all device drivers into their respective Dx state. If the
> device is enabled for wakeup, it enters the Dx state associated with the
> wakeup capability. If the device is not enabled to wakeup the system, it
> enters the D3 state.
> 
> Which very clearly states _PTS() is executed before we suspend the devices.
> 
> So, I am correct for the 1.0 case (which is what this BIOS claims to be
> conformant to).
> 
> For the 2.0 case, the spec just contradicts itself - section 7.3.2 in 2.0
> states _PTS() should be run before device suspend, but as you say, 9.1.6 says
> the opposite.

Well, that's probably because section 7.3.2 was based on the 1.0 specification.

> So for 1.0 we are still doing the wrong thing (and both the DSDTs here
> claim to be conformant to ACPI 1.0), and for 2.0, we are not unclear (to
> which I presume 3.0 tried to clear it up).

Yes, I think that the 3.0 specification tried to clear things up.

Well, the solution might be to execute _PTS in acpi_pm_set_target() and not to execute it in acpi_pm_prepare() for ACPI 1.0x-compliant systems.  [We'll also need an additional global callback for executing _WAK after resuming devices on ACPI 1.0x systems in that case.]
Comment 56 Rafael J. Wysocki 2007-12-27 09:08:19 UTC
I have uploaded a new patchset, containing patches that should fix this problem (patches 28-34), into http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.24-rc6/patches/ .

Can you test it, please?
Comment 57 Carlos Corbacho 2007-12-27 09:52:00 UTC
Tested here and working fine. Feel free to tack this onto any relevant patches if you want someone to blame/ experiment on later:

Tested-by: Carlos Corbacho <carlos@strangeworlds.co.uk>
Comment 58 Arthur Erhardt 2007-12-27 13:50:38 UTC
(In reply to comment #56)
> I have uploaded a new patchset, containing patches that should fix this
> problem
> (patches 28-34), into
> http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.24-rc6/patches/ .
> 
> Can you test it, please?
Tested with following results:
runlevel 3 (console):
s2ram -f -a 1 works, but forgets to restore console, i.e. after resume I 
need to type commands while monitor is in power save mode or log in via 
network.

s2disk works, including resume.

runlevel 5:
  with Xorg and nv driver:
  s2disk works
  s2ram suspends fine, but after resume Xorg runs at 100% CPU and monitor stays   
  in powersave mode.

 
  with Xorg and nvidia (14.19 and 169.07) binary drivers:
  s2ram suspends fine, but after resume Xorg runs at 100% CPU and monitor stays   
  in powersave mode.

  Forgot to test s2disk. Production system (2.6.21.7) runs with Nvidia 14.19
  binary driver.
Comment 59 Rafael J. Wysocki 2007-12-27 14:17:55 UTC
(In reply to comment #58)
> (In reply to comment #56)
> > I have uploaded a new patchset, containing patches that should fix this
> problem
> > (patches 28-34), into
> > http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.24-rc6/patches/ .
> > 
> > Can you test it, please?
> Tested with following results:

Thanks for testing.

> runlevel 3 (console):
> s2ram -f -a 1 works, but forgets to restore console, i.e. after resume I 
> need to type commands while monitor is in power save mode or log in via 
> network.

What's your kernel command line in this configuration?

> s2disk works, including resume.

Fine.

> runlevel 5:
>   with Xorg and nv driver:
>   s2disk works

OK

>   s2ram suspends fine, but after resume Xorg runs at 100% CPU and monitor
>   stays in powersave mode.

Well, it seems we're not handling the graphics appropriately.  You may try to pass some other options to s2ram.

>   with Xorg and nvidia (14.19 and 169.07) binary drivers:
>   s2ram suspends fine, but after resume Xorg runs at 100% CPU and monitor
>   stays in powersave mode.

Same as above, the NVidia driver doesn't seem to help.

>   Forgot to test s2disk. Production system (2.6.21.7) runs with Nvidia 14.19
>   binary driver.

Please try to compile the kernel with CONFIG_PM_DEBUG set and do:

# echo core > /sys/power/pm_test
# echo mem > /sys/power/state

(that will run through the entire suspend sequence without entering the system sleep state - busy waiting for 5 sec. instead - and through the resume sequence).  If that works, the problem is most probably related to the graphics not being initialized appropriately after the suspend.
Comment 60 Arthur Erhardt 2007-12-28 03:41:09 UTC
Created attachment 14215 [details]
dmesg output on 2.6.24rc6 w/ PM patches
Comment 61 Arthur Erhardt 2007-12-28 03:43:29 UTC
(In reply to comment #59)
> (In reply to comment #58)
> > (In reply to comment #56)
 
> Thanks for testing.
> 
> > runlevel 3 (console):
> > s2ram -f -a 1 works, but forgets to restore console, i.e. after resume I 
> > need to type commands while monitor is in power save mode or log in via 
> > network.
> 
> What's your kernel command line in this configuration?
from grub/menu.lst:
kernel /boot/vmlinuz-2.6.24-rc6-default \
 root=/dev/disk/by-id/scsi-SATA_ST...-part1 resume=/dev/sdb2 showopts

[...]

> >   s2ram suspends fine, but after resume Xorg runs at 100% CPU and monitor
> >   stays in powersave mode.
> 
> Well, it seems we're not handling the graphics appropriately.  You may try to
> pass some other options to s2ram.
Ok, I'll try the complete set of suggestions in the s2ram manual, just can't 
do that right now.

> >   with Xorg and nvidia (14.19 and 169.07) binary drivers:
> >   s2ram suspends fine, but after resume Xorg runs at 100% CPU and monitor
> >   stays in powersave mode.
> 
> Same as above, the NVidia driver doesn't seem to help.
One more addition to that: in all cases, killing Xorg and restarting it 
after resume doesn't solve the problem, the new Xorg process also runs at
100% CPU w/o diplaying anything.

[...]
> Please try to compile the kernel with CONFIG_PM_DEBUG set and do:
> 
> # echo core > /sys/power/pm_test
> # echo mem > /sys/power/state
Worked fine, even without breaking the console. dmesg output is attached.
Comment 62 Adolfo R. Brandes 2008-01-06 12:23:11 UTC
I have tested 2.6.24-rc6 patched with the series described in comment 56 (the whole series, not just 28-34) on an Asus A8N-SLI Premium BIOS v1303, and now suspend to ram works flawlessly, even with the latest binary Nvidia drivers (169.07).  The only remaining glitch is not directly related to this bug; it deals with the forcedeth net driver, as described here: http://lkml.org/lkml/2008/1/2/253

All this is on an otherwise base install of Ubuntu Gutsy 7.10. I'm using Ubuntu's default suspend scripts, not s2ram.  I did try uswsusp-0.7 in single user mode and init=/bin/bash, but s2ram would cause the symptons described in comment 58 by Arthur.
Comment 63 Rafael J. Wysocki 2008-01-06 12:29:54 UTC
(In reply to comment #62)
> I have tested 2.6.24-rc6 patched with the series described in comment 56 (the
> whole series, not just 28-34) on an Asus A8N-SLI Premium BIOS v1303, and now
> suspend to ram works flawlessly, even with the latest binary Nvidia drivers
> (169.07).

Great, thanks for testing.

> The only remaining glitch is not directly related to this bug; it
> deals with the forcedeth net driver, as described here:
> http://lkml.org/lkml/2008/1/2/253
> 
> All this is on an otherwise base install of Ubuntu Gutsy 7.10. I'm using
> Ubuntu's default suspend scripts, not s2ram.  I did try uswsusp-0.7 in single
> user mode and init=/bin/bash, but s2ram would cause the symptons described in
> comment 58 by Arthur.

Hmm.  Do you run s2ram as "s2ram -f" or "s2ram" alone is enough?
Comment 64 Len Brown 2008-01-08 10:12:03 UTC
Re: the hang regression...

Without the patchset from comment #56 that reverts Linux
to ACPI 1.0 suspend order...
If USB is unloaded before suspend is invoked, then suspend works, yes?
So it is possible that the Linux/Windows incompatibility here
isn't the _PTS vs. D3 suspend order, it may be that Linux
is suspending USB differently than Windows.

Please provide the output from cat /proc/acpi/wakeup
As the strings in this file are arbitrary, please echo
each device name into the file to get as many of the
devices enabled for wakeup as you can.  Then see
if that change makes the hang go away.

The other thing I wonder about is the mechanism for
the actual hang.  It seems to occur when we are
writing a number reflecting the OS version to an
IO address.  It is likely that this write actually
triggers and SMI and the root failure here is that
the SMM code in the BIOS crashed.  Please see if
you can have any effect on the hang by changing
what number is written here.  While you could
edit the DSDT, it would be better to use bootparams:

The ABIT and ASUS have the same AML code for asking
what OS is running.
You can make them think they are running Windows 2000 this way:

acpi_osi="!Windows 2001"

You can make it think they are running Windows NT this way:

acpi_osi=
Comment 65 Len Brown 2008-01-08 10:15:24 UTC
also...
The reason that I don't think reverting Linux to ACPI 1.0
suspend order is the ultimate solution is because we'll
still be exposed to a hang if the USB were suspended
to D3 before the system suspend was invoked.
Comment 66 Carlos Corbacho 2008-01-08 10:30:22 UTC
Created attachment 14365 [details]
/proc/acpi/wakeup from Abit KN9

Len - Please read the earlier comments from Robert Hancock on this - yes, this is very likely an SMI trap being triggered.

I have also already tried booting with various values of acpi_osi, and they make no difference.

Regardless, the suspend order on Linux is still wrong for ACPI 1.0, even in the unlikely case that EHCI USB may be to blame here. And why would USB be put into D3 before _PTS is executed, if the spec states that the transition to Dx should happen after _PTS?
Comment 67 Carlos Corbacho 2008-01-08 10:40:43 UTC
Unless you're referring to autosuspend putting the USB device into D3?
Comment 68 Carlos Corbacho 2008-01-08 12:19:46 UTC
Although it is interesting to note that Vista has not enabled USB selective suspend for the two USB hubs on this board (whereas, on all my other machines, although running XP, USB selective suspend is enabled on all of them...)

Either this is strange Vista behaviour (but apparently Vista is supposed to always enable USB selective suspend), or there is a known nVidia problem with entering Dx before calling _PTS() and the SMI trap.

So it's possible we may need to blacklist CK804 (nForce 4) from USB autosuspend, as well as the _PTS() fixes, to ensure OHCI nevers drop into D3 before calling _PTS(), but I'd like some confirmation from other nForce 4 boxes running Windows before I put this forward as a serious suggestion.
Comment 69 Len Brown 2008-01-08 12:39:27 UTC
Carlos, please
# echo USB0 > /proc/acpi/wakeup
# echo USB2 > /proc/acpi/wakeup
see if those lines in the wakeup file change to "enabled" from "disabled"
and see if that has any effect on the hang.

The AML for both of these devices tell us that the OS
should limit how deep a D-state is used for them during system S3.
Though, technically, this should be independent of wakeup capability,
we'll have to check if Linux is that smart...

Yes, the system can suspend USB independent of an S3 request.
What the ASL below shows is that in the case of USB2,
if it is in D3 at the S3 request, Linux should move it to D1
before entering S3.

Also, does the same driver bind to both of these USB devices?
USB0      S3     disabled  pci:0000:00:02.0
USB2      S3     disabled  pci:0000:00:02.1


            Device (USB0)
            {
                Name (_ADR, 0x00020000)
                Method (_S1D, 0, NotSerialized)
                {
                    Return (0x01)
                }

                Method (_S3D, 0, NotSerialized)
                {
                    If (LEqual (OSFL, 0x02))
                    {
                        Return (0x02)
# must not go below D2 in S3
# I think this is Windows ME only case
                    }
                    Else
                    {
                        Return (0x03)
# okay to go to D3 in S3
# I think this non WinME case
                    }
                }
# no _PSW in USB0, so it should be enabled as a wake device always

                Name (_PRW, Package (0x02)
                {
                    0x0D,
                    0x03
# wakeup supported in S3
                })
            }

            Device (USB2)
            {
                Name (_ADR, 0x00020001)
                OperationRegion (P020, PCI_Config, 0x49, 0x01)
                Field (P020, AnyAcc, NoLock, Preserve)
                {
                    U0WK,   1
                }

                Method (_PSW, 1, NotSerialized)
                {
# _PSW should let Linux enable or disable wake capability of USB2
                    If (Arg0)
                    {
                        Store (0x01, U0WK)
                    }
                    Else
                    {
                        Store (0x00, U0WK)
                    }
                }

                Method (_S1D, 0, NotSerialized)
                {
                    Return (0x01)
                }

                Method (_S3D, 0, NotSerialized)
                {
                    Return (0x01)
# USB2 should never be put below D1 when system enters S3
# (this could be qualified by an _S3W, but this device has none,
# so no matter if USB2 is a wake device or not, Linux should
# never put it deeper than D1 when entering S3
                }

                Name (_PRW, Package (0x02)
                {
                    0x05,
                    0x03
# GPE 5 wakes the system from this device as deep as S3
                })
            }

re: OSFL...

   Name (OSFL, 0x01)

# default is 1
                If (STRC (\_OS, "Microsoft Windows"))
                {
                    Store (0x56, SMIP)
# windows 98 comes here, so it gets OSFL=1
                }
                Else
                {
                    If (STRC (\_OS, "Microsoft Windows NT"))
                    {
#everything NT and later comes here
                        If (CondRefOf (\_OSI, Local0))
                        {
                            If (\_OSI ("Windows 2001"))
                            {
                                Store (0x59, SMIP)
                                Store (0x00, OSFL)
# WinXP and later: OSFL=0
                                Store (0x03, OSFX)
                            }
                        }
                        Else
                        {
                            Store (0x58, SMIP)
                            Store (0x00, OSFX)
                            Store (0x00, OSFL)
# Windows NT: OSFL=0
                        }
                    }
                    Else
                    {
                        Store (0x57, SMIP)
                        Store (0x02, OSFX)
# Not NT. not Win98, I guess Windows ME? comes here: OSFL=2 
                        Store (0x02, OSFL)
                    }
                }
Comment 70 Rafael J. Wysocki 2008-01-08 13:04:30 UTC
*** Bug 9313 has been marked as a duplicate of this bug. ***
Comment 71 Rafael J. Wysocki 2008-01-08 14:23:56 UTC
(In reply to comment #69)
> Carlos, please
> # echo USB0 > /proc/acpi/wakeup
> # echo USB2 > /proc/acpi/wakeup
> see if those lines in the wakeup file change to "enabled" from "disabled"
> and see if that has any effect on the hang.
> 
> The AML for both of these devices tell us that the OS
> should limit how deep a D-state is used for them during system S3.
> Though, technically, this should be independent of wakeup capability,
> we'll have to check if Linux is that smart...
> 
> Yes, the system can suspend USB independent of an S3 request.
> What the ASL below shows is that in the case of USB2,
> if it is in D3 at the S3 request, Linux should move it to D1
> before entering S3.
> 
> Also, does the same driver bind to both of these USB devices?
> USB0      S3     disabled  pci:0000:00:02.0
> USB2      S3     disabled  pci:0000:00:02.1
> 
> 
>             Device (USB0)
>             {
>                 Name (_ADR, 0x00020000)
>                 Method (_S1D, 0, NotSerialized)
>                 {
>                     Return (0x01)
>                 }
> 
>                 Method (_S3D, 0, NotSerialized)
>                 {
>                     If (LEqual (OSFL, 0x02))
>                     {
>                         Return (0x02)
> # must not go below D2 in S3

Er, are you sure?  ACPI 2.0c says:

"7.2.14 _S3D (S3 Device State)
This object evaluates to an integer that conveys to OSPM the highest power (lowest number) D-state supported by this device in the S3 system sleeping state."

So that would be: must not go above D2 in S3.
Comment 72 Rafael J. Wysocki 2008-01-08 14:39:52 UTC
ACPI 1.0b doesn't make this so clear, but the comment in section 7.2.5 (_S0D):

"This particular method is redundant since the device must support D0 while in the S0 state."

suggests the meaning is the same here.

ACPI 3.0b, in turn, contains examples that make this crystal clear (Table 7-8): if _S3D returns 2, we're supposed to put the device into D2 or D3, _not_ D1.
Comment 73 Carlos Corbacho 2008-01-08 15:55:04 UTC
Len,

(In reply to comment #69)
> Carlos, please
> # echo USB0 > /proc/acpi/wakeup
> # echo USB2 > /proc/acpi/wakeup
> see if those lines in the wakeup file change to "enabled" from "disabled"
> and see if that has any effect on the hang.

No effect.

> Also, does the same driver bind to both of these USB devices?

No.

> USB0      S3     disabled  pci:0000:00:02.0

ohci-hcd (the offending device)

> USB2      S3     disabled  pci:0000:00:02.1

ehci-hcd (not the problem)

So unfortunately, your theory about USB2 being put into D3 when it shouldn't falls apart here - perhaps USB2 shouldn't be put into D3, but this doesn't break the system, since the system suspends and resumes just fine with ehci-hcd loaded.

The offending device is USB0, which, as the DSDT points out, can, on XP or above, be put into D3.
Comment 74 Rafael J. Wysocki 2008-01-08 16:10:35 UTC
(In reply to comment #73)
> Len,
> 
> (In reply to comment #69)
> 
> The offending device is USB0, which, as the DSDT points out, can, on XP or
> above, be put into D3.

IMO the DSDT says USB0 must go into D3 when going to S3, or am I getting it all wrong?
Comment 75 Carlos Corbacho 2008-01-08 16:31:21 UTC
The ACPI 1.0b spec (page 200) says that _SxD specifies the highest D-state you can go into for any given device when switching to state x, and that the OS is free to choose a lower state than the highest specified, but must not exceed it.

So in theory, yes, we could pick a lower state than D3, but you cannot go higher than that.
Comment 76 Rafael J. Wysocki 2008-01-08 16:53:05 UTC
(In reply to comment #75)
> The ACPI 1.0b spec (page 200) says that _SxD specifies the highest D-state
> you
> can go into for any given device when switching to state x, and that the OS
> is
> free to choose a lower state than the highest specified, but must not exceed
> it.
> 
> So in theory, yes, we could pick a lower state than D3, but you cannot go
> higher than that.

That depends on what "highest" means, though, highest number or highest power (ie. lowest number).  The citation in Comment #71 says it's the latter, and I think ACPI 1.0 also means that.

So, it would mean that for _S3D returning 3 we're supposed to put the device into D3, for _S3D returning 2 we're supposed to put it into D2 or D3 and so on.

ACPI 3.0 clarifies it really nicely and I don't think they would've changed something like this between versions ...
Comment 77 Carlos Corbacho 2008-01-08 17:08:25 UTC
That's how I read it as well, so I agree (I think any confusion boils down to lower power state = higher D-state value) - this really hasn't changed at all.

Basically, the issue now as I see it is:

1) We must call _PTS() before we start putting the devices into their low power states (so the suspend re-ordering patches are correct).

2) Len is right to bring up USB though, in the sense that this is the one device that could be in a higher (aka lower power) D-state before we reach _PTS(), and on this chipset it's known to cause problems with the BIOS.

On the nVidia chipsets (at least the CK804 - although I'm wondering if this also effects the Asus P1-AH2 we saw, which was also an nVidia board...) with these bad BIOS's, we may also need to stop ohci-hcd autosuspend to make sure we don't inadvertently move to a higher D-state early and trigger the hang that way (and it's possible Windows may also do this; Rafael, you mentioned before you had a CK804 board somewhere? It may be worth checking to see what a fully patched Vista or XP set USB autosuspend to on it for comparison).

I think it's also becoming clear why _PTS() was changed in ACPI 2.0 - since (ab)uses like this are clearly possible with 1.0...
Comment 78 Rafael J. Wysocki 2008-01-08 17:17:03 UTC
(In reply to comment #77)
> That's how I read it as well, so I agree (I think any confusion boils down to
> lower power state = higher D-state value) - this really hasn't changed at
> all.
> 
> Basically, the issue now as I see it is:
> 
> 1) We must call _PTS() before we start putting the devices into their low
> power states (so the suspend re-ordering patches are correct).

Agreed.

> 2) Len is right to bring up USB though, in the sense that this is the one
> device that could be in a higher (aka lower power) D-state before we reach
> _PTS(), and on this chipset it's known to cause problems with the BIOS.

Yes.  More precisely, I think that putting the USB into D3 cuts power from something on which _PTS relies.
 
> On the nVidia chipsets (at least the CK804 - although I'm wondering if this
> also effects the Asus P1-AH2 we saw, which was also an nVidia board...) with
> these bad BIOS's, we may also need to stop ohci-hcd autosuspend to make sure
> we don't inadvertently move to a higher D-state early and trigger the hang
> that way (and it's possible Windows may also do this; Rafael, you mentioned
> before you had a CK804 board somewhere? It may be worth checking to see what
> a fully patched Vista or XP set USB autosuspend to on it for comparison).

How to check that on Windows?

> I think it's also becoming clear why _PTS() was changed in ACPI 2.0 - since
> (ab)uses like this are clearly possible with 1.0...

Yeah.
Comment 79 Carlos Corbacho 2008-01-08 17:24:12 UTC
To check on Windows; in 'Device Manager', for each of the devices named 'USB Root Hub' (there are two listed on the Abit KN9), check their 'Power Management' tab and see if the option "Allow the computer to turn this device off to save power" is checked or not.

(On the box with the KN9 (Vista), that option is unchecked for both hubs (autosuspend disabled). For all the other boxes in the house (XP), the option is always checked (autosuspend enabled)).
Comment 80 Carlos Corbacho 2008-01-08 17:48:49 UTC
I've now had a report from someone with a Shuttle SN25P (also running Vista) that has an nForce 4 chipset, and Windows has also not enabled USB autosuspend on that either.

So I'm pretty confident in saying that this is a known bug to someone at MS and/ or nVidia, and that nForce 4 is blacklisted for USB autosuspend, probably because of all these buggy BIOSs (I suspect the chipset can handle USB autosuspend just fine, but a lot of the BIOSs don't, so MS just have a blanket USB autosuspend disable on all CK804 boards).

(What I'm not sure of is with the other ACPI 1.0 suspend report in the Asus P1-AH2, which appears to be an nForce 6 system, is whether this blanket disable also extends up, or if that P1-AH2 is just a bad one off BIOS?)
Comment 81 Len Brown 2008-01-08 19:10:54 UTC
Carlos,
thanks for clarifying that unloading ohci (USB0)
makes the hang go away, and that unloading ehci (USB2)
has no effect on the problem at hand.

re: comment #71 and comment #72
Yes, I'm sure.

_S3D specifies the shallowest D-state available in S3.
If there were no _PRW=3 (S3 wake capability),
then indeed _SD3=2 means we can choose D2 or (the deeper) D3.

When _PRW=3 is present, _S3D is telling us that
S3 wakeup will work in D2.  But it is also telling us that
"OSPM cannot assume that wake from the S3 system
sleeping state is supported in any lower D-state unless
specified by a corresponding _S3W object."

here "lower" means "deeper", for _S3W clearly
specifies the lowest power = highest number = deepest
D-state that still supports wake.  USB2 in this case
has no _S3W, so "highest" = "lowest" = only = D2.

ACPI 3.0 table 7-8 gives exactly the example of USB2.
_S3D=2 and _PRW=3, and no _S3W present.
So for that device, the OS must put the device into D2
and only into D2 upon S3.  If the device was in D3
before the S3 request, it must first be transitioned to D2.
No, this didn't change between versions of the ACPI spec.

The only unknown is if Windows always cares about USB
wake functionality.  I'm going to go out on a limb and
venture that yes, it does, and thus for devices with
a _PRW, it will honor _S3D and _S3W, if present.

The other unknown is if this chipset and its SMM BIOS code
ties together any of the ehci and ohci functionality,
or if they are totally independent.

Re: ohci (USB0)
Yes, the way I read the DSDT, we should put it into D3.
It would be an interesting experiment if putting it into D2
instead had an effect.
Comment 82 Carlos Corbacho 2008-01-08 20:03:50 UTC
(In reply to comment #81)
> The other unknown is if this chipset and its SMM BIOS code
> ties together any of the ehci and ohci functionality,
> or if they are totally independent.

I suspect they are independent. Disabling the SMI trap call on the suspend path didn't appear to break anything in my original testing.

A cursory retest now (disabling all calls to the trap in a custom DSDT) didn't cause anything to break here that I could see.

> Re: ohci (USB0)
> Yes, the way I read the DSDT, we should put it into D3.
> It would be an interesting experiment if putting it into D2
> instead had an effect.

I've hacked drivers/usb/core/hcd-pci.c to try this, but the result is the same - the system still hangs on suspend.

We really cannot let this device be in any mode other than D1 when calling _PTS().

So can we now get Rafael's suspend reordering patches into the ACPI tree, and I'll handle getting the CK804 USB autosuspend disabling side of this sorted?
Comment 83 Carlos Corbacho 2008-01-08 20:11:18 UTC
Well, by 'get into the tree' I mean having them in whichever branch goes up to 2.6.25 (since I see they're already in the suspend branch... I really shouldn't be replying to bugs at 4am).
Comment 84 Carlos Corbacho 2008-01-09 10:35:01 UTC
Ok, so after my discussion with Alan Stern:

1) At the moment, Linux USB autosuspend doesn't actually suspend the PCI OHCI controller (i.e. Linux doesn't change its power state), just the ports; so the issue of the OHCI controller going to D3 behind our backs isn't relevant for now.

2) For the future: 

USB0's _S1D states that the OHCI controller cannot be put into a higher D-state anyway when in S1 - so when Linux gets round to being able to runtime suspend USB PCI controllers (as opposed to just the ports), when in S1 we will need to check _S1D to make sure it is actually safe to go to high D-states (even if we don't want to do this check for S3) to make sure we don't runtime suspend the OHCI controller.

So to summarise: We don't put the OHCI controller into D3 at the moment in S1; we should only do so for S3 after _PTS(); and in the future, the DSDT clearly states that the OHCI PCI controller cannot be runtime suspended, so we should always obey _S1D for any future PCI runtime suspend when in S1 (even if we ignore _S3D for S3).
Comment 85 Carlos Corbacho 2008-01-09 10:57:48 UTC
Erum... to readdress my point 2, since I got myself confused between S1 and S0:

We will still need to blacklist CK804 from PCI EHCI controller runtime (S0) suspend - but we can't do this until the PCI controller runtime suspend functionality is actually added to the USB code in the first place (and there's no ETA on that); so for the present time, the EHCI PCI controller being put into D3 by the USB code in S0 is not an issue.
Comment 86 Rafael J. Wysocki 2008-01-09 11:01:45 UTC
(In reply to comment #84)
> Ok, so after my discussion with Alan Stern:
> 
> 1) At the moment, Linux USB autosuspend doesn't actually suspend the PCI OHCI
> controller (i.e. Linux doesn't change its power state), just the ports; so
> the
> issue of the OHCI controller going to D3 behind our backs isn't relevant for
> now.

Fine.

> 2) For the future: 
> 
> USB0's _S1D states that the OHCI controller cannot be put into a higher
> D-state

Please don't use the "higher D-state" language, it's ambiguous.

Actually, the USB0's _S1D says that we can put the device in either of D1, D2 and D3 before entering S1.

> anyway when in S1 - so when Linux gets round to being able to runtime suspend
> USB PCI controllers (as opposed to just the ports), when in S1 we will need
> to
> check _S1D to make sure it is actually safe to go to high D-states (even if
> we
> don't want to do this check for S3) to make sure we don't runtime suspend the
> OHCI controller.

S1 is a system sleep state (often referred to as standby), it has nothing to do with dynamic pm. 

> So to summarise: We don't put the OHCI controller into D3 at the moment in
> S1;

Yes, we do, in the same way as we do before entering S3.  We don't autosuspend it into D3, which makes quite a difference.
Comment 87 Carlos Corbacho 2008-01-09 11:34:59 UTC
(In reply to comment #86)
> Yes, we do, in the same way as we do before entering S3.  We don't
> autosuspend
> it into D3, which makes quite a difference.

I noticed (I really should read the specs more first...) - see my previous reply from about three minutes before yours.

We will still need to blacklist CK804 from any OHCI PCI controller autosuspend, but at the moment, this isn't an issue, since there is no PCI controller autosuspend for any USB PCI controllers.

So your suspend reordering patches are all that is needed to solve this issue; for any future USB controller autosuspend we don't need to deal with that until it actually gets added to the kernel.
Comment 88 Len Brown 2008-01-10 12:56:33 UTC
re: 2.6.25
yes, these patches are in the suspend branch,
and everything currently in the suspend branch
is targeted for 2.6.25.

I have a very bad feeling about this change, however.
I'll not be surprised to see it break other systems --
at which point we'll need to get a better understanding
why this this failure occurs and if there is a different
way to handle these systems.
Comment 89 Rafael J. Wysocki 2008-01-10 14:10:31 UTC
(In reply to comment #88)
> re: 2.6.25
> I have a very bad feeling about this change, however.
> I'll not be surprised to see it break other systems --
> at which point we'll need to get a better understanding
> why this this failure occurs and if there is a different
> way to handle these systems.

I'm suspicious too, but I don't really expect things to break (I expect ACPI 2.0+ hardware to be backwards compatible with ACPI 1.0).  Still, some systems may just work better with the 2.0 ordering and that's why the boot option has been provided.
Comment 90 Michael Tokarev 2008-02-01 08:32:43 UTC
Just for record.

I've an ASUS M2NPV-VM motherboard, Geforce6150/Nforce430(?) chipset, and it shows similar behavour - it never finishes suspend with 2.6.24, while with 2.6.23 all worked ok.  Note it's NOT CK804.

After Rafael pointed me out this thread, I tried using "shutdown" suspend method, and also tried unloading ohci-hcd - both gave me correctly completing suspend.

Later on I discovered a new BIOS for this motherboard on asus.com.tw site (rev. 1201, it was 0901 before).  I updated the BIOS, and the problem went away - this machine is now able to suspend/resume without the above workarounds.
Comment 91 Carlos Corbacho 2008-02-01 09:13:55 UTC
I suspect, based on what we've observed now, that something like this has occurred:

1) Some time ago, an old nVidia reference BIOS had an SMI trap with code that really did need to poke at the OHCI PCI device.

2) nVidia have long since removed this part of the trap from their reference BIOS (certainly pre nForce 4).

3) Many BIOS vendors have not updated their BIOS's accordingly, and the old SMI trap has found its way into more modern machines (so far - nForce 4, 5 & 6 at least).

4) Some vendors have then removed the offending part of the trap later (I suspect, for instance, that Robert Hancock has a much newer BIOS on his board than Arthur, hence could not reproduce the problem). So Asus at least are on the ball and have fixed this in later BIOS releases, I know of a Tyan that cannot reproduce this problem, whilst Abit _still_ haven't fixed this as of their latest BIOS.

So, yes, technically this is a bad BIOS issue, and some vendors have fixed it, but I suspect Windows still implements the old suspend behaviour anyway (since there will be a lot of boards with the older BIOS's still floating about), and blanket bans PCI autosuspend for the OHCI controller on offending chipsets to play it safe (I know at least on nForce 4 this is the case - I'm not sure for newer chipsets).
Comment 92 Arthur Erhardt 2008-02-01 12:58:02 UTC
(In reply to comment #91)
> I suspect, based on what we've observed now, that something like this has
> occurred:
[...]
> 3) Many BIOS vendors have not updated their BIOS's accordingly, and the old
> SMI
> trap has found its way into more modern machines (so far - nForce 4, 5 & 6 at
> least).
> 
> 4) Some vendors have then removed the offending part of the trap later (I
> suspect, for instance, that Robert Hancock has a much newer BIOS on his board
> than Arthur, hence could not reproduce the problem). So Asus at least are on
> the ball and have fixed this in later BIOS releases, I know of a Tyan that
> cannot reproduce this problem, whilst Abit _still_ haven't fixed this as of
> their latest BIOS.
Unfortunately, here is BIOS 1303 released on 2006-08-10 in use on an Asus A8N SLI Premium. There is no newer BIOS for this board.
 
For now, I'm happy with Rafaels patches which make s2disk usable though that
doesn't solve the problem discussed.
Comment 93 Carlos Corbacho 2008-02-01 14:37:30 UTC
Ah, you're right. I was confused by the fact that there are quite a few variations of the A8N, with different BIOS releases - I suspect this may be why Robert didn't see the bug - different A8N variation.

I think the general theory is still right though - that there is a BIOS bug, and that some vendors/ BIOSs have fixed it.

This is more a 'why is this fix necessary' postscript to the bug, rather than an argument against it (I'm all in favour of Rafael's patches, I'd just like some sort of record as to exactly what is wrong, and why it doesn't affect all nForce 4 chipsets - which still seems to boil down to vendors releasing bad BIOS's).
Comment 94 Adolfo R. Brandes 2008-04-25 03:28:39 UTC
Would this break the Asus A8N again?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7731ce63d9a863c987dd87b0425451fff0e6cdc8

Should this bug be reopened for 2.6.26?
Comment 95 Rafael J. Wysocki 2008-04-25 04:47:16 UTC
(In reply to comment #94)
> Would this break the Asus A8N again?
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7731ce63d9a863c987dd87b0425451fff0e6cdc8

Yes.  I have a patch in the works, but it'll need some testing.
Expect 2.6.27 time frame.
 
> Should this bug be reopened for 2.6.26?

Well, it hasn't been closed in the first place.
Comment 96 Adolfo R. Brandes 2008-04-25 06:50:23 UTC
(In reply to comment #95)
> Yes.  I have a patch in the works, but it'll need some testing.
> Expect 2.6.27 time frame.

Roger that.  Holler if you need a test case.
Comment 97 Zhang Rui 2008-11-19 23:19:11 UTC
2.6.27 is released, need to update the status of this bug, ping rafael. :)
Comment 98 Arthur Erhardt 2008-11-20 01:02:54 UTC
(In reply to comment #97)
> 2.6.27 is released, need to update the status of this bug, ping rafael. :)
> 
Vanilla 2.6.27.6 with OpenSUSE 11.0 x86_64 distribution: 
- suspend to disk with shutdown = "shutdown" (not: "platform") works, as did 2.6.25.
- suspend to ram (s2ram -f -a 1 ) freezes instantly, both from console and Xorg. Xorg with nvidia binary driver, console without (i.e. unloaded nvidia.ko, not only "not currently used"). Blinking cursor is still visible on monitor, fans and power LED are still on.
Comment 99 Carlos Corbacho 2008-11-20 10:28:56 UTC
Arthur,

Boot your kernel with acpi_sleep=old_ordering and try again (and also attach the output of dmidecode for your machine so that we can apply the quirk to your motherboard out-of-the-box).
Comment 100 Arthur Erhardt 2008-11-20 14:30:45 UTC
Created attachment 18956 [details]
dmidecode output of Asus A8N SLI Premium
Comment 101 Arthur Erhardt 2008-11-20 14:34:57 UTC
(In reply to comment #99)
> Arthur,
> 
> Boot your kernel with acpi_sleep=old_ordering and try again (and also attach
> the output of dmidecode for your machine so that we can apply the quirk to
> your
> motherboard out-of-the-box).
Ok, here are some test results:
(1) 
console mode, i.e. no X, no nvidia driver.
s2ram -f -a 1 works
resume works mostly: console remains black (but ssh login works after resume)
starting Xorg via ssh login works, graphics (w/nvidia driver) fine, but tty[1-6] still black, but responding (i.e. can type commands, just not see anything).

(2) from Xorg/fvwm (with nvidia.ko version 177.80):
s2ram -f -a 3
After resume, X11 _and_ console ok.

(3) 
After test (2), w/o rebooting:
killall X (or Ctrl Alt Backspace to zap, or whatever method to end X)

modprobe -r nvidia
s2ram -f -a 3

*resume*
ssh login works. From ssh prompt, 
X :0 works (reproducable) on 2nd try. 1st try says 

(EE) NVIDIA(0): The requested configuration of display devices (CRT-0) in
(EE) NVIDIA(0):     MetaMode "1920x1200_60.00_rb" is not supported on this
(EE) NVIDIA(0):     GPU.
[... same message for all configured resolutions ... ]
(EE) NVIDIA(0): Unable to use default mode "nvidia-auto-select".
(EE) Screen(s) found, but none have a usable configuration.

Fatal server error:
no screens found

Console stays black even after successfully starting Xorg in 2nd try.
Comment 102 Rafael J. Wysocki 2009-02-15 13:01:39 UTC
Do I understand correctly that everything works with the NVidia binary driver and acpi_sleep=old_ordering ?
Comment 103 Javier Kohen 2009-02-15 13:04:36 UTC
(In reply to comment #102)
> Do I understand correctly that everything works with the NVidia binary driver
> and acpi_sleep=old_ordering ?
> 

I just found this report today. I tried that and it worked for me. I was able to suspend my Asus A8N SLI Premium-based computer for the first time with that option.
Comment 104 Javier Kohen 2009-02-15 13:07:19 UTC
(In reply to comment #102)
> Do I understand correctly that everything works with the NVidia binary driver
> and acpi_sleep=old_ordering ?
> 

I just found this report today. I tried that and it worked for me. I was able
to suspend my Asus A8N SLI Premium-based computer for the first time with that
option.

Sorry, and yes, I'm using the nVidia proprietary drivers. Version 180.22.
Comment 105 Arthur Erhardt 2009-02-16 09:58:44 UTC
(In reply to comment #102)
> Do I understand correctly that everything works with the NVidia binary driver
> and acpi_sleep=old_ordering ?
This is correct. Using the "nv" (i.e. non-nvidia) driver does not restore 
console properly (from suspend to ram, IIRC). Since I use nvidia's drivers, this is good enough for me. If anybody wants to investigate further, I'm willing to test.

 
Comment 106 Rafael J. Wysocki 2009-02-16 12:31:41 UTC
Unfortunately, newer NVidia graphics adapters require the binary driver to be used for suspend-resume to work, even from the console.

The binary driver apparently does some kind of "magic" that's necessary to reinitialize the adapter during resume and we don't know what that is.

The only thing we can do is to use the NVidia binary driver with NVidia adapters if you want suspend-resume to work.

We can only add quirks to the kernel for the main boards that require acpi_sleep=old_ordering to suspend-resume, so please send me the output of dmidecode from your systems.
Comment 107 Javier Kohen 2009-02-16 13:03:25 UTC
Created attachment 20264 [details]
DMIDecode output A8N-SLI Premium 1303

(In reply to comment #106)
> We can only add quirks to the kernel for the main boards that require
> acpi_sleep=old_ordering to suspend-resume, so please send me the output of
> dmidecode from your systems.
Comment 108 Zhang Rui 2009-06-23 08:01:32 UTC
Created attachment 22063 [details]
patch: dmi workaround for Asus A8N-SLI Premium

please try this patch on top of the latest git kernel.
Comment 109 Zhang Rui 2009-06-23 08:02:30 UTC
please re-open this bug if the problem still exists with the patch applied.
Comment 110 Heiko Ettelbrück 2009-09-04 15:58:41 UTC
I found this bug some days ago when searching for help regarding the (by then not working) S3 support on my A8N-SLI Premium (with BIOS version 1303 - btw there is a newer one, but I haven't tried it yet).

As there was no feedback from the people previously discussing in this thread regarding Zhang's latest patch, maybe mine is helpful, too:
- With the original code from git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git, S3 didn't work (i.e. the machine didn't arrive in S3, but hung on the way to S3).
- Having applied Zhang's patch, it worked fine (as well as with unpatched kernel and kernel parameter "acpi_sleep=old_ordering", which I assume actually does the same). The only issue left is that the consoles are "black" after resume, which has already been reported by Arthur and which is that a real problem for me by now.

Zhang: Just to be sure: Has your patch already been added to the main kernel tree (in the meanwhile), or is this still supposed to happen? Can you estimate when it will probably be integrated?

Apart from that, just many thanks from me to all of you for having analyzed this and provided the patch + kernel parameter :-)

Heiko
Comment 111 Brian Beardall 2010-06-18 19:39:45 UTC
Could this piece of code be added to drivers/acpi/sleep.c in the mainline kernel?

        {
        .callback = init_old_suspend_ordering,
        .ident = "Asus A8N-SLI",
        .matches = {
                DMI_MATCH(DMI_BOARD_VENDOR, "ASUSTeK Computer INC."),
                DMI_MATCH(DMI_BOARD_NAME, "A8N-SLI DELUXE"),
                },
        },

This fixes S3 suspend on the A8N-SLI DELUXE that I have. The full board name is:
ASUS A8N-SLI DELUXE ACPI BIOS Revision 1805. I tested this code on the board and the suspend works with this errata workaround. Without this the computer freezes going into the S3 mode.

I'd also go ahead and add the fix for the ASUS A8N-SLI Premium. The board is almost exactly the same the Deluxe model. I however don't have an A8N-SLI premium board.
Comment 112 Zhang Rui 2010-06-21 06:57:45 UTC
Created attachment 26874 [details]
dmi workaround for Asus A8N-SLI Premium and  Asus A8N-SLI DELUXE

so the patch works for both of you, right?
Comment 113 Zhang Rui 2010-06-21 06:58:48 UTC
heiko,
does the black screen issue still exist in the latest git kernel?
Comment 114 Heiko Ettelbrück 2010-06-21 16:42:41 UTC
Unfortunately I'm currently kind of busy, so it might take some time until I can compile and test a new git kernel... :-(

If someone knows whether Ubuntu includes such a patch in their default kernel (currently I have 2.6.32-22-generic) or has some other workaround in place, that might help, as I can use standby (S3) and hibernate modes with it.
Comment 115 Brian Beardall 2010-06-24 20:02:05 UTC
Yes the code fixes S3 suspend for my A8N-SLI Deluxe motherboard.
Comment 116 Brian Beardall 2010-09-29 17:04:22 UTC
When is attachment #26874 [details] going to be submitted for the stable kernel? I was hoping this would have been in the 2.6.35 kernel but was not submitted and neither is it in the 2.6.36 kernel.
Comment 117 Len Brown 2011-07-30 05:41:20 UTC
patch applied to acpi-test
Comment 118 Len Brown 2011-08-06 02:09:42 UTC
pulled in linux 3.1 merge window.

closed.


commit bb0c5ed6ec523199e34e81dcef8e987507553b63
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Sat Jul 30 01:40:48 2011 -0400

    ACPI: DMI workaround for Asus A8N-SLI Premium and Asus A8N-SLI DELUX
    
    DMI workaround for A8N-SLI Premium and A8N-SLI DELUXE
    to enable the s3 suspend old ordering.
    http://bugzilla.kernel.org/show_bug.cgi?id=9528
    
    Tested-by: Heiko Ettelbrück <hbruckynews@gmx.de>
    Tested-by: Brian Beardall <brian@rapsure.net>
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>
    Signed-off-by: Len Brown <len.brown@intel.com>