|Summary:||Resume broken on iBook|
|Product:||Power Management||Reporter:||Andreas Schwab (schwab)|
|Component:||Hibernation/Suspend||Assignee:||Rafael J. Wysocki (rjw)|
|Bug Depends on:|
|Bug Blocks:||7216, 12398|
Put devices into D0 before restoring their state
Workaround for devices without the PM capability
Read state from the device after trying to change it on early resume
Change current_state to unknown in PMAC resume
Sanitize radeonfb suspend/resume
Description Andreas Schwab 2009-02-01 16:54:50 UTC
Latest working kernel version: 2.6.29-rc2 Earliest failing kernel version: 2.6.29-rc3 Distribution: Hardware Environment: iBook G4 Software Environment: Problem Description: Crash during resume from suspend to ram. Caused by: commit aa8c6c93747f7b55fa11e1624fec8ca33763a805 Author: Rafael J. Wysocki <email@example.com> Date: Fri Jan 16 21:54:43 2009 +0100 PCI PM: Restore standard config registers of all devices early There is a problem in our handling of suspend-resume of PCI devices that many of them have their standard config registers restored with interrupts enabled and they are put into the full power state with interrupts enabled as well. This may lead to the following scenario: * an interrupt vector is shared between two or more devices * one device is resumed earlier and generates an interrupt * the interrupt handler of another device tries to handle it and attempts to access the device the config space of which hasn't been restored yet and/or which still is in a low power state * the system crashes as a result To prevent this from happening we should restore the standard configuration registers of all devices with interrupts disabled and we should put them into the D0 power state right after that. Unfortunately, this cannot be done using the existing pci_set_power_state(), because it can sleep. Also, to do it we have to make sure that the config spaces of all devices were actually saved during suspend.
Comment 1 Rafael J. Wysocki 2009-02-02 03:59:44 UTC
Created attachment 20076 [details] Put devices into D0 before restoring their state Can you try this patch, please?
Comment 2 Andreas Schwab 2009-02-02 11:37:11 UTC
That does not help.
Comment 3 Rafael J. Wysocki 2009-02-02 13:25:22 UTC
Created attachment 20078 [details] Workaround for devices without the PM capability Thanks for testing. Please also try this one (on top of the previous one).
Comment 4 Rafael J. Wysocki 2009-02-02 13:26:29 UTC
First-Bad-Commit: aa8c6c93747f7b55fa11e1624fec8ca33763a805 Handled-By : Rafael J. Wysocki <firstname.lastname@example.org>
Comment 5 Andreas Schwab 2009-02-02 13:59:40 UTC
That didn't help either.
Comment 6 Rafael J. Wysocki 2009-02-02 14:15:40 UTC
Please attach the output of 'lspci -vv'.
Comment 7 Andreas Schwab 2009-02-02 14:35:15 UTC
When suspending with no_console_suspend init=/bin/sh I see Badness at drivers/pci/pci-driver.c:368 (and in this state resume actually works even in vanilla rc3, but the badness only happens with the patches applied).
Comment 9 Rafael J. Wysocki 2009-02-02 15:06:15 UTC
(In reply to comment #7) > When suspending with no_console_suspend init=/bin/sh I see Badness at > drivers/pci/pci-driver.c:368 This only is a warning, no need to worry about it. You can just comment out lines 368 and 369 in drivers/pci/pci-driver.c and it should go away. > (and in this state resume actually works even in > vanilla rc3, but the badness only happens with the patches applied). Ah, good. One of device drivers apparently breaks things, I wonder which one.
Comment 10 Rafael J. Wysocki 2009-02-02 15:13:10 UTC
Which drivers are loaded when you boot with init=/bin/sh ?
Comment 11 Rafael J. Wysocki 2009-02-02 15:24:50 UTC
IOW, please attach a dmesg log containing a successful suspend-resume cycle with init=/bin/sh, if possible.
Comment 12 Rafael J. Wysocki 2009-02-02 15:39:08 UTC
Created attachment 20080 [details] Read state from the device after trying to change it on early resume Linus thinks something like this may help, so please also try it.
Comment 13 Rafael J. Wysocki 2009-02-02 15:47:20 UTC
The last patch is on top of the one from comment #1.
Comment 14 Rafael J. Wysocki 2009-02-02 15:53:31 UTC
Created attachment 20081 [details] Change current_state to unknown in PMAC resume If the patch from comment #12 doesn't help, please also try this one.
Comment 16 Benjamin Herrenschmidt 2009-02-02 15:55:52 UTC
The root problem is radeonfb. It has its own config space restore code (yuck, but heh, that predates all the PCI core stuff) -and- it makes the decision as to whether it must re-POST the card on not ... based on the BAR values having been lost... With your patch, the BAR values are restored, so radeonfb things the card doesn't need re-POSTing and doesn't try. This is fishy, I'll try to come up with something a bit more sane. I'm not sure though what is the best way to "detect" that the card is non posted, I'll figure something. (Note, email temporarily down, so I can't reply on the lkml thread right away)
Comment 17 Benjamin Herrenschmidt 2009-02-02 15:58:00 UTC
Actually, Andreas, did you disable the early resume code in there ? With the early resume, radeonfb should have kicked in before the PCI restore code and thus should have somewhat worked anyway... unless pci_enable_device() -also- causes the BARs to be restored. Now, the early resume hack has some other issues that I mentioned in the email thread on the list, that will need to be sorted out but it seems to be working enough on the albook here.
Comment 18 Rafael J. Wysocki 2009-02-02 16:06:20 UTC
(In reply to comment #17) > Actually, Andreas, did you disable the early resume code in there ? With the > early resume, radeonfb should have kicked in before the PCI restore code and > thus should have somewhat worked anyway... I don't really think it would. The PCI restore code goes before the driver's resume_early(), actually. The rationale is that the PCI restore should not change whatever the driver might have done in resume_early(). > unless pci_enable_device() -also- causes the BARs to be restored. No, it doesn't. > Now, the early resume hack has some other issues that I mentioned in the > email > thread on the list, that will need to be sorted out but it seems to be > working > enough on the albook here. What about reposting the adapter unconditionally on resume? Would that break?
Comment 19 Benjamin Herrenschmidt 2009-02-02 17:07:36 UTC
It might... we don't know how to repost all of them, but that's not necessary the problem, it's also that the repost code was reverse engineered and may not actually work if the chip is still "hot". In fact I know at least of a couple of things that will probably burst, such as the initialization of the chip's memory controller.
Comment 20 Benjamin Herrenschmidt 2009-02-02 17:08:21 UTC
Created attachment 20085 [details] Sanitize radeonfb suspend/resume Andreas, can you test that ? I need to test it on older machines too see if it works with the ones using D2 state, and some of the thinkpads as well...
Comment 21 Rafael J. Wysocki 2009-02-03 00:15:50 UTC
(In reply to comment #19) > It might... we don't know how to repost all of them, but that's not necessary > the problem, it's also that the repost code was reverse engineered and may > not > actually work if the chip is still "hot". In fact I know at least of a couple > of things that will probably burst, such as the initialization of the chip's > memory controller. Hmm. So perhaps we need a "do not restore" flag? Not nice, but would do the trick.
Comment 22 Benjamin Herrenschmidt 2009-02-03 01:00:51 UTC
Let's see first. I've used a different method to detect if the chip needs to be POSTed in the patch, comparing the values of some important PLL registers with the saved values. I need to test on some older machines that don't power off the chip to see if that works fine, and if it does, we can just merge that.
Comment 23 Andreas Schwab 2009-02-03 11:33:16 UTC
Ben's patch works fine here.
Comment 24 Rafael J. Wysocki 2009-02-03 13:24:06 UTC
Great! Thanks for testing. Handled-By : Benjamin Herrenschmidt <email@example.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=20085&action=view
Comment 25 Rafael J. Wysocki 2009-02-08 11:31:23 UTC
Comment 26 Rafael J. Wysocki 2009-02-08 13:25:08 UTC