Latest working kernel version: 2.6.29-rc2
Earliest failing kernel version: 2.6.29-rc3
Hardware Environment: iBook G4
Crash during resume from suspend to ram.
Author: Rafael J. Wysocki <email@example.com>
Date: Fri Jan 16 21:54:43 2009 +0100
PCI PM: Restore standard config registers of all devices early
There is a problem in our handling of suspend-resume of PCI devices that
many of them have their standard config registers restored with
interrupts enabled and they are put into the full power state with
interrupts enabled as well. This may lead to the following scenario:
* an interrupt vector is shared between two or more devices
* one device is resumed earlier and generates an interrupt
* the interrupt handler of another device tries to handle it and
attempts to access the device the config space of which hasn't been
restored yet and/or which still is in a low power state
* the system crashes as a result
To prevent this from happening we should restore the standard
configuration registers of all devices with interrupts disabled and we
should put them into the D0 power state right after that.
Unfortunately, this cannot be done using the existing
pci_set_power_state(), because it can sleep. Also, to do it we have to
make sure that the config spaces of all devices were actually saved
Created attachment 20076 [details]
Put devices into D0 before restoring their state
Can you try this patch, please?
That does not help.
Created attachment 20078 [details]
Workaround for devices without the PM capability
Thanks for testing.
Please also try this one (on top of the previous one).
Handled-By : Rafael J. Wysocki <firstname.lastname@example.org>
That didn't help either.
Please attach the output of 'lspci -vv'.
When suspending with no_console_suspend init=/bin/sh I see Badness at drivers/pci/pci-driver.c:368 (and in this state resume actually works even in vanilla rc3, but the badness only happens with the patches applied).
Created attachment 20079 [details]
(In reply to comment #7)
> When suspending with no_console_suspend init=/bin/sh I see Badness at
This only is a warning, no need to worry about it. You can just comment out lines 368 and 369 in drivers/pci/pci-driver.c and it should go away.
> (and in this state resume actually works even in
> vanilla rc3, but the badness only happens with the patches applied).
Ah, good. One of device drivers apparently breaks things, I wonder which one.
Which drivers are loaded when you boot with init=/bin/sh ?
IOW, please attach a dmesg log containing a successful suspend-resume cycle with init=/bin/sh, if possible.
Created attachment 20080 [details]
Read state from the device after trying to change it on early resume
Linus thinks something like this may help, so please also try it.
The last patch is on top of the one from comment #1.
Created attachment 20081 [details]
Change current_state to unknown in PMAC resume
If the patch from comment #12 doesn't help, please also try this one.
(On top of the patch from comment #1).
The root problem is radeonfb.
It has its own config space restore code (yuck, but heh, that predates all the
PCI core stuff) -and- it makes the decision as to whether it must re-POST the
card on not ... based on the BAR values having been lost...
With your patch, the BAR values are restored, so radeonfb things the card
doesn't need re-POSTing and doesn't try.
This is fishy, I'll try to come up with something a bit more sane. I'm not sure
though what is the best way to "detect" that the card is non posted, I'll
(Note, email temporarily down, so I can't reply on the lkml thread right away)
Actually, Andreas, did you disable the early resume code in there ? With the early resume, radeonfb should have kicked in before the PCI restore code and thus should have somewhat worked anyway... unless pci_enable_device() -also- causes the BARs to be restored.
Now, the early resume hack has some other issues that I mentioned in the email thread on the list, that will need to be sorted out but it seems to be working enough on the albook here.
(In reply to comment #17)
> Actually, Andreas, did you disable the early resume code in there ? With the
> early resume, radeonfb should have kicked in before the PCI restore code and
> thus should have somewhat worked anyway...
I don't really think it would. The PCI restore code goes before the driver's resume_early(), actually. The rationale is that the PCI restore should not change whatever the driver might have done in resume_early().
> unless pci_enable_device() -also- causes the BARs to be restored.
No, it doesn't.
> Now, the early resume hack has some other issues that I mentioned in the
> thread on the list, that will need to be sorted out but it seems to be
> enough on the albook here.
What about reposting the adapter unconditionally on resume? Would that break?
It might... we don't know how to repost all of them, but that's not necessary the problem, it's also that the repost code was reverse engineered and may not actually work if the chip is still "hot". In fact I know at least of a couple of things that will probably burst, such as the initialization of the chip's memory controller.
Created attachment 20085 [details]
Sanitize radeonfb suspend/resume
Andreas, can you test that ? I need to test it on older machines too see if it works with the ones using D2 state, and some of the thinkpads as well...
(In reply to comment #19)
> It might... we don't know how to repost all of them, but that's not necessary
> the problem, it's also that the repost code was reverse engineered and may
> actually work if the chip is still "hot". In fact I know at least of a couple
> of things that will probably burst, such as the initialization of the chip's
> memory controller.
Hmm. So perhaps we need a "do not restore" flag? Not nice, but would do the trick.
Let's see first. I've used a different method to detect if the chip needs to be POSTed in the patch, comparing the values of some important PLL registers with the saved values. I need to test on some older machines that don't power off the chip to see if that works fine, and if it does, we can just merge that.
Ben's patch works fine here.
Great! Thanks for testing.
Handled-By : Benjamin Herrenschmidt <email@example.com>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=20085&action=view
Ignore-Patch : http://bugzilla.kernel.org/attachment.cgi?id=20085&action=view
Patch : http://marc.info/?l=linux-kernel&m=123379612011590&w=4
Patch : http://marc.info/?l=linux-kernel&m=123379612111593&w=4
Patch : http://marc.info/?l=linux-kernel&m=123379612111596&w=4
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1fb25cb8b83e85f5bf1a4adb3c9a254c4ce92405