Created attachment 109251 [details]
I use systemd's tmpfiles to power off the Radeon DIS early during boot to save power. This works fine with kernel 3.11.1 but it breaks with 3.12-rc1. I removed the systemd rule and tried to power the card off manually (echo OFF > /sys/kernel/debug/vgaswitcheroo/switch). I got a kernel warning and vga_switcheroo died. See the attached file for full dmesg dump.
Can you bisect?
wierd I'm guessing something in acpi is causing a hot unplug we haven't seen before, or did we hook up the radeon release method and just see this now?
Created attachment 109261 [details]
Error on hard freeze
Bisecting of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git revealed that the first bad commit is "bbd34fcdd1b201e996235731a7c98fd5197d9e51". I was getting hard freezes from which the only escape was a hard reset during bisecting, see the attached image for the error message.
Kernel 3.12-rc2 seems to fix the issue for me. (The kernel still crashes with the patches for powerxpress dynamic power switching, but that's obviously another story).
Created attachment 113521 [details]
dmesg from 3.12 final
I just updated to 3.12 final and it is happening again. The problem seemed to be fixed in 3.12-rc2 and I switched back to stable releases then. New dmesg log attached...
As Alex said on FreeDesktop's bugzilla, it looks like this bug might be a duplicate of this one: https://bugs.freedesktop.org/show_bug.cgi?id=70687
There, we can find another call trace of this crash and a bisect. Maybe it can help :-)
I tried to revert my git repo to 3.12-rc2 tag to see if I could do another round of bisecting but it turns out that even 3.12-rc2 was broken. I don't know I missed that. It seems that the previously pinpointed commit is the cause. Unfortunately the commit won't revert cleanly no matter what I try...
"I seem to have discovered the root of the issue.
I've just built 3.13-rc5 kernel which has the dynamic powering of the discrete gpu and all hell broke loose.
I've narrowed the error down to the pci hotplug driver. My machine loads shpchp pci hotplug driver from what I can see in lsmod output. But the trick is, that there is another pci hotplug driver, acpi pci hotplug one, which seems to break all hell loose here. Disabling it seems to fix everything for me, at least on kernel 3.13.
# CONFIG_HOTPLUG_PCI_ACPI is not set
This kernel config option is the culprit for this, and that also can be seen from my backtrace:
[ 22.731998] [<ffffffff81343cb1>] ? acpiphp_check_bridge+0x72/0x88
So the trick behind this is that acpi pci hotplug driver conflicts with shpchp one that my machine uses. And since it is a builtin driver, and can't be built as module it is always loaded. The other possibility is that this machine doesn't support acpi hotplug, but does support shpc pci hotplug. We need a kernel workarround so that acpi pci hotplug is disabled and out of the way when shpc pci hotplug is enabled."
Rafael, any ideas?
"acpiphp.disable=1" in the kernel bootline fixes the problem for me. The Radeon is reported as off in vgaswitcheroo and the laptop draws less power. The DIS even powers up and down correctly with DRI_PRIME.
Created attachment 119691 [details]
Acpidump for HP ProBook4730s
Yes, this most likely is related to PCI hotplug, because ACPIPHP now handles devices it didn't try to handle before. This means that if there are ACPI hotplug events for those devices, it will try to handle them.
What happens is probably that there is a bus check or device check causing ACPIPHP to rescan the bus and during that bus rescan it finds a device that doesn't respond (no wonder), so it decides that the device has gone and tries to remove it.
The solution might be to tell ACPIPHP somehow that the device in question didn't really go away. Or to ignore that device entirely.
I guess we may use a flag in struct acpi_device set for the graphics adapter's ACPI companion by the radeon driver during probe. Or something like that.
email@example.com: Can you please check if reverting commit ab1225901da2 makes any difference for you?
Can anyone please point me to the switcheroo code removing power from the radeon device?
radeon_atpx_set_discrete_state() is the specific function that calls the ACPI method to power off the dGPU.
Thanks. And where is atpx->handler set?
OK, I see. The method is called "ATPX" and I suppose it is device-specific?
ATPX is the AMD specific switching interface for AMD/AMD and AMD/Intel PowerXpress laptops.
Nvidia/Nvidia and Nvidia/Intel laptops use a different interface (called DSM I think).
I created a patch file for commit ab1225901da2 ("Revert ACPI hotplug...") and applied it with -R onto 3.13-rc5 source, but it didn't change anything.
Created attachment 119791 [details]
ACPIPHP / radeon: Avoid removing devices that are not really gone
Maybe something like this helps (for radeon).
Totally untested, may kill your hamster pet.
No, it won't help, sorry.
Created attachment 119801 [details]
ACPIPHP / radeon: Avoid removing devices that are not really gone, v2
Please try this one instead (hamster pet disclaimer still applies).
Applied onto 3.13-rc5, unfortunately this does not fix the problem (dmesg attached). Also KDM failed to start with this patch applied, I didn't try to start X manually...
Created attachment 119811 [details]
Dmesg with fix v2 applied.
Created attachment 119821 [details]
ACPIPHP / radeon: Avoid removing devices that are not really gone, v3
This is a slightly modified version of the patch that should give us a bit more debug information.
Please apply it instead of the previous one, retest and attach dmesg.
Created attachment 119831 [details]
Dmesg with patch 3
My whole system stutters when the card is being powered up / down
This is with runpm=1
Yikes after testing this patch the laptop complains that there is no boot disk attached to the system.
I'm trying sysrescuecd now
Created attachment 119841 [details]
Dmesg from SysRescueCD
I'm thinking either something has happened to the SSD or the controller
Also when X starts the screen goes blank and I don't know if the system remains responsive or not
The drive shows up fine on another system so it looks like something's happened to the controller or its not being initialized properly. Is there anything obvious in the above dmesg that might suggest the problem
Looks like leaving the laptop unplugged with the battery out for a wee while sorted the issue (phew)
Are there any other patches that need testing?
Created attachment 119861 [details]
dmesg with v3 fix applied on clean 3.13-rc5
Thanks! Evidently, the power_removed flag is set for a wrong device (i.e. not the one the hotplug events are signaled for).
Please attach the output of "ls -lR /sys/devices/LNXSYSTM\:00/" from your system.
Created attachment 119871 [details]
Output of "ls -lR /sys/devices/LNXSYSTM\:00/"
@Mike: I have no idea why the system behaves like that with the patch applied, sorry about that.
At this point I need to figure out how this all thing is supposed to work and that doesn't appear to be straightforward.
Created attachment 119881 [details]
Output of "ls -lR /sys/devices/LNXSYSTM\:00/"
In case mine is different
(In reply to madcatx from comment #37)
> Created attachment 119871 [details]
> Output of "ls -lR /sys/devices/LNXSYSTM\:00/"
and post the output.
(In reply to Mike Lothian from comment #39)
> Created attachment 119881 [details]
> Output of "ls -lR /sys/devices/LNXSYSTM\:00/"
> In case mine is different
It is different, but the layout seems to be analogous.
In your case the files of interest are:
In both cases LNXVIDEO:00 and LNXVIDEO:01 seem to be the Radeon and the Intel graphics, respectively.
If that's the case, the attached dmesg output means that the power_removed flag in the patch is set for the Intel graphics, but the hotplug event is generated for the Radeon. I'm not sure why at the moment.
OK, so GFX0 is the Intel graphics and that's the one having the ATPX method.
These are my two:
Created attachment 119891 [details]
ACPIPHP / radeon: Debug switcheroo problem
OK, thanks! Everything is consistent at least. :-)
This patch doesn't fix anything yet, it just should set the no_hotplug flag for the radeon device (which ACPIPHP will be able to use later) during switcheroo detection.
Please apply it, boot the kernel and send dmesg.
It's not applying cleanly:
patching file drivers/gpu/drm/radeon/radeon_atpx_handler.c
patching file include/acpi/acpi_bus.h
Hunk #1 FAILED at 163.
1 out of 1 hunk FAILED -- saving rejects to file include/acpi/acpi_bus.h.rej
Does it need to be applied in conjunction with one of the other patches?
Should we apply it over the "v3" fix attempt or on a clean source?
Looks like the issue is
In your patch its 25 changing to 24
Created attachment 119901 [details]
Dmesg with reserved 26
This is the dmesg with reserved 26 - I'm guessing the number is decremented every time you add a new option
(In reply to madcatx from comment #47)
> Should we apply it over the "v3" fix attempt or on a clean source?
Clean source, but I forgot I had some more patches on top of the Linus' tree applied.
Mike did that right.
Created attachment 119911 [details]
Dmesg with runpm off - so switcheroo is enabled propery
Created attachment 119921 [details]
Dmesg with "Debug" patch applied
Created attachment 119931 [details]
ACPIPHP / radeon: Debug (and possibly fix) switcheroo problem
This patch contains the ACPIPHP part too, so hopefully it will help.
Please apply instead of the previous one (should apply cleanly on top of 3.13-rc5), retest and report back (please attach dmesg after a single attempt to switch graphics in any case).
Created attachment 119941 [details]
Seems to work :D
That seems to work for me with radeon.runpm=1
So the system successfully powers up the card only when DRI_PRIME=1 is set when running an application (after xrandr --setprovideroffloadsink radeon Intel)
Thanks for this!
Do you think it'll land in 3.13?
Created attachment 119951 [details]
Dmesg with "v4" fix applied
Brilliant! This seems to work for me too. I can finally drop below 10 Watts again:) I'll do some more testing and report back if I come across anything odd. Thanks for taking care of this.
Created attachment 119961 [details]
ACPIPHP / radeon: Fix VGA switcheroo problem related to hotplug events
Thanks for testing!
That was a debug-only version of the patch, though, because the no_hotplug flag also needs to be checked in trim_stale_devices(). The attached one is a candidate for the final version (in addition to extending the ACPIPHP changes I removed the debug output from it and added a comment explaining what's going on to radeon_atpx_detect().
Please test this one and report back.
Created attachment 119971 [details]
Powering up and down automatically just fine
Feel free to add my tested by
Looks like we need a similar patch for DSM on nvidia laptops. See bug 64891.
Everything seems fine with the hopefully-final version. Good job!
OK, I'll send the patch to mailing lists later today. Many thanks to everyone involved!
@Alex: I'll have a look at that one too.
I'm running into an issue with my 7970M/Intel muxless in which the discrete GPU doesn't actually power down once I've started X.
With acpiphp disabled I don't get the errors that were indicated in https://bugzilla.kernel.org/show_bug.cgi?id=65761 however vgaswitcheroo/switch remainds on DynPwr and never goes to DynOff (until I kill X, anyway)
I tried Rafael's patch in the hopes that this might resolve the issue, but it doesn't seem to have done so -- still stuck in DynPwr when X is started.
Is this patch specific for certain Radeon models, ie, would it not work with radeonsi? My first guess was no it shouldn't, but I don't know all that much s I figured I'd ask. :)
Created attachment 119991 [details]
This is the dmesg output from my 7970M with 3.13rc5 + acpiphp.disable + rafael's latest patch from this bug report
Also both radeon.dpm=1 and radeon.runpm=1 were set in grub
Created attachment 120001 [details]
Also acpidump from 7970M
(In reply to Jack from comment #61)
> Hey guys
> I'm running into an issue with my 7970M/Intel muxless in which the discrete
> GPU doesn't actually power down once I've started X.
The patch in this entry only fixes the problem where ACPIPHP is involved. Moreover, I suppose that the failing removal of radeon may also play a role here, so that bug should be addressed first. Please continue to use bug #65761 to track the issues you have reported.
Patch submitted to mailing lists: https://patchwork.kernel.org/patch/3414401/
Mike, madcatx, since the problem w/ nouveau in bug #64891 is slightly different, I modified the patch slightly and the current version is at:
Can you please double check if it still fixes the problem for you?
Hi Rafael, yes the patch still works - I've not done any HDMI testing at all though - would you like me to hook my laptop up to the TV?
Lastly I think I can see a bug / warning appear whilst the system is shutting down - unfortunately due to systemd being so quick I don't actually see the error - everything is compiled in on my system if that makes a difference
Everything is looking fine with 3.13-rc6 and the latest fix.
@Mike: Perhaps journald will have the problem logged?
(In reply to Mike Lothian from comment #68)
> Hi Rafael, yes the patch still works - I've not done any HDMI testing at all
> though - would you like me to hook my laptop up to the TV?
No, thanks, that's fine.
Thanks for testing!
Unclear if this issue is resolved or not,
because parts of patch above were applied, then reverted:
Author: Bjorn Helgaas <firstname.lastname@example.org>
Date: Wed Sep 10 15:30:08 2014 -0600
ACPIPHP / radeon / nouveau: Remove acpi_bus_no_hotplug()
Revert parts of f244d8b623da ("ACPIPHP / radeon / nouveau: Fix VGA
switcheroo problem related to hotplug").
A previous commit 5493b31f0b55 ("PCI: Add pci_ignore_hotplug() to ignore
hotplug events for a device") added equivalent functionality implemented in
a different way for both acpiphp and pciehp.
Signed-off-by: Bjorn Helgaas <email@example.com>
Acked-by: Alex Deucher <firstname.lastname@example.org>
Acked-by: Rafael J. Wysocki <email@example.com>
Acked-by: Dave Airlie <firstname.lastname@example.org>
Acked-by: Rajat Jain <email@example.com>
Is 3.17 working?
Len: I had some issues starting from v3.16.4 that introduced a reimplementation of Rafael's patch.
Could you test if #86011 is affecting you and if the patch in comments fixes it?
This appears to have been broken again in 3.19. See:
On Tuesday, February 03, 2015 03:36:38 PM firstname.lastname@example.org wrote:
> --- Comment #73 from Alex Deucher <email@example.com> ---
> This appears to have been broken again in 3.19. See:
This appears to be a different bug, in the PCI core somewhere this time.
Please open a new entry for it.
Does this work in 3.18? I might be able to borrow a Intel/AMD machine and bisect as long as there is a reasonable amount of commits to go through.
(In reply to Rafael J. Wysocki from comment #74)
> On Tuesday, February 03, 2015 03:36:38 PM
> firstname.lastname@example.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=61891
> > --- Comment #73 from Alex Deucher <email@example.com> ---
> > This appears to have been broken again in 3.19. See:
> > https://bugs.freedesktop.org/show_bug.cgi?id=88927
> This appears to be a different bug, in the PCI core somewhere this time.
> Please open a new entry for it.
Looks like there already is one:
Created attachment 166951 [details]
Patch to fix the missed ignore_hotplug flag on some radeon pci devices
I still think that the bug https://bugs.freedesktop.org/show_bug.cgi?id=88927 is related to the original issue.
I'm the original reporter of the freedesktop bug and I did some more testing on my machine. As far as I can tell the acpiphp_glue.c:slot_no_hotplug function doesn't go deep enough to check the flag 'ignore_hotplug' set by the radeon driver.
Other functions go through the pci_dev->subordinate devices as well. I made a small patch to try this approach for the slot_no_hotplug function and it fixed this problem on my machine.
I'm no kernel developer and I don't have any knowledge on the pci driver system, but maybe my small patch can help to make a proper fix for this problem.
Created attachment 169601 [details]
PCI / hotplug: Propagate the "ignore hotplug" setting to parent
Can you please check if this patch helps too?
no response in a month.
please re-open if this is still an unresolved issue.
(In reply to Rafael J. Wysocki from comment #78)
> Created attachment 169601 [details]
> PCI / hotplug: Propagate the "ignore hotplug" setting to parent
> Can you please check if this patch helps too?
Sorry for the late answer.
Yes, this is still an issue (just tried the latest kernel version 4.0.0-rc7).
I tried your patch (attachment 169601 [details]) and it also fixes the problems on my machine.
I also tried the last patch from Rafael J. Wysocki, on kernel 3.19.0-15-generic on Xubuntu 15.04. Looks like it's working for me too, my laptop is an Acer Aspire TimelineX 4820TG with mixed integrated Intel graphics and ATI Mobility Radeon HD5470.
Created attachment 177371 [details]
PCI / hotplug / ACPI: Check ignore_hotplug for devices without ACPI companions
Maybe we don't need to propagate ignore_hotplug to parents after all.
Anyone with a reproducer, can you please check if this patch helps too?
> Anyone with a reproducer, can you please check if this patch helps too?
I will asap. Should I try your last patch alone, or I should try it on top of 169601 I'm currently using?
Alone, please. Applies on top of 4.1-rc4, not sure about earlier kernels.
(In reply to Rafael J. Wysocki from comment #85)
> Alone, please. Applies on top of 4.1-rc4, not sure about earlier kernels.
Unfortunately I'm able to test only on top of Ubuntu kernel, so I tested the patch alone on top of 3.19.0-18. It does not work, the symptoms are the same of the unpatched kernel (which were originally described here https://bugs.freedesktop.org/show_bug.cgi?id=88927).
I'm reverting to the kernel with patch 169601, which didn't gave me any problem since I tried it on the beginning of May.
Let's wait tiagdtd-lava for his test, maybe he's able to try on top of latest kernel rc.
Thanks for the testing, this should be fine.
I believe we should just use the patch from Comment #78 then.
The patch proposed at Comment #78 helped me a lot.
I don't want to update my linux core till it be fixed in it.
Does anyone know if currently distributed linux-image-3.19.0-20-generic contains this fix? If not, when this expected to happen? If this is unknown, then what to track to notice this moment?
Ping. Rafael, any chance you can send the fix upstream?
Created attachment 182001 [details]
I am encountering a similar acpiphp bug with my Mobility Radeon HD 4330 [RV710] after applying the patch from a report which I filed against the radeon driver:
The patch from this report (comment #78) does not resolve the problem in my case.
the patch from comment #78 shipped in Linux 4.2-rc1:
Author: Rafael J. Wysocki <firstname.lastname@example.org>
Date: Mon Apr 13 16:23:36 2015 +0200
PCI: Propagate the "ignore hotplug" setting to parent
re: comment #90
if I understand it, that is a similar bug, but not the same as this,
and that one is fixed by a patch in the referenced radeon bug report.
So I'm closing this bug -- please re-open if I mis-understood.
*** Bug 67461 has been marked as a duplicate of this bug. ***