Bug 79701

Summary: Dual AMD graphics systems broken by PCIe hotplug in kernel 3.15+
Product: Drivers Reporter: Jose P. (lbdkmjdf)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: a818958, airlied, alexdeucher, bjorn, falsenick, funfunctor, jajadekroon, kristian, kristofer.rye, mfitzpatrick, pali, rajatxjain, rjw, rui.zhang, shawn.starr, tianyu.lan, ying.huang
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.15 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg
Suggested patch
dmsg commit b1811d2455f32754cc3d8725bf2e961c5eda2a72
Test patch - to verify if the problem has been diagnosed correctly
dmesg for pcie_ports=compat & for patch
ACPI Dump - HP Pavilion dv6-6145ca
test patch
dmesg patch 2
dmesg 3.17-rc6
dmesg messages
dmesg after resume from suspend

Description Jose P. 2014-07-08 21:50:43 UTC
Created attachment 142551 [details]
dmesg

Systems that have dual AMD graphics (APU + dedicated GPU) and use the radeon open source drivers become unusable with kernel 3.15-rc and after, freezing every few seconds.

I talked to a radeon dev (Alex Deucher, agd5f at #radeon in irc.freenode.net) about this problem, and, If I understood correctly, he said it's a problem with PCIEHP. There is a patch to fix this, but it's for another module, ACPIPHP... for some reason, PCIEHP is loading (instead of ACPIPHP? not sure) in kernels 3.15+, and this module has not been patched.

This is the patch: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f244d8b623dae7a7bc695b0336f67729b95a9736

A workaround is to add "radeon.runpm=0" to the kernel command line.

Logs from IRC, sorted by date (I'm "pepee"):
http://people.freedesktop.org/~cbrill/dri-log/?channel=radeon&date=2014-06-09#t-1138
http://people.freedesktop.org/~cbrill/dri-log/?channel=radeon&date=2014-06-24#t-1847 (read the last 4 lines)
http://people.freedesktop.org/~cbrill/dri-log/?channel=radeon&date=2014-07-08#t-1329

Here are more reports:
https://bbs.archlinux.org/viewtopic.php?pid=1431648
https://bbs.archlinux.org/viewtopic.php?pid=1433713

Hope this helps.
Comment 1 Jan Jasper de Kroon 2014-07-14 10:52:56 UTC
This bug also affects me.
Got Fedora 20 system with the 3.16 rc4 kernel.
The workaround 'radeon.runpm=0' command works for me.
This suggests there may be some problems with power management handling on these types of APU's
Comment 2 Jose P. 2014-07-14 11:13:02 UTC
I disabled "OSC" from my BIOS (unlocked BIOS) as suggested here: http://www.phoronix.com/forums/showthread.php?103481-Linux-3-16-rc5-Kernel-Released&p=428972#post428972 and, so far, I'm running 3.16-rc2 without any problems.
Comment 3 Kristofer Rye 2014-07-17 16:26:00 UTC
This bug affects me, as well.

I am on a HP Pavilion dv6-6c00 with a Radeon HD 6600M.

I reinstalled Fedora and this bug affects me on kernel 3.15.5. I booted into Windows and reimaged the BIOS of my computer to the latest version provided by HP, to no avail. Kernels up through 3.14 worked, but 3.15 doesn't work for me.

Adding "radeon.runpm=0" to the boot command causes it to progress further in the boot process, but at the point that it normally goes into the login manager for me (i.e. when it starts Xorg), I get a kernel panic and pciehp is mentioned in the Call Trace.

I can't find an unlocked BIOS that will let me enable those things, and I feel like the kernel should support my hardware without the requirement of BIOS hacks, so trying to hack my BIOS is out of the question for now.
Comment 4 Jose P. 2014-07-23 09:36:41 UTC
Does https://bugzilla.kernel.org/show_bug.cgi?id=79621 have anything to do with this bug? If so, can anyone test the patch? (I'm unable to do it.)
Comment 5 Kristofer Rye 2014-07-23 13:34:39 UTC
It's very possible that that has something to do with it. Unfortunately, I can't test it either.
Comment 6 Jose P. 2014-08-05 19:15:53 UTC
@devs: is there any other workaround to either disable pciehp, or to go back to the old behavior?
Blacklisting it doesn't help. Google 
There seem to be some similar (old, unrelated to radeon) reports and each one of them needed a kernel patch instead of a simple module option/workaround. Can you code a way to completely disable pciehp, for everyone to use in cases like this?
Just FYI, this issue is still present in kernel 3.16.0 and 3.15.8, rendering the system almost unusable, and disabling _OSC makes a bunch of different (not related to pciehp) bugs appear...
Comment 7 SpacemanSpiff 2014-08-17 06:03:25 UTC
Same problem with HP dv6z with 6755g2. 
My dgpu keeps turning on and off, desktop freezes every few seconds and 2 kworkers are using up 2 cores
https://bugs.archlinux.org/task/38980#comment124837

3.14 lts works great.
Comment 8 Kristian Klausen 2014-08-17 08:34:45 UTC
(In reply to Jose P. from comment #4)
> Does https://bugzilla.kernel.org/show_bug.cgi?id=79621 have anything to do
> with this bug? If so, can anyone test the patch? (I'm unable to do it.)
Just tested with 3.17-rc1 same problem..
If anyone else want to test, I have uploaded binary and PKGBUILD to: http://188.228.31.139/dl/aur/linux/
Comment 9 Alex Deucher 2014-08-21 17:31:21 UTC
(In reply to SpacemanSpiff from comment #7)
> Same problem with HP dv6z with 6755g2. 
> My dgpu keeps turning on and off, desktop freezes every few seconds and 2
> kworkers are using up 2 cores
> https://bugs.archlinux.org/task/38980#comment124837
> 
> 3.14 lts works great.

Can you bisect to see what commit changed the hotplug behavior?
Comment 10 Alan 2014-08-21 18:12:59 UTC
*** Bug 82071 has been marked as a duplicate of this bug. ***
Comment 11 Edward O'Callaghan 2014-08-22 09:22:46 UTC
Created attachment 147731 [details]
Suggested patch

Some feedback would be appricated I am very unfamilar with these subsystems.
Comment 12 Shawn Starr 2014-08-23 04:21:32 UTC
I can reproduce this just by triggering a manual GPU reset:

cat /sys/kernel/debug/dri/0/radeon_gpu_reset

This will induce a reset, throws:

Aug 23 00:10:02 segfault kernel: [173022.968555] radeon 0000:01:00.0: GPU softreset: 0x00000040
Aug 23 00:10:02 segfault kernel: [173022.968769] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
Aug 23 00:10:02 segfault kernel: [173022.969047] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
Aug 23 00:10:02 segfault kernel: [173022.969307] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200080C0
Aug 23 00:10:02 segfault kernel: [173022.969575] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Aug 23 00:10:02 segfault kernel: [173022.969839] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Aug 23 00:10:02 segfault kernel: [173022.970116] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Aug 23 00:10:02 segfault kernel: [173022.970363] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80100000
Aug 23 00:10:02 segfault kernel: [173022.970622] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Aug 23 00:10:02 segfault kernel: [173023.050288] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00002000
Aug 23 00:10:02 segfault kernel: [173023.052585] radeon 0000:01:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
Aug 23 00:10:02 segfault kernel: [173023.052799] radeon 0000:01:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
Aug 23 00:10:02 segfault kernel: [173023.053033] radeon 0000:01:00.0:   R_000E50_SRBM_STATUS      = 0x200000C0
Aug 23 00:10:02 segfault kernel: [173023.053282] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Aug 23 00:10:02 segfault kernel: [173023.053495] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Aug 23 00:10:02 segfault kernel: [173023.053689] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Aug 23 00:10:02 segfault kernel: [173023.053905] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80100000
Aug 23 00:10:02 segfault kernel: [173023.054142] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Aug 23 00:10:02 segfault kernel: [173023.054386] radeon 0000:01:00.0: GPU pci config reset
Aug 23 00:10:02 segfault kernel: [173023.128744] pciehp 0000:00:01.0:pcie04: Card not present on Slot(1-1)
Aug 23 00:10:02 segfault kernel: [173023.140895] pciehp 0000:00:01.0:pcie04: Card present on Slot(1-1)
Aug 23 00:10:02 segfault kernel: [173023.288485] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Aug 23 00:10:02 segfault kernel: [173023.415478] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed
Aug 23 00:10:03 segfault kernel: [173024.057551] radeon 0000:01:00.0: Wait for MC idle timedout !
Aug 23 00:10:03 segfault kernel: [173024.218245] radeon 0000:01:00.0: Wait for MC idle timedout !
Aug 23 00:10:03 segfault kernel: [173024.219467] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
Aug 23 00:10:03 segfault kernel: [173024.219693] divide error: 0000 [#1] SMP
Aug 23 00:10:03 segfault kernel: [173024.220436] Modules linked in: vhost_net vhost macvtap macvlan tun bridge stp llc arc4 uvcvideo iwldvm snd_usb_audio videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops videobuf2_core v4l2_common snd_rawmidi mmc_block videodev media coretemp kvm_intel sdhci_pci iTCO_wdt mac80211 snd_hda_codec_conexant snd_hda_codec_generic iTCO_vendor_support kvm sdhci mmc_core r592 memstick microcode i2c_i801 snd_hda_intel snd_hda_controller iwlwifi cfg80211 snd_hda_codec thinkpad_acpi lpc_ich mfd_core wmi snd_hwdep shpchp snd_seq tpm_tis mei_me mei tpm snd_seq_device rfkill snd_pcm snd_timer snd soundcore video acpi_cpufreq binfmt_misc sunrpc radeon i2c_algo_bit drm_kms_helper e1000e ttm drm ptp pps_core
Aug 23 00:10:03 segfault kernel: [173024.223037] CPU: 1 PID: 28358 Comm: Xorg.bin Not tainted 3.17.0-0.rc1.git0.1.fc22.x86_64 #1
Aug 23 00:10:03 segfault kernel: [173024.223037] Hardware name: LENOVO 4058CTO/4058CTO, BIOS 6FET93WW (3.23 ) 10/12/2012
Aug 23 00:10:03 segfault kernel: [173024.223037] task: ffff8802181cf500 ti: ffff8801fd648000 task.ti: ffff8801fd648000
Aug 23 00:10:03 segfault kernel: [173024.223037] RIP: 0010:[<ffffffffa013c31a>]  [<ffffffffa013c31a>] r6xx_remap_render_backend+0x6a/0xe0 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037] RSP: 0018:ffff8801fd64bbd8  EFLAGS: 00010246
Aug 23 00:10:03 segfault kernel: [173024.223037] RAX: 0000000000000002 RBX: 00000000ffffffff RCX: 0000000000000002
Aug 23 00:10:03 segfault kernel: [173024.223037] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000002
Aug 23 00:10:03 segfault kernel: [173024.223037] RBP: ffff8801fd64bc10 R08: 00000000000000ff R09: 0000000000000565
Aug 23 00:10:03 segfault kernel: [173024.223037] R10: 0000000000000000 R11: 0000000000000565 R12: 0000000080000000
Aug 23 00:10:03 segfault kernel: [173024.223037] R13: 00000000000000ff R14: 0000000000000000 R15: 0000000000000000
Aug 23 00:10:03 segfault kernel: [173024.223037] FS:  00007ff86036a9c0(0000) GS:ffff88023bc80000(0000) knlGS:0000000000000000
Aug 23 00:10:03 segfault kernel: [173024.223037] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug 23 00:10:03 segfault kernel: [173024.223037] CR2: 00007fea6e66d000 CR3: 00000000a0893000 CR4: 00000000000427e0
Aug 23 00:10:03 segfault kernel: [173024.223037] Stack:
Aug 23 00:10:03 segfault kernel: [173024.223037]  ffff8800bf6a0000 0000000200000200 ffff8800bf6a0000 000000000000c352
Aug 23 00:10:03 segfault kernel: [173024.223037]  00000000ffffffff 000000000000cb52 0000000000ffff00 ffff8801fd64bc60
Aug 23 00:10:03 segfault kernel: [173024.223037]  ffffffffa013f5ac ffffffff04ea0000 00000000ffffffff 00000000af584ada
Aug 23 00:10:03 segfault kernel: [173024.223037] Call Trace:
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa013f5ac>] r600_startup+0x7ec/0x1b60 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa0140953>] r600_resume+0x33/0x70 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa00e9bf1>] radeon_gpu_reset+0x131/0x2c0 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa011c10e>] radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa011cca0>] radeon_gem_wait_idle_ioctl+0x100/0x150 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa0019e5f>] drm_ioctl+0x1df/0x680 [drm]
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff810bf420>] ? wake_up_state+0x10/0x20
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa00e704c>] radeon_drm_ioctl+0x4c/0x80 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff8121b5c0>] do_vfs_ioctl+0x2d0/0x4b0
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff81738ecf>] ? __schedule+0x2ef/0x840
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff8121b821>] SyS_ioctl+0x81/0xa0
Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff8173de29>] system_call_fastpath+0x16/0x1b
Aug 23 00:10:03 segfault kernel: [173024.223037] Code: b6 ed 45 09 c5 41 80 fd ff 45 0f 44 e8 d3 e7 89 7d d4 44 89 ef e8 97 ff ff ff 8b 4d d4 41 29 c7 44 39 f9 72 6c 89 c8 31 d2 89 cf <41> f7 f7 44 0f af f8 89 c6 48 8b 45 c8 44 29 ff 83 b8 68 01 00
Aug 23 00:10:03 segfault kernel: [173024.223037] RIP  [<ffffffffa013c31a>] r6xx_remap_render_backend+0x6a/0xe0 [radeon]
Aug 23 00:10:03 segfault kernel: [173024.223037]  RSP <ffff8801fd64bbd8>
Aug 23 00:10:03 segfault kernel: [173024.314798] ---[ end trace 76082dc70d248257 ]---
Aug 23 00:10:03 segfault kernel: [173024.390074] pciehp 0000:00:01.0:pcie04: Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add
Aug 23 00:10:03 segfault kernel: [173024.390343] pciehp 0000:00:01.0:pcie04: Cannot add device at 0000:01:00
Comment 13 Shawn Starr 2014-08-23 04:22:19 UTC
This occurs in 3.17-rc1 currently
Comment 14 Edward O'Callaghan 2014-08-23 05:57:56 UTC
(In reply to Shawn Starr from comment #12)
> I can reproduce this just by triggering a manual GPU reset:
> 
> cat /sys/kernel/debug/dri/0/radeon_gpu_reset
> 
> This will induce a reset, throws:
> 
> Aug 23 00:10:02 segfault kernel: [173022.968555] radeon 0000:01:00.0: GPU
> softreset: 0x00000040
> Aug 23 00:10:02 segfault kernel: [173022.968769] radeon 0000:01:00.0:  
> R_008010_GRBM_STATUS      = 0xA0003030
> Aug 23 00:10:02 segfault kernel: [173022.969047] radeon 0000:01:00.0:  
> R_008014_GRBM_STATUS2     = 0x00000003
> Aug 23 00:10:02 segfault kernel: [173022.969307] radeon 0000:01:00.0:  
> R_000E50_SRBM_STATUS      = 0x200080C0
> Aug 23 00:10:02 segfault kernel: [173022.969575] radeon 0000:01:00.0:  
> R_008674_CP_STALLED_STAT1 = 0x00000000
> Aug 23 00:10:02 segfault kernel: [173022.969839] radeon 0000:01:00.0:  
> R_008678_CP_STALLED_STAT2 = 0x00000000
> Aug 23 00:10:02 segfault kernel: [173022.970116] radeon 0000:01:00.0:  
> R_00867C_CP_BUSY_STAT     = 0x00000000
> Aug 23 00:10:02 segfault kernel: [173022.970363] radeon 0000:01:00.0:  
> R_008680_CP_STAT          = 0x80100000
> Aug 23 00:10:02 segfault kernel: [173022.970622] radeon 0000:01:00.0:  
> R_00D034_DMA_STATUS_REG   = 0x44C83D57
> Aug 23 00:10:02 segfault kernel: [173023.050288] radeon 0000:01:00.0:
> SRBM_SOFT_RESET=0x00002000
> Aug 23 00:10:02 segfault kernel: [173023.052585] radeon 0000:01:00.0:  
> R_008010_GRBM_STATUS      = 0xA0003030
> Aug 23 00:10:02 segfault kernel: [173023.052799] radeon 0000:01:00.0:  
> R_008014_GRBM_STATUS2     = 0x00000003
> Aug 23 00:10:02 segfault kernel: [173023.053033] radeon 0000:01:00.0:  
> R_000E50_SRBM_STATUS      = 0x200000C0
> Aug 23 00:10:02 segfault kernel: [173023.053282] radeon 0000:01:00.0:  
> R_008674_CP_STALLED_STAT1 = 0x00000000
> Aug 23 00:10:02 segfault kernel: [173023.053495] radeon 0000:01:00.0:  
> R_008678_CP_STALLED_STAT2 = 0x00000000
> Aug 23 00:10:02 segfault kernel: [173023.053689] radeon 0000:01:00.0:  
> R_00867C_CP_BUSY_STAT     = 0x00000000
> Aug 23 00:10:02 segfault kernel: [173023.053905] radeon 0000:01:00.0:  
> R_008680_CP_STAT          = 0x80100000
> Aug 23 00:10:02 segfault kernel: [173023.054142] radeon 0000:01:00.0:  
> R_00D034_DMA_STATUS_REG   = 0x44C83D57
> Aug 23 00:10:02 segfault kernel: [173023.054386] radeon 0000:01:00.0: GPU
> pci config reset
> Aug 23 00:10:02 segfault kernel: [173023.128744] pciehp 0000:00:01.0:pcie04:
> Card not present on Slot(1-1)
> Aug 23 00:10:02 segfault kernel: [173023.140895] pciehp 0000:00:01.0:pcie04:
> Card present on Slot(1-1)
> Aug 23 00:10:02 segfault kernel: [173023.288485] radeon 0000:01:00.0: GPU
> reset succeeded, trying to resume
> Aug 23 00:10:02 segfault kernel: [173023.415478] [drm:radeon_pm_resume_dpm]
> *ERROR* radeon: dpm resume failed
> Aug 23 00:10:03 segfault kernel: [173024.057551] radeon 0000:01:00.0: Wait
> for MC idle timedout !
> Aug 23 00:10:03 segfault kernel: [173024.218245] radeon 0000:01:00.0: Wait
> for MC idle timedout !
> Aug 23 00:10:03 segfault kernel: [173024.219467] [drm] PCIE GART of 512M
> enabled (table at 0x0000000000040000).
> Aug 23 00:10:03 segfault kernel: [173024.219693] divide error: 0000 [#1] SMP
> Aug 23 00:10:03 segfault kernel: [173024.220436] Modules linked in:
> vhost_net vhost macvtap macvlan tun bridge stp llc arc4 uvcvideo iwldvm
> snd_usb_audio videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops
> videobuf2_core v4l2_common snd_rawmidi mmc_block videodev media coretemp
> kvm_intel sdhci_pci iTCO_wdt mac80211 snd_hda_codec_conexant
> snd_hda_codec_generic iTCO_vendor_support kvm sdhci mmc_core r592 memstick
> microcode i2c_i801 snd_hda_intel snd_hda_controller iwlwifi cfg80211
> snd_hda_codec thinkpad_acpi lpc_ich mfd_core wmi snd_hwdep shpchp snd_seq
> tpm_tis mei_me mei tpm snd_seq_device rfkill snd_pcm snd_timer snd soundcore
> video acpi_cpufreq binfmt_misc sunrpc radeon i2c_algo_bit drm_kms_helper
> e1000e ttm drm ptp pps_core
> Aug 23 00:10:03 segfault kernel: [173024.223037] CPU: 1 PID: 28358 Comm:
> Xorg.bin Not tainted 3.17.0-0.rc1.git0.1.fc22.x86_64 #1
> Aug 23 00:10:03 segfault kernel: [173024.223037] Hardware name: LENOVO
> 4058CTO/4058CTO, BIOS 6FET93WW (3.23 ) 10/12/2012
> Aug 23 00:10:03 segfault kernel: [173024.223037] task: ffff8802181cf500 ti:
> ffff8801fd648000 task.ti: ffff8801fd648000
> Aug 23 00:10:03 segfault kernel: [173024.223037] RIP:
> 0010:[<ffffffffa013c31a>]  [<ffffffffa013c31a>]
> r6xx_remap_render_backend+0x6a/0xe0 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037] RSP: 0018:ffff8801fd64bbd8 
> EFLAGS: 00010246
> Aug 23 00:10:03 segfault kernel: [173024.223037] RAX: 0000000000000002 RBX:
> 00000000ffffffff RCX: 0000000000000002
> Aug 23 00:10:03 segfault kernel: [173024.223037] RDX: 0000000000000000 RSI:
> 0000000000000001 RDI: 0000000000000002
> Aug 23 00:10:03 segfault kernel: [173024.223037] RBP: ffff8801fd64bc10 R08:
> 00000000000000ff R09: 0000000000000565
> Aug 23 00:10:03 segfault kernel: [173024.223037] R10: 0000000000000000 R11:
> 0000000000000565 R12: 0000000080000000
> Aug 23 00:10:03 segfault kernel: [173024.223037] R13: 00000000000000ff R14:
> 0000000000000000 R15: 0000000000000000
> Aug 23 00:10:03 segfault kernel: [173024.223037] FS:  00007ff86036a9c0(0000)
> GS:ffff88023bc80000(0000) knlGS:0000000000000000
> Aug 23 00:10:03 segfault kernel: [173024.223037] CS:  0010 DS: 0000 ES: 0000
> CR0: 000000008005003b
> Aug 23 00:10:03 segfault kernel: [173024.223037] CR2: 00007fea6e66d000 CR3:
> 00000000a0893000 CR4: 00000000000427e0
> Aug 23 00:10:03 segfault kernel: [173024.223037] Stack:
> Aug 23 00:10:03 segfault kernel: [173024.223037]  ffff8800bf6a0000
> 0000000200000200 ffff8800bf6a0000 000000000000c352
> Aug 23 00:10:03 segfault kernel: [173024.223037]  00000000ffffffff
> 000000000000cb52 0000000000ffff00 ffff8801fd64bc60
> Aug 23 00:10:03 segfault kernel: [173024.223037]  ffffffffa013f5ac
> ffffffff04ea0000 00000000ffffffff 00000000af584ada
> Aug 23 00:10:03 segfault kernel: [173024.223037] Call Trace:
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa013f5ac>]
> r600_startup+0x7ec/0x1b60 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa0140953>]
> r600_resume+0x33/0x70 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa00e9bf1>]
> radeon_gpu_reset+0x131/0x2c0 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa011c10e>]
> radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa011cca0>]
> radeon_gem_wait_idle_ioctl+0x100/0x150 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa0019e5f>]
> drm_ioctl+0x1df/0x680 [drm]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff810bf420>] ?
> wake_up_state+0x10/0x20
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffffa00e704c>]
> radeon_drm_ioctl+0x4c/0x80 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff8121b5c0>]
> do_vfs_ioctl+0x2d0/0x4b0
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff81738ecf>] ?
> __schedule+0x2ef/0x840
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff8121b821>]
> SyS_ioctl+0x81/0xa0
> Aug 23 00:10:03 segfault kernel: [173024.223037]  [<ffffffff8173de29>]
> system_call_fastpath+0x16/0x1b
> Aug 23 00:10:03 segfault kernel: [173024.223037] Code: b6 ed 45 09 c5 41 80
> fd ff 45 0f 44 e8 d3 e7 89 7d d4 44 89 ef e8 97 ff ff ff 8b 4d d4 41 29 c7
> 44 39 f9 72 6c 89 c8 31 d2 89 cf <41> f7 f7 44 0f af f8 89 c6 48 8b 45 c8 44
> 29 ff 83 b8 68 01 00
> Aug 23 00:10:03 segfault kernel: [173024.223037] RIP  [<ffffffffa013c31a>]
> r6xx_remap_render_backend+0x6a/0xe0 [radeon]
> Aug 23 00:10:03 segfault kernel: [173024.223037]  RSP <ffff8801fd64bbd8>
> Aug 23 00:10:03 segfault kernel: [173024.314798] ---[ end trace
> 76082dc70d248257 ]---
> Aug 23 00:10:03 segfault kernel: [173024.390074] pciehp 0000:00:01.0:pcie04:
> Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add
> Aug 23 00:10:03 segfault kernel: [173024.390343] pciehp 0000:00:01.0:pcie04:
> Cannot add device at 0000:01:00

That could perhaps be a different issue actually, certainly not good though!
Comment 15 Kristian Klausen 2014-08-23 10:22:55 UTC
(In reply to Edward O'Callaghan from comment #11)
> Created attachment 147731 [details]
> Suggested patch
> 
> Some feedback would be appricated I am very unfamilar with these subsystems.

Cant get it to compile with 3.16.1.

  LD      drivers/net/wireless/built-in.o
  LD      drivers/net/built-in.o
Makefile:901: recipe for target 'drivers' failed
make: *** [drivers] Error 2
Comment 16 Alex Deucher 2014-08-25 18:38:06 UTC
(In reply to Shawn Starr from comment #12)
> I can reproduce this just by triggering a manual GPU reset:
> 

This is unrelated to this bug.
Comment 17 Edward O'Callaghan 2014-08-26 18:28:05 UTC
Alex Deucher,

Can we get this bug labled as a regression and confirmed since it is both.

I would vote for bumping up the importance also given that this can result in a 1000$ laptop going down the drain unless the BIOS catches the heating and manages to power it off in time..

Ta,
Comment 18 Alex Deucher 2014-08-26 18:37:47 UTC
Only the persoon that opened the bug can mark it as a regression.  As to dynamically powering off the dGPU, support for that was added relatively recently; before that it was always left on, so I'm not sure it's really that big of a problem.

Can someone with an effected system bisect?  It would be helpful to indentify what commit caused the change.
Comment 19 sergey 2014-08-26 20:41:38 UTC
I have pcie_aspm=off in grub.cfg, so no more freezes.
Anyway have some problems during boot sometime (black screen only). Don't know is it related to this bug or not.
Comment 20 Alex Deucher 2014-08-26 20:50:54 UTC
(In reply to sergey from comment #19)
> I have pcie_aspm=off in grub.cfg, so no more freezes.
> Anyway have some problems during boot sometime (black screen only). Don't
> know is it related to this bug or not.

Unless you are seeing an oops related to pciehp trying to unload the driver while it's running, you are seeing something else.
Comment 21 SpacemanSpiff 2014-08-28 14:12:46 UTC
Bisect result. phew


02e93a8a7c1dcecc1a33ea762a0c041cbb6a0a66 is the first bad commit
commit 02e93a8a7c1dcecc1a33ea762a0c041cbb6a0a66
Author: Rajat Jain <rajatxjain@gmail.com>
Date:   Tue Feb 4 18:30:21 2014 -0800

    PCI: pciehp: Don't check adapter or latch status while disabling
    
    It does not make much sense to refuse to disable a slot if an adapter is
    not present or the latch is open. If an adapter is not present, it provides
    an even better reason to disable the device slot.
    
    This is specially a problem for link state hot-plug, because some ports use
    in band mechanism for presence detection. Thus when link goes down,
    presence detect also goes down. We _want_ that the removal should take
    place in such case.
    
    Thus remove the checks for adapter and latch in pciehp_disable_slot()
    
    Signed-off-by: Rajat Jain <rajatxjain@gmail.com>
    Signed-off-by: Rajat Jain <rajatjain@juniper.net>
    Signed-off-by: Guenter Roeck <groeck@juniper.net>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

:040000 040000 4db507fb235c6ded307a6160347b5e79b28c58b5 e54ea37ad48a9e9ca0567ca7a3b26e0193c62e53 M	drivers
Comment 22 Rajat Jain 2014-08-30 02:24:20 UTC
Hi,

I looked at this and wanted to share by observations:

The Basic Issue
==============
There are a bunch of quick hotplug events (unplug followed by the hot-plug) that are received by the hotplug driver. While both the hotplug drivers (pciehp and acpiphp) are fine with it, the radeon driver itself is probably not equipped enough to handle them so well?
[   41.224428] trying to unbind memory from uninitialized GART !

When acpiphp was being used
=======================
As Rafael mentions in this commit log, this is a problem with the VGA subsystem, that requires the hot-plug driver to ignore such hot-plug events associated with a slot that connects to such known Radeon controllers. This was done for acpiphp by introducing a "no_hotplug" flag for the ACPI:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f244d8b623dae7a7bc695b0336f67729b95a9736
The above commit would fix the problem if the acpiphp is used, by ignoring the hot-plug events for that slot.

Switch to using pciehp
=================
1) For some reason, the system now seems to use pciehp for these slots instead of the acpiphp (can someone please tell if this looks OK? I only ask because I see the concerned Rafael's log getting printed that seems to indicate that he is expecting the acpiphp to control this slot?). But I also see that the pciehp has already grabbed the slot by the time this messages gets printed:
[    4.419180] VGA switcheroo: detected switching method \_SB_.PCI0.VGA_.ATPX handle

Even with using pciehp, things were still all right until the commit 02e93a8a7, beacuse the pciehp used to ignore the hot-unplug events (including loss-of-presence-detect and link-down) if (1) SURPRISE removal is not supported or (2) ADAPTER is not present (which is what this commit addresses). Thus the hot unplug event used to come, the pciehp_disable_slot() used to find no adapter and refused to do anything. 

Why problem started with pciehp
==========================
Essentially the commits 02e93a8a7 and 2b3940b60 made the pciehp handle all hot-unplug events (loss-of-presence-detect and link-downs) irrespective of whether the the SURPRISE removal was supported or not, and also if ADAPTER is not present. Now, I would think that both these commits are still valid because it makes no sense to ignore an unplug event (and let the kernel continue with stale data structures) just because SURPRISE is not set, or the ADAPTER is not present (The latter is an even better reason to process the unplug event).

My recommendations / Options
========================
1) I would first like an opinion on whether it is OK to see the pciehp handle these hotplug slots. The radeon code seems to be ACPI intensive, and Rafael's commit also seems to say that this was supposed to be handled by acpiphp.

2) If it is expected to continue using pciehp, may be we could handle it in the same way as Rafael did for acpiphp. We could add a flag in the pci_dev ("ignore_hp_events" or something) and set it for the hot pluggable slot from radeon code, just like acpi_bus_no_hotplug() is called today. 

I'll be going out for vacation for the 3 days, and would be glad to submit a patch if needed.

One question to the gentleman who bisected this. (SpacemanSpiff). Would it be possible for you to look for the following messages while trying out the image just before the commit 02e93a8a7c? 
...
No adapter on slot(2)
...

Thanks & Best Regards,

Rajat Jain
Comment 23 SpacemanSpiff 2014-08-30 07:27:50 UTC
Created attachment 148801 [details]
dmsg commit b1811d2455f32754cc3d8725bf2e961c5eda2a72
Comment 24 SpacemanSpiff 2014-08-30 07:35:16 UTC
Yes i get it. dmesg attached

[   64.138675] pciehp 0000:00:02.0:pcie04: Card present on Slot(2)
[   64.138792] pciehp 0000:00:02.0:pcie04: slot(2): Link Up event
[   64.242002] pciehp 0000:00:02.0:pcie04: Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add
[   64.242017] pciehp 0000:00:02.0:pcie04: Cannot add device at 0000:01:00

....

[   71.866201] pciehp 0000:00:02.0:pcie04: Card not present on Slot(2)
[   71.866354] pciehp 0000:00:02.0:pcie04: slot(2): Link Down event
[   71.866519] pciehp 0000:00:02.0:pcie04: No adapter on slot(2)        


This was the last kernel that i had tested which passed during bisect. Today, 3 times out of 3, i was not able to login using kdm. The screen just went blank and had to use tty2. The last three lines were printed then. Wierdly, i didnt get this problem yesterday during bisect.
Anyway, dgpu is reported to be off by /sys/kernel/debug/vgaswitcheroo/switch
Comment 25 Rajat Jain 2014-09-03 01:07:48 UTC
Created attachment 149101 [details]
Test patch - to verify if the problem has been diagnosed correctly
Comment 26 Rajat Jain 2014-09-03 01:17:51 UTC
Hi,

I believe there are a few ways to make forward progress on this issue:

1) Debug these radeon & other graphics card drivers on why do they ask / require the hotplug drivers to ignore these hotplug events. And fix this basic issue in these VGA drivers.

2) If these hotplug events from these slots are to be ignored, why mark them as hot-pluggable slots in the first place? May be a platform bug?

3) Use acpiphp for these slots (that already ignores these hotplug events). May be that is the intent - atleast that's what seems like to me. So we may want to debug why pciehp and not acpiphp is controlling these slots. If some one can please try out the "pcie_ports=compat" argument to the kernel that disables pciehp and hopefully will force acpiphp to take owner ship of these slots, that shall be great. 

4) Write a patch to let pciehp also ignore these hotplug events. I have attached a quick and dirty sample patch but don't have the hardware to try it out. If some one can test it out for me, that shall be a great help! (Please note that this is just a verification patch to see if the problem will go away in such a manner - this is not the final patch).

I guess at this point, I will just wait for inputs from Bjorn or Rafael on how to go about this.

Thanks,

Rajat
Comment 27 SpacemanSpiff 2014-09-03 15:18:30 UTC
Created attachment 149191 [details]
dmesg for pcie_ports=compat & for patch
Comment 28 SpacemanSpiff 2014-09-03 15:20:57 UTC
pcie_ports=compat worked, i was able to use over latest linux git.
Patch didnt work for me. Perhaps someone else can also test it to verify

I have attached dmesg for both.
Comment 29 Rajat Jain 2014-09-04 13:25:40 UTC
Hi SpacemanSpiff,

Thanks a lot for the testing. Since it works all right when the acpiphp handles it, I think my diagnosis was correct.

I do realize a bug in my patch (I set the "ignore_hotplug_events" in the VGA device's pci_dev, where as it actually needs to be set in the parent bridge's pci_dev since that is the hotplug slot. Sorry, I'm not as familiar with Radeon driver code).

I think at this time it is better if we wait for inputs from Bjorn, if he wants to take this path (introduce a flag in PCIe hotplug driver to ignore the hotplug events) or explore other options.

Thanks,

Rajat
Comment 30 Alex Deucher 2014-09-08 03:05:19 UTC
(In reply to Rajat Jain from comment #26)
> Hi,
> 
> I believe there are a few ways to make forward progress on this issue:
> 
> 1) Debug these radeon & other graphics card drivers on why do they ask /
> require the hotplug drivers to ignore these hotplug events. And fix this
> basic issue in these VGA drivers.
> 

Just a little background on what's going on on the GPU side.  There are a number of laptops (combinations of Intel, AMD, and Nvidia hardware) that contain both a integrated GPU (iGPU, lower power, lower performance) and a discrete GPU (dGPU, higher power, higher performance).  They are called PowerXpress or Enduro or Optimus depending on the vendor.  Users can use the iGPU for normal tasks and then selectively power up the dGPU when playing games, etc. when they want higher performance.  In order to save power, the dGPU can be completely powered down.  The power control for the dGPU is handled by an ACPI method.  The driver wants to stay loaded and selectively powers down the dGPU when it's idle and powers it back up on demand.  We don't want to unload the driver in this case.
Comment 31 Rajat Jain 2014-09-08 18:53:05 UTC
(In reply to Alex Deucher from comment #30)
> 
> Just a little background on what's going on on the GPU side.  There are a
> number of laptops (combinations of Intel, AMD, and Nvidia hardware) that
> contain both a integrated GPU (iGPU, lower power, lower performance) and a
> discrete GPU (dGPU, higher power, higher performance).  They are called
> PowerXpress or Enduro or Optimus depending on the vendor.  Users can use the
> iGPU for normal tasks and then selectively power up the dGPU when playing
> games, etc. when they want higher performance.  In order to save power, the
> dGPU can be completely powered down.  The power control for the dGPU is
> handled by an ACPI method.  The driver wants to stay loaded and selectively
> powers down the dGPU when it's idle and powers it back up on demand.  We
> don't want to unload the driver in this case.

Thanks for the info. Understood. It seems to me that using acpiphp for the hot-plug makes better sense since you already use ACPI for other stuff too. Spacemanspiff confirmed that using acpiphp solves the problem (using pcie_ports=compat) because it already has the work around for ignoring these hotplug events. We might still want to see that why pciehp is getting loaded instead of acpiphp.
Comment 32 Bjorn Helgaas 2014-09-08 21:26:59 UTC
Can someone collect an acpidump for one of these systems?

(In reply to SpacemanSpiff from comment #28)
> pcie_ports=compat worked

Are you using a modified DSDT?  Your dmesg shows the same BIOS as Jose P.'s (Hewlett-Packard HP Pavilion dv6 Notebook PC/3590, BIOS F.21 09/13/2011), but his dmesg log shows three acpiphp slots registered, and your log shows none.

Since acpiphp doesn't claim anything on your system, I don't think we know whether it works or not.  Using "pcie_ports=compat" turns off all PCIe services, including PCIe hotplug, and I think that works because there's nothing paying attention to any power or hotplug events related to the device except for the explicit management done by the video driver.


Rafael, I'm confused.  pciehp_acpi_slot_detection_check() determines whether pciehp will handle the hotplug bridge.  This boils down to running pcihp_is_ejectable() on all the device handles in the subtree starting at the bride.

Here's what it looks like to me.  The logic here seems backwards to me, but it's been this way forever, so I'm probably missing something:

  pcihp_is_ejectable:
    if handle has _EJ0 or (has _RMV and _RMV returns 1)
      pcihp_is_ejectable() returns 1
      => check_hotplug() sets *found = 1
      => acpi_pci_detect_ejectable() returns 1
      => pciehp_acpi_slot_detection_check() returns 0
      => pciehp_probe() handles hotplug events for bridge

I would think in this case we would want acpiphp to handle the bridge, not pciehp.

In Jose P.'s dmesg log (https://bugzilla.kernel.org/attachment.cgi?id=142551) I see this, which I think means acpiphp wants to manage the slot, but pciehp barges in and claims it anyway:

  pci 0000:00:02.0: PCI bridge to [bus 01]
  acpiphp: Slot [1] registered
  pciehp 0000:00:02.0:pcie04: Slot #2 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+
Comment 33 Rafael J. Wysocki 2014-09-08 21:45:26 UTC
(In reply to Bjorn Helgaas from comment #32)
> Can someone collect an acpidump for one of these systems?
> 
> (In reply to SpacemanSpiff from comment #28)
> > pcie_ports=compat worked
> 
> Are you using a modified DSDT?  Your dmesg shows the same BIOS as Jose P.'s
> (Hewlett-Packard HP Pavilion dv6 Notebook PC/3590, BIOS F.21 09/13/2011),
> but his dmesg log shows three acpiphp slots registered, and your log shows
> none.
> 
> Since acpiphp doesn't claim anything on your system, I don't think we know
> whether it works or not.  Using "pcie_ports=compat" turns off all PCIe
> services, including PCIe hotplug, and I think that works because there's
> nothing paying attention to any power or hotplug events related to the
> device except for the explicit management done by the video driver.
> 
> 
> Rafael, I'm confused.  pciehp_acpi_slot_detection_check() determines whether
> pciehp will handle the hotplug bridge.  This boils down to running
> pcihp_is_ejectable() on all the device handles in the subtree starting at
> the bride.

I'm not sure what's going on in there.

> Here's what it looks like to me.  The logic here seems backwards to me, but
> it's been this way forever, so I'm probably missing something:
> 
>   pcihp_is_ejectable:
>     if handle has _EJ0 or (has _RMV and _RMV returns 1)
>       pcihp_is_ejectable() returns 1
>       => check_hotplug() sets *found = 1
>       => acpi_pci_detect_ejectable() returns 1
>       => pciehp_acpi_slot_detection_check() returns 0
>       => pciehp_probe() handles hotplug events for bridge
> 
> I would think in this case we would want acpiphp to handle the bridge, not
> pciehp.

acpiphp will handle the bridge if device_is_managed_by_native_pciehp() returns false.

> In Jose P.'s dmesg log
> (https://bugzilla.kernel.org/attachment.cgi?id=142551) I see this, which I
> think means acpiphp wants to manage the slot, but pciehp barges in and
> claims it anyway:
> 
>   pci 0000:00:02.0: PCI bridge to [bus 01]
>   acpiphp: Slot [1] registered

And that means device_is_managed_by_native_pciehp() does return false.

>   pciehp 0000:00:02.0:pcie04: Slot #2 AttnBtn- AttnInd- PwrInd- PwrCtrl-
> MRL- Interlock- NoCompl+ LLActRep+
Comment 34 Bjorn Helgaas 2014-09-08 23:05:20 UTC
(In reply to Alex Deucher from comment #30)
> ...  In order to save power, the
> dGPU can be completely powered down.  The power control for the dGPU is
> handled by an ACPI method.  The driver wants to stay loaded and selectively
> powers down the dGPU when it's idle and powers it back up on demand.  We
> don't want to unload the driver in this case.

When you power it back up, the dGPU has to be re-initialized (BARs restored, PCI config restored, etc.)  That's basically what would happen if pciehp re-enumerated the device.  Is there something special that means this wouldn't work in this case?

If the driver being loaded is the only concern, I would think we could figure out how to keep the driver loaded even if it is unbound and rebound to a device.

I assume we're talking about this path:

    vga_switchoff
      client->ops->set_gpu_state
      radeon_switcheroo_set_state       # vga_switcheroo_client_ops.set_gpu_state
        radeon_suspend_kms
      vgasr_priv.handler->power_state
      radeon_atpx_power_state           # vga_switcheroo_handler.power_state
        radeon_atpx_set_discrete_state
          radeon_atpx_call(..., ATPX_FUNCTION_POWER_CONTROL, ...)

So I think I see where the state is saved (in radeon_suspend_kms(), which calls pci_save_state()), but I'm nervous about this ATPX_FUNCTION_POWER_CONTROL thing.  That seems to run at ATPX method to switch off the power.  But PCI doesn't know anything about that, so don't we now have a device that PCI thinks is in D0, but is actually in D3cold?  This seems like a bad situation.  What happens if the PCI core tries to touch the device (AER config, ASPM config, etc.)?
Comment 35 Dave Airlie 2014-09-08 23:55:49 UTC
We tell t(In reply to Bjorn Helgaas from comment #34)
> (In reply to Alex Deucher from comment #30)
> > ...  In order to save power, the
> > dGPU can be completely powered down.  The power control for the dGPU is
> > handled by an ACPI method.  The driver wants to stay loaded and selectively
> > powers down the dGPU when it's idle and powers it back up on demand.  We
> > don't want to unload the driver in this case.
> 
> When you power it back up, the dGPU has to be re-initialized (BARs restored,
> PCI config restored, etc.)  That's basically what would happen if pciehp
> re-enumerated the device.  Is there something special that means this
> wouldn't work in this case?
> 
> If the driver being loaded is the only concern, I would think we could
> figure out how to keep the driver loaded even if it is unbound and rebound
> to a device.

> I assume we're talking about this path:
> 
>     vga_switchoff
>       client->ops->set_gpu_state
>       radeon_switcheroo_set_state       #
> vga_switcheroo_client_ops.set_gpu_state
>         radeon_suspend_kms
>       vgasr_priv.handler->power_state
>       radeon_atpx_power_state           # vga_switcheroo_handler.power_state
>         radeon_atpx_set_discrete_state
>           radeon_atpx_call(..., ATPX_FUNCTION_POWER_CONTROL, ...)
> 
> So I think I see where the state is saved (in radeon_suspend_kms(), which
> calls pci_save_state()), but I'm nervous about this
> ATPX_FUNCTION_POWER_CONTROL thing.  That seems to run at ATPX method to
> switch off the power.  But PCI doesn't know anything about that, so don't we
> now have a device that PCI thinks is in D0, but is actually in D3cold?  This
> seems like a bad situation.  What happens if the PCI core tries to touch the
> device (AER config, ASPM config, etc.)?

There are unfortuantely two paths into shutting these devices down, and that is the non-dynamic one, user driven.

We have a runtime pmops for radeon in radeon_pmops_runtime_suspend
it tells the vga_switcheroo the card is dynamically off, then sticks it into D3cold.

Thus we wake the card back up for PCI accesses etc.

The older method was only ever a debugfs hack for users to save power with, and in that case, the pci core would be pretty unhappy!

dave.
Comment 36 Bjorn Helgaas 2014-09-09 04:14:58 UTC
OK, it makes me feel a little better if you're using the usual PCI PM interfaces.

But I guess that means that any caller of pci_set_power_state(D3cold) is susceptible to this problem, doesn't it?  I.e., if a device below a hotplug-capable bridge is put in D3cold, the bridge is likely to report a hot-remove event.

The current acpiphp workaround is to have the driver call acpi_bus_no_hotplug() (see f244d8b623da ("ACPIPHP / radeon / nouveau: Fix VGA switcheroo problem related to hotplug")), but that doesn't seem like a very general solution.

Huang Ying added D3cold support in 448bd857d48e ("PCI/PM: add PCIe runtime D3cold support").  I cc'd him in case he has ideas.

My question about "can we treat a device being put into D3cold as a hot-remove, and it being restored to D0 as a hot-add?" is still on the table.  That seems like the obvious way to handle it, since that's exactly what we do for normal hotplug, so I'd like to push on that a little more before trying to figure out more workarounds.
Comment 37 SpacemanSpiff 2014-09-09 04:30:59 UTC
Created attachment 149591 [details]
ACPI Dump - HP Pavilion dv6-6145ca
Comment 38 SpacemanSpiff 2014-09-09 04:32:11 UTC
acpidump attached. i have HP Pavilion dv6-6145ca. I am not using modified DSDT.
Comment 39 Dave Airlie 2014-09-09 04:34:03 UTC
well my problem with doing hot remove as unplug, is the driver model excepts unplug to unbind, and we want to keep userspace thinking the device is still there, and userspace has the driver open and is sitting doing nothing with it.

I can't see a way if we unbind the driver for it to come back like magic.
Comment 40 Rafael J. Wysocki 2014-09-09 13:34:41 UTC
(In reply to Dave Airlie from comment #39)
> well my problem with doing hot remove as unplug, is the driver model excepts
> unplug to unbind, and we want to keep userspace thinking the device is still
> there, and userspace has the driver open and is sitting doing nothing with
> it.
> 
> I can't see a way if we unbind the driver for it to come back like magic.

And if we know that the device is not going away, it's better to use that knowledge in my opinion.

Yes, the platform may send us a device notification in that case, but then we decide what to do about that and that need not mean "hot-remove".
Comment 41 Bjorn Helgaas 2014-09-09 17:12:39 UTC
(In reply to Rafael J. Wysocki from comment #40)
> And if we know that the device is not going away, it's better to use that
> knowledge in my opinion.

How do we know the device isn't going away?  Relying on the driver to tell us seems like it puts too many assumptions in the driver.

I'm a little concerned about D3cold support in general because of this.  If we completely power off a device below a hotplug-capable bridge, how can we have any confidence that it's the same device when we power it back up?

But I guess none of this is helping us resolve this bug.  Let me try to summarize:

- Linux requests control of PCIe native hotplug with _OSC, and it succeeds
- 00:02.0 is a Root Port to bus 01 and supports PCIe native hotplug
- pciehp claims hotplug control for 00:02.0
- 01:00.0 is a dGPU behind the 00:02.0 Root Port
- The dGPU driver uses acpi_bus_no_hotplug() to tell acpiphp to ignore hotplug events
- The dGPU driver uses pci_set_power_state(D3cold) to power off 01:00.0
- The \_SB.PCI0.VGA.ATPX method turns off power and generates a Bus Check notification to \_SB_.PCI0.PB2_
- Because of the Bus Check, hotplug_event() calls acpiphp_check_bridge(), which does nothing of the previous acpi_bus_no_hotplug() on the slot
- When 01:00.0 is powered off, the upstream bridge (00:02.0) generates a Link Down hotplug interrupt
- Because of the Link Down interrupt, pciehp removes the 01:00.0 dGPU (prior to 02e93a8a7c1d and 2b3940b60626 it would have ignored this interupt)
- Removing 01:00.0 causes the problems reported in this bugzilla

The first question is whether acpiphp or pciehp should handle hotplug events.  My opinion is that the BIOS granted control to the OS via _OSC, so it expects PCIe native hotplug, i.e., pciehp.

If anybody thinks acpiphp should handle them, we need to know how the OS would figure that out.  pciehp_acpi_slot_detection_check() does things along that line, but I can't see how the spec would suggest that we combine _OSC, _ADR, _EJ0, and _RMV and conclude that we should use ACPI hotplug in this case.

If we agree that pciehp should handle them, the only easy fix looks like Rajat's proposal in comment #25.  I would prefer some sort of PCI interface instead of having the driver make both an ACPI call and a PCI call.  From the driver's point of view, this is just a way to modify what happens when it calls pci_set_power_state(D3cold), so it seems like it ought to be related to that call.
Comment 42 Alex Deucher 2014-09-09 17:39:48 UTC
(In reply to Bjorn Helgaas from comment #41)
> (In reply to Rafael J. Wysocki from comment #40)
> > And if we know that the device is not going away, it's better to use that
> > knowledge in my opinion.
> 
> How do we know the device isn't going away?  Relying on the driver to tell
> us seems like it puts too many assumptions in the driver.
> 
> I'm a little concerned about D3cold support in general because of this.  If
> we completely power off a device below a hotplug-capable bridge, how can we
> have any confidence that it's the same device when we power it back up?
> 

This only affects laptops that contain the ATPX acpi method (for AMD hybrid graphics) or _DSM (for nvidia hybrid graphics).  Since these are latops, there's not much chance of the user swapping out the dGPU.
Comment 43 Pali Rohár 2014-09-09 18:03:39 UTC
There is one other scenario: Some laptops still have ExpressCard slot and with special adapter it is possible to plug some PCIe device into ExpressCard slot. There exists project which using ExpressCard slot for connecting external PCIe GPU. In this case you can have: intel GPU, PowerXpress/Enduro/Optimus GPU and another Nvidia/AMD GPU. And last one connected to EC can be hotplugged. I do not know if somebody is using it on Linux, but this configuration working on Windows. And I do not see reason why it should not work on Linux too (once drivers are loaded)

So when you are trying to fix this bug, it would be good if you do not set all GPUs in notebook with ATPX/_DSM as non swapable as somebody can really connect another GPU into PCIe/EC slot.
Comment 44 Rafael J. Wysocki 2014-09-09 21:25:22 UTC
(In reply to Bjorn Helgaas from comment #41)
> (In reply to Rafael J. Wysocki from comment #40)
> > And if we know that the device is not going away, it's better to use that
> > knowledge in my opinion.
> 
> How do we know the device isn't going away?  Relying on the driver to tell
> us seems like it puts too many assumptions in the driver.
> 
> I'm a little concerned about D3cold support in general because of this.  If
> we completely power off a device below a hotplug-capable bridge, how can we
> have any confidence that it's the same device when we power it back up?

We should get a notification when the device appears again too.  If we don't, that's a platform bug and we can't do much.  If we do, though, we can double check the info in the config header (we don't do that today, but maybe we should).

> But I guess none of this is helping us resolve this bug.  Let me try to
> summarize:
> 
> - Linux requests control of PCIe native hotplug with _OSC, and it succeeds

From that point on, the platform should not send notifications to us.

> - 00:02.0 is a Root Port to bus 01 and supports PCIe native hotplug
> - pciehp claims hotplug control for 00:02.0
> - 01:00.0 is a dGPU behind the 00:02.0 Root Port
> - The dGPU driver uses acpi_bus_no_hotplug() to tell acpiphp to ignore
> hotplug events

Which is OK given the above.

> - The dGPU driver uses pci_set_power_state(D3cold) to power off 01:00.0
> - The \_SB.PCI0.VGA.ATPX method turns off power and generates a Bus Check
> notification to \_SB_.PCI0.PB2_

Which is a platform bug.

> - Because of the Bus Check, hotplug_event() calls acpiphp_check_bridge(),
> which does nothing of the previous acpi_bus_no_hotplug() on the slot
> - When 01:00.0 is powered off, the upstream bridge (00:02.0) generates a
> Link Down hotplug interrupt
> - Because of the Link Down interrupt, pciehp removes the 01:00.0 dGPU (prior
> to 02e93a8a7c1d and 2b3940b60626 it would have ignored this interupt)
> - Removing 01:00.0 causes the problems reported in this bugzilla

We need to ignore that event, because we have an alternative way of handling it (the switcheroo thing) and we know that.

> The first question is whether acpiphp or pciehp should handle hotplug
> events.  My opinion is that the BIOS granted control to the OS via _OSC, so
> it expects PCIe native hotplug, i.e., pciehp.

If it expected that, it wouldn't send us ACPI device notifications in the first place.

Now, we generally need both, because some platforms are more buggy and not only send us ACPI device notifications in that case, but also do not send PCIe interrupts, so we only get one.  On the other hand, getting both should not be a problem if everything is serialized properly (and I think it is today).

> If anybody thinks acpiphp should handle them, we need to know how the OS
> would figure that out.  pciehp_acpi_slot_detection_check() does things along
> that line, but I can't see how the spec would suggest that we combine _OSC,
> _ADR, _EJ0, and _RMV and conclude that we should use ACPI hotplug in this
> case.
> 
> If we agree that pciehp should handle them, the only easy fix looks like
> Rajat's proposal in comment #25.  I would prefer some sort of PCI interface
> instead of having the driver make both an ACPI call and a PCI call.  From
> the driver's point of view, this is just a way to modify what happens when
> it calls pci_set_power_state(D3cold), so it seems like it ought to be
> related to that call.

We've already told acpiphp to ignore events for that device, so we need to tell PCIe to do the same thing.  At least for consistency, if nothing else.
Comment 45 Bjorn Helgaas 2014-09-09 21:37:02 UTC
(In reply to Rafael J. Wysocki from comment #44)
> (In reply to Bjorn Helgaas from comment #41)
> > The first question is whether acpiphp or pciehp should handle hotplug
> > events.  My opinion is that the BIOS granted control to the OS via _OSC, so
> > it expects PCIe native hotplug, i.e., pciehp.
> 
> If it expected that, it wouldn't send us ACPI device notifications in the
> first place.

I should have said "if the BIOS grants control to the OS, it *should* expect PCIe native hotplug."

In practical terms, I guess I'm asserting that if the OS has PCIe native hotplug control, we should be able to take over hotplug event reporting on all bridges.  If that's the case, we should be able to slice out a lot of the pciehp_acpi_slot_detection_check() mess.

I don't think that would preclude acpiphp from also listening to notifications.

> > If we agree that pciehp should handle them, the only easy fix looks like
> > Rajat's proposal in comment #25.  I would prefer some sort of PCI interface
> > instead of having the driver make both an ACPI call and a PCI call.  From
> > the driver's point of view, this is just a way to modify what happens when
> > it calls pci_set_power_state(D3cold), so it seems like it ought to be
> > related to that call.
> 
> We've already told acpiphp to ignore events for that device, so we need to
> tell PCIe to do the same thing.  At least for consistency, if nothing else.

I'm just suggesting that the driver should only have to call a single function, and either that function should talk to both acpiphp and pciehp, or it should set a pci_dev flag that both acpiphp and pciehp look at.
Comment 46 Rafael J. Wysocki 2014-09-09 21:55:23 UTC
(In reply to Bjorn Helgaas from comment #45)
> (In reply to Rafael J. Wysocki from comment #44)
> > (In reply to Bjorn Helgaas from comment #41)
> > > The first question is whether acpiphp or pciehp should handle hotplug
> > > events.  My opinion is that the BIOS granted control to the OS via _OSC,
> so
> > > it expects PCIe native hotplug, i.e., pciehp.
> > 
> > If it expected that, it wouldn't send us ACPI device notifications in the
> > first place.
> 
> I should have said "if the BIOS grants control to the OS, it *should* expect
> PCIe native hotplug."
> 
> In practical terms, I guess I'm asserting that if the OS has PCIe native
> hotplug control, we should be able to take over hotplug event reporting on
> all bridges.

All bridges below the root the _OSC was called for.

> If that's the case, we should be able to slice out a lot of
> the pciehp_acpi_slot_detection_check() mess.
> 
> I don't think that would preclude acpiphp from also listening to
> notifications.

OK, that makes sense.

> > > If we agree that pciehp should handle them, the only easy fix looks like
> > > Rajat's proposal in comment #25.  I would prefer some sort of PCI
> interface
> > > instead of having the driver make both an ACPI call and a PCI call.  From
> > > the driver's point of view, this is just a way to modify what happens
> when
> > > it calls pci_set_power_state(D3cold), so it seems like it ought to be
> > > related to that call.
> > 
> > We've already told acpiphp to ignore events for that device, so we need to
> > tell PCIe to do the same thing.  At least for consistency, if nothing else.
> 
> I'm just suggesting that the driver should only have to call a single
> function, and either that function should talk to both acpiphp and pciehp,
> or it should set a pci_dev flag that both acpiphp and pciehp look at.

Sounds reasonable.  I didn't anticipate PCIe hotplug to have the same problem to be honest.
Comment 47 Rafael J. Wysocki 2014-09-09 21:55:35 UTC
BTW, this is not a plain D3cold, because platforms don't send device checks for D3cold transitions as a rule.

In fact, we've had D3cold forever, although it used to be called D3 in ACPI, but the semantics were pretty much the same.  Only later someone noticed the confusion between ACPI device states and PCI device states and decided to do something about that.

The switcheroo thing is just "special" and platforms tend to treat it as hotplug (which may be due to the way it is handled on Windows).
Comment 48 Bjorn Helgaas 2014-09-10 22:46:04 UTC
(In reply to Rafael J. Wysocki from comment #47)
> BTW, this is not a plain D3cold, because platforms don't send device checks
> for D3cold transitions as a rule.

I assume you mean that we don't normally get Bus Check notifications when ACPI puts things in D3cold.  That would make sense to me because ACPI probably doesn't know how to put a removable device in D3cold.  When ACPI does a D3cold transition, I would guess it's for a built-in device and there's no possibility of it being replaced with a different device before it's powered up again.
Comment 49 Bjorn Helgaas 2014-09-10 22:51:00 UTC
Created attachment 149691 [details]
test patch

This is a trial-balloon patch based on Rafael's existing acpiphp work (f244d8b623da) and Rajat's pciehp work from comment #25.

My intent is that this should work for both acpiphp and pciehp.  As long as both are enabled (CONFIG_HOTPLUG_PCI_ACPI and CONFIG_HOTPLUG_PCI_PCIE), you should be able to test the pciehp path by booting normally, and the acpiphp path by booting with "pcie_ports=compat".
Comment 50 SpacemanSpiff 2014-09-11 14:38:32 UTC
Created attachment 149801 [details]
dmesg patch 2
Comment 51 SpacemanSpiff 2014-09-11 14:39:30 UTC
patch - works great with and without pcie_ports=compat. dmesg attached. thanks
Comment 52 Kristian Klausen 2014-09-14 10:56:28 UTC
(In reply to Bjorn Helgaas from comment #49)
> Created attachment 149691 [details]
> test patch
> 
> This is a trial-balloon patch based on Rafael's existing acpiphp work
> (f244d8b623da) and Rajat's pciehp work from comment #25.
> 
> My intent is that this should work for both acpiphp and pciehp.  As long as
> both are enabled (CONFIG_HOTPLUG_PCI_ACPI and CONFIG_HOTPLUG_PCI_PCIE), you
> should be able to test the pciehp path by booting normally, and the acpiphp
> path by booting with "pcie_ports=compat".

Dont know if it have anything todo with that patch but. Yesterday I compiled 3.6.2 with that patch applied, and it didn't freeze (or what we call it?). Then I used my laptop for 2-3 hours with 3.6.2 patched, after that I suspended it and went to bed.

This morning, when I want to use my laptop. It just freeze, so i removed the battery and the ac-adapter pressed poower-on-button for some seconds, and then I discovered that the Caps-Lock light was blinking. A google search got me to HP website where is says it:

LEDs blink 3 times	Memory	Module error not functional
http://h10025.www1.hp.com/ewfrf/wc/document?docname=c01732674&tmp_task=solveCategory&cc=us&dlc=en&lc=en&product=5149092&query=QA598EA&tool=#N241

Of course this could just be a coincidence, just want you to hear it.

My laptop is a HP dv6-6145eo spec here: http://h10025.www1.hp.com/ewfrf/wc/document?docname=c02921653&tmp_task=prodinfoCategory&cc=us&dlc=en&lc=en&product=5153075
Comment 53 Kristian Klausen 2014-09-14 11:19:39 UTC
(In reply to Kristian from comment #52)
> (In reply to Bjorn Helgaas from comment #49)
> > Created attachment 149691 [details]
> > test patch
> > 
> > This is a trial-balloon patch based on Rafael's existing acpiphp work
> > (f244d8b623da) and Rajat's pciehp work from comment #25.
> > 
> > My intent is that this should work for both acpiphp and pciehp.  As long as
> > both are enabled (CONFIG_HOTPLUG_PCI_ACPI and CONFIG_HOTPLUG_PCI_PCIE), you
> > should be able to test the pciehp path by booting normally, and the acpiphp
> > path by booting with "pcie_ports=compat".
> 
> Dont know if it have anything todo with that patch but. Yesterday I compiled
> 3.6.2 with that patch applied, and it didn't freeze (or what we call it?).
> Then I used my laptop for 2-3 hours with 3.6.2 patched, after that I
> suspended it and went to bed.
> 
> This morning, when I want to use my laptop. It just freeze, so i removed the
> battery and the ac-adapter pressed poower-on-button for some seconds, and
> then I discovered that the Caps-Lock light was blinking. A google search got
> me to HP website where is says it:
> 
> LEDs blink 3 times    Memory  Module error not functional
> http://h10025.www1.hp.com/ewfrf/wc/
> document?docname=c01732674&tmp_task=solveCategory&cc=us&dlc=en&lc=en&product=
> 5149092&query=QA598EA&tool=#N241
> 
> Of course this could just be a coincidence, just want you to hear it.
> 
> My laptop is a HP dv6-6145eo spec here:
> http://h10025.www1.hp.com/ewfrf/wc/
> document?docname=c02921653&tmp_task=prodinfoCategory&cc=us&dlc=en&lc=en&produ
> ct=5153075
Maybe I was a little bit too fast.. Just switched the memory, just through it maybe have something to do with this patch..
Comment 54 Bjorn Helgaas 2014-09-19 20:34:14 UTC
Kristian, that sounds like a different issue.  If it persists, please open a separate bugzilla report for it.
Comment 55 Jose P. 2014-09-22 10:47:27 UTC
Created attachment 151261 [details]
dmesg 3.17-rc6

I'm running 3.17-rc6 from ubuntu's kernel repo ( http://kernel.ubuntu.com/~kernel-ppa/mainline/ ), and, since the patch was already added to it, everything is working great so far. The only outstanding things would be:

- some (hopefully) unrelated bug in my system... for some reason, I have to manually restart (as in, stop completely and then start) KDM / the X server to make my dedicated card appear in 'xrandr --listproviders' or 'DRI_PRIME=1 glxinfo' (I don't know when did this started to happen). And,

- this message being spammed in the system logs, which I don't know what it means:
>[ 1889.169800] radeon 0000:00:01.0: BAR 6: [??? 0x00000000 flags 0x2] has
>bogus alignment
>[ 1889.169827] pci 0000:00:14.4: PCI bridge to [bus 05]
>[ 1889.169833] pci 0000:00:14.4:   bridge window [io  0x6000-0x6fff]
>[ 1889.169840] pci 0000:00:14.4:   bridge window [mem 0xf0d00000-0xf0efffff]
>[ 1889.169846] pci 0000:00:14.4:   bridge window [mem 0xf0f00000-0xf10fffff
>pref]
>[ 1889.169876] radeon 0000:01:00.0: Max Payload Size 16384, but upstream
>0000:00:02.0 set to 128; if necessary, use "pci=pcie_bus_safe" and report a
>bug

Anyway, I can confirm the patch is working. Attached is dmesg.
Thank you guys, thank very much!


@Kristian: I've had similar issues many times before... I don't know if it's related to linux (I assumed it is not), but looks like these HP laptop BIOS are really buggy. To fix it, you have to reset the BIOS by removing the battery for 30 seconds or so. Not sure if it's the same issue, though.
Comment 56 Pali Rohár 2014-09-22 12:06:09 UTC
Created attachment 151271 [details]
dmesg messages

I'm getting same dmesg messages with kernel 3.17-rc6 when I close LID of laptop.
Comment 57 SpacemanSpiff 2014-09-22 16:54:50 UTC
> 
> - some (hopefully) unrelated bug in my system... for some reason, I have to
> manually restart (as in, stop completely and then start) KDM / the X server
> to make my dedicated card appear in 'xrandr --listproviders' or 'DRI_PRIME=1
> glxinfo' (I don't know when did this started to happen). And,

i can see dgpu with xrandr --listproviders. 
Maybe not related, but I think i had similar problem before when i turned off dgpu using echo "OFF" > /sys/kernel/debug/vgaswitcheroo/switch
Comment 58 SpacemanSpiff 2014-09-22 17:01:33 UTC
(In reply to Pali Rohár from comment #56)
> Created attachment 151271 [details]
> dmesg messages
> 
> I'm getting same dmesg messages with kernel 3.17-rc6 when I close LID of
> laptop.

yes resume for sleep is broken for me with latest git. 

With 3.14 it works if i pass "acpi_sleep=s3_bios" to kernel ( bug - https://bugs.freedesktop.org/show_bug.cgi?id=42960 ). But it does not work anymore.

After resume, the laptop screen is white for a few seconds and then turns off. Desktop otherwise is ok and i can use external monitor. Restarting X by logging off and on brings my laptop screen back on. Attaching dmesg after resume with and without the paramater and after restarting X. 

If needed, i can later try to check behaviour before patch. What is patch commit id?
Comment 59 SpacemanSpiff 2014-09-22 17:02:30 UTC
Created attachment 151351 [details]
dmesg after resume from suspend
Comment 60 SpacemanSpiff 2014-09-22 17:20:36 UTC
In dmesg after resume, i see pciehp message that may be related.

[ 1252.262984] pciehp 0000:00:02.0:pcie04: Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add
[ 1252.262988] pciehp 0000:00:02.0:pcie04: Cannot add device at 0000:01:00

dgpu is still powered off, so thats ok.
then some ERROR


[ 1252.424006] [drm:radeon_dp_link_train_ce] *ERROR* channel eq failed: 5 tries
[ 1252.424008] [drm:radeon_dp_link_train_ce] *ERROR* channel eq failed
...
[ 1257.444547] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[ 1257.444551] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E3B0 (len 2585, WS 4, PS 4) @ 0xEA9A


and also the "bogus alignment" message Jose reported ( this one is also present on startup )
Comment 61 Bjorn Helgaas 2014-09-22 17:57:56 UTC
SpacemanSpiff, unless you think this is the same image Jose P. originally reported in this bugzilla, can you open new bugzilla(s)?

The "Device already exists, cannot hot-add" is definitely a different problem (probably the same as https://bugzilla.kernel.org/show_bug.cgi?id=74471; we even had some patches to address that, and they probably need to be resurrected).

The "BAR ... has bogus alignment" is another separate problem.
Comment 62 Jose P. 2014-09-22 19:25:50 UTC
(In reply to SpacemanSpiff from comment #58)
> yes resume for sleep is broken for me with latest git. 
> 
> With 3.14 it works if i pass "acpi_sleep=s3_bios" to kernel ( bug -
> https://bugs.freedesktop.org/show_bug.cgi?id=42960 ). But it does not work
> anymore.
Well, I just tested it... after resume from suspend-to-RAM, I got a black screen. I switched to TTY1, ran xrandr to attach the display to an external monitor, switched to X and after some other commands, the laptop monitor turned on...
I guess this could work:  https://bugs.freedesktop.org/show_bug.cgi?id=42960#c47
>sleep 1; xset dpms force standby
Comment 63 Alex Deucher 2014-09-22 19:37:23 UTC
None of these other issues are related to this bug.  Please report them separately or follow up on existing bugs now that this one is fixed.
Comment 64 Jose P. 2014-09-23 19:20:23 UTC
Well, I suppose I have to close it as fixed... fixed in 3.17-rc6.
Thanks radeon & pci devs.
Comment 65 Pali Rohár 2014-09-30 10:28:54 UTC
Problem from comment #56 is reported in new bug #85311