Bug 79701
Summary: | Dual AMD graphics systems broken by PCIe hotplug in kernel 3.15+ | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jose P. (lbdkmjdf) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | a818958, airlied, alexdeucher, bjorn, falsenick, funfunctor, jajadekroon, kristian, kristofer.rye, mfitzpatrick, pali, rajatxjain, rjw, rui.zhang, shawn.starr, tianyu.lan, ying.huang |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.15 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg
Suggested patch dmsg commit b1811d2455f32754cc3d8725bf2e961c5eda2a72 Test patch - to verify if the problem has been diagnosed correctly dmesg for pcie_ports=compat & for patch ACPI Dump - HP Pavilion dv6-6145ca test patch dmesg patch 2 dmesg 3.17-rc6 dmesg messages dmesg after resume from suspend |
Description
Jose P.
2014-07-08 21:50:43 UTC
This bug also affects me. Got Fedora 20 system with the 3.16 rc4 kernel. The workaround 'radeon.runpm=0' command works for me. This suggests there may be some problems with power management handling on these types of APU's I disabled "OSC" from my BIOS (unlocked BIOS) as suggested here: http://www.phoronix.com/forums/showthread.php?103481-Linux-3-16-rc5-Kernel-Released&p=428972#post428972 and, so far, I'm running 3.16-rc2 without any problems. This bug affects me, as well. I am on a HP Pavilion dv6-6c00 with a Radeon HD 6600M. I reinstalled Fedora and this bug affects me on kernel 3.15.5. I booted into Windows and reimaged the BIOS of my computer to the latest version provided by HP, to no avail. Kernels up through 3.14 worked, but 3.15 doesn't work for me. Adding "radeon.runpm=0" to the boot command causes it to progress further in the boot process, but at the point that it normally goes into the login manager for me (i.e. when it starts Xorg), I get a kernel panic and pciehp is mentioned in the Call Trace. I can't find an unlocked BIOS that will let me enable those things, and I feel like the kernel should support my hardware without the requirement of BIOS hacks, so trying to hack my BIOS is out of the question for now. Does https://bugzilla.kernel.org/show_bug.cgi?id=79621 have anything to do with this bug? If so, can anyone test the patch? (I'm unable to do it.) It's very possible that that has something to do with it. Unfortunately, I can't test it either. @devs: is there any other workaround to either disable pciehp, or to go back to the old behavior? Blacklisting it doesn't help. Google There seem to be some similar (old, unrelated to radeon) reports and each one of them needed a kernel patch instead of a simple module option/workaround. Can you code a way to completely disable pciehp, for everyone to use in cases like this? Just FYI, this issue is still present in kernel 3.16.0 and 3.15.8, rendering the system almost unusable, and disabling _OSC makes a bunch of different (not related to pciehp) bugs appear... Same problem with HP dv6z with 6755g2. My dgpu keeps turning on and off, desktop freezes every few seconds and 2 kworkers are using up 2 cores https://bugs.archlinux.org/task/38980#comment124837 3.14 lts works great. (In reply to Jose P. from comment #4) > Does https://bugzilla.kernel.org/show_bug.cgi?id=79621 have anything to do > with this bug? If so, can anyone test the patch? (I'm unable to do it.) Just tested with 3.17-rc1 same problem.. If anyone else want to test, I have uploaded binary and PKGBUILD to: http://188.228.31.139/dl/aur/linux/ (In reply to SpacemanSpiff from comment #7) > Same problem with HP dv6z with 6755g2. > My dgpu keeps turning on and off, desktop freezes every few seconds and 2 > kworkers are using up 2 cores > https://bugs.archlinux.org/task/38980#comment124837 > > 3.14 lts works great. Can you bisect to see what commit changed the hotplug behavior? *** Bug 82071 has been marked as a duplicate of this bug. *** Created attachment 147731 [details]
Suggested patch
Some feedback would be appricated I am very unfamilar with these subsystems.
I can reproduce this just by triggering a manual GPU reset: cat /sys/kernel/debug/dri/0/radeon_gpu_reset This will induce a reset, throws: Aug 23 00:10:02 segfault kernel: [173022.968555] radeon 0000:01:00.0: GPU softreset: 0x00000040 Aug 23 00:10:02 segfault kernel: [173022.968769] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA0003030 Aug 23 00:10:02 segfault kernel: [173022.969047] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000003 Aug 23 00:10:02 segfault kernel: [173022.969307] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200080C0 Aug 23 00:10:02 segfault kernel: [173022.969575] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 Aug 23 00:10:02 segfault kernel: [173022.969839] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 Aug 23 00:10:02 segfault kernel: [173022.970116] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 Aug 23 00:10:02 segfault kernel: [173022.970363] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80100000 Aug 23 00:10:02 segfault kernel: [173022.970622] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 Aug 23 00:10:02 segfault kernel: [173023.050288] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00002000 Aug 23 00:10:02 segfault kernel: [173023.052585] radeon 0000:01:00.0: R_008010_GRBM_STATUS = 0xA0003030 Aug 23 00:10:02 segfault kernel: [173023.052799] radeon 0000:01:00.0: R_008014_GRBM_STATUS2 = 0x00000003 Aug 23 00:10:02 segfault kernel: [173023.053033] radeon 0000:01:00.0: R_000E50_SRBM_STATUS = 0x200000C0 Aug 23 00:10:02 segfault kernel: [173023.053282] radeon 0000:01:00.0: R_008674_CP_STALLED_STAT1 = 0x00000000 Aug 23 00:10:02 segfault kernel: [173023.053495] radeon 0000:01:00.0: R_008678_CP_STALLED_STAT2 = 0x00000000 Aug 23 00:10:02 segfault kernel: [173023.053689] radeon 0000:01:00.0: R_00867C_CP_BUSY_STAT = 0x00000000 Aug 23 00:10:02 segfault kernel: [173023.053905] radeon 0000:01:00.0: R_008680_CP_STAT = 0x80100000 Aug 23 00:10:02 segfault kernel: [173023.054142] radeon 0000:01:00.0: R_00D034_DMA_STATUS_REG = 0x44C83D57 Aug 23 00:10:02 segfault kernel: [173023.054386] radeon 0000:01:00.0: GPU pci config reset Aug 23 00:10:02 segfault kernel: [173023.128744] pciehp 0000:00:01.0:pcie04: Card not present on Slot(1-1) Aug 23 00:10:02 segfault kernel: [173023.140895] pciehp 0000:00:01.0:pcie04: Card present on Slot(1-1) Aug 23 00:10:02 segfault kernel: [173023.288485] radeon 0000:01:00.0: GPU reset succeeded, trying to resume Aug 23 00:10:02 segfault kernel: [173023.415478] [drm:radeon_pm_resume_dpm] *ERROR* radeon: dpm resume failed Aug 23 00:10:03 segfault kernel: [173024.057551] radeon 0000:01:00.0: Wait for MC idle timedout ! Aug 23 00:10:03 segfault kernel: [173024.218245] radeon 0000:01:00.0: Wait for MC idle timedout ! Aug 23 00:10:03 segfault kernel: [173024.219467] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000). Aug 23 00:10:03 segfault kernel: [173024.219693] divide error: 0000 [#1] SMP Aug 23 00:10:03 segfault kernel: [173024.220436] Modules linked in: vhost_net vhost macvtap macvlan tun bridge stp llc arc4 uvcvideo iwldvm snd_usb_audio videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops videobuf2_core v4l2_common snd_rawmidi mmc_block videodev media coretemp kvm_intel sdhci_pci iTCO_wdt mac80211 snd_hda_codec_conexant snd_hda_codec_generic iTCO_vendor_support kvm sdhci mmc_core r592 memstick microcode i2c_i801 snd_hda_intel snd_hda_controller iwlwifi cfg80211 snd_hda_codec thinkpad_acpi lpc_ich mfd_core wmi snd_hwdep shpchp snd_seq tpm_tis mei_me mei tpm snd_seq_device rfkill snd_pcm snd_timer snd soundcore video acpi_cpufreq binfmt_misc sunrpc radeon i2c_algo_bit drm_kms_helper e1000e ttm drm ptp pps_core Aug 23 00:10:03 segfault kernel: [173024.223037] CPU: 1 PID: 28358 Comm: Xorg.bin Not tainted 3.17.0-0.rc1.git0.1.fc22.x86_64 #1 Aug 23 00:10:03 segfault kernel: [173024.223037] Hardware name: LENOVO 4058CTO/4058CTO, BIOS 6FET93WW (3.23 ) 10/12/2012 Aug 23 00:10:03 segfault kernel: [173024.223037] task: ffff8802181cf500 ti: ffff8801fd648000 task.ti: ffff8801fd648000 Aug 23 00:10:03 segfault kernel: [173024.223037] RIP: 0010:[<ffffffffa013c31a>] [<ffffffffa013c31a>] r6xx_remap_render_backend+0x6a/0xe0 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] RSP: 0018:ffff8801fd64bbd8 EFLAGS: 00010246 Aug 23 00:10:03 segfault kernel: [173024.223037] RAX: 0000000000000002 RBX: 00000000ffffffff RCX: 0000000000000002 Aug 23 00:10:03 segfault kernel: [173024.223037] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000002 Aug 23 00:10:03 segfault kernel: [173024.223037] RBP: ffff8801fd64bc10 R08: 00000000000000ff R09: 0000000000000565 Aug 23 00:10:03 segfault kernel: [173024.223037] R10: 0000000000000000 R11: 0000000000000565 R12: 0000000080000000 Aug 23 00:10:03 segfault kernel: [173024.223037] R13: 00000000000000ff R14: 0000000000000000 R15: 0000000000000000 Aug 23 00:10:03 segfault kernel: [173024.223037] FS: 00007ff86036a9c0(0000) GS:ffff88023bc80000(0000) knlGS:0000000000000000 Aug 23 00:10:03 segfault kernel: [173024.223037] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 23 00:10:03 segfault kernel: [173024.223037] CR2: 00007fea6e66d000 CR3: 00000000a0893000 CR4: 00000000000427e0 Aug 23 00:10:03 segfault kernel: [173024.223037] Stack: Aug 23 00:10:03 segfault kernel: [173024.223037] ffff8800bf6a0000 0000000200000200 ffff8800bf6a0000 000000000000c352 Aug 23 00:10:03 segfault kernel: [173024.223037] 00000000ffffffff 000000000000cb52 0000000000ffff00 ffff8801fd64bc60 Aug 23 00:10:03 segfault kernel: [173024.223037] ffffffffa013f5ac ffffffff04ea0000 00000000ffffffff 00000000af584ada Aug 23 00:10:03 segfault kernel: [173024.223037] Call Trace: Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa013f5ac>] r600_startup+0x7ec/0x1b60 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa0140953>] r600_resume+0x33/0x70 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa00e9bf1>] radeon_gpu_reset+0x131/0x2c0 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa011c10e>] radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa011cca0>] radeon_gem_wait_idle_ioctl+0x100/0x150 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa0019e5f>] drm_ioctl+0x1df/0x680 [drm] Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff810bf420>] ? wake_up_state+0x10/0x20 Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa00e704c>] radeon_drm_ioctl+0x4c/0x80 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff8121b5c0>] do_vfs_ioctl+0x2d0/0x4b0 Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff81738ecf>] ? __schedule+0x2ef/0x840 Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff8121b821>] SyS_ioctl+0x81/0xa0 Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff8173de29>] system_call_fastpath+0x16/0x1b Aug 23 00:10:03 segfault kernel: [173024.223037] Code: b6 ed 45 09 c5 41 80 fd ff 45 0f 44 e8 d3 e7 89 7d d4 44 89 ef e8 97 ff ff ff 8b 4d d4 41 29 c7 44 39 f9 72 6c 89 c8 31 d2 89 cf <41> f7 f7 44 0f af f8 89 c6 48 8b 45 c8 44 29 ff 83 b8 68 01 00 Aug 23 00:10:03 segfault kernel: [173024.223037] RIP [<ffffffffa013c31a>] r6xx_remap_render_backend+0x6a/0xe0 [radeon] Aug 23 00:10:03 segfault kernel: [173024.223037] RSP <ffff8801fd64bbd8> Aug 23 00:10:03 segfault kernel: [173024.314798] ---[ end trace 76082dc70d248257 ]--- Aug 23 00:10:03 segfault kernel: [173024.390074] pciehp 0000:00:01.0:pcie04: Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add Aug 23 00:10:03 segfault kernel: [173024.390343] pciehp 0000:00:01.0:pcie04: Cannot add device at 0000:01:00 This occurs in 3.17-rc1 currently (In reply to Shawn Starr from comment #12) > I can reproduce this just by triggering a manual GPU reset: > > cat /sys/kernel/debug/dri/0/radeon_gpu_reset > > This will induce a reset, throws: > > Aug 23 00:10:02 segfault kernel: [173022.968555] radeon 0000:01:00.0: GPU > softreset: 0x00000040 > Aug 23 00:10:02 segfault kernel: [173022.968769] radeon 0000:01:00.0: > R_008010_GRBM_STATUS = 0xA0003030 > Aug 23 00:10:02 segfault kernel: [173022.969047] radeon 0000:01:00.0: > R_008014_GRBM_STATUS2 = 0x00000003 > Aug 23 00:10:02 segfault kernel: [173022.969307] radeon 0000:01:00.0: > R_000E50_SRBM_STATUS = 0x200080C0 > Aug 23 00:10:02 segfault kernel: [173022.969575] radeon 0000:01:00.0: > R_008674_CP_STALLED_STAT1 = 0x00000000 > Aug 23 00:10:02 segfault kernel: [173022.969839] radeon 0000:01:00.0: > R_008678_CP_STALLED_STAT2 = 0x00000000 > Aug 23 00:10:02 segfault kernel: [173022.970116] radeon 0000:01:00.0: > R_00867C_CP_BUSY_STAT = 0x00000000 > Aug 23 00:10:02 segfault kernel: [173022.970363] radeon 0000:01:00.0: > R_008680_CP_STAT = 0x80100000 > Aug 23 00:10:02 segfault kernel: [173022.970622] radeon 0000:01:00.0: > R_00D034_DMA_STATUS_REG = 0x44C83D57 > Aug 23 00:10:02 segfault kernel: [173023.050288] radeon 0000:01:00.0: > SRBM_SOFT_RESET=0x00002000 > Aug 23 00:10:02 segfault kernel: [173023.052585] radeon 0000:01:00.0: > R_008010_GRBM_STATUS = 0xA0003030 > Aug 23 00:10:02 segfault kernel: [173023.052799] radeon 0000:01:00.0: > R_008014_GRBM_STATUS2 = 0x00000003 > Aug 23 00:10:02 segfault kernel: [173023.053033] radeon 0000:01:00.0: > R_000E50_SRBM_STATUS = 0x200000C0 > Aug 23 00:10:02 segfault kernel: [173023.053282] radeon 0000:01:00.0: > R_008674_CP_STALLED_STAT1 = 0x00000000 > Aug 23 00:10:02 segfault kernel: [173023.053495] radeon 0000:01:00.0: > R_008678_CP_STALLED_STAT2 = 0x00000000 > Aug 23 00:10:02 segfault kernel: [173023.053689] radeon 0000:01:00.0: > R_00867C_CP_BUSY_STAT = 0x00000000 > Aug 23 00:10:02 segfault kernel: [173023.053905] radeon 0000:01:00.0: > R_008680_CP_STAT = 0x80100000 > Aug 23 00:10:02 segfault kernel: [173023.054142] radeon 0000:01:00.0: > R_00D034_DMA_STATUS_REG = 0x44C83D57 > Aug 23 00:10:02 segfault kernel: [173023.054386] radeon 0000:01:00.0: GPU > pci config reset > Aug 23 00:10:02 segfault kernel: [173023.128744] pciehp 0000:00:01.0:pcie04: > Card not present on Slot(1-1) > Aug 23 00:10:02 segfault kernel: [173023.140895] pciehp 0000:00:01.0:pcie04: > Card present on Slot(1-1) > Aug 23 00:10:02 segfault kernel: [173023.288485] radeon 0000:01:00.0: GPU > reset succeeded, trying to resume > Aug 23 00:10:02 segfault kernel: [173023.415478] [drm:radeon_pm_resume_dpm] > *ERROR* radeon: dpm resume failed > Aug 23 00:10:03 segfault kernel: [173024.057551] radeon 0000:01:00.0: Wait > for MC idle timedout ! > Aug 23 00:10:03 segfault kernel: [173024.218245] radeon 0000:01:00.0: Wait > for MC idle timedout ! > Aug 23 00:10:03 segfault kernel: [173024.219467] [drm] PCIE GART of 512M > enabled (table at 0x0000000000040000). > Aug 23 00:10:03 segfault kernel: [173024.219693] divide error: 0000 [#1] SMP > Aug 23 00:10:03 segfault kernel: [173024.220436] Modules linked in: > vhost_net vhost macvtap macvlan tun bridge stp llc arc4 uvcvideo iwldvm > snd_usb_audio videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops > videobuf2_core v4l2_common snd_rawmidi mmc_block videodev media coretemp > kvm_intel sdhci_pci iTCO_wdt mac80211 snd_hda_codec_conexant > snd_hda_codec_generic iTCO_vendor_support kvm sdhci mmc_core r592 memstick > microcode i2c_i801 snd_hda_intel snd_hda_controller iwlwifi cfg80211 > snd_hda_codec thinkpad_acpi lpc_ich mfd_core wmi snd_hwdep shpchp snd_seq > tpm_tis mei_me mei tpm snd_seq_device rfkill snd_pcm snd_timer snd soundcore > video acpi_cpufreq binfmt_misc sunrpc radeon i2c_algo_bit drm_kms_helper > e1000e ttm drm ptp pps_core > Aug 23 00:10:03 segfault kernel: [173024.223037] CPU: 1 PID: 28358 Comm: > Xorg.bin Not tainted 3.17.0-0.rc1.git0.1.fc22.x86_64 #1 > Aug 23 00:10:03 segfault kernel: [173024.223037] Hardware name: LENOVO > 4058CTO/4058CTO, BIOS 6FET93WW (3.23 ) 10/12/2012 > Aug 23 00:10:03 segfault kernel: [173024.223037] task: ffff8802181cf500 ti: > ffff8801fd648000 task.ti: ffff8801fd648000 > Aug 23 00:10:03 segfault kernel: [173024.223037] RIP: > 0010:[<ffffffffa013c31a>] [<ffffffffa013c31a>] > r6xx_remap_render_backend+0x6a/0xe0 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] RSP: 0018:ffff8801fd64bbd8 > EFLAGS: 00010246 > Aug 23 00:10:03 segfault kernel: [173024.223037] RAX: 0000000000000002 RBX: > 00000000ffffffff RCX: 0000000000000002 > Aug 23 00:10:03 segfault kernel: [173024.223037] RDX: 0000000000000000 RSI: > 0000000000000001 RDI: 0000000000000002 > Aug 23 00:10:03 segfault kernel: [173024.223037] RBP: ffff8801fd64bc10 R08: > 00000000000000ff R09: 0000000000000565 > Aug 23 00:10:03 segfault kernel: [173024.223037] R10: 0000000000000000 R11: > 0000000000000565 R12: 0000000080000000 > Aug 23 00:10:03 segfault kernel: [173024.223037] R13: 00000000000000ff R14: > 0000000000000000 R15: 0000000000000000 > Aug 23 00:10:03 segfault kernel: [173024.223037] FS: 00007ff86036a9c0(0000) > GS:ffff88023bc80000(0000) knlGS:0000000000000000 > Aug 23 00:10:03 segfault kernel: [173024.223037] CS: 0010 DS: 0000 ES: 0000 > CR0: 000000008005003b > Aug 23 00:10:03 segfault kernel: [173024.223037] CR2: 00007fea6e66d000 CR3: > 00000000a0893000 CR4: 00000000000427e0 > Aug 23 00:10:03 segfault kernel: [173024.223037] Stack: > Aug 23 00:10:03 segfault kernel: [173024.223037] ffff8800bf6a0000 > 0000000200000200 ffff8800bf6a0000 000000000000c352 > Aug 23 00:10:03 segfault kernel: [173024.223037] 00000000ffffffff > 000000000000cb52 0000000000ffff00 ffff8801fd64bc60 > Aug 23 00:10:03 segfault kernel: [173024.223037] ffffffffa013f5ac > ffffffff04ea0000 00000000ffffffff 00000000af584ada > Aug 23 00:10:03 segfault kernel: [173024.223037] Call Trace: > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa013f5ac>] > r600_startup+0x7ec/0x1b60 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa0140953>] > r600_resume+0x33/0x70 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa00e9bf1>] > radeon_gpu_reset+0x131/0x2c0 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa011c10e>] > radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa011cca0>] > radeon_gem_wait_idle_ioctl+0x100/0x150 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa0019e5f>] > drm_ioctl+0x1df/0x680 [drm] > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff810bf420>] ? > wake_up_state+0x10/0x20 > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffffa00e704c>] > radeon_drm_ioctl+0x4c/0x80 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff8121b5c0>] > do_vfs_ioctl+0x2d0/0x4b0 > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff81738ecf>] ? > __schedule+0x2ef/0x840 > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff8121b821>] > SyS_ioctl+0x81/0xa0 > Aug 23 00:10:03 segfault kernel: [173024.223037] [<ffffffff8173de29>] > system_call_fastpath+0x16/0x1b > Aug 23 00:10:03 segfault kernel: [173024.223037] Code: b6 ed 45 09 c5 41 80 > fd ff 45 0f 44 e8 d3 e7 89 7d d4 44 89 ef e8 97 ff ff ff 8b 4d d4 41 29 c7 > 44 39 f9 72 6c 89 c8 31 d2 89 cf <41> f7 f7 44 0f af f8 89 c6 48 8b 45 c8 44 > 29 ff 83 b8 68 01 00 > Aug 23 00:10:03 segfault kernel: [173024.223037] RIP [<ffffffffa013c31a>] > r6xx_remap_render_backend+0x6a/0xe0 [radeon] > Aug 23 00:10:03 segfault kernel: [173024.223037] RSP <ffff8801fd64bbd8> > Aug 23 00:10:03 segfault kernel: [173024.314798] ---[ end trace > 76082dc70d248257 ]--- > Aug 23 00:10:03 segfault kernel: [173024.390074] pciehp 0000:00:01.0:pcie04: > Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add > Aug 23 00:10:03 segfault kernel: [173024.390343] pciehp 0000:00:01.0:pcie04: > Cannot add device at 0000:01:00 That could perhaps be a different issue actually, certainly not good though! (In reply to Edward O'Callaghan from comment #11) > Created attachment 147731 [details] > Suggested patch > > Some feedback would be appricated I am very unfamilar with these subsystems. Cant get it to compile with 3.16.1. LD drivers/net/wireless/built-in.o LD drivers/net/built-in.o Makefile:901: recipe for target 'drivers' failed make: *** [drivers] Error 2 (In reply to Shawn Starr from comment #12) > I can reproduce this just by triggering a manual GPU reset: > This is unrelated to this bug. Alex Deucher, Can we get this bug labled as a regression and confirmed since it is both. I would vote for bumping up the importance also given that this can result in a 1000$ laptop going down the drain unless the BIOS catches the heating and manages to power it off in time.. Ta, Only the persoon that opened the bug can mark it as a regression. As to dynamically powering off the dGPU, support for that was added relatively recently; before that it was always left on, so I'm not sure it's really that big of a problem. Can someone with an effected system bisect? It would be helpful to indentify what commit caused the change. I have pcie_aspm=off in grub.cfg, so no more freezes. Anyway have some problems during boot sometime (black screen only). Don't know is it related to this bug or not. (In reply to sergey from comment #19) > I have pcie_aspm=off in grub.cfg, so no more freezes. > Anyway have some problems during boot sometime (black screen only). Don't > know is it related to this bug or not. Unless you are seeing an oops related to pciehp trying to unload the driver while it's running, you are seeing something else. Bisect result. phew 02e93a8a7c1dcecc1a33ea762a0c041cbb6a0a66 is the first bad commit commit 02e93a8a7c1dcecc1a33ea762a0c041cbb6a0a66 Author: Rajat Jain <rajatxjain@gmail.com> Date: Tue Feb 4 18:30:21 2014 -0800 PCI: pciehp: Don't check adapter or latch status while disabling It does not make much sense to refuse to disable a slot if an adapter is not present or the latch is open. If an adapter is not present, it provides an even better reason to disable the device slot. This is specially a problem for link state hot-plug, because some ports use in band mechanism for presence detection. Thus when link goes down, presence detect also goes down. We _want_ that the removal should take place in such case. Thus remove the checks for adapter and latch in pciehp_disable_slot() Signed-off-by: Rajat Jain <rajatxjain@gmail.com> Signed-off-by: Rajat Jain <rajatjain@juniper.net> Signed-off-by: Guenter Roeck <groeck@juniper.net> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> :040000 040000 4db507fb235c6ded307a6160347b5e79b28c58b5 e54ea37ad48a9e9ca0567ca7a3b26e0193c62e53 M drivers Hi, I looked at this and wanted to share by observations: The Basic Issue ============== There are a bunch of quick hotplug events (unplug followed by the hot-plug) that are received by the hotplug driver. While both the hotplug drivers (pciehp and acpiphp) are fine with it, the radeon driver itself is probably not equipped enough to handle them so well? [ 41.224428] trying to unbind memory from uninitialized GART ! When acpiphp was being used ======================= As Rafael mentions in this commit log, this is a problem with the VGA subsystem, that requires the hot-plug driver to ignore such hot-plug events associated with a slot that connects to such known Radeon controllers. This was done for acpiphp by introducing a "no_hotplug" flag for the ACPI: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f244d8b623dae7a7bc695b0336f67729b95a9736 The above commit would fix the problem if the acpiphp is used, by ignoring the hot-plug events for that slot. Switch to using pciehp ================= 1) For some reason, the system now seems to use pciehp for these slots instead of the acpiphp (can someone please tell if this looks OK? I only ask because I see the concerned Rafael's log getting printed that seems to indicate that he is expecting the acpiphp to control this slot?). But I also see that the pciehp has already grabbed the slot by the time this messages gets printed: [ 4.419180] VGA switcheroo: detected switching method \_SB_.PCI0.VGA_.ATPX handle Even with using pciehp, things were still all right until the commit 02e93a8a7, beacuse the pciehp used to ignore the hot-unplug events (including loss-of-presence-detect and link-down) if (1) SURPRISE removal is not supported or (2) ADAPTER is not present (which is what this commit addresses). Thus the hot unplug event used to come, the pciehp_disable_slot() used to find no adapter and refused to do anything. Why problem started with pciehp ========================== Essentially the commits 02e93a8a7 and 2b3940b60 made the pciehp handle all hot-unplug events (loss-of-presence-detect and link-downs) irrespective of whether the the SURPRISE removal was supported or not, and also if ADAPTER is not present. Now, I would think that both these commits are still valid because it makes no sense to ignore an unplug event (and let the kernel continue with stale data structures) just because SURPRISE is not set, or the ADAPTER is not present (The latter is an even better reason to process the unplug event). My recommendations / Options ======================== 1) I would first like an opinion on whether it is OK to see the pciehp handle these hotplug slots. The radeon code seems to be ACPI intensive, and Rafael's commit also seems to say that this was supposed to be handled by acpiphp. 2) If it is expected to continue using pciehp, may be we could handle it in the same way as Rafael did for acpiphp. We could add a flag in the pci_dev ("ignore_hp_events" or something) and set it for the hot pluggable slot from radeon code, just like acpi_bus_no_hotplug() is called today. I'll be going out for vacation for the 3 days, and would be glad to submit a patch if needed. One question to the gentleman who bisected this. (SpacemanSpiff). Would it be possible for you to look for the following messages while trying out the image just before the commit 02e93a8a7c? ... No adapter on slot(2) ... Thanks & Best Regards, Rajat Jain Created attachment 148801 [details]
dmsg commit b1811d2455f32754cc3d8725bf2e961c5eda2a72
Yes i get it. dmesg attached [ 64.138675] pciehp 0000:00:02.0:pcie04: Card present on Slot(2) [ 64.138792] pciehp 0000:00:02.0:pcie04: slot(2): Link Up event [ 64.242002] pciehp 0000:00:02.0:pcie04: Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add [ 64.242017] pciehp 0000:00:02.0:pcie04: Cannot add device at 0000:01:00 .... [ 71.866201] pciehp 0000:00:02.0:pcie04: Card not present on Slot(2) [ 71.866354] pciehp 0000:00:02.0:pcie04: slot(2): Link Down event [ 71.866519] pciehp 0000:00:02.0:pcie04: No adapter on slot(2) This was the last kernel that i had tested which passed during bisect. Today, 3 times out of 3, i was not able to login using kdm. The screen just went blank and had to use tty2. The last three lines were printed then. Wierdly, i didnt get this problem yesterday during bisect. Anyway, dgpu is reported to be off by /sys/kernel/debug/vgaswitcheroo/switch Created attachment 149101 [details]
Test patch - to verify if the problem has been diagnosed correctly
Hi, I believe there are a few ways to make forward progress on this issue: 1) Debug these radeon & other graphics card drivers on why do they ask / require the hotplug drivers to ignore these hotplug events. And fix this basic issue in these VGA drivers. 2) If these hotplug events from these slots are to be ignored, why mark them as hot-pluggable slots in the first place? May be a platform bug? 3) Use acpiphp for these slots (that already ignores these hotplug events). May be that is the intent - atleast that's what seems like to me. So we may want to debug why pciehp and not acpiphp is controlling these slots. If some one can please try out the "pcie_ports=compat" argument to the kernel that disables pciehp and hopefully will force acpiphp to take owner ship of these slots, that shall be great. 4) Write a patch to let pciehp also ignore these hotplug events. I have attached a quick and dirty sample patch but don't have the hardware to try it out. If some one can test it out for me, that shall be a great help! (Please note that this is just a verification patch to see if the problem will go away in such a manner - this is not the final patch). I guess at this point, I will just wait for inputs from Bjorn or Rafael on how to go about this. Thanks, Rajat Created attachment 149191 [details]
dmesg for pcie_ports=compat & for patch
pcie_ports=compat worked, i was able to use over latest linux git. Patch didnt work for me. Perhaps someone else can also test it to verify I have attached dmesg for both. Hi SpacemanSpiff, Thanks a lot for the testing. Since it works all right when the acpiphp handles it, I think my diagnosis was correct. I do realize a bug in my patch (I set the "ignore_hotplug_events" in the VGA device's pci_dev, where as it actually needs to be set in the parent bridge's pci_dev since that is the hotplug slot. Sorry, I'm not as familiar with Radeon driver code). I think at this time it is better if we wait for inputs from Bjorn, if he wants to take this path (introduce a flag in PCIe hotplug driver to ignore the hotplug events) or explore other options. Thanks, Rajat (In reply to Rajat Jain from comment #26) > Hi, > > I believe there are a few ways to make forward progress on this issue: > > 1) Debug these radeon & other graphics card drivers on why do they ask / > require the hotplug drivers to ignore these hotplug events. And fix this > basic issue in these VGA drivers. > Just a little background on what's going on on the GPU side. There are a number of laptops (combinations of Intel, AMD, and Nvidia hardware) that contain both a integrated GPU (iGPU, lower power, lower performance) and a discrete GPU (dGPU, higher power, higher performance). They are called PowerXpress or Enduro or Optimus depending on the vendor. Users can use the iGPU for normal tasks and then selectively power up the dGPU when playing games, etc. when they want higher performance. In order to save power, the dGPU can be completely powered down. The power control for the dGPU is handled by an ACPI method. The driver wants to stay loaded and selectively powers down the dGPU when it's idle and powers it back up on demand. We don't want to unload the driver in this case. (In reply to Alex Deucher from comment #30) > > Just a little background on what's going on on the GPU side. There are a > number of laptops (combinations of Intel, AMD, and Nvidia hardware) that > contain both a integrated GPU (iGPU, lower power, lower performance) and a > discrete GPU (dGPU, higher power, higher performance). They are called > PowerXpress or Enduro or Optimus depending on the vendor. Users can use the > iGPU for normal tasks and then selectively power up the dGPU when playing > games, etc. when they want higher performance. In order to save power, the > dGPU can be completely powered down. The power control for the dGPU is > handled by an ACPI method. The driver wants to stay loaded and selectively > powers down the dGPU when it's idle and powers it back up on demand. We > don't want to unload the driver in this case. Thanks for the info. Understood. It seems to me that using acpiphp for the hot-plug makes better sense since you already use ACPI for other stuff too. Spacemanspiff confirmed that using acpiphp solves the problem (using pcie_ports=compat) because it already has the work around for ignoring these hotplug events. We might still want to see that why pciehp is getting loaded instead of acpiphp. Can someone collect an acpidump for one of these systems? (In reply to SpacemanSpiff from comment #28) > pcie_ports=compat worked Are you using a modified DSDT? Your dmesg shows the same BIOS as Jose P.'s (Hewlett-Packard HP Pavilion dv6 Notebook PC/3590, BIOS F.21 09/13/2011), but his dmesg log shows three acpiphp slots registered, and your log shows none. Since acpiphp doesn't claim anything on your system, I don't think we know whether it works or not. Using "pcie_ports=compat" turns off all PCIe services, including PCIe hotplug, and I think that works because there's nothing paying attention to any power or hotplug events related to the device except for the explicit management done by the video driver. Rafael, I'm confused. pciehp_acpi_slot_detection_check() determines whether pciehp will handle the hotplug bridge. This boils down to running pcihp_is_ejectable() on all the device handles in the subtree starting at the bride. Here's what it looks like to me. The logic here seems backwards to me, but it's been this way forever, so I'm probably missing something: pcihp_is_ejectable: if handle has _EJ0 or (has _RMV and _RMV returns 1) pcihp_is_ejectable() returns 1 => check_hotplug() sets *found = 1 => acpi_pci_detect_ejectable() returns 1 => pciehp_acpi_slot_detection_check() returns 0 => pciehp_probe() handles hotplug events for bridge I would think in this case we would want acpiphp to handle the bridge, not pciehp. In Jose P.'s dmesg log (https://bugzilla.kernel.org/attachment.cgi?id=142551) I see this, which I think means acpiphp wants to manage the slot, but pciehp barges in and claims it anyway: pci 0000:00:02.0: PCI bridge to [bus 01] acpiphp: Slot [1] registered pciehp 0000:00:02.0:pcie04: Slot #2 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+ (In reply to Bjorn Helgaas from comment #32) > Can someone collect an acpidump for one of these systems? > > (In reply to SpacemanSpiff from comment #28) > > pcie_ports=compat worked > > Are you using a modified DSDT? Your dmesg shows the same BIOS as Jose P.'s > (Hewlett-Packard HP Pavilion dv6 Notebook PC/3590, BIOS F.21 09/13/2011), > but his dmesg log shows three acpiphp slots registered, and your log shows > none. > > Since acpiphp doesn't claim anything on your system, I don't think we know > whether it works or not. Using "pcie_ports=compat" turns off all PCIe > services, including PCIe hotplug, and I think that works because there's > nothing paying attention to any power or hotplug events related to the > device except for the explicit management done by the video driver. > > > Rafael, I'm confused. pciehp_acpi_slot_detection_check() determines whether > pciehp will handle the hotplug bridge. This boils down to running > pcihp_is_ejectable() on all the device handles in the subtree starting at > the bride. I'm not sure what's going on in there. > Here's what it looks like to me. The logic here seems backwards to me, but > it's been this way forever, so I'm probably missing something: > > pcihp_is_ejectable: > if handle has _EJ0 or (has _RMV and _RMV returns 1) > pcihp_is_ejectable() returns 1 > => check_hotplug() sets *found = 1 > => acpi_pci_detect_ejectable() returns 1 > => pciehp_acpi_slot_detection_check() returns 0 > => pciehp_probe() handles hotplug events for bridge > > I would think in this case we would want acpiphp to handle the bridge, not > pciehp. acpiphp will handle the bridge if device_is_managed_by_native_pciehp() returns false. > In Jose P.'s dmesg log > (https://bugzilla.kernel.org/attachment.cgi?id=142551) I see this, which I > think means acpiphp wants to manage the slot, but pciehp barges in and > claims it anyway: > > pci 0000:00:02.0: PCI bridge to [bus 01] > acpiphp: Slot [1] registered And that means device_is_managed_by_native_pciehp() does return false. > pciehp 0000:00:02.0:pcie04: Slot #2 AttnBtn- AttnInd- PwrInd- PwrCtrl- > MRL- Interlock- NoCompl+ LLActRep+ (In reply to Alex Deucher from comment #30) > ... In order to save power, the > dGPU can be completely powered down. The power control for the dGPU is > handled by an ACPI method. The driver wants to stay loaded and selectively > powers down the dGPU when it's idle and powers it back up on demand. We > don't want to unload the driver in this case. When you power it back up, the dGPU has to be re-initialized (BARs restored, PCI config restored, etc.) That's basically what would happen if pciehp re-enumerated the device. Is there something special that means this wouldn't work in this case? If the driver being loaded is the only concern, I would think we could figure out how to keep the driver loaded even if it is unbound and rebound to a device. I assume we're talking about this path: vga_switchoff client->ops->set_gpu_state radeon_switcheroo_set_state # vga_switcheroo_client_ops.set_gpu_state radeon_suspend_kms vgasr_priv.handler->power_state radeon_atpx_power_state # vga_switcheroo_handler.power_state radeon_atpx_set_discrete_state radeon_atpx_call(..., ATPX_FUNCTION_POWER_CONTROL, ...) So I think I see where the state is saved (in radeon_suspend_kms(), which calls pci_save_state()), but I'm nervous about this ATPX_FUNCTION_POWER_CONTROL thing. That seems to run at ATPX method to switch off the power. But PCI doesn't know anything about that, so don't we now have a device that PCI thinks is in D0, but is actually in D3cold? This seems like a bad situation. What happens if the PCI core tries to touch the device (AER config, ASPM config, etc.)? We tell t(In reply to Bjorn Helgaas from comment #34) > (In reply to Alex Deucher from comment #30) > > ... In order to save power, the > > dGPU can be completely powered down. The power control for the dGPU is > > handled by an ACPI method. The driver wants to stay loaded and selectively > > powers down the dGPU when it's idle and powers it back up on demand. We > > don't want to unload the driver in this case. > > When you power it back up, the dGPU has to be re-initialized (BARs restored, > PCI config restored, etc.) That's basically what would happen if pciehp > re-enumerated the device. Is there something special that means this > wouldn't work in this case? > > If the driver being loaded is the only concern, I would think we could > figure out how to keep the driver loaded even if it is unbound and rebound > to a device. > I assume we're talking about this path: > > vga_switchoff > client->ops->set_gpu_state > radeon_switcheroo_set_state # > vga_switcheroo_client_ops.set_gpu_state > radeon_suspend_kms > vgasr_priv.handler->power_state > radeon_atpx_power_state # vga_switcheroo_handler.power_state > radeon_atpx_set_discrete_state > radeon_atpx_call(..., ATPX_FUNCTION_POWER_CONTROL, ...) > > So I think I see where the state is saved (in radeon_suspend_kms(), which > calls pci_save_state()), but I'm nervous about this > ATPX_FUNCTION_POWER_CONTROL thing. That seems to run at ATPX method to > switch off the power. But PCI doesn't know anything about that, so don't we > now have a device that PCI thinks is in D0, but is actually in D3cold? This > seems like a bad situation. What happens if the PCI core tries to touch the > device (AER config, ASPM config, etc.)? There are unfortuantely two paths into shutting these devices down, and that is the non-dynamic one, user driven. We have a runtime pmops for radeon in radeon_pmops_runtime_suspend it tells the vga_switcheroo the card is dynamically off, then sticks it into D3cold. Thus we wake the card back up for PCI accesses etc. The older method was only ever a debugfs hack for users to save power with, and in that case, the pci core would be pretty unhappy! dave. OK, it makes me feel a little better if you're using the usual PCI PM interfaces. But I guess that means that any caller of pci_set_power_state(D3cold) is susceptible to this problem, doesn't it? I.e., if a device below a hotplug-capable bridge is put in D3cold, the bridge is likely to report a hot-remove event. The current acpiphp workaround is to have the driver call acpi_bus_no_hotplug() (see f244d8b623da ("ACPIPHP / radeon / nouveau: Fix VGA switcheroo problem related to hotplug")), but that doesn't seem like a very general solution. Huang Ying added D3cold support in 448bd857d48e ("PCI/PM: add PCIe runtime D3cold support"). I cc'd him in case he has ideas. My question about "can we treat a device being put into D3cold as a hot-remove, and it being restored to D0 as a hot-add?" is still on the table. That seems like the obvious way to handle it, since that's exactly what we do for normal hotplug, so I'd like to push on that a little more before trying to figure out more workarounds. Created attachment 149591 [details]
ACPI Dump - HP Pavilion dv6-6145ca
acpidump attached. i have HP Pavilion dv6-6145ca. I am not using modified DSDT. well my problem with doing hot remove as unplug, is the driver model excepts unplug to unbind, and we want to keep userspace thinking the device is still there, and userspace has the driver open and is sitting doing nothing with it. I can't see a way if we unbind the driver for it to come back like magic. (In reply to Dave Airlie from comment #39) > well my problem with doing hot remove as unplug, is the driver model excepts > unplug to unbind, and we want to keep userspace thinking the device is still > there, and userspace has the driver open and is sitting doing nothing with > it. > > I can't see a way if we unbind the driver for it to come back like magic. And if we know that the device is not going away, it's better to use that knowledge in my opinion. Yes, the platform may send us a device notification in that case, but then we decide what to do about that and that need not mean "hot-remove". (In reply to Rafael J. Wysocki from comment #40) > And if we know that the device is not going away, it's better to use that > knowledge in my opinion. How do we know the device isn't going away? Relying on the driver to tell us seems like it puts too many assumptions in the driver. I'm a little concerned about D3cold support in general because of this. If we completely power off a device below a hotplug-capable bridge, how can we have any confidence that it's the same device when we power it back up? But I guess none of this is helping us resolve this bug. Let me try to summarize: - Linux requests control of PCIe native hotplug with _OSC, and it succeeds - 00:02.0 is a Root Port to bus 01 and supports PCIe native hotplug - pciehp claims hotplug control for 00:02.0 - 01:00.0 is a dGPU behind the 00:02.0 Root Port - The dGPU driver uses acpi_bus_no_hotplug() to tell acpiphp to ignore hotplug events - The dGPU driver uses pci_set_power_state(D3cold) to power off 01:00.0 - The \_SB.PCI0.VGA.ATPX method turns off power and generates a Bus Check notification to \_SB_.PCI0.PB2_ - Because of the Bus Check, hotplug_event() calls acpiphp_check_bridge(), which does nothing of the previous acpi_bus_no_hotplug() on the slot - When 01:00.0 is powered off, the upstream bridge (00:02.0) generates a Link Down hotplug interrupt - Because of the Link Down interrupt, pciehp removes the 01:00.0 dGPU (prior to 02e93a8a7c1d and 2b3940b60626 it would have ignored this interupt) - Removing 01:00.0 causes the problems reported in this bugzilla The first question is whether acpiphp or pciehp should handle hotplug events. My opinion is that the BIOS granted control to the OS via _OSC, so it expects PCIe native hotplug, i.e., pciehp. If anybody thinks acpiphp should handle them, we need to know how the OS would figure that out. pciehp_acpi_slot_detection_check() does things along that line, but I can't see how the spec would suggest that we combine _OSC, _ADR, _EJ0, and _RMV and conclude that we should use ACPI hotplug in this case. If we agree that pciehp should handle them, the only easy fix looks like Rajat's proposal in comment #25. I would prefer some sort of PCI interface instead of having the driver make both an ACPI call and a PCI call. From the driver's point of view, this is just a way to modify what happens when it calls pci_set_power_state(D3cold), so it seems like it ought to be related to that call. (In reply to Bjorn Helgaas from comment #41) > (In reply to Rafael J. Wysocki from comment #40) > > And if we know that the device is not going away, it's better to use that > > knowledge in my opinion. > > How do we know the device isn't going away? Relying on the driver to tell > us seems like it puts too many assumptions in the driver. > > I'm a little concerned about D3cold support in general because of this. If > we completely power off a device below a hotplug-capable bridge, how can we > have any confidence that it's the same device when we power it back up? > This only affects laptops that contain the ATPX acpi method (for AMD hybrid graphics) or _DSM (for nvidia hybrid graphics). Since these are latops, there's not much chance of the user swapping out the dGPU. There is one other scenario: Some laptops still have ExpressCard slot and with special adapter it is possible to plug some PCIe device into ExpressCard slot. There exists project which using ExpressCard slot for connecting external PCIe GPU. In this case you can have: intel GPU, PowerXpress/Enduro/Optimus GPU and another Nvidia/AMD GPU. And last one connected to EC can be hotplugged. I do not know if somebody is using it on Linux, but this configuration working on Windows. And I do not see reason why it should not work on Linux too (once drivers are loaded) So when you are trying to fix this bug, it would be good if you do not set all GPUs in notebook with ATPX/_DSM as non swapable as somebody can really connect another GPU into PCIe/EC slot. (In reply to Bjorn Helgaas from comment #41) > (In reply to Rafael J. Wysocki from comment #40) > > And if we know that the device is not going away, it's better to use that > > knowledge in my opinion. > > How do we know the device isn't going away? Relying on the driver to tell > us seems like it puts too many assumptions in the driver. > > I'm a little concerned about D3cold support in general because of this. If > we completely power off a device below a hotplug-capable bridge, how can we > have any confidence that it's the same device when we power it back up? We should get a notification when the device appears again too. If we don't, that's a platform bug and we can't do much. If we do, though, we can double check the info in the config header (we don't do that today, but maybe we should). > But I guess none of this is helping us resolve this bug. Let me try to > summarize: > > - Linux requests control of PCIe native hotplug with _OSC, and it succeeds From that point on, the platform should not send notifications to us. > - 00:02.0 is a Root Port to bus 01 and supports PCIe native hotplug > - pciehp claims hotplug control for 00:02.0 > - 01:00.0 is a dGPU behind the 00:02.0 Root Port > - The dGPU driver uses acpi_bus_no_hotplug() to tell acpiphp to ignore > hotplug events Which is OK given the above. > - The dGPU driver uses pci_set_power_state(D3cold) to power off 01:00.0 > - The \_SB.PCI0.VGA.ATPX method turns off power and generates a Bus Check > notification to \_SB_.PCI0.PB2_ Which is a platform bug. > - Because of the Bus Check, hotplug_event() calls acpiphp_check_bridge(), > which does nothing of the previous acpi_bus_no_hotplug() on the slot > - When 01:00.0 is powered off, the upstream bridge (00:02.0) generates a > Link Down hotplug interrupt > - Because of the Link Down interrupt, pciehp removes the 01:00.0 dGPU (prior > to 02e93a8a7c1d and 2b3940b60626 it would have ignored this interupt) > - Removing 01:00.0 causes the problems reported in this bugzilla We need to ignore that event, because we have an alternative way of handling it (the switcheroo thing) and we know that. > The first question is whether acpiphp or pciehp should handle hotplug > events. My opinion is that the BIOS granted control to the OS via _OSC, so > it expects PCIe native hotplug, i.e., pciehp. If it expected that, it wouldn't send us ACPI device notifications in the first place. Now, we generally need both, because some platforms are more buggy and not only send us ACPI device notifications in that case, but also do not send PCIe interrupts, so we only get one. On the other hand, getting both should not be a problem if everything is serialized properly (and I think it is today). > If anybody thinks acpiphp should handle them, we need to know how the OS > would figure that out. pciehp_acpi_slot_detection_check() does things along > that line, but I can't see how the spec would suggest that we combine _OSC, > _ADR, _EJ0, and _RMV and conclude that we should use ACPI hotplug in this > case. > > If we agree that pciehp should handle them, the only easy fix looks like > Rajat's proposal in comment #25. I would prefer some sort of PCI interface > instead of having the driver make both an ACPI call and a PCI call. From > the driver's point of view, this is just a way to modify what happens when > it calls pci_set_power_state(D3cold), so it seems like it ought to be > related to that call. We've already told acpiphp to ignore events for that device, so we need to tell PCIe to do the same thing. At least for consistency, if nothing else. (In reply to Rafael J. Wysocki from comment #44) > (In reply to Bjorn Helgaas from comment #41) > > The first question is whether acpiphp or pciehp should handle hotplug > > events. My opinion is that the BIOS granted control to the OS via _OSC, so > > it expects PCIe native hotplug, i.e., pciehp. > > If it expected that, it wouldn't send us ACPI device notifications in the > first place. I should have said "if the BIOS grants control to the OS, it *should* expect PCIe native hotplug." In practical terms, I guess I'm asserting that if the OS has PCIe native hotplug control, we should be able to take over hotplug event reporting on all bridges. If that's the case, we should be able to slice out a lot of the pciehp_acpi_slot_detection_check() mess. I don't think that would preclude acpiphp from also listening to notifications. > > If we agree that pciehp should handle them, the only easy fix looks like > > Rajat's proposal in comment #25. I would prefer some sort of PCI interface > > instead of having the driver make both an ACPI call and a PCI call. From > > the driver's point of view, this is just a way to modify what happens when > > it calls pci_set_power_state(D3cold), so it seems like it ought to be > > related to that call. > > We've already told acpiphp to ignore events for that device, so we need to > tell PCIe to do the same thing. At least for consistency, if nothing else. I'm just suggesting that the driver should only have to call a single function, and either that function should talk to both acpiphp and pciehp, or it should set a pci_dev flag that both acpiphp and pciehp look at. (In reply to Bjorn Helgaas from comment #45) > (In reply to Rafael J. Wysocki from comment #44) > > (In reply to Bjorn Helgaas from comment #41) > > > The first question is whether acpiphp or pciehp should handle hotplug > > > events. My opinion is that the BIOS granted control to the OS via _OSC, > so > > > it expects PCIe native hotplug, i.e., pciehp. > > > > If it expected that, it wouldn't send us ACPI device notifications in the > > first place. > > I should have said "if the BIOS grants control to the OS, it *should* expect > PCIe native hotplug." > > In practical terms, I guess I'm asserting that if the OS has PCIe native > hotplug control, we should be able to take over hotplug event reporting on > all bridges. All bridges below the root the _OSC was called for. > If that's the case, we should be able to slice out a lot of > the pciehp_acpi_slot_detection_check() mess. > > I don't think that would preclude acpiphp from also listening to > notifications. OK, that makes sense. > > > If we agree that pciehp should handle them, the only easy fix looks like > > > Rajat's proposal in comment #25. I would prefer some sort of PCI > interface > > > instead of having the driver make both an ACPI call and a PCI call. From > > > the driver's point of view, this is just a way to modify what happens > when > > > it calls pci_set_power_state(D3cold), so it seems like it ought to be > > > related to that call. > > > > We've already told acpiphp to ignore events for that device, so we need to > > tell PCIe to do the same thing. At least for consistency, if nothing else. > > I'm just suggesting that the driver should only have to call a single > function, and either that function should talk to both acpiphp and pciehp, > or it should set a pci_dev flag that both acpiphp and pciehp look at. Sounds reasonable. I didn't anticipate PCIe hotplug to have the same problem to be honest. BTW, this is not a plain D3cold, because platforms don't send device checks for D3cold transitions as a rule. In fact, we've had D3cold forever, although it used to be called D3 in ACPI, but the semantics were pretty much the same. Only later someone noticed the confusion between ACPI device states and PCI device states and decided to do something about that. The switcheroo thing is just "special" and platforms tend to treat it as hotplug (which may be due to the way it is handled on Windows). (In reply to Rafael J. Wysocki from comment #47) > BTW, this is not a plain D3cold, because platforms don't send device checks > for D3cold transitions as a rule. I assume you mean that we don't normally get Bus Check notifications when ACPI puts things in D3cold. That would make sense to me because ACPI probably doesn't know how to put a removable device in D3cold. When ACPI does a D3cold transition, I would guess it's for a built-in device and there's no possibility of it being replaced with a different device before it's powered up again. Created attachment 149691 [details] test patch This is a trial-balloon patch based on Rafael's existing acpiphp work (f244d8b623da) and Rajat's pciehp work from comment #25. My intent is that this should work for both acpiphp and pciehp. As long as both are enabled (CONFIG_HOTPLUG_PCI_ACPI and CONFIG_HOTPLUG_PCI_PCIE), you should be able to test the pciehp path by booting normally, and the acpiphp path by booting with "pcie_ports=compat". Created attachment 149801 [details]
dmesg patch 2
patch - works great with and without pcie_ports=compat. dmesg attached. thanks (In reply to Bjorn Helgaas from comment #49) > Created attachment 149691 [details] > test patch > > This is a trial-balloon patch based on Rafael's existing acpiphp work > (f244d8b623da) and Rajat's pciehp work from comment #25. > > My intent is that this should work for both acpiphp and pciehp. As long as > both are enabled (CONFIG_HOTPLUG_PCI_ACPI and CONFIG_HOTPLUG_PCI_PCIE), you > should be able to test the pciehp path by booting normally, and the acpiphp > path by booting with "pcie_ports=compat". Dont know if it have anything todo with that patch but. Yesterday I compiled 3.6.2 with that patch applied, and it didn't freeze (or what we call it?). Then I used my laptop for 2-3 hours with 3.6.2 patched, after that I suspended it and went to bed. This morning, when I want to use my laptop. It just freeze, so i removed the battery and the ac-adapter pressed poower-on-button for some seconds, and then I discovered that the Caps-Lock light was blinking. A google search got me to HP website where is says it: LEDs blink 3 times Memory Module error not functional http://h10025.www1.hp.com/ewfrf/wc/document?docname=c01732674&tmp_task=solveCategory&cc=us&dlc=en&lc=en&product=5149092&query=QA598EA&tool=#N241 Of course this could just be a coincidence, just want you to hear it. My laptop is a HP dv6-6145eo spec here: http://h10025.www1.hp.com/ewfrf/wc/document?docname=c02921653&tmp_task=prodinfoCategory&cc=us&dlc=en&lc=en&product=5153075 (In reply to Kristian from comment #52) > (In reply to Bjorn Helgaas from comment #49) > > Created attachment 149691 [details] > > test patch > > > > This is a trial-balloon patch based on Rafael's existing acpiphp work > > (f244d8b623da) and Rajat's pciehp work from comment #25. > > > > My intent is that this should work for both acpiphp and pciehp. As long as > > both are enabled (CONFIG_HOTPLUG_PCI_ACPI and CONFIG_HOTPLUG_PCI_PCIE), you > > should be able to test the pciehp path by booting normally, and the acpiphp > > path by booting with "pcie_ports=compat". > > Dont know if it have anything todo with that patch but. Yesterday I compiled > 3.6.2 with that patch applied, and it didn't freeze (or what we call it?). > Then I used my laptop for 2-3 hours with 3.6.2 patched, after that I > suspended it and went to bed. > > This morning, when I want to use my laptop. It just freeze, so i removed the > battery and the ac-adapter pressed poower-on-button for some seconds, and > then I discovered that the Caps-Lock light was blinking. A google search got > me to HP website where is says it: > > LEDs blink 3 times Memory Module error not functional > http://h10025.www1.hp.com/ewfrf/wc/ > document?docname=c01732674&tmp_task=solveCategory&cc=us&dlc=en&lc=en&product= > 5149092&query=QA598EA&tool=#N241 > > Of course this could just be a coincidence, just want you to hear it. > > My laptop is a HP dv6-6145eo spec here: > http://h10025.www1.hp.com/ewfrf/wc/ > document?docname=c02921653&tmp_task=prodinfoCategory&cc=us&dlc=en&lc=en&produ > ct=5153075 Maybe I was a little bit too fast.. Just switched the memory, just through it maybe have something to do with this patch.. Kristian, that sounds like a different issue. If it persists, please open a separate bugzilla report for it. Created attachment 151261 [details] dmesg 3.17-rc6 I'm running 3.17-rc6 from ubuntu's kernel repo ( http://kernel.ubuntu.com/~kernel-ppa/mainline/ ), and, since the patch was already added to it, everything is working great so far. The only outstanding things would be: - some (hopefully) unrelated bug in my system... for some reason, I have to manually restart (as in, stop completely and then start) KDM / the X server to make my dedicated card appear in 'xrandr --listproviders' or 'DRI_PRIME=1 glxinfo' (I don't know when did this started to happen). And, - this message being spammed in the system logs, which I don't know what it means: >[ 1889.169800] radeon 0000:00:01.0: BAR 6: [??? 0x00000000 flags 0x2] has >bogus alignment >[ 1889.169827] pci 0000:00:14.4: PCI bridge to [bus 05] >[ 1889.169833] pci 0000:00:14.4: bridge window [io 0x6000-0x6fff] >[ 1889.169840] pci 0000:00:14.4: bridge window [mem 0xf0d00000-0xf0efffff] >[ 1889.169846] pci 0000:00:14.4: bridge window [mem 0xf0f00000-0xf10fffff >pref] >[ 1889.169876] radeon 0000:01:00.0: Max Payload Size 16384, but upstream >0000:00:02.0 set to 128; if necessary, use "pci=pcie_bus_safe" and report a >bug Anyway, I can confirm the patch is working. Attached is dmesg. Thank you guys, thank very much! @Kristian: I've had similar issues many times before... I don't know if it's related to linux (I assumed it is not), but looks like these HP laptop BIOS are really buggy. To fix it, you have to reset the BIOS by removing the battery for 30 seconds or so. Not sure if it's the same issue, though. Created attachment 151271 [details]
dmesg messages
I'm getting same dmesg messages with kernel 3.17-rc6 when I close LID of laptop.
>
> - some (hopefully) unrelated bug in my system... for some reason, I have to
> manually restart (as in, stop completely and then start) KDM / the X server
> to make my dedicated card appear in 'xrandr --listproviders' or 'DRI_PRIME=1
> glxinfo' (I don't know when did this started to happen). And,
i can see dgpu with xrandr --listproviders.
Maybe not related, but I think i had similar problem before when i turned off dgpu using echo "OFF" > /sys/kernel/debug/vgaswitcheroo/switch
(In reply to Pali Rohár from comment #56) > Created attachment 151271 [details] > dmesg messages > > I'm getting same dmesg messages with kernel 3.17-rc6 when I close LID of > laptop. yes resume for sleep is broken for me with latest git. With 3.14 it works if i pass "acpi_sleep=s3_bios" to kernel ( bug - https://bugs.freedesktop.org/show_bug.cgi?id=42960 ). But it does not work anymore. After resume, the laptop screen is white for a few seconds and then turns off. Desktop otherwise is ok and i can use external monitor. Restarting X by logging off and on brings my laptop screen back on. Attaching dmesg after resume with and without the paramater and after restarting X. If needed, i can later try to check behaviour before patch. What is patch commit id? Created attachment 151351 [details]
dmesg after resume from suspend
In dmesg after resume, i see pciehp message that may be related. [ 1252.262984] pciehp 0000:00:02.0:pcie04: Device 0000:01:00.0 already exists at 0000:01:00, cannot hot-add [ 1252.262988] pciehp 0000:00:02.0:pcie04: Cannot add device at 0000:01:00 dgpu is still powered off, so thats ok. then some ERROR [ 1252.424006] [drm:radeon_dp_link_train_ce] *ERROR* channel eq failed: 5 tries [ 1252.424008] [drm:radeon_dp_link_train_ce] *ERROR* channel eq failed ... [ 1257.444547] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 1257.444551] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E3B0 (len 2585, WS 4, PS 4) @ 0xEA9A and also the "bogus alignment" message Jose reported ( this one is also present on startup ) SpacemanSpiff, unless you think this is the same image Jose P. originally reported in this bugzilla, can you open new bugzilla(s)? The "Device already exists, cannot hot-add" is definitely a different problem (probably the same as https://bugzilla.kernel.org/show_bug.cgi?id=74471; we even had some patches to address that, and they probably need to be resurrected). The "BAR ... has bogus alignment" is another separate problem. (In reply to SpacemanSpiff from comment #58) > yes resume for sleep is broken for me with latest git. > > With 3.14 it works if i pass "acpi_sleep=s3_bios" to kernel ( bug - > https://bugs.freedesktop.org/show_bug.cgi?id=42960 ). But it does not work > anymore. Well, I just tested it... after resume from suspend-to-RAM, I got a black screen. I switched to TTY1, ran xrandr to attach the display to an external monitor, switched to X and after some other commands, the laptop monitor turned on... I guess this could work: https://bugs.freedesktop.org/show_bug.cgi?id=42960#c47 >sleep 1; xset dpms force standby None of these other issues are related to this bug. Please report them separately or follow up on existing bugs now that this one is fixed. Well, I suppose I have to close it as fixed... fixed in 3.17-rc6. Thanks radeon & pci devs. Problem from comment #56 is reported in new bug #85311 |