Created attachment 296647 [details] The Full DSDT ASL Hi, We have some Intel laptops (11th generation CPU) with NVIDIA GPU suffering the same GPU falling off the bus problem while exiting s2idle with external display connected. These laptops connect the external display via the HDMI/DisplayPort on a USB Type-C interfaced dock. If we enter and exit s2idle with the dock connected, the NVIDIA GPU (confirmed on 10de:24b6 and 10de:25b8) and the PCIe port can come back to D0 w/o problem. If we enter the s2idle, disconnect the dock, then exit the s2idle, both external display and the panel will remain with no output for ~30 seconds. The dmesg as follows shows the "nvidia 0000:01:00.0:can't change power state from D3cold to D0 (config space inaccessible)" due to the following ACPI error [ 154.446781] [ 154.446783] [ 154.446783] Initialized Local Variables for Method [IPCS]: [ 154.446784] Local0: 000000009863e365 <Obj> Integer 00000000000009C5 [ 154.446790] [ 154.446791] Initialized Arguments for Method [IPCS]: (7 arguments defined for method invocation) [ 154.446792] Arg0: 0000000025568fbd <Obj> Integer 00000000000000AC [ 154.446795] Arg1: 000000009ef30e76 <Obj> Integer 0000000000000000 [ 154.446798] Arg2: 00000000fdf820f0 <Obj> Integer 0000000000000010 [ 154.446801] Arg3: 000000009fc2a088 <Obj> Integer 0000000000000001 [ 154.446804] Arg4: 000000003a3418f7 <Obj> Integer 0000000000000001 [ 154.446807] Arg5: 0000000020c4b87c <Obj> Integer 0000000000000000 [ 154.446810] Arg6: 000000008b965a8a <Obj> Integer 0000000000000000 [ 154.446813] [ 154.446815] ACPI Error: Aborting method \IPCS due to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446824] ACPI Error: Aborting method \MCUI due to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446829] ACPI Error: Aborting method \SPCX due to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446835] ACPI Error: Aborting method \_SB.PC00.PGSC due to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446841] ACPI Error: Aborting method \_SB.PC00.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446846] ACPI Error: Aborting method \_SB.PC00.PEG1.NPON due to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446852] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529) [ 154.446860] acpi device:02: Failed to change power state to D0 [ 154.690760] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown) The IPCS is the last function called from \_SB.PC00.PEG1.PG01._ON which we expect it to prepare everything before bringing back the NVIDIA GPU but it's stuck in the infinite loop as described below. Please refer to the attached full dsdt.sal While (One) { If ((!IBSY || (IERR == One))) { Break } If ((Local0 > TMOV)) { RPKG [Zero] = 0x03 Return (RPKG) /* \IPCS.RPKG */ } Sleep (One) Local0++ } And the upstream PCIe port of NVIDIA seems to become inaccessible due to the messages as follows. [ 292.746508] pcieport 0000:00:01.0: waiting 100 ms for downstream link, after activation [ 292.882296] pci 0000:01:00.0: waiting additional 100 ms to become accessible [ 316.876997] pci 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible) Since the IPCS is the Intel Reference Code and we don't really know why the never-end loop happens just because we unplug the dock while the system still stays in s2idle. Can anyone from Intel suggest what happens here? And one thing also worth mentioning, if we unplug the display cable from the dock before entering the s2idle, NVIDIA GPU can come back w/o problem even if we disconnect the dock before exiting s2idle. Here's the lspci information in the attached lspci.log and the dmesg.log with ACPI trace_state enabled and dynamic debug on for drivers/pci/pci.c, drivers/acpi/device_pm.c for the whole s2idle enter/exit with IPCS timeout. Any suggestion would be appreciated. Thanks. Chris
Created attachment 296649 [details] lspci -nnvt output
Created attachment 296651 [details] dmesg with dynamic debug and ACPI trace_state on
I found a timing issue which might be worth investigation. The TBT WD dock I have here with the latest BIOS 0.5.57 can be manually waked up w/o AE_AML_LOOP_TIMEOUT problem. So I dump the acpi log and do cross-comparison between the WD19SC and WD19TB dock. A key resource and function in ssdt6.dsl are PowerResource (D3C) and \_SB.PC00.TCON (). The \_SB.PC00.D3C._ON and \_SB.PC00.TCON begins after the \_SB.PC00.PGON. Then during the period of TCON, the IPCS will run in parallel with TCON but the timing between TB and SC are different. Please refer to the TCON code in https://gist.github.com/mschiu77/05b9b8ff3ba795f11dd467660952f6e5#file-ssdt7-dsl-L884 In WD19SC case, the IPCS() will begin running before L911 (line 911 in the above link) per the ACPI log as follows. [ 262.733726] extrace-0138 ex_trace_point : Method Begin [0x0000000024236d48:\MCUI] execution. [ 262.733882] extrace-0138 ex_trace_point : Method Begin [0x000000003ff97c7a:\IPCS] execution. [ 262.744064] extrace-0138 ex_trace_point : Method Begin [0x000000004edb7213:\_SB.PC00.DSGS] execution. [ 262.744233] extrace-0138 ex_trace_point : Method End [0x000000004edb7213:\_SB.PC00.DSGS] execution. [ 262.744245] extrace-0138 ex_trace_point : Method Begin [0x0000000073d60801:\_SB.PC00.DSCR] execution. [ 262.744424] extrace-0138 ex_trace_point : Method Begin [0x000000004edb7213:\_SB.PC00.DSGS] execution. In WD19TB case, the IPCS will begin to run in parallel with L913 per the following ACPI log. [ 147.643528] extrace-0138 ex_trace_point : Method Begin [0x000000008fde276b:\MCUI] execution. [ 147.643927] extrace-0138 ex_trace_point : Method Begin [0x0000000006dad187:\_SB.PC00.DSGS] execution. [ 147.644164] extrace-0138 ex_trace_point : Method End [0x0000000006dad187:\_SB.PC00.DSGS] execution. [ 147.644175] extrace-0138 ex_trace_point : Method Begin [0x000000003af322ca:\_SB.PC00.DSCR] execution. [ 147.644338] extrace-0138 ex_trace_point : Method Begin [0x00000000dd5b6bc8:\IPCS] execution. [ 147.644788] extrace-0138 ex_trace_point : Method Begin [0x0000000006dad187:\_SB.PC00.DSGS] execution. [ 147.644927] extrace-0138 ex_trace_point : Method End [0x0000000006dad187:\_SB.PC00.DSGS] execution. [ 147.644949] extrace-0138 ex_trace_point : Method End [0x000000003af322ca:\_SB.PC00.DSCR] execution. From the timestamp of WD19SC case, the IPCS starts 11ms ahead of L911, which may indicate the TCON is still in the while loop of L893. According to the debug message, the "TACK after D3 cold exit" is not printed which means TACK is not asserted yet. Is it possible that the IPCS also needs to wait for the TACK or the "ADBG ("D3Cold Exit")"?
Created attachment 296739 [details] PMC info when IPSC timeout
Created attachment 296741 [details] PMC info of normal s2idle exit
Don't know whether the PCM information in /sys/kernel/debug/pmc_core is helpful or not. The file includes `ltr_show`, `pch_ip_power_gating_status`, I`substate_live_status_registers` and `substate_status_registers`. I also attached the same information when exit s2idle w/o problem. The major difference is mainly XHCI/USB related (because the dock is removed). And maybe the ISH? Attach the information for your reference.
Provide ACPI verbose log: Pass log with TBT dock: https://files.support.canonical.com/f/550898f54203b96d Fail log with DC dock: https://files.support.canonical.com/f/5aedaec721232a34 All based on the latest BIOS. Capture method under Ubuntu Linux: sudo su cd /sys/kernel/debug/dynamic_debug echo "file drivers/pci/pci.c +p" > control echo "file drivers/acpi/device_pm.c +p" > control cd /sys/module/acpi/parameters/ echo 0xffff00f0 > debug_layer echo 0x600 > debug_level Start to suspend Exit suspend Use dmesg to get the output file
Created attachment 296745 [details] fail log (DC docking) with kernel parameter "drm.debug=0xe" fail log (DC docking) with kernel parameter "drm.debug=0xe"
Created attachment 296847 [details] TGL-U log with kernel parameter "drm.debug=0xe" TGL-U log with kernel parameter "drm.debug=0xe" Issue cannot reproduce. "PM: suspend entry (s2idle)" which indicates the suspend start and "ACPI: EC: interrupt unblocked" which indicate the system starts the resume.
Created attachment 296849 [details] ACPI debug log on TGL-U ACPI debug log on TGL-U with dual C docking. Issue cannot reproduce. "PM: suspend entry (s2idle)" which indicates the suspend start and "ACPI: EC: interrupt unblocked" which indicate the system starts the resume. Note: TGL-U system seems a little bit slow to resume but it still can resume back okay.
Verbose ACPI debug log with drm.debug=0xe https://files.support.canonical.com/f/208f43fd63f86041 echo 0xffff00f0 > debug_layer echo 0x600 > debug_level
Created attachment 296855 [details] ACPI debug log with enabling trace_state ACPI debug log with enabling trace_state
I got this problem fixed with the following hacky code and I think it should be a finding that could help. ``` --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1399,6 +1399,15 @@ static int i915_pm_poweroff_late(struct device *kdev) return i915_drm_suspend_late(&i915->drm, true); } +static int i915_pm_resume_noirq(struct device *kdev) +{ + struct drm_i915_private *i915 = kdev_to_i915(kdev); + + if (!IS_VALLEYVIEW(i915) && !IS_CHERRYVIEW(i915)) + intel_hpd_init(i915); + return 0; +} + static int i915_pm_resume_early(struct device *kdev) { struct drm_i915_private *i915 = kdev_to_i915(kdev); @@ -1626,6 +1635,7 @@ const struct dev_pm_ops i915_pm_ops = { .prepare = i915_pm_prepare, .suspend = i915_pm_suspend, .suspend_late = i915_pm_suspend_late, + .resume_noirq = i915_pm_resume_noirq, .resume_early = i915_pm_resume_early, .resume = i915_pm_resume, ``` Because this APCI method IPCS lockup only happens when we remove the dock with external display connected(or the USB-C dongle with external display connected) when suspended and then exit s2idle. It should be related to either graphic drivers or hpd events handling. I did lots of experiments and found if the \_SB.PC00.PEG1.PG01._ON exeutes after i915_pm_resume_early, the IPCS will not be locked by the IBSY and IERR. Since the \_SB.PC00.PEG1.PG01._ON is invoked by pci_power_up() in pci_pm_resume_noirq and it's in the context of dpm_resume_noirq(). The i915_pm_resume_early is in dpm_resume_early() context and that's why the Sleep(One) in the IPCS lasts for >30 seconds but i915 still has no chance to early resume to free the IPCS lock. I assume that it's due to the unplug action is not handled properly during resume, so I add the .resume_noirq() in i915 driver to let the hpd have chance to do something during in the same dpm_resume_noirq context. It seems to work fine and pci 0000:01:00.0 (PEG1) can transition to D0 w/o problem. However, it's a dirty hack because I'll hit the kernel WARN in hpd_init(). Please refer to the attached hpd_init_dmesg.log. Can anyone from Intel help on this? I'll be glad to collect logs or test any PATCH. Thanks
Created attachment 296897 [details] Log of doing hpd_init during .resume_noirq
Thanks for the information and your debug patch, that is very helpful to illustrate the problem. Given this is graphics related, can you please file a bug to the graphics experts following this page? https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs
Thanks. I've filed a bug on https://gitlab.freedesktop.org/drm/intel/-/issues/3500 and still get no information about the progress.