Bug 212951 - System freeze for 30 seconds after exiting s2idle
Summary: System freeze for 30 seconds after exiting s2idle
Status: CLOSED MOVED
Alias: None
Product: ACPI
Classification: Unclassified
Component: BIOS (show other bugs)
Hardware: Intel Linux
: P1 blocking
Assignee: acpi_bios
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-05-05 07:20 UTC by Chris Chiu
Modified: 2021-06-01 07:41 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.12
Subsystem:
Regression: No
Bisected commit-id:


Attachments
The Full DSDT ASL (2.61 MB, text/x-csrc)
2021-05-05 07:20 UTC, Chris Chiu
Details
lspci -nnvt output (1.30 KB, text/plain)
2021-05-05 07:22 UTC, Chris Chiu
Details
dmesg with dynamic debug and ACPI trace_state on (1.13 MB, text/plain)
2021-05-05 07:23 UTC, Chris Chiu
Details
PMC info when IPSC timeout (2.04 KB, application/x-compressed-tar)
2021-05-13 12:37 UTC, Chris Chiu
Details
PMC info of normal s2idle exit (2.08 KB, application/x-compressed-tar)
2021-05-13 12:38 UTC, Chris Chiu
Details
fail log (DC docking) with kernel parameter "drm.debug=0xe" (270.01 KB, text/plain)
2021-05-14 11:10 UTC, Hsiao-Ting Wang, Tiffany
Details
TGL-U log with kernel parameter "drm.debug=0xe" (243.60 KB, text/plain)
2021-05-19 04:37 UTC, Hsiao-Ting Wang, Tiffany
Details
ACPI debug log on TGL-U (252.59 KB, text/plain)
2021-05-19 04:47 UTC, Hsiao-Ting Wang, Tiffany
Details
ACPI debug log with enabling trace_state (695.79 KB, text/plain)
2021-05-19 09:04 UTC, Hsiao-Ting Wang, Tiffany
Details
Log of doing hpd_init during .resume_noirq (733.11 KB, text/plain)
2021-05-20 11:33 UTC, Chris Chiu
Details

Description Chris Chiu 2021-05-05 07:20:48 UTC
Created attachment 296647 [details]
The Full DSDT ASL

Hi,
    We have some Intel laptops (11th generation CPU) with NVIDIA GPU
suffering the same GPU falling off the bus problem while exiting
s2idle with external display connected. These laptops connect the
external display via the HDMI/DisplayPort on a USB Type-C interfaced
dock. If we enter and exit s2idle with the dock connected, the NVIDIA
GPU (confirmed on 10de:24b6 and 10de:25b8) and the PCIe port can come
back to D0 w/o problem. If we enter the s2idle, disconnect the dock,
then exit the s2idle, both external display and the panel will remain
with no output for ~30 seconds. The dmesg as follows shows the "nvidia 0000:01:00.0:can't change power state from D3cold to D0 (config space
inaccessible)" due to the following ACPI error
[ 154.446781]
[ 154.446783]
[ 154.446783] Initialized Local Variables for Method [IPCS]:
[ 154.446784] Local0: 000000009863e365 <Obj> Integer 00000000000009C5
[ 154.446790]
[ 154.446791] Initialized Arguments for Method [IPCS]: (7 arguments
defined for method invocation)
[ 154.446792] Arg0: 0000000025568fbd <Obj> Integer 00000000000000AC
[ 154.446795] Arg1: 000000009ef30e76 <Obj> Integer 0000000000000000
[ 154.446798] Arg2: 00000000fdf820f0 <Obj> Integer 0000000000000010
[ 154.446801] Arg3: 000000009fc2a088 <Obj> Integer 0000000000000001
[ 154.446804] Arg4: 000000003a3418f7 <Obj> Integer 0000000000000001
[ 154.446807] Arg5: 0000000020c4b87c <Obj> Integer 0000000000000000
[ 154.446810] Arg6: 000000008b965a8a <Obj> Integer 0000000000000000
[ 154.446813]
[ 154.446815] ACPI Error: Aborting method \IPCS due to previous error
(AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
[ 154.446824] ACPI Error: Aborting method \MCUI due to previous error
(AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
[ 154.446829] ACPI Error: Aborting method \SPCX due to previous error
(AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
[ 154.446835] ACPI Error: Aborting method \_SB.PC00.PGSC due to
previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
[ 154.446841] ACPI Error: Aborting method \_SB.PC00.PGON due to
previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
[ 154.446846] ACPI Error: Aborting method \_SB.PC00.PEG1.NPON due to
previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
[ 154.446852] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due
to previous error (AE_AML_LOOP_TIMEOUT) (20200925/psparse-529)
[ 154.446860] acpi device:02: Failed to change power state to D0
[ 154.690760] video LNXVIDEO:00: Cannot transition to power state D0
for parent in (unknown)

The IPCS is the last function called from \_SB.PC00.PEG1.PG01._ON
which we expect it to prepare everything before bringing back the
NVIDIA GPU but it's stuck in the infinite loop as described below.
Please refer to the attached full dsdt.sal
           While (One)
            {
                If ((!IBSY || (IERR == One)))
                {
                    Break
                }

                If ((Local0 > TMOV))
                {
                    RPKG [Zero] = 0x03
                    Return (RPKG) /* \IPCS.RPKG */
                }

                Sleep (One)
                Local0++
            }

And the upstream PCIe port of NVIDIA seems to become inaccessible due
to the messages as follows.
[ 292.746508] pcieport 0000:00:01.0: waiting 100 ms for downstream
link, after activation
[ 292.882296] pci 0000:01:00.0: waiting additional 100 ms to become accessible
[ 316.876997] pci 0000:01:00.0: can't change power state from D3cold
to D0 (config space inaccessible)

Since the IPCS is the Intel Reference Code and we don't really know
why the never-end loop happens just because we unplug the dock while
the system still stays in s2idle. Can anyone from Intel suggest what
happens here?

And one thing also worth mentioning, if we unplug the display cable
from the dock before entering the s2idle, NVIDIA GPU can come back w/o
problem even if we disconnect the dock before exiting s2idle. Here's
the lspci information in the attached lspci.log and the dmesg.log with ACPI trace_state enabled and dynamic debug on for drivers/pci/pci.c, drivers/acpi/device_pm.c for the whole s2idle enter/exit with IPCS timeout.

Any suggestion would be appreciated. Thanks.

Chris
Comment 1 Chris Chiu 2021-05-05 07:22:11 UTC
Created attachment 296649 [details]
lspci -nnvt output
Comment 2 Chris Chiu 2021-05-05 07:23:49 UTC
Created attachment 296651 [details]
dmesg with dynamic debug and ACPI trace_state on
Comment 3 Chris Chiu 2021-05-06 06:45:05 UTC
I found a timing issue which might be worth investigation. The TBT WD dock I have here with the latest BIOS 0.5.57 can be manually waked up w/o AE_AML_LOOP_TIMEOUT problem. So I dump the acpi log and do cross-comparison between the WD19SC and WD19TB dock.

A key resource and function in ssdt6.dsl are PowerResource (D3C) and \_SB.PC00.TCON ().
The \_SB.PC00.D3C._ON and \_SB.PC00.TCON begins after the \_SB.PC00.PGON. Then during the period of TCON, the IPCS will run in parallel with TCON but the timing between TB and SC are different.

Please refer to the TCON code in https://gist.github.com/mschiu77/05b9b8ff3ba795f11dd467660952f6e5#file-ssdt7-dsl-L884

In WD19SC case, the IPCS() will begin running before L911 (line 911 in the above link) per the ACPI log as follows.

[ 262.733726] extrace-0138 ex_trace_point : Method Begin [0x0000000024236d48:\MCUI] execution.
[ 262.733882] extrace-0138 ex_trace_point : Method Begin [0x000000003ff97c7a:\IPCS] execution.
[ 262.744064] extrace-0138 ex_trace_point : Method Begin [0x000000004edb7213:\_SB.PC00.DSGS] execution.
[ 262.744233] extrace-0138 ex_trace_point : Method End [0x000000004edb7213:\_SB.PC00.DSGS] execution.
[ 262.744245] extrace-0138 ex_trace_point : Method Begin [0x0000000073d60801:\_SB.PC00.DSCR] execution.
[ 262.744424] extrace-0138 ex_trace_point : Method Begin [0x000000004edb7213:\_SB.PC00.DSGS] execution.

In WD19TB case, the IPCS will begin to run in parallel with L913 per the following ACPI log.

[ 147.643528] extrace-0138 ex_trace_point : Method Begin [0x000000008fde276b:\MCUI] execution.
[ 147.643927] extrace-0138 ex_trace_point : Method Begin [0x0000000006dad187:\_SB.PC00.DSGS] execution.
[ 147.644164] extrace-0138 ex_trace_point : Method End [0x0000000006dad187:\_SB.PC00.DSGS] execution.
[ 147.644175] extrace-0138 ex_trace_point : Method Begin [0x000000003af322ca:\_SB.PC00.DSCR] execution.
[ 147.644338] extrace-0138 ex_trace_point : Method Begin
[0x00000000dd5b6bc8:\IPCS] execution.
[ 147.644788] extrace-0138 ex_trace_point : Method Begin [0x0000000006dad187:\_SB.PC00.DSGS] execution.
[ 147.644927] extrace-0138 ex_trace_point : Method End [0x0000000006dad187:\_SB.PC00.DSGS] execution.
[ 147.644949] extrace-0138 ex_trace_point : Method End [0x000000003af322ca:\_SB.PC00.DSCR] execution.

From the timestamp of WD19SC case, the IPCS starts 11ms ahead of L911, which may indicate the TCON is still in the while loop of L893. According to the debug message, the "TACK after D3 cold exit" is not printed which means TACK is not asserted yet. Is it possible that the IPCS also needs to wait for the TACK or the "ADBG ("D3Cold Exit")"?
Comment 4 Chris Chiu 2021-05-13 12:37:51 UTC
Created attachment 296739 [details]
PMC info when IPSC timeout
Comment 5 Chris Chiu 2021-05-13 12:38:48 UTC
Created attachment 296741 [details]
PMC info of normal s2idle exit
Comment 6 Chris Chiu 2021-05-13 12:42:02 UTC
Don't know whether the PCM information in /sys/kernel/debug/pmc_core is helpful or not.
The file includes `ltr_show`, `pch_ip_power_gating_status`,  I`substate_live_status_registers` and `substate_status_registers`. I also attached the same information when exit s2idle w/o problem. The major difference is mainly XHCI/USB related (because the dock is removed). And maybe the ISH?

Attach the information for your reference.
Comment 7 Hsiao-Ting Wang, Tiffany 2021-05-14 06:36:42 UTC
Provide ACPI verbose log:

Pass log with TBT dock:
https://files.support.canonical.com/f/550898f54203b96d

Fail log with DC dock:
https://files.support.canonical.com/f/5aedaec721232a34

All based on the latest BIOS.


Capture method under Ubuntu Linux:

sudo su
cd /sys/kernel/debug/dynamic_debug
echo "file drivers/pci/pci.c +p" > control
echo "file drivers/acpi/device_pm.c +p" > control

cd /sys/module/acpi/parameters/
echo 0xffff00f0  >  debug_layer
echo 0x600 > debug_level

Start to suspend
Exit suspend

Use dmesg to get the output file
Comment 8 Hsiao-Ting Wang, Tiffany 2021-05-14 11:10:50 UTC
Created attachment 296745 [details]
fail log (DC docking) with kernel parameter "drm.debug=0xe"

fail log (DC docking) with kernel parameter "drm.debug=0xe"
Comment 9 Hsiao-Ting Wang, Tiffany 2021-05-19 04:37:59 UTC
Created attachment 296847 [details]
TGL-U log with kernel parameter "drm.debug=0xe"

TGL-U log with kernel parameter "drm.debug=0xe"

Issue cannot reproduce.

"PM: suspend entry (s2idle)" which indicates the suspend start
and
"ACPI: EC: interrupt unblocked" which indicate the system starts the resume.
Comment 10 Hsiao-Ting Wang, Tiffany 2021-05-19 04:47:32 UTC
Created attachment 296849 [details]
ACPI debug log on TGL-U

ACPI debug log on TGL-U with dual C docking.

Issue cannot reproduce.

"PM: suspend entry (s2idle)" which indicates the suspend start
and
"ACPI: EC: interrupt unblocked" which indicate the system starts the resume.

Note:
TGL-U system seems a little bit slow to resume but it still can resume back okay.
Comment 11 Hsiao-Ting Wang, Tiffany 2021-05-19 09:02:57 UTC
Verbose ACPI debug log with drm.debug=0xe

https://files.support.canonical.com/f/208f43fd63f86041

echo 0xffff00f0 > debug_layer
echo 0x600 > debug_level
Comment 12 Hsiao-Ting Wang, Tiffany 2021-05-19 09:04:24 UTC
Created attachment 296855 [details]
ACPI debug log with enabling trace_state

ACPI debug log with enabling trace_state
Comment 13 Chris Chiu 2021-05-20 11:31:45 UTC
I got this problem fixed with the following hacky code and I think it should be a finding that could help.

```
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1399,6 +1399,15 @@ static int i915_pm_poweroff_late(struct device *kdev)
        return i915_drm_suspend_late(&i915->drm, true);
 }
 
+static int i915_pm_resume_noirq(struct device *kdev)
+{
+       struct drm_i915_private *i915 = kdev_to_i915(kdev);
+
+       if (!IS_VALLEYVIEW(i915) && !IS_CHERRYVIEW(i915))
+               intel_hpd_init(i915);
+       return 0;
+}
+
 static int i915_pm_resume_early(struct device *kdev)
 {
        struct drm_i915_private *i915 = kdev_to_i915(kdev);
@@ -1626,6 +1635,7 @@ const struct dev_pm_ops i915_pm_ops = {
        .prepare = i915_pm_prepare,
        .suspend = i915_pm_suspend,
        .suspend_late = i915_pm_suspend_late,
+       .resume_noirq = i915_pm_resume_noirq,
        .resume_early = i915_pm_resume_early,
        .resume = i915_pm_resume,

```

Because this APCI method IPCS lockup only happens when we remove the dock with external display connected(or the USB-C dongle with external display connected) when suspended and then exit s2idle. It should be related to either graphic drivers or hpd events handling. I did lots of experiments and found if the \_SB.PC00.PEG1.PG01._ON exeutes after i915_pm_resume_early, the IPCS will not be locked by the IBSY and IERR.

Since the \_SB.PC00.PEG1.PG01._ON is invoked by pci_power_up() in pci_pm_resume_noirq and it's in the context of dpm_resume_noirq(). The i915_pm_resume_early is in dpm_resume_early() context and that's why the Sleep(One) in the IPCS lasts for >30 seconds but i915 still has no chance to early resume to free the IPCS lock.

I assume that it's due to the unplug action is not handled properly during resume, so I add the .resume_noirq() in i915 driver to let the hpd have chance to do something during in the same dpm_resume_noirq context. It seems to work fine and pci 0000:01:00.0 (PEG1) can transition to D0 w/o problem. However, it's a dirty hack because I'll hit the kernel WARN in hpd_init(). Please refer to the attached hpd_init_dmesg.log.

Can anyone from Intel help on this? I'll be glad to collect logs or test any PATCH. Thanks
Comment 14 Chris Chiu 2021-05-20 11:33:00 UTC
Created attachment 296897 [details]
Log of doing hpd_init during .resume_noirq
Comment 15 Zhang Rui 2021-05-28 06:11:21 UTC
Thanks for the information and your debug patch, that is very helpful to illustrate the problem.
Given this is graphics related, can you please file a bug to the graphics experts following this page?
https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs
Comment 16 Chris Chiu 2021-06-01 07:41:43 UTC
Thanks. I've filed a bug on https://gitlab.freedesktop.org/drm/intel/-/issues/3500 and still get no information about the progress.

Note You need to log in before you can comment on or make changes to this bug.