Bug 213717

Summary: S0ix: can't reach s0ix - HP Spectre 13-aw2002
Product: Power Management Reporter: Adam Štrauch (cx)
Component: Hibernation/SuspendAssignee: David Box (david.e.box)
Status: ASSIGNED ---    
Severity: normal CC: bugzilla, david.e.box, felash, fkrueger, jonah, leho, me, michael.a.bottini, rui.zhang, wendy.wang
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.13.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci -tvv
dmesg after suspend
turbostat
BIOS setup VMD related options
lspci -n
PCI-ASPM-Enable-ASPM-for-links-under-VMD-domain
PCI-ASPM-Enable-LTR-for-endpoints-behind-VMD
dmesg2
tc2.out
Add root port A0B2
lspci-10000:e0:1d.2.txt
lspci-10000:e2:00.0.txt
s2idle test script
PCI-ASPM-Create-ASPM-override
PCI-vmd-Override-ASPM-on-TGL-ADL-VMD-devices
lspci -vvv
Patch that adds pci_warn debug messages
Corrected patch-0003
dmesg 5.14.11 after suspend
0001-PCI-ASPM-Create-ASPM-override.patch
0002-PCI-vmd-Override-ASPM-on-TGL-ADL-VMD-devices.patch
0003-Add-pci-warnings-everywhere.patch
dmesg 5.15-rc5 after suspend

Description Adam Štrauch 2021-07-13 16:47:16 UTC
Created attachment 297823 [details]
lspci -tvv

Hi,


I bought a new notebook and I noticed that there is something wrong with the suspend. When I put the notebook into sleep it stays warm a little and the battery is drained for 2-3 % per hour. On my previous notebook and my wife's notebook the battery usage is around 0.5 % per hour in s2idle.

After suspend /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us stays on zero while /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us is around ~95 % so it looks like part of the system is simply not sleeping.

The notebook is: HP Spectre 13-aw2002nc

CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

Based on my testing the problem occurs on 5.13.0 and 5.14-rc1. I haven't tried older kernels because sound is broken in those.


I found this issue: https://bugzilla.kernel.org/show_bug.cgi?id=207893#c2

Which is pretty similar to mine. There is a mention that non-connected ports could prevent the system from sleeping so I tried:

echo 1 | tee /sys/bus/pci/devices/0000:00:07.0/remove
echo 1 | tee /sys/bus/pci/devices/0000:00:07.1/remove


But it didn't help.

I attach output of turbostat, lspci and dmesg with pm_debug_messages enabled and I would like to ask for a little with this issue.
Comment 1 Adam Štrauch 2021-07-13 16:47:53 UTC
Created attachment 297825 [details]
dmesg after suspend
Comment 2 Adam Štrauch 2021-07-13 16:48:16 UTC
Created attachment 297827 [details]
turbostat
Comment 3 wendy.wang 2021-07-15 15:07:04 UTC
Please disable VMD controller from your BIOS, then double check.
Comment 4 Adam Štrauch 2021-07-15 15:18:23 UTC
I disabled VMD at the beginning even before I installed my Fedora.
Comment 5 wendy.wang 2021-07-15 15:28:09 UTC
From lspci log I see:
+-0e.0  Intel Corporation Volume Management Device NVMe RAID Controller

If VMD is enabled, there is a known issue from VMD, which will block s0ix.
Fix patches are WIP:
https://bugzilla.kernel.org/show_bug.cgi?id=213047
Comment 6 Adam Štrauch 2021-07-15 15:37:52 UTC
You are right. I went to the BIOS and it looks different and I cannot disable VMD there anymore. Probably after I upgraded it to the latest version. I will give a shot to the patch you shared. Thank you very much.
Comment 7 Adam Štrauch 2021-07-15 23:22:05 UTC
I applied the patch on 5.13.2 but unfortunately it doesn't work for me :-( The battery drain is still the same and low_power_idle_system_residency_us remains zero after resume.
Comment 8 David Box 2021-07-19 20:59:10 UTC
VMD is used in RAID mode. To disable it you would want to switch to AHCI mode. Is that option not available?
Comment 9 Adam Štrauch 2021-07-19 21:11:39 UTC
Yep, that's the problem. HP has started blocking advanced features of BIOS setup a few years ago and it's not available even as a hidden menu accessible via a shortcut anymore. At the beginning I remember I selected to not use the RAID but I guess it didn't turn VMD off but just stopped combining those two drives together.
Comment 10 Adam Štrauch 2021-07-19 21:13:27 UTC
Created attachment 297939 [details]
BIOS setup VMD related options

Only those three screens are available for VMD in the BIOS. If I enter the specific drive it just gives me info about what it is. No way to turn VMD off.
Comment 11 David Box 2021-07-20 18:23:09 UTC
Okay. Seems that is not configurable. Let's see which root port this is. Your lspci output didn't actually capture that. Can you attach just the output from 'lspci -n'.
Comment 12 Adam Štrauch 2021-07-20 18:31:05 UTC
Created attachment 297947 [details]
lspci -n

Sure, here it is.
Comment 13 David Box 2021-07-20 19:13:36 UTC
Created attachment 297951 [details]
PCI-ASPM-Enable-ASPM-for-links-under-VMD-domain
Comment 14 David Box 2021-07-20 19:14:13 UTC
Created attachment 297953 [details]
PCI-ASPM-Enable-LTR-for-endpoints-behind-VMD
Comment 15 David Box 2021-07-20 19:29:56 UTC
Check the following commands. You should see ASPM is disabled in both the Root Port and the NVMe device. And you should see the LTR on the NVMe device set to 0.

sudo lspci -vv -s 10000:e0:1d.0 
	Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
		...
		...
		...
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+

sudo lspci -vv -s 10000:e1:00.0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		...
		...
		...
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+

	Capabilities: [178 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns

Try the attached patches from Canonical. When running the new kernel you should see ASPM enabled and a high value for the LTR. Then check to see if you get s0ix.
Comment 16 Adam Štrauch 2021-07-20 21:41:51 UTC
Created attachment 297955 [details]
dmesg2
Comment 17 Adam Štrauch 2021-07-20 21:42:31 UTC
Created attachment 297957 [details]
tc2.out

I tried to apply those two patches and s0ix CPU residency is now 0.25 for CPU%LPI  and remained zero for SYS%LPI. I think it's worse because CPU%LPI used to be ~95.
Comment 18 Adam Štrauch 2021-07-20 21:47:10 UTC
I forgot to write that everything was as you described and after applying those patches ASPM is enabled and latencies are non-zero. I attached dmesg and turbostat output and I hope it can be helpful.
Comment 19 David Box 2021-07-20 22:20:09 UTC
Is the kernel new? Was *this* kernel getting 95% CPU%LPI before the patches?

Also I missed that you have 2 NVMe drives. Is the 2nd pair (root port and drive) still showing the same initial values?

sudo lspci -vv -s 10000:e0:1d.2
sudo lspci -vv -s 10000:e2:00.0
Comment 20 David Box 2021-07-20 22:21:39 UTC
Created attachment 297959 [details]
Add root port A0B2

Use this patch on top of the other two to add the work around to the 2nd drive.
Comment 21 Adam Štrauch 2021-07-20 22:35:57 UTC
It's clean kernel 5.13.2 with only those two patches applied.

Yes, it was getting 95% CPU%LPI before.

I think it's a single NVMe module with optane cache integrated but in the system both the SSD and optane behaves like two different NVMe devices. Probably covered by VMD somehow.

$ sudo parted /dev/nvme0n1 p
Model: INTEL HBRPEKNX0202AH (nvme)
Disk /dev/nvme0n1: 512GB

$ sudo parted /dev/nvme1n1 p
Model: INTEL HBRPEKNX0202AHO (nvme)                                       
Disk /dev/nvme1n1: 29,3GB

I will post the lspci soon. The initial values look same.
Comment 22 Adam Štrauch 2021-07-20 22:36:32 UTC
Created attachment 297961 [details]
lspci-10000:e0:1d.2.txt
Comment 23 Adam Štrauch 2021-07-20 22:36:45 UTC
Created attachment 297963 [details]
lspci-10000:e2:00.0.txt
Comment 24 David Box 2021-07-20 23:10:01 UTC
Created attachment 297965 [details]
s2idle test script

Since you are using a 5.13 kernel we should be able to see which s0ix requirements aren't getting met. But for this to work you need to be getting PC10 (CPU%LPI). Try the 3rd patch to see if that gets you 95% again. If it does then you can run this script (assuming you're not getting s0ix after the patch). If the 3rd patch still doesn't get you back to 95%, undo all of them to get back to the working state and run this script.
Comment 25 Adam Štrauch 2021-07-20 23:30:26 UTC
You are awesome. It looks like the second quirk did the job:

$ cat /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us
58959985

CPU%LPI 95.65
SYS%LPI 95.56

I am gonna make more tests tomorrow, especially battery usage while the machine is sleeping and I will let you know how it goes.

Are those patches coming into the kernel or are they just a workaround?
Comment 26 David Box 2021-07-21 00:39:20 UTC
Great. There was an attempt by Canonical to upstream these patches but it didn't happen due to disagreement over the solution.

Rafael, can you take a look at these patches and see what can be done for a proper solution? There is a fix in the works to ensure these values get programmed by firmware, but I'm not sure if it will apply to Tiger Lake, and we have this issue of VMD-enabled systems in the field without this support.
Comment 27 Adam Štrauch 2021-07-21 10:05:08 UTC
It's a shame about those patches but I am happy I have at least a way how to keep the notebook sleeping now.

I left the notebook sleeping for three hours and battery remained on 100 % and its body wasn't warm anymore.

There is probably one more issue. Initiating s0ix sometimes take more time than it should. It's like a minute or even more. It seems that the system goes sleeping because bluetooth speakers are disconnected almost immediately but the power LED stays on for a longer period of time. Is there a way how can I debug the issue?
Comment 28 Rafael J. Wysocki 2021-07-22 18:22:31 UTC
(In reply to David Box from comment #26)
> Great. There was an attempt by Canonical to upstream these patches but it
> didn't happen due to disagreement over the solution.
> 
> Rafael, can you take a look at these patches and see what can be done for a
> proper solution?

It looks like the VMD host bridge driver needs to walk the PCI bus segment exposed by it and enable ASPM/LTR for all of the devices in there having these features in their config space.
Comment 29 Adam Štrauch 2021-08-18 13:45:26 UTC
There is one big problem with the three patches enabling ASPM/LTR stuff. My computer started crashing like once a day when they were applied to the Vanilla 5.13.x kernel. It happens always after the computer resumed but not right after that. System just freezes so I am not able to get any logs. I know it's hard to say what could be the problem but is there a way how to obtain logs from the system? I can ping the machine but I am unable to get into it via SSH.
Comment 31 Michael Bottini 2021-10-01 23:04:51 UTC
Created attachment 299049 [details]
PCI-ASPM-Create-ASPM-override

Hi Adam, can you try to applying the following two patches and let me know if they solve the issue?
Comment 32 Michael Bottini 2021-10-01 23:05:26 UTC
Created attachment 299051 [details]
PCI-vmd-Override-ASPM-on-TGL-ADL-VMD-devices
Comment 33 Adam Štrauch 2021-10-03 23:30:43 UTC
Hi Michael,

I applied the patches on 5.14.9.

/sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us is still zero


`$ turbostat rtcwake -m freeze -s 30` returns:

CPU%LPI   91.50
SYS%LPI   0.00

`$ lspci -vv -s 10000:e0:1d.0 | grep ASPM` returns:

...
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
...


`$ lspci -vv -s 10000:e1:00.0` returns


...
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
...

and

...
Capabilities: [178 v1] Latency Tolerance Reporting
Max snoop latency: 3145728ns
Max no snoop latency: 3145728ns
...

If there is anything I can do to debug it please let me know.
Comment 34 Michael Bottini 2021-10-04 19:06:46 UTC
Hi Adam,

Can you attach the results of doing `lspci -vvv`? We need more information on these devices.
Comment 35 Adam Štrauch 2021-10-04 19:54:51 UTC
Created attachment 299089 [details]
lspci -vvv

Hi Michael,

yes, sure, here it is.
Comment 36 Michael Bottini 2021-10-11 19:55:12 UTC
Created attachment 299183 [details]
Patch that adds pci_warn debug messages

Hi Adam,

Two questions:

1. Can you print the results of doing

    sudo cat /sys/module/pcie_aspm/parameters/policy

I want to make sure that it's not in "performance."

2. Can you apply the attached patch in addition to the previous two and print the results of doing

    sudo dmesg | grep VMD_DEBUG

I want to see if something weird is going on with the classes and links.
Comment 37 Michael Bottini 2021-10-11 19:58:01 UTC
Created attachment 299185 [details]
Corrected patch-0003

I made a typo. Please apply the attached patch instead.
Comment 38 Adam Štrauch 2021-10-13 14:17:13 UTC
Hi Michael,

sorry for the delay. It took me a moment to find some time to deal with the patch and build the kernel. It couldn't be applied to 5.14.11 but I added the lines manually.

here it is:

1.

❯ sudo cat /sys/module/pcie_aspm/parameters/policy
[default] performance powersave powersupersave


I don't think there is anything that would change this in my system.

2. Here are the debug messages:

[    2.485951] pci 10000:e0:1d.0: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 394240
[    2.485952] pci 10000:e1:00.0: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 67586
[    2.485957] pci 10000:e1:00.0: VMD_DEBUG: We've found a quirk, finding parent...
[    2.485957] pci 10000:e1:00.0: VMD_DEBUG: parent = 000000002435851c, pos = 376
[    2.485959] pci 10000:e1:00.0: VMD_DEBUG: Parent and capability exist, setting LTR and policy...
[    2.485965] pci 10000:e0:1d.2: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 394240
[    2.485966] pci 10000:e2:00.0: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 67586
[    2.485973] pci 10000:e2:00.0: VMD_DEBUG: We've found a quirk, finding parent...
[    2.485973] pci 10000:e2:00.0: VMD_DEBUG: parent = 000000003cfa6410, pos = 720
[    2.485974] pci 10000:e2:00.0: VMD_DEBUG: Parent and capability exist, setting LTR and policy...

I will attach full dmesg if it helps.
Comment 39 Adam Štrauch 2021-10-13 14:18:03 UTC
Created attachment 299195 [details]
dmesg 5.14.11 after suspend
Comment 40 Michael Bottini 2021-10-14 19:53:39 UTC
Created attachment 299203 [details]
0001-PCI-ASPM-Create-ASPM-override.patch

Hi Adam, please apply these three patches to Linux 5.15-rc6. If it doesn't work, please do

    sudo dmesg | grep VMD_DEBUG

as before.
Comment 41 Michael Bottini 2021-10-14 19:54:18 UTC
Created attachment 299205 [details]
0002-PCI-vmd-Override-ASPM-on-TGL-ADL-VMD-devices.patch
Comment 42 Michael Bottini 2021-10-14 19:54:52 UTC
Created attachment 299207 [details]
0003-Add-pci-warnings-everywhere.patch
Comment 43 Adam Štrauch 2021-10-15 11:04:44 UTC
Created attachment 299215 [details]
dmesg 5.15-rc5 after suspend

Hi Michael,

I applied the patches on 5.15-rc5 because rc6 is not out yet. Do you want me to use master branch instead?

The /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us is still zero but ASPM looks enabled on both PCIe devices.

# sudo lspci -vv -s 10000:e1:00.0 | grep ASPM
LnkCap:	Port #0, Speed 8GT/s, Width x2, ASPM L1, Exit Latency L1 <8us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1+

# sudo lspci -vv -s 10000:e0:1d.0 | grep ASPM
LnkCap:	Port #9, Speed 8GT/s, Width x2, ASPM L1, Exit Latency L1 <16us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1+

This is the dmesg output:

[    2.405165] pci 10000:e0:1d.0: VMD_DEBUG: features = 46
[    2.405166] pci 10000:e1:00.0: VMD_DEBUG: features = 46
[    2.405171] pci 10000:e1:00.0: VMD_DEBUG: pos = 376
[    2.405172] pci 10000:e1:00.0: VMD_DEBUG: Setting LTRs and policy...
[    2.405178] pci 10000:e1:00.0: VMD_DEBUG: Inside aspm_override_policy, link = 00000000beebc0fe
[    2.405179] pci 10000:e1:00.0: VMD_DEBUG: Found link, setting policy...
[    2.405209] pci 10000:e0:1d.2: VMD_DEBUG: features = 46
[    2.405210] pci 10000:e2:00.0: VMD_DEBUG: features = 46
[    2.405217] pci 10000:e2:00.0: VMD_DEBUG: pos = 720
[    2.405218] pci 10000:e2:00.0: VMD_DEBUG: Setting LTRs and policy...
[    2.405965] pci 10000:e2:00.0: VMD_DEBUG: Inside aspm_override_policy, link = 00000000ce58e343
[    2.405967] pci 10000:e2:00.0: VMD_DEBUG: Found link, setting policy...

Attaching full dmesg too.

Btw: the third patch didn't work, I had to pick the changed lines manually.
Comment 44 Adhitya Mohan 2021-11-20 07:56:22 UTC
Hello, I don't mean to bump in, but the patches that Micheal sent fixed my suspend issue on the dell XPS 13(9305). I opened another bug report (https://bugzilla.kernel.org/show_bug.cgi?id=215063) to keep from cluttering this one.
Comment 45 David Box 2021-11-20 19:00:49 UTC
(In reply to Adhitya Mohan from comment #44)

Thanks Adhitya. We were waiting on Adam to test the last revision which works on the systems we've tested on and now is confirmed to work for you. May we add a tested-by tag for you to this patch set?
Comment 46 Adam Štrauch 2021-11-20 19:16:48 UTC
Hi David,

I didn't know you have been waiting for me. Which patches do you have on mind? I tested the latest ones from Michael with not success - low_power_idle_system_residency_us was still zero. It's possible there is another problem with this specific notebook.
Comment 47 Adhitya Mohan 2021-11-20 19:21:15 UTC
(In reply to David Box from comment #45)
> (In reply to Adhitya Mohan from comment #44)
> 
> Thanks Adhitya. We were waiting on Adam to test the last revision which
> works on the systems we've tested on and now is confirmed to work for you.
> May we add a tested-by tag for you to this patch set?

Yes of course, you could see the dmesg and the result of the self test tool, that I posted in my earlier bug report. Happy to help.
Comment 48 David Box 2021-11-20 23:32:07 UTC
(In reply to Adam Štrauch from comment #46)
> Hi David,
> 
> I didn't know you have been waiting for me. Which patches do you have on
> mind? I tested the latest ones from Michael with not success -
> low_power_idle_system_residency_us was still zero. It's possible there is
> another problem with this specific notebook.

Oh. Well there certainly could be a different issue. I'd run the same script Adhitya ran and see what it says, https://github.com/qwang59/S0ixSelftestTool

If you are not able to get S0ix, but lspci shows that ASPM is enabled and LTR is set on both drives, then it's unlikely to be due to VMD.
Comment 49 Adhitya Mohan 2021-11-25 12:24:46 UTC
Hi David,
Any idea which release cycle these patches might make it into?
Comment 50 Julien Wajsberg 2023-12-28 10:40:10 UTC
My understanding is that David's work has been merged in v6.3.