Bug 218394

Summary: Enabled VMD on ASUS B1400CEAE blocks going to deeper CPU Package C-state when s2idle
Product: Drivers Reporter: Jian-Hong Pan (jhp)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: david.e.box, drake, leho, mika.westerberg, nirmal.patel, rafael, stereomato
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: The kernel log with enabled VMD and s2idle
PCI devices information
Power on the Bus' PCI devices before enable link states
PCI information for comment #3
Power on the Bus' PCI devices before enable link states and don't reset the PCI Bus
PCI information for comment #5
PCI devices information with disabled VMD
The kernel log with disabled VMD and s2idle
Power on the Bus' PCI devices before enable link states, don't reset the PCI Bus, NVMe simple suspend quirk and power off the dummy PCI devices
warn_on_s0ix_failures output

Description Jian-Hong Pan 2024-01-19 04:04:59 UTC
Created attachment 305728 [details]
The kernel log with enabled VMD and s2idle

We have an ASUS B1400CEAE laptop equipped with Intel i3-1115G4 CPU and Sandisk Corp WD Blue SN550 NVMe SSD.

Tested with kernel 6.7 and we found the system consumes DC battery's power quickly when s2idle, if the Intel® RST VMD is enabled.  The battery only survives few hours.  After that, system cannot wake up because the battery is drained out.

However, if the VMD is disabled, the battery has 90% energy left after s2idle 24 hours.

The BIOS has already been updated to latest version 314 from ASUS website.

I also checked with S0ixSelftestTool.
* If VMD is disabled, it shows "Congratulations! Your system achieved the deepest S0ix substate!"
* If VMD is enabled, here is the test output:


---Check S2idle path S0ix Residency---:

The system OS Kernel version is:
Linux endless 6.7.0+ #97 SMP PREEMPT_DYNAMIC Thu Jan 18 16:52:17 CST 2024 x86_64 GNU/Linux

---Check whether your system supports S0ix or not---:

Low Power S0 Idle is:1
Your system supports low power S0 idle capability.



---Check whether intel_pmc_core sysfs files exit---:

The pmc_core debug sysfs files are OK on your system.



---Judge PC10, S0ix residency available status---:

Test system supports S0ix.y substate

S0ix substate before S2idle:
  S0i2.0

S0ix substate residency before S2idle:
  0

Turbostat output: 
15.791949 sec
CPU%c1	CPU%c6	CPU%c7	GFX%rc6	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	SYS%LPI
2.81	0.00	94.93	659.86	91.61	0.00	0.00	0.00	0.00	0.00	0.00	0.00
3.84	0.00	93.93	659.92	91.62	0.00	0.00	0.00	0.00	0.00	0.00	0.00
2.62
3.26	0.00	95.93
1.51

CPU Core C7 residency after S2idle is: 94.93
GFX RC6 residency after S2idle is: 659.86
CPU Package C-state 2 residency after S2idle is: 91.61
CPU Package C-state 3 residency after S2idle is: 0.00
CPU Package C-state 8 residency after S2idle is: 0.00
CPU Package C-state 9 residency after S2idle is: 0.00
CPU Package C-state 10 residency after S2idle is: 0.00
S0ix residency after S2idle is: 0.00

Your system achieved PC2 residency: 91.61, but no PC8 residency during S2idle: 0.00

---Debug no PC8 residency scenario---:
modprobe cpufreq_stats failedLoaded 0 prior measurements
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask d
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask d
Devfreq not enabled
glob returned GLOB_ABORTED
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
Leaving PowerTOP

Turbostat output: 

16.186215 sec
CPU%c1	CPU%c6	CPU%c7	GFX%rc6	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	SYS%LPI
2.52	0.00	95.42	783.82	91.97	0.35	0.00	0.00	0.00	0.00	0.00	0.00
0.88	0.00	95.37	783.93	91.98	0.35	0.00	0.00	0.00	0.00	0.00	0.00
4.05
1.22	0.00	95.47
3.92

Your CPU Core C7 residency is available: 95.42

Your system Intel graphics RC6 residency is available:783.82

Checking PCIe Device D state and Bridge Link state:

Available bridge device: 0000:00:07.0

0000:00:07.0 Link is in Detect

The link power management state of PCIe bridge: 0000:00:07.0 is OK.

Your system default pcie_aspm policy setting is OK.

Here is D3 log:
 [  151.977362] nvme 10000:e1:00.0: PCI PM: Suspend power state: D0
[  151.977365] nvme 10000:e1:00.0: PCI PM: Skipped
[  151.977426] pcieport 10000:e0:06.0: PCI PM: Suspend power state: D0
[  151.977429] pcieport 10000:e0:06.0: PCI PM: Skipped
[  151.989417] ahci 10000:e0:17.0: PCI PM: Suspend power state: D3hot
[  151.989577] sof-audio-pci-intel-tgl 0000:00:1f.3: PCI PM: Suspend power state: D3hot
[  151.989583] i915 0000:00:02.0: PCI PM: Suspend power state: D3hot
[  152.000784] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
[  152.000785] i801_smbus 0000:00:1f.4: PCI PM: Skipped
[  152.013242] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
[  152.014780] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
[  152.014784] vmd 0000:00:0e.0: PCI PM: Suspend power state: D3hot
[  152.014790] iwlwifi 0000:00:14.3: PCI PM: Suspend power state: D3hot
[  152.014795] proc_thermal 0000:00:04.0: PCI PM: Suspend power state: D3hot
[  152.014959] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
[  152.015075] xhci_hcd 0000:00:14.0: PCI PM: Suspend power state: D3hot
[  152.018464] xhci_hcd 0000:00:0d.0: PCI PM: Suspend power state: D3cold
[  152.023926] pcieport 0000:00:07.0: PCI PM: Suspend power state: D3cold
[  152.055546] thunderbolt 0000:00:0d.2: PCI PM: Suspend power state: D3cold

Checking PCI Devices D3 States:
[  151.977362] nvme 10000:e1:00.0: PCI PM: Suspend power state: D0
[  151.977365] nvme 10000:e1:00.0: PCI PM: Skipped
[  151.977426] pcieport 10000:e0:06.0: PCI PM: Suspend power state: D0
[  151.977429] pcieport 10000:e0:06.0: PCI PM: Skipped
[  151.989417] ahci 10000:e0:17.0: PCI PM: Suspend power state: D3hot
[  151.989577] sof-audio-pci-intel-tgl 0000:00:1f.3: PCI PM: Suspend power state: D3hot
[  151.989583] i915 0000:00:02.0: PCI PM: Suspend power state: D3hot
[  152.000784] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
[  152.000785] i801_smbus 0000:00:1f.4: PCI PM: Skipped
[  152.013242] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
[  152.014780] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
[  152.014784] vmd 0000:00:0e.0: PCI PM: Suspend power state: D3hot
[  152.014790] iwlwifi 0000:00:14.3: PCI PM: Suspend power state: D3hot
[  152.014795] proc_thermal 0000:00:04.0: PCI PM: Suspend power state: D3hot
[  152.014959] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
[  152.015075] xhci_hcd 0000:00:14.0: PCI PM: Suspend power state: D3hot
[  152.018464] xhci_hcd 0000:00:0d.0: PCI PM: Suspend power state: D3cold
[  152.023926] pcieport 0000:00:07.0: PCI PM: Suspend power state: D3cold
[  152.055546] thunderbolt 0000:00:0d.2: PCI PM: Suspend power state: D3cold


Checking PCI Devices tree diagram:
-+-[0000:00]-+-00.0  Intel Corporation Device 9a04
 |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD Graphics G4]
 |           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant
 |           +-06.0  Intel Corporation RST VMD Managed Controller
 |           +-07.0-[01-2b]--
 |           +-08.0  Intel Corporation GNA Scoring Accelerator module
 |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator Driver
 |           +-0d.0  Intel Corporation Tiger Lake-LP Thunderbolt 4 USB Controller
 |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0
 |           +-0e.0  Intel Corporation Volume Management Device NVMe RAID Controller
 |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller
 |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
 |           +-14.3  Intel Corporation Wi-Fi 6 AX201
 |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0
 |           +-15.1  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1
 |           +-16.0  Intel Corporation Tiger Lake-LP Management Engine Interface
 |           +-17.0  Intel Corporation RST VMD Managed Controller
 |           +-1f.0  Intel Corporation Tiger Lake-LP LPC Controller
 |           +-1f.3  Intel Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
 |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
 |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
 |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
 \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
              \-17.0  Intel Corporation Tiger Lake-LP SATA Controller

pcieport 10000:e0:06.0: PCI PM: Suspend power state: D0 pcieport 10000:e0:06.0: PCI PM: Skipped pcieport 0000:00:07.0: PCI PM: Suspend power state: D3cold
The pcieport 10000:e0:06.0 ASPM enable status:
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+

Pcieport is not in D3cold:          
10000:e0:06.0
Comment 1 Jian-Hong Pan 2024-01-19 04:09:39 UTC
Created attachment 305729 [details]
PCI devices information

I notice L1SubCtl1's PCI-PM_L1.2 is disabled on both "10000:e0:06.0 PCI bridge" and "10000:e1:00.0 Non-Volatile memory controller":

10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen Core Processor PCIe Controller (rev 01) (prog-if 00 [Normal decode])
...
        Capabilities: [200 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                           T_CommonMode=45us LTR1.2_Threshold=101376ns
                L1SubCtl2: T_PwrOn=50us

10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN550 NVMe SSD (rev 01) (prog-if 02 [NVM Express])
...
        Capabilities: [900 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
                          PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us

Besides, 10000:e1:00.0 Non-Volatile memory controller's LTR1.2_Threshold is 0ns.
Comment 2 Jian-Hong Pan 2024-01-19 05:54:54 UTC
After trace code, I notice all "10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core Processor PCIe Controller [8086:9a09] (rev 01)", "10000:e0:17.0 SATA controller [0106]: Intel Corporation Tiger Lake-LP SATA Controller [8086:a0d3] (rev 20)" and "10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01)" go into vmd_pm_enable_quirk() [1] and invoke pci_enable_link_state_locked() to config link states, including L1 sub-states.

Then, trace pci_enable_link_state_locked() deeplier.  I notice the comment:  /* Spec says both ports must be in D0 before enabling PCI PM substates */  [2] in pcie_config_aspm_link().  But, "10000:e0:06.0 PCI bridge" is at the 'unknown' power state, not D0.  So, the PCI PM L1.1 & L1.2 are disabled here.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/controller/vmd.c?h=v6.7#n744
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/aspm.c?h=v6.7#n741
Comment 3 Jian-Hong Pan 2024-01-19 06:03:24 UTC
Created attachment 305730 [details]
Power on the Bus' PCI devices before enable link states

I prepare the patch as this attachment to power on the PCI devices before enable link states.  Then, test the new built kernel with with S0ixSelftestTool again.
Comment 4 Jian-Hong Pan 2024-01-19 06:14:06 UTC
Created attachment 305731 [details]
PCI information for comment #3

With the patch in comment #3, the PCI PM L1.2 is enabled on both 10000:e0:06.0 PCI bridge and 10000:e1:00.0 Non-Volatile memory controller.

10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal decode])
...
        Capabilities: [200 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
                L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                           T_CommonMode=45us LTR1.2_Threshold=101376ns
                L1SubCtl2: T_PwrOn=50us


10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
...
        Capabilities: [900 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
                          PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us

There full PCI information is as the attachment.

However, S0ixSelftestTool's test result is the same as comment #0:

pcieport 10000:e0:06.0: PCI PM: Suspend power state: D0 pcieport 10000:e0:06.0: PCI PM: Skipped pcieport 0000:00:07.0: PCI PM: Suspend power state: D3cold
The pcieport 10000:e0:06.0 ASPM enable status:
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+

Pcieport is not in D3cold:          
10000:e0:06.0

Besides, 10000:e1:00.0 Non-Volatile memory controller's LTR1.2_Threshold is still 0ns.
Comment 5 Jian-Hong Pan 2024-01-19 06:42:13 UTC
Created attachment 305732 [details]
Power on the Bus' PCI devices before enable link states and don't reset the PCI Bus

More tracing, and found 10000:e1:00.0 Non-Volatile memory controller's LTR1.2_Threshold is cleared after reset the VMD PCI bus: pci_reset_bus(dev) [1].

Prepare the test patch to power on the Bus' PCI devices before enable link states and don't reset the PCI Bus as the attachment.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/controller/vmd.c?h=v6.7#n944
Comment 6 Jian-Hong Pan 2024-01-19 06:50:23 UTC
Created attachment 305733 [details]
PCI information for comment #5

With the patch in comment #3, the PCI PM L1.2 is enabled on both 10000:e0:06.0 PCI bridge and 10000:e1:00.0 Non-Volatile memory controller.  And, they all have LTR1.2_Threshold=101376ns.

10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal decode])
...
        Capabilities: [200 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
                L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                           T_CommonMode=45us LTR1.2_Threshold=101376ns
                L1SubCtl2: T_PwrOn=50us


10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
...
        Capabilities: [900 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
                          PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=101376ns
                L1SubCtl2: T_PwrOn=50us


However, S0ixSelftestTool's test result is the same as comment #0 again.
Comment 7 Jian-Hong Pan 2024-01-19 06:51:48 UTC
(In reply to Jian-Hong Pan from comment #6)
> Created attachment 305733 [details]
> PCI information for comment #5
> 
> With the patch in comment #3, the PCI PM L1.2 is enabled on both
> 10000:e0:06.0 PCI bridge and 10000:e1:00.0 Non-Volatile memory controller. 
> And, they all have LTR1.2_Threshold=101376ns.
Sorry, typo.  Here should be With the patch in comment #5.
Comment 8 Jian-Hong Pan 2024-01-19 08:28:07 UTC
I just notice that S0ixSelftestTool only checks PCIe's L1 substates whose domain is "0000"! [1]

So, I replace the PCI_SYSFS as "/sys/bus/pci/devices", like:

diff --git a/s0ix-selftest-tool.sh b/s0ix-selftest-tool.sh
index a56aca0..88e021a 100755
--- a/s0ix-selftest-tool.sh
+++ b/s0ix-selftest-tool.sh
@@ -800,7 +800,7 @@ debug_pcie_link_aspm() {
 
 #Function to check PCIe bridge Link Power Management state
 debug_pcie_bridge_lpm() {
-  PCI_SYSFS="/sys/devices/pci0000:00"
+  PCI_SYSFS="/sys/bus/pci/devices"
   PCI_BRIDGE_CLASS="0x060400"
   devices=$(ls ${PCI_SYSFS})
   bridge_devices=()

Going to test with the fixed S0ixSelftestTool in the condition as comment #0 again.

[1]: https://github.com/intel/S0ixSelftestTool/blob/d392b2eedd61da92486c7f285d62383b747907f2/s0ix-selftest-tool.sh#L803
Comment 9 Jian-Hong Pan 2024-01-19 08:31:36 UTC
(In reply to Jian-Hong Pan from comment #8) 
> Going to test with the fixed S0ixSelftestTool in the condition as comment #0
> again.

Here is the test result of the fixed S0ixSelftestTool.  It can check all PCIe Device D state and Bridge Link state now, including the domain "10000" created by VMD:

---Check S2idle path S0ix Residency---:

The system OS Kernel version is:
Linux endless 6.7.0+ #102 SMP PREEMPT_DYNAMIC Fri Jan 19 16:17:23 CST 2024 x86_64 GNU/Linux

---Check whether your system supports S0ix or not---:

Low Power S0 Idle is:1
Your system supports low power S0 idle capability.



---Check whether intel_pmc_core sysfs files exit---:

The pmc_core debug sysfs files are OK on your system.



---Judge PC10, S0ix residency available status---:

Test system supports S0ix.y substate

S0ix substate before S2idle:
  S0i2.0

S0ix substate residency before S2idle:
  0

Turbostat output: 
15.399998 sec
CPU%c1	CPU%c6	CPU%c7	GFX%rc6	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9Pk%pc10	SYS%LPI
3.15	0.00	94.42	400.56	90.98	0.00	0.00	0.00	0.00	0.00	0.00	0.00
2.09	0.00	96.98	400.47	90.97	0.00	0.00	0.00	0.00	0.00	0.00	0.00
1.64
5.67	0.00	91.85
3.21

CPU Core C7 residency after S2idle is: 94.42
GFX RC6 residency after S2idle is: 400.56
CPU Package C-state 2 residency after S2idle is: 90.98
CPU Package C-state 3 residency after S2idle is: 0.00
CPU Package C-state 8 residency after S2idle is: 0.00
CPU Package C-state 9 residency after S2idle is: 0.00
CPU Package C-state 10 residency after S2idle is: 0.00
S0ix residency after S2idle is: 0.00

Your system achieved PC2 residency: 90.98, but no PC8 residency during S2idle: 0.00

---Debug no PC8 residency scenario---:
modprobe cpufreq_stats failedLoaded 0 prior measurements
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask d
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask d
Devfreq not enabled
glob returned GLOB_ABORTED
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
Leaving PowerTOP

Turbostat output: 

16.160568 sec
CPU%c1	CPU%c6	CPU%c7	GFX%rc6	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9Pk%pc10	SYS%LPI
2.76	0.00	95.15	521.90	91.67	0.38	0.00	0.00	0.00	0.00	0.00	0.00
1.62	0.00	97.66	521.97	91.67	0.38	0.00	0.00	0.00	0.00	0.00	0.00
1.66
2.83	0.00	92.65
4.94

Your CPU Core C7 residency is available: 95.15

Your system Intel graphics RC6 residency is available:521.90

Checking PCIe Device D state and Bridge Link state:

Available bridge device: 0000:00:07.0 10000:e0:06.0

0000:00:07.0 Link is in Detect

The link power management state of PCIe bridge: 0000:00:07.0 is OK.

10000:e0:06.0 Link is in L1.2

The link power management state of PCIe bridge: 10000:e0:06.0 is OK.

Your system default pcie_aspm policy setting is OK.


Checking PCI Devices D3 States:
[  107.787829] nvme 10000:e1:00.0: PCI PM: Suspend power state: D0
[  107.787833] nvme 10000:e1:00.0: PCI PM: Skipped
[  107.787912] pcieport 10000:e0:06.0: PCI PM: Suspend power state: D0
[  107.787914] pcieport 10000:e0:06.0: PCI PM: Skipped
[  107.799780] ahci 10000:e0:17.0: PCI PM: Suspend power state: D3hot
[  107.801722] sof-audio-pci-intel-tgl 0000:00:1f.3: PCI PM: Suspend power state: D3hot
[  107.801833] i915 0000:00:02.0: PCI PM: Suspend power state: D3hot
[  107.811276] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
[  107.811278] i801_smbus 0000:00:1f.4: PCI PM: Skipped
[  107.822783] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
[  107.823888] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
[  107.824057] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
[  107.826033] iwlwifi 0000:00:14.3: PCI PM: Suspend power state: D3hot
[  107.826132] vmd 0000:00:0e.0: PCI PM: Suspend power state: D3hot
[  107.826136] proc_thermal 0000:00:04.0: PCI PM: Suspend power state: D3hot
[  107.826458] xhci_hcd 0000:00:14.0: PCI PM: Suspend power state: D3hot
[  107.830225] xhci_hcd 0000:00:0d.0: PCI PM: Suspend power state: D3cold
[  107.838416] thunderbolt 0000:00:0d.2: PCI PM: Suspend power state: D3cold
[  107.867145] pcieport 0000:00:07.0: PCI PM: Suspend power state: D3cold


Checking PCI Devices tree diagram:
-+-[0000:00]-+-00.0  Intel Corporation Device 9a04
 |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD Graphics G4]
 |           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant
 |           +-06.0  Intel Corporation RST VMD Managed Controller
 |           +-07.0-[01-2b]--
 |           +-08.0  Intel Corporation GNA Scoring Accelerator module
 |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator Driver
 |           +-0d.0  Intel Corporation Tiger Lake-LP Thunderbolt 4 USB Controller
 |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0
 |           +-0e.0  Intel Corporation Volume Management Device NVMe RAID Controller
 |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller
 |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
 |           +-14.3  Intel Corporation Wi-Fi 6 AX201
 |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0
 |           +-15.1  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1
 |           +-16.0  Intel Corporation Tiger Lake-LP Management Engine Interface
 |           +-17.0  Intel Corporation RST VMD Managed Controller
 |           +-1f.0  Intel Corporation Tiger Lake-LP LPC Controller
 |           +-1f.3  Intel Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
 |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
 |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
 |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
 \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
              \-17.0  Intel Corporation Tiger Lake-LP SATA Controller

The pcieport 10000:e0:06.0 ASPM enable status:
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+

Pcieport is not in D3cold:          
10000:e0:06.0
Comment 10 David Box 2024-01-19 20:43:58 UTC
Hi Jian-Hong,

1. Was this working on an older kernel?
2. Can you boot with VMD disabled and dump the detailed lspci report?
Comment 11 Luis Ortega 2024-01-22 02:16:15 UTC
IIRC, VMD has had the "nothing but PC2 is reachable if enabled" bug for a couple of years now. I discovered this in 2021/2(?) from a bugzilla bug report.
Comment 12 Jian-Hong Pan 2024-01-22 03:28:19 UTC
(In reply to David Box from comment #10) 
> 1. Was this working on an older kernel?
The consumption power issue of s2idle with enabled VMD can be traced back to the kernel 5.11, which we had installed/tested before.
Comment 13 Jian-Hong Pan 2024-01-22 03:37:45 UTC
Created attachment 305742 [details]
PCI devices information with disabled VMD

(In reply to David Box from comment #10)
> 2. Can you boot with VMD disabled and dump the detailed lspci report?
This attachment is the PCI devices information with disabled VMD.
Comment 14 Jian-Hong Pan 2024-01-22 03:43:54 UTC
Created attachment 305743 [details]
The kernel log with disabled VMD and s2idle

This attachment is the kernel log with disabled VMD and s2idle following comment #13.

Here is S0ixSelftestTool's output with disabled VMD:

---Check S2idle path S0ix Residency---:

The system OS Kernel version is:
Linux endless 6.7.0+ #109 SMP PREEMPT_DYNAMIC Fri Jan 19 18:46:35 CST 2024 x86_64 GNU/Linux

---Check whether your system supports S0ix or not---:

Low Power S0 Idle is:1
Your system supports low power S0 idle capability.



---Check whether intel_pmc_core sysfs files exit---:

The pmc_core debug sysfs files are OK on your system.



---Judge PC10, S0ix residency available status---:

Test system supports S0ix.y substate

S0ix substate before S2idle:
  S0i2.0

S0ix substate residency before S2idle:
  14341100

Turbostat output: 
15.857775 sec
CPU%c1	CPU%c6	CPU%c7	GFX%rc6	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	SYS%LPI
2.80	0.00	94.91	2112.95	2.26	0.82	0.02	0.15	0.30	1.81	86.22	84.06
3.13	0.00	96.01	2113.15	2.26	0.82	0.02	0.15	0.30	1.81	86.23	84.07
1.35
3.77	0.00	93.82
2.95

CPU Core C7 residency after S2idle is: 94.91
GFX RC6 residency after S2idle is: 2112.95
CPU Package C-state 2 residency after S2idle is: 2.26
CPU Package C-state 3 residency after S2idle is: 0.82
CPU Package C-state 8 residency after S2idle is: 0.30
CPU Package C-state 9 residency after S2idle is: 1.81
CPU Package C-state 10 residency after S2idle is: 86.22
S0ix residency after S2idle is: 84.06

S0ix substate after S2idle:
  S0i2.0

S0ix substate residency after S2idle:
  27668959

S0ix substates residency delta value: S0i2.0 13327859

Congratulations! Your system achieved the deepest S0ix substate!           
Here is the S0ix substates status: 
Substate   Residency      
S0i2.0     27668959
Comment 15 Jian-Hong Pan 2024-01-22 04:09:33 UTC
Here is the PCI bus tree with disabled VMD following comment #13:

-[0000:00]-+-00.0  Intel Corporation Device [8086:9a04]
           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD Graphics G4] [8086:9a78]
           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant [8086:9a03]
           +-06.0-[01]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD [15b7:5009]
           +-07.0-[02-2c]--
           +-08.0  Intel Corporation GNA Scoring Accelerator module [8086:9a11]
           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator Driver [8086:9a0d]
           +-0d.0  Intel Corporation Tiger Lake-LP Thunderbolt 4 USB Controller [8086:9a13]
           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0 [8086:9a1b]
           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller [8086:a0ed]
           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM [8086:a0ef]
           +-14.3  Intel Corporation Wi-Fi 6 AX201 [8086:a0f0]
           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0 [8086:a0e8]
           +-15.1  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1 [8086:a0e9]
           +-16.0  Intel Corporation Tiger Lake-LP Management Engine Interface [8086:a0e0]
           +-17.0  Intel Corporation Tiger Lake-LP SATA Controller [8086:a0d3]
           +-1f.0  Intel Corporation Tiger Lake-LP LPC Controller [8086:a082]
           +-1f.3  Intel Corporation Tiger Lake-LP Smart Sound Technology Audio Controller [8086:a0c8]
           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller [8086:a0a3]
           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller [8086:a0a4]
           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V [8086:15fc]
Comment 16 Jian-Hong Pan 2024-01-23 06:10:38 UTC
Created attachment 305754 [details]
Power on the Bus' PCI devices before enable link states, don't reset the PCI Bus, NVMe simple suspend quirk and power off the dummy PCI devices

(In reply to Jian-Hong Pan from comment #9)
> The pcieport 10000:e0:06.0 ASPM enable status:
>               LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
> 
> Pcieport is not in D3cold:          
> 10000:e0:06.0
To make pcieport 10000:e0:06.0 in D3 when s2idle, I tried the patch for test:
* Power on the Bus' PCI devices before enable link states
* Don't reset the PCI Bus
* NVMe with simple suspend quirk
* Power off the dummy PCI devices [8086:09ab] [1]

[1] https://www.intel.com/content/www/us/en/support/articles/000088762/technologies/intel-rapid-storage-technology-intel-rst.html

Then, here is S0ixSelftestTool's output with enabled VMD:

---Check S2idle path S0ix Residency---:

The system OS Kernel version is:
Linux endless 6.7.0+ #112 SMP PREEMPT_DYNAMIC Tue Jan 23 12:05:25 CST 2024 x86_64 GNU/Linux

---Check whether your system supports S0ix or not---:

Low Power S0 Idle is:1
Your system supports low power S0 idle capability.



---Check whether intel_pmc_core sysfs files exit---:

The pmc_core debug sysfs files are OK on your system.



---Judge PC10, S0ix residency available status---:

Test system supports S0ix.y substate

S0ix substate before S2idle:
  S0i2.0

S0ix substate residency before S2idle:
  0

Turbostat output: 
15.753671 sec
CPU%c1	CPU%c6	CPU%c7	GFX%rc6	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	SYS%LPI
2.42	0.00	95.62	1982.66	92.93	0.00	0.00	0.00	0.00	0.00	0.00	0.00
5.42	0.00	93.68	1982.68	92.93	0.00	0.00	0.00	0.00	0.00	0.00	0.00
1.32
1.26	0.00	97.55
1.68

CPU Core C7 residency after S2idle is: 95.62
GFX RC6 residency after S2idle is: 1982.66
CPU Package C-state 2 residency after S2idle is: 92.93
CPU Package C-state 3 residency after S2idle is: 0.00
CPU Package C-state 8 residency after S2idle is: 0.00
CPU Package C-state 9 residency after S2idle is: 0.00
CPU Package C-state 10 residency after S2idle is: 0.00
S0ix residency after S2idle is: 0.00

Your system achieved PC2 residency: 92.93, but no PC8 residency during S2idle: 0.00

---Debug no PC8 residency scenario---:
modprobe cpufreq_stats failedLoaded 0 prior measurements
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask d
RAPL device for cpu 0
RAPL Using PowerCap Sysfs : Domain Mask d
Devfreq not enabled
glob returned GLOB_ABORTED
Cannot load from file /var/cache/powertop/saved_parameters.powertop
File will be loaded after taking minimum number of measurement(s) with battery only 
Leaving PowerTOP

Turbostat output: 

15.945095 sec
CPU%c1	CPU%c6	CPU%c7	GFX%rc6	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	SYS%LPI
2.09	0.00	96.27	2101.49	94.11	0.00	0.00	0.00	0.00	0.00	0.00	0.00
1.68	0.00	97.56	2101.41	94.11	0.00	0.00	0.00	0.00	0.00	0.00	0.00
1.53
4.08	0.00	94.99
1.08

Your CPU Core C7 residency is available: 96.27

Your system Intel graphics RC6 residency is available:2101.49

Checking PCIe Device D state and Bridge Link state:

Available bridge device: 0000:00:07.0 10000:e0:06.0

0000:00:07.0 Link is in Detect

The link power management state of PCIe bridge: 0000:00:07.0 is OK.

10000:e0:06.0 Link is in L1.2

The link power management state of PCIe bridge: 10000:e0:06.0 is OK.

Your system default pcie_aspm policy setting is OK.


Checking PCI Devices D3 States:
[  361.445113] nvme 10000:e1:00.0: PCI PM: Suspend power state: D3hot
[  361.445116] ahci 10000:e0:17.0: PCI PM: Suspend power state: D3hot
[  361.446603] sof-audio-pci-intel-tgl 0000:00:1f.3: PCI PM: Suspend power state: D3hot
[  361.446718] i915 0000:00:02.0: PCI PM: Suspend power state: D3hot
[  361.456849] pcieport 10000:e0:06.0: PCI PM: Suspend power state: D3cold
[  361.456997] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
[  361.457000] i801_smbus 0000:00:1f.4: PCI PM: Skipped
[  361.469639] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
[  361.470892] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
[  361.471149] xhci_hcd 0000:00:14.0: PCI PM: Suspend power state: D3hot
[  361.471362] vmd 0000:00:0e.0: PCI PM: Suspend power state: D3hot
[  361.471365] proc_thermal 0000:00:04.0: PCI PM: Suspend power state: D3hot
[  361.471529] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
[  361.471922] iwlwifi 0000:00:14.3: PCI PM: Suspend power state: D3hot
[  361.475818] xhci_hcd 0000:00:0d.0: PCI PM: Suspend power state: D3cold
[  361.505092] thunderbolt 0000:00:0d.2: PCI PM: Suspend power state: D3cold


Checking PCI Devices tree diagram:
-+-[0000:00]-+-00.0  Intel Corporation Device 9a04
 |           +-02.0  Intel Corporation Tiger Lake-LP GT2 [UHD Graphics G4]
 |           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant
 |           +-06.0  Intel Corporation RST VMD Managed Controller
 |           +-07.0-[01-2b]--
 |           +-08.0  Intel Corporation GNA Scoring Accelerator module
 |           +-0a.0  Intel Corporation Tigerlake Telemetry Aggregator Driver
 |           +-0d.0  Intel Corporation Tiger Lake-LP Thunderbolt 4 USB Controller
 |           +-0d.2  Intel Corporation Tiger Lake-LP Thunderbolt 4 NHI #0
 |           +-0e.0  Intel Corporation Volume Management Device NVMe RAID Controller
 |           +-14.0  Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1 xHCI Host Controller
 |           +-14.2  Intel Corporation Tiger Lake-LP Shared SRAM
 |           +-14.3  Intel Corporation Wi-Fi 6 AX201
 |           +-15.0  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #0
 |           +-15.1  Intel Corporation Tiger Lake-LP Serial IO I2C Controller #1
 |           +-16.0  Intel Corporation Tiger Lake-LP Management Engine Interface
 |           +-17.0  Intel Corporation RST VMD Managed Controller
 |           +-1f.0  Intel Corporation Tiger Lake-LP LPC Controller
 |           +-1f.3  Intel Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
 |           +-1f.4  Intel Corporation Tiger Lake-LP SMBus Controller
 |           +-1f.5  Intel Corporation Tiger Lake-LP SPI Controller
 |           \-1f.6  Intel Corporation Ethernet Connection (13) I219-V
 \-[10000:e0]-+-06.0-[e1]----00.0  Sandisk Corp WD Blue SN550 NVMe SSD
              \-17.0  Intel Corporation Tiger Lake-LP SATA Controller

pcieport 10000:e0:06.0 went to D3cold when s2idle, because nvme 10000:e1:00.0 suspended to D3hot simply.

However, it still stop at CPU Package C-state 2.
Comment 17 Daniel Drake 2024-01-23 16:00:01 UTC
I'm working alongside Jian-Hong on this issue.

(In reply to Luis Ortega from comment #11)
> IIRC, VMD has had the "nothing but PC2 is reachable if enabled" bug for a
> couple of years now. I discovered this in 2021/2(?) from a bugzilla bug
> report.

Thanks Luis. Indeed, I can now see a bunch of discussions that expose that particular point.

I think I've just discovered something new though:

As you say, with VMD on, s2idle cannot go beyond C2. /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us is 0 after resume.

However, if you run this command: echo 2 > /sys/kernel/debug/pmc_core/ltr_ignore

then after suspend/resume, /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us is a positive value!

/sys/kernel/debug/slp_s0_residency_usec is still 0, which means that the system is not getting into the deepest sleep that we are aiming for. But this ltr_ignore thing is at least getting the CPU into a lower power state (C10), and may provide a clue to the problem at the system level?

It looks like ltr_ignore bit 2 corresponds to the SATA IP block. This product does not have any SATA storage as shipped, although I believe there is a SATA connector on the motherboard.

We'll keep investigating and would greatly appreciate any pointers around what this ltr_ignore thing is doing and what we could check next. I'd also like to understand if there's a way of querying which PCH block is preventing SLP_S0# from being activated. I will investigate if Chipset Initialization Register 34C (CIR34C) can help with this.
Comment 18 David Box 2024-01-23 18:37:42 UTC
(In reply to Daniel Drake from comment #17)
> 
> It looks like ltr_ignore bit 2 corresponds to the SATA IP block. This
> product does not have any SATA storage as shipped, although I believe there
> is a SATA connector on the motherboard.

There is a SATA controller in the SoC. It is device 17 on the PCI bus. Being a storage controller, it is also remapped along with the NVMe when VMD is enabled. Is the a SATA drive attached to this system?

> 
> We'll keep investigating and would greatly appreciate any pointers around
> what this ltr_ignore thing is doing and what we could check next.

LTR is the latency tolerance of a device. Devices in the PCH report their latency tolerance to the PMC. The PMC takes these values and reports an aggregate maximum latency to the CPU. The CPU uses this value (among many others) to determine how deep of a package c state it can enter that will allow saving the most power while still being able to wake up in time to service requests. ltr_ignore is a debug/testing mechanism to allow the PMC to ignore the LTR value of a device.

> I'd also like to understand if there's a way of querying which PCH block is
> preventing SLP_S0# from being activated.

From the OS point of view, we aim for PC10 to achieve SLP_S0. The other requirements are generally handled by hardware & firmware. The PMC driver provides some insight into blocking reasons. A basic check can be done by enabling "warn_on_s0ix_failures" under /sys/module/intel_pmc_core/parameters/. In order for this to work, you must be able to get PC10. If you do get PC10 but not SLP_S0, the kernel log will display the conditions that prevented entry.

David
Comment 19 Daniel Drake 2024-01-23 20:30:39 UTC
Created attachment 305768 [details]
warn_on_s0ix_failures output

Thanks for the hints. There's no SATA drive attached.

You say the PCH devices report their latency tolerance to to the PMC - is there a way to know or check what tolerance they are reporting, or how is that defined/controlled?

Here is the output from warn_on_s0ix_failures. The CPU did enter C10 during suspend - but only because I did the SATA ltr_ignore trick. Not sure if that invalidates the test, would appreciate your quick analysis in any case!
Comment 20 Daniel Drake 2024-01-23 21:01:04 UTC
> You say the PCH devices report their latency tolerance to to the PMC - is
> there a way to know or check what tolerance they are reporting, or how is
> that defined/controlled?

I found ltr_show.

Comparing the working config (VMD off in BIOS, CPU enters PC10 and system gets into SLP_S0 without any further poking):
PMC0:SATA      LTR: RAW: 0x9001    Non-Snoop(ns): 0   Snoop(ns): 1048576	PMC0:AGGREGATED_SYSTEM    LTR: RAW: 0x7fffdff  Non-Snoop(ns): 0          Snoop(ns): 0               

to the failing config (VMD on in BIOS, CPU gets stuck in C2 unless poking ltr_ignore):
PMC0:SATA      LTR: RAW: 0x880a    Non-Snoop(ns): 0   Snoop(ns): 10240	PMC0:AGGREGATED_SYSTEM    LTR: RAW: 0x7fc0dff  Non-Snoop(ns): 0          Snoop(ns): 0               

There are no other differences in the ltr_show output  between working and failing setups.
Comment 21 Daniel Drake 2024-01-24 15:55:20 UTC
Thanks for the hints David, I think we are on the right track.

As a next step I think we need to understand precisely how the ltr_show value for the SATA IP block is determined and controlled. Can you help us understand this?

Today I have tested the following configuration:
 * BIOS version 304 (the one originally shipped on these products)
 * VMD enabled in firmware setup menu 
 * Windows 11 freshly installed

Using WinDbg to read physical memory, plus Intel socwatch & VTune, I have observed: **Windows does not get into SLP_S0 nor PC10!** It only reaches PC8 during sleep. So, in order to solve the original issue (rapid battery drain in suspend mode), I believe we are not actually chasing SLP_S0/PC10 as an end goal (although we can see if we can get there too), instead we should first figure out how to get it as low as PC8 to match Windows, and see if that gets us an equivalent in-suspend battery drain result.

As you have explained, LTR values are one important factor in determining how deep of a Package C-state we can achieve. On Linux under this same configuration, the LTR value for SATA is 0x880a (snoop 10240ns). On Windows, it is 0x9009 (snoop 294912ns - much higher). There are no other LTR value differences apart from the aggregate one. We also know that if we turn off VMD in the firmware setup menu, the Linux SATA LTR value becomes 0x9001 (snoop 1048576ns), and in this configuration we can get PC10 & SLP_S0.

So there's a decent indication that if we can increase the SATA LTR value for the problematic case when VMD is enabled, we'll get to a deeper package C-state and enjoy reduced power drain in suspend. Now just need to understand how to affect that specific LTR value.
Comment 22 David Box 2024-01-24 20:24:01 UTC
(In reply to Daniel Drake from comment #21)
> Thanks for the hints David, I think we are on the right track.
> 
> As a next step I think we need to understand precisely how the ltr_show
> value for the SATA IP block is determined and controlled. Can you help us
> understand this?

The LTR is reported by devices to the PMC. The value it reports is generally based on the device's activity level. It can change dynamically. If you continually cat ltr_show you may see the value change for some devices. Some devices don't ever report any requirement to the PMC (the zeros). Of the ones that do there are different mechanisms for how they may be set. PCIe devices have capabilities in configuration space for setting LTRs. I am not sure how it's managed for the SATA controller.

> 
> Today I have tested the following configuration:
>  * BIOS version 304 (the one originally shipped on these products)
>  * VMD enabled in firmware setup menu 
>  * Windows 11 freshly installed
> 
> Using WinDbg to read physical memory, plus Intel socwatch & VTune, I have
> observed: **Windows does not get into SLP_S0 nor PC10!** It only reaches PC8
> during sleep. So, in order to solve the original issue (rapid battery drain
> in suspend mode), I believe we are not actually chasing SLP_S0/PC10 as an
> end goal (although we can see if we can get there too), instead we should
> first figure out how to get it as low as PC8 to match Windows, and see if
> that gets us an equivalent in-suspend battery drain result.

On Linux, you can get PC8 with ltr_ignore, but still only PC2 during suspend correct? The suspend issue may not be LTR related at all. ltr_ignore is permanent over suspend-to-idle (or at least should be). I'll provide you a patch to confirm. We often use ltr_ignore to confirm that we can achieve SLP_S0 by ignoring a problematic LTR. The actual fix of course is to address the proper setting of the LTR under idle conditions of the device.

> 
> As you have explained, LTR values are one important factor in determining
> how deep of a Package C-state we can achieve. On Linux under this same
> configuration, the LTR value for SATA is 0x880a (snoop 10240ns). On Windows,
> it is 0x9009 (snoop 294912ns - much higher). There are no other LTR value
> differences apart from the aggregate one. We also know that if we turn off
> VMD in the firmware setup menu, the Linux SATA LTR value becomes 0x9001
> (snoop 1048576ns), and in this configuration we can get PC10 & SLP_S0.

One thing you can try is enabling ALPM for all of the links. The s0ixselftest tool should have done this already, but you can do it yourself with powertop. When you run powertop as root you can tab over to the Tunables pages and see the runtime power management status of the devices on your system. The ones that say Bad are not enabled. You can try enabling them on all the SATA hosts you see by pressing the space bar. When you do, powertop will show you the command it wrote to enable it so that you can it do yourself from the shell. Do this before running ltr_ignore (fresh boot) to see if the value changes such that you can get PC8 or deeper. A shortcut for this is to run 'powertop --auto-tune'. That will enable everything. But fair warning, some devices may not respond well to this (like increased response delay with USB devices, an example of why the kernel doesn't just turn everything on).

However, I don't know why this would work with AHCI mode and not VMD mode. The other thing that's weird is that there is even any LTR requirement at all given there are no SATA devices attached. This seems like a firmware programming issue, or maybe something that needs to be handled in the VMD driver.
Comment 23 David Box 2024-01-24 20:33:09 UTC
BTW, the kernel log does confirm that SATA was not power gated during suspend.

[  425.969747] intel_pmc_core INT33A1:00: PMC0:SATA_PG_STS                    0 

This is not directly related to LTR though. LTR reporting allows the CPU to set the overall power state of the package based on the need for access to core resources. But a high or even no-requirement LTR doesn't mean the device was power gated. If fact, this status is not even collected until PC10 is reached.
Comment 24 Daniel Drake 2024-01-24 22:55:02 UTC
(In reply to David Box from comment #22)
> The LTR is reported by devices to the PMC. The value it reports is generally
> based on the device's activity level. It can change dynamically. If you
> continually cat ltr_show you may see the value change for some devices.

In this case I have sampled the LTR value a good few times under these different configurations and it seems constant within each configuration. Perhaps not surprising given that there is no SATA activity, there are no drives...

> On Linux, you can get PC8 with ltr_ignore, but still only PC2 during suspend
> correct?

Almost. Under this configuration (VMD on in firmware setup, BIOS 304) the behaviour is:
- Linux: only reaches PC2 in suspend
- Linux, with ltr_ignore 2: reaches PC10 (and no SLP_S0)
- Windows 11: reaches PC8 in suspend (and no SLP_S0)

> We often use ltr_ignore to confirm that we can achieve
> SLP_S0 by ignoring a problematic LTR. The actual fix of course is to address
> the proper setting of the LTR under idle conditions of the device.

Agreed.

> One thing you can try is enabling ALPM for all of the links. The
> s0ixselftest tool should have done this already, but you can do it yourself
> with powertop. When you run powertop as root you can tab over to the
> Tunables pages and see the runtime power management status of the devices on
> your system. The ones that say Bad are not enabled. You can try enabling
> them on all the SATA hosts you see by pressing the space bar. When you do,
> powertop will show you the command it wrote to enable it so that you can it
> do yourself from the shell. Do this before running ltr_ignore (fresh boot)
> to see if the value changes such that you can get PC8 or deeper. A shortcut
> for this is to run 'powertop --auto-tune'.

We have already tried this at an earlier point in the investigation and it didn't affect the drain issue.

> However, I don't know why this would work with AHCI mode and not VMD mode.
> The other thing that's weird is that there is even any LTR requirement at
> all given there are no SATA devices attached. This seems like a firmware
> programming issue, or maybe something that needs to be handled in the VMD
> driver.

Yes, this AHCI vs VMD factor is the really suspect thing. And one clear difference that emerges is the SATA LTR value. That's why I think it would be super useful to know precisely how that specific LTR value is influenced or controlled, if it's possible to find out.

It's entirely possible that we are facing firmware misbehaviour, but do remember, under this problematic VMD=on configuration, Windows has a 294912ns tolerance and Linux has a 10240ns tolerance - so it seems highly likely that the OS *can* control or influence this value. Googling around also indicates that other products hit this issue with identical symptoms (VMD=on only reaches PC2, VMD=off reaches deeper state). So I think/hope there is something fairly generic we can do to solve this.
Comment 25 David Box 2024-01-25 02:23:01 UTC
(In reply to Daniel Drake from comment #24)
> Almost. Under this configuration (VMD on in firmware setup, BIOS 304) the
> behaviour is:
> - Linux: only reaches PC2 in suspend
> - Linux, with ltr_ignore 2: reaches PC10 (and no SLP_S0)
> - Windows 11: reaches PC8 in suspend (and no SLP_S0)

Okay. This looks like two problems, SATA LTR blocking PC10 and SATA IP not power gated during suspend. Though I'm aware you're primarily concerned with the LTR because you can get significant power savings in PC10 even without SLP_S0.

I've cc'd some folks to chime in on the SATA LTR issue.
Comment 26 Daniel Drake 2024-01-25 13:13:04 UTC
Thanks. Mika, any chance you can help with this specific question:

On TGL, what precisely controls or influences the SATA LTR value visible in /sys/kernel/debug/pmc_core/ltr_show? This corresponds to physical address 0xfe001b68.

Context: on multiple platforms including this one, when VMD is enabled, battery drains fast in s2idle suspend because CPU only reaches PC2. On Windows, battery drain in suspend is acceptable and CPU reaches PC8.

There's a strong indication that the SATA LTR value is a major part of this issue. On Linux, this is 0x880a (snoop 10240ns). On Windows, it is 0x9009 (snoop 294912ns - much higher). We would like to match the Windows behaviour to advance on this issue. (There are no SATA drives attached.)
Comment 27 Mika Westerberg 2024-01-25 13:25:22 UTC
I don't think we program LTR at all in Linux for the ahci. So it is what the BIOS programmed it if I'm not mistaken.

One thing that came to mind, not sure if you already looked at this or not (sorry not reading through the whole bug). The ahci should be marked as "low power" but it is not done for TGL because a regression. You could try to revert this revert and see if it has any effect.

6210038aeaf4 (2ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI controller")
Comment 28 Daniel Drake 2024-01-25 13:36:09 UTC
Bingo! With that reverted, as soon as SATA is probed the problematic LTR value shifts from 0x880a (very low latency tolerance) to 0x9001.

And, CPU enters PC10 in suspend, SLP_S0 entered too. Better than Windows.

That detail would have been rather hard to spot. Thanks for your fast advice!
Comment 29 David Box 2024-01-25 15:35:40 UTC
Thanks Mika.

Daniel, can you confirm all the changes you used? Are Jian-Hong's VMD changes also needed to reach PC10 and SLP_S0?
Comment 30 Daniel Drake 2024-01-25 18:20:10 UTC
Jian-Hong's changes are not needed.

For Asus B1400 I am testing with these changes:

1. "echo s2idle > /sys/power/mem_sleep" to switch from S3 to s2idle. S3 is the default on this platform only via a Linux quirk because of the NVMe D3 issue explained below. However, we have now found that S3 is buggy (firmware fails to wake up sometimes), so that's why we came back to revisit this whole situation.

2. With s2idle re-enabled and VMD=off, the system fails to resume the NVMe devices after sleep. Instead of going with a workaround like was done originally with the S3 product quirk, we now found the root cause: The NVMe device's parent PCI bridge cannot be put into D3cold - it cannot be revived from that state. From a platform perspective, I imagine that particular D3cold transition is untested by the vendor, because we also observed that Windows leaves the NVMe device (and hence the parent bridge) in D0 throughout suspend. The path forward here is either to quirk to set d3cold_allowed=0 on the parent bridge (D3hot is fine), or to quirk the NVMe device not to change PCI power state in suspend. (Incidentally, a new BIOS update made available after we shipped 1000 of these products now achieves this by setting StorageD3Enable=0)

3. The SATA LPM change mentioned by Mika, which looks like it can also be achieved by writing a suitable value to /sys/class/scsi_host/*/link_power_management_policy.

Now that we have detailed understanding of the issues, we will seek upstream solutions for all of the above.