Bug 215063

Summary: S0ix: can't reach s0ix with VMD and raid mode - Dell XPS 13 9305
Product: Power Management Reporter: Adhitya Mohan (me)
Component: Hibernation/SuspendAssignee: David Box (david.e.box)
Status: ASSIGNED ---    
Severity: normal CC: bugzilla, felash, fkrueger, jonah, leho, rui.zhang
Priority: P1    
Hardware: Intel   
OS: Linux   
URL: https://patchwork.kernel.org/project/linux-pci/cover/20220301041943.2935892-1-david.e.box@linux.intel.com/
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=213717
Kernel Version: 5.17.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Subject: [PATCH 1/3] PCI/ASPM: Create ASPM override
Subject: [PATCH 2/3] PCI: vmd: Override ASPM on TGL/ADL VMD devices
Subject: [PATCH 3/3] Add pci warnings everywhere
patch for 5.16.18

Description Adhitya Mohan 2021-11-18 13:44:43 UTC
Hi, 

so I used https://github.com/qwang59/S0ixSelftestTool to find out the fact that my NVME ssd (Non-Volatile memory controller: KIOXIA Corporation Device 0001) with VMD and raid mode can't reach s0ix (Pcieport is not in D3cold). Hence resulting in a huge battery drain on suspend, while running the mainline version of the 5.15 kernel.

Then I came across https://bugzilla.kernel.org/show_bug.cgi?id=213717, tried the 3 patches proposed there and it fixed the issue, when I built my own kernel with these patches. I ran the self test tool again and I got, 

S0ix substates residency delta value: S0i2.0 13284122

Congratulations! Your system achieved the deepest S0ix substate!    

CPU Core C7 residency after S2idle is: 95.59
GFX RC6 residency after S2idle is: 617.37
CPU Package C-state 2 residency after S2idle is: 1.96
CPU Package C-state 3 residency after S2idle is: 1.55
CPU Package C-state 8 residency after S2idle is: 3.35
CPU Package C-state 9 residency after S2idle is: 0.00
CPU Package C-state 10 residency after S2idle is: 86.32

vs before 

CPU Core C7 residency after S2idle is: 95.59
GFX RC6 residency after S2idle is: 617.37
CPU Package C-state 2 residency after S2idle is: 1.96
CPU Package C-state 3 residency after S2idle is: 86.32
CPU Package C-state 8 residency after S2idle is: 0.00
CPU Package C-state 9 residency after S2idle is: 0.00
CPU Package C-state 10 residency after S2idle is: 0.00

the patches also had some debug lines, for which the outputs were

[adhitya@XPS13 self]$ sudo dmesg | grep VMD_DEBUG
[    0.624100] pci 10000:e0:06.0: VMD_DEBUG: features = 46
[    0.624101] pci 10000:e1:00.0: VMD_DEBUG: features = 46
[    0.624103] pci 10000:e1:00.0: VMD_DEBUG: pos = 608
[    0.624104] pci 10000:e1:00.0: VMD_DEBUG: Setting LTRs and policy...
[    0.624107] pci 10000:e1:00.0: VMD_DEBUG: Inside aspm_override_policy, link = 0000000033512dab
[    0.624109] pci 10000:e1:00.0: VMD_DEBUG: Found link, setting policy...
[adhitya@XPS13 self]$ cat /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us
199014273

I opened this separate report as my device is different so I am not entirely sure if I should include my result in that report.

TLDR: mainline kernel has broken suspend behavior, patches from  https://bugzilla.kernel.org/show_bug.cgi?id=213717 fix it.


I am also attaching the patch files, if it helps anyone else with my same device.
Comment 1 Adhitya Mohan 2021-11-18 13:55:36 UTC
Created attachment 299631 [details]
Subject: [PATCH 1/3] PCI/ASPM: Create ASPM override
Comment 2 Adhitya Mohan 2021-11-18 13:56:05 UTC
Created attachment 299633 [details]
Subject: [PATCH 2/3] PCI: vmd: Override ASPM on TGL/ADL VMD devices
Comment 3 Adhitya Mohan 2021-11-18 13:56:39 UTC
Created attachment 299635 [details]
Subject: [PATCH 3/3] Add pci warnings everywhere
Comment 4 Julien Wajsberg 2022-01-18 20:57:29 UTC
I also had issues with this on my Dell XPS 15 9510 (2021) (so probably very similar to the one described in this bug).
After applying these same patches, and according to the script S0ixSelftestTool, it looks like the computer can now reach S0ix substate. I'll report again if the computer seems to still lose battery too much.
Comment 5 Julien Wajsberg 2022-04-08 14:10:18 UTC
This is still happening with 5.15.32 and 5.16.18 for me (XPS 15 9510 from 2021). I even booted on Windows to make sure I'd update all firmwares.

I could unbitrot this patch and apply it on top of 5.16.18 and it still fixes the problem for me.

How can we move forward ?
Comment 6 Julien Wajsberg 2022-04-08 14:13:35 UTC
Created attachment 300726 [details]
patch for 5.16.18

Here is the updated patch for 5.16.18, as a concatenation from the 3 others.

Note that I don't know what I'm doing :-)
Comment 7 Adhitya Mohan 2022-04-08 14:15:13 UTC
Hi,
This patch has changed significantly since the version I posted, i recommend checking out the mailing list for the patch series under the title

[PATCH V5 0/3] PCI: vmd: Enable PCIE ASPM and LTR

Unfortunately it doesn't look like the module maintainer is very interested in working with Intel on merging the patches.

(In reply to Julien Wajsberg from comment #5)
> This is still happening with 5.15.32 and 5.16.18 for me (XPS 15 9510 from
> 2021). I even booted on Windows to make sure I'd update all firmwares.
> 
> I could unbitrot this patch and apply it on top of 5.16.18 and it still
> fixes the problem for me.
> 
> How can we move forward ?
Comment 8 Julien Wajsberg 2022-04-08 14:24:56 UTC
Ah that's good to know, thanks.

I'm a complete newbie in kernel development, is there a better way to import them than copy paste them from the mailing list archives? Especially do you know if some public git tree would contain them?
Comment 9 Julien Wajsberg 2022-04-08 14:36:59 UTC
ah, I found the v6 version of the patch even, the 3 parts are in https://patchwork.kernel.org/project/linux-pci/cover/20220301041943.2935892-1-david.e.box@linux.intel.com/

v5 doesn't apply cleanly, let's see with v6.
Comment 10 Julien Wajsberg 2022-04-08 14:54:43 UTC
First patch seems to be in-tree now.
Second and third patch needs some edit to be applied but nothing crazy.
Let's see how it goes...
Comment 11 Julien Wajsberg 2022-04-08 15:16:47 UTC
I confirm this patchset also works for me! Great!
Since 1 is intree now, let's hope 2 and 3 will go too.
Comment 12 Julien Wajsberg 2022-04-08 15:24:08 UTC
The patches 2 and 3 all apply cleanly on 5.17.1 (not sure why I missed this stable version).
Comment 13 Adhitya Mohan 2022-04-08 15:29:57 UTC
Great!, maybe send updated patch files, to this thread? It would be useful for someone else.
Comment 14 Julien Wajsberg 2022-04-08 15:33:38 UTC
Forget about my previous messages: patch 1 isn't in-tree. I believe I had it applied locally and got confused.

The full patchset applies cleanly on 5.17.1.
Comment 15 Julien Wajsberg 2022-04-08 15:39:58 UTC
Here is how I applied:

1. Go to https://patchwork.kernel.org/project/linux-pci/cover/20220301041943.2935892-1-david.e.box@linux.intel.com/
2. Click on "series" at the top right.
3. Run `git am download_file.patch`

Then build the kernel as usual. For me this is:
```
# update config by running make menuconfig and immediatly exiting
make menuconfig

# build updated debian packages
make bindeb-pkg -j$(nproc)
```
Comment 16 Leho Kraav 2022-04-08 17:18:44 UTC
I took a shot at backporting vmd.c to 5.15.y, but PCI device defs hunk appeared too complex, so I'm bumping myself to 5.17.y to see if it helps bug 215367.
Comment 17 Adhitya Mohan 2022-04-08 17:23:47 UTC
(In reply to Leho Kraav from comment #16)
> I took a shot at backporting vmd.c to 5.15.y, but PCI device defs hunk
> appeared too complex, so I'm bumping myself to 5.17.y to see if it helps bug
> 215367.

This bug is present in any devices that use the VMD driver, through the virtue of using the nvme drive by setting the drive in Intel VMD on all alder lake and tiger lake platforms, so yes it should help. To make this easier if you are on 5.15 use the patches I posted, if you are on 5.17 (which I reccomend bumping to) Then use the newer patches from David.
Comment 18 Leho Kraav 2022-04-08 19:06:27 UTC
I have now figured out that my XPS 13 7390 2-in-1 is running NVMe in plain AHCI mode, not VMD.

Adding this patchset unfortunatey does nothing for it.

```
The pcieport 0000:00:1d.0
0000:00:1d.7 ASPM enable status:

Pcieport is not in D3cold:    
0000:00:1d.0
0000:00:1d.7

The PCIe bridge link power management state is:
0000:00:1d.0 Link is in L0

The link power management state of PCIe bridge: 0000:00:1d.0 is not expected. 
which is expected to be L1.1 or L1.2, or user would run this script again.
```

I'll continue my investigation in bug 215367. Glad y'all got something moving forward for VMD though.
Comment 19 Julien Wajsberg 2022-10-25 08:55:47 UTC
I see there's a new version of the patches, see https://patchwork.kernel.org/project/linux-pci/cover/20221025004411.2910026-1-david.e.box@linux.intel.com/

I'm applying on latest stable and currently building.
Comment 20 Julien Wajsberg 2023-01-29 19:49:51 UTC
I forgot to comment, but I'm running on this v7 series since then with no issue.
Comment 21 Julien Wajsberg 2023-12-28 10:41:31 UTC
My understanding is that David's work has been merged in v6.3.