Bug 215832

Summary: S0ix: PCIE ASPM-related trouble with S0ix on Thinkpad X1 (NVME-related?)
Product: Power Management Reporter: Toke Høiland-Jørgensen (toke)
Component: Hibernation/SuspendAssignee: David Box (david.e.box)
Status: ASSIGNED ---    
Severity: normal CC: david.e.box, jwrdegoede, lenb, melnikovsky, rajvi.jingar, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.17.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci output
S0ixSelftestTool output
PCIe link state script
PCIe link state script
turbostat output
PCIE link port status
turbostat output after ltr_ignore
contents of /sys/kernel/debug/pmc_core/ltr_show
dmesg after cold boot (working PC10)
dmesg after a reboot (stuck in pc3)
dmesg during suspend with debug enabled
acpidump output after cold boot
acpidump output after reboot
lspci output in PS3-only state
lspci output in PS10 state
lspci output in PS3-only state (as root)
lspci output in PS10 state (as root)
[PATCH] Test using NVMe APST for shutdown instead of D3.
Test using NVMe APST for shutdown instead of D3

Description Toke Høiland-Jørgensen 2022-04-12 10:30:08 UTC
Created attachment 300748 [details]
lspci output

I'm trying to get S0ix idle to work properly on my Thinkpad X1 (9th
gen), and am having some trouble which leads to high battery usage on
suspend. I've been using the S0ixSelftestTool[0] which is telling me
that it's related to PCI ASPM.

Specifically, the selftest script is telling me that:

"The pcieroot port 0000:00:06.0 ASPM setting is Enabled, its D state and
Link PM are not expected."

This appears to be the slot my NVME is in:

Checking PCI Devices tree diagram:
-[0000:00]-+-00.0  Intel Corporation 11th Gen Core Processor Host Bridge/DRAM Registers
           +-02.0  Intel Corporation TigerLake-LP GT2 [Iris Xe Graphics]
           +-04.0  Intel Corporation TigerLake-LP Dynamic Tuning Processor Participant
           +-06.0-[04]----00.0  Seagate Technology PLC FireCuda 530 SSD

[snip - full output attached]

According to the manufacturer[1], the NVME device in question should
support suspending to L1.2. The S0ix troubleshooting guide[2] mentions
that the 5.3 kernel added special handling for NVME devices, but I'm
trying this on a 5.17 kernel, so that should already be there?

Attaching the full output of the S0ixSelftestTool and the output of lspci -vvv.

[0] https://github.com/intel/S0ixSelftestTool
[1] https://www.seagate.com/files/www-content/datasheets/pdfs/firecuda-530-ssd-DS2059-3-2112GB-en_GB.pdf
[2] https://01.org/blogs/qwang59/2020/linux-s0ix-troubleshooting
Comment 1 Toke Høiland-Jørgensen 2022-04-12 10:31:06 UTC
Created attachment 300749 [details]
S0ixSelftestTool output
Comment 2 David Box 2022-04-27 17:08:54 UTC
Created attachment 300830 [details]
PCIe link state script
Comment 3 David Box 2022-04-27 17:27:03 UTC
Created attachment 300831 [details]
PCIe link state script

Updated to collect over 10 seconds.
Comment 4 David Box 2022-04-27 17:36:42 UTC
This doesn't look like at ASPM issue to me. The nvme at least shows that ASPM is enabled. To be sure I've added the script used by the S0ix Self Test tool so that you can manually check it. It will sample the link states across the PCIe ports over a 10 second period. You can than grep the output by port to see if any are staying in L0.

Please also run a turbostat collection for about a minute while your system is idle. You can do this with:

sudo turbostat -n 12 -out ts.out

This is to observe how your runtime package c-state residencies compare to those during suspend.
Comment 5 Toke Høiland-Jørgensen 2022-04-27 19:37:47 UTC
Created attachment 300832 [details]
turbostat output
Comment 6 Toke Høiland-Jørgensen 2022-04-27 19:38:15 UTC
Created attachment 300833 [details]
PCIE link port status
Comment 7 Toke Høiland-Jørgensen 2022-04-27 19:41:10 UTC
You're right that there doesn't seem to be any PCIe ports that stay in L0; attached the output, as well as that from turbostat.
Comment 8 David Box 2022-04-27 20:27:39 UTC
Yeah, based on this I don't think the issue is with your nvme. Looking at the lspci output again, I see it could be WiFi. It's reporting 0s for its LTR values. Since this is a PCH connected device the small LTR value may appear in 

cat /sys/kernel/debug/pmc_core/ltr_show

Please attach this output. There's a mechanism to ignore these LTRs for testing purposes. You can do this with:

for i in {0..25}; do echo $i > /sys/kernel/debug/pmc_core/ltr_ignore; done

You may see some failed writes. That's okay. The actual number of IPs that can be ignored depends on the platform. 25 is a large number to make sure all of them are captured. The rest will fail to write.

After this you can run the turbostat command again to see if you can enter deeper package c-states. You can alternatively continuously cat the package_state file in the same pmc_core folder to see which states are updating.
Comment 9 Toke Høiland-Jørgensen 2022-04-27 20:38:50 UTC
Created attachment 300834 [details]
turbostat output after ltr_ignore

Hmm, doesn't seem to help? IIRC the S0ixSelftestTool also does the ltr_ignore dance, doesn't it? Don't think that helped either.

I got two write errors writing to the ltr_ignore file, BTW, so I suppose it goes up to 23?
Comment 10 Toke Høiland-Jørgensen 2022-04-27 20:39:18 UTC
Created attachment 300835 [details]
contents of /sys/kernel/debug/pmc_core/ltr_show
Comment 11 David Box 2022-04-29 19:52:45 UTC
Sorry for the delay. Stuck in Package C3 is difficult to debug without hardware tools. We'll have to search for the cause by trial and error. Here are some other things to try.

1. Check in the log that DMC firmware is loaded for the i915 driver.
2. Check in the log that firmware is loaded for wifi/bluetooth.
3. Report any errors in dmesg.

4. Run powertop --auto-tune to turn on runtime pm on all devices and check the package residency.
5. Unplug any attached USB devices and check the package residency.
6. Check the package residency with the display off, running turbostat in the background.
7. Physically remove the WiFi module and check the package residency. It's still suspect that the LTR is zero.
Comment 12 Toke Høiland-Jørgensen 2022-05-04 22:00:14 UTC
Created attachment 300879 [details]
dmesg after cold boot (working PC10)
Comment 13 Toke Høiland-Jørgensen 2022-05-04 22:00:36 UTC
Created attachment 300880 [details]
dmesg after a reboot (stuck in pc3)
Comment 14 Toke Høiland-Jørgensen 2022-05-04 22:00:58 UTC
Okay, so I played around with enabling/disabling the WLAN module (in BIOS). At first I actually thought this helped, since after disabling it, the machine went into pc10 just fine. However, after re-enabling it, that was *still* the case. And then after rebooting, it got stuck in pc3 again.

Playing around a bit more, it seems it was simply *the act of toggling* the WLAN module that "fixed" things. Or rather, whether the machine gets stuck in pc3 depends on how it was booted: If I cold boot it (i.e., shut it off completely, then start it back up) I get pc10 working, but if I then issue a "systemctl reboot", it' stuck in pc3 again. This is reproducible with or without the WLAN module enabled, and seems to be quite consistent.

So I guess toggling the WLAN module causes a hardware reset that corresponds to a full power-off?

I'm attaching the dmesg output after both a cold boot and a reboot; I couldn't really spot any meaningful difference, but maybe you can?
Comment 15 David Box 2022-05-05 21:50:24 UTC
Well, the act of changing any setting in BIOS causes a hardware reset, so it may not be related to WiFi at all. Did you do any suspends before rebooting? If you're not sure, can you confirm that after cold boot you get PC10 and then after reboot (without having suspended/resumed first) you only get PC3?
Comment 16 Toke Høiland-Jørgensen 2022-05-21 22:01:24 UTC
Ah, totally missed your comment, sorry about that!

Yeah, cold boot gets me PC10, then rebooting without suspending gets me only PC3. Just suspending (after a cold boot) also gets me stuck in PC3...
Comment 17 Toke Høiland-Jørgensen 2022-07-18 17:55:20 UTC
Ping? Any updates on this?
Comment 18 Rajvi Jingar 2022-07-28 14:46:39 UTC
sorry about the delayed reply. we are setting up the system in the lab to debug more on this issue.
Comment 19 Rajvi Jingar 2022-09-13 17:36:01 UTC
Can you please confirm when it stops getting PC10 residency? check multiple times after reboot and confirm the PC10 counter is still incrementing.

Also, what distribution are you using?
Comment 20 Toke Høiland-Jørgensen 2022-10-12 18:02:54 UTC
Yeah, seems pretty consistent: cold boot, goes into PC10 every time I run turbostat (over a period of ~10-15 minutes after boot). Reboot, only PC3.

I'm running Arch Linux; these latest tests were on the 5.19.13-arch1-1 kernel.
Comment 21 David Box 2022-10-21 01:14:56 UTC
Can you do a cold boot. Make sure you're getting PC10. Then write 1 to /sys/power/pm_debug_messages. Also enable some pci debug messages by doing the following as root:

echo -n "file pci-driver.c +p" > /sys/kernel/debug/dynamic_debug/control

Then do a 1 minute suspend. When you come back check that you are now only getting PC3. Send the dmesg log of the suspend/resume cycle.

Also, as a separate test, please set nvme.noacpi=1 on the grub kernel command line. Before this change, /sys/module/nvme/parameters/noacpi is N. It will be Y after this change. There's a message in your log that indicates a quirk is being used for your nvme device during suspend. This will disable that quirk.
Comment 22 Toke Høiland-Jørgensen 2022-12-21 19:59:57 UTC
Created attachment 303448 [details]
dmesg during suspend with debug enabled

Attaching the dmesg with debug enabled as requested.

I also tried the nvme.noacpi=1 kernel parameter. With this, I get PC10 even after a suspend; but after a (soft) reboot, I'm stuck in PC3 again...
Comment 23 David Box 2022-12-21 22:18:46 UTC
Did you modify your bootloader to apply the kernel parameter every time? It otherwise looks like this is the issue. Please attach a copy of your acpi dump `sudo acpidump > acpidump.out`. The flag that the driver uses to apply this quirk is in one of the acpi tables.
Comment 24 Toke Høiland-Jørgensen 2022-12-22 09:54:00 UTC
Created attachment 303451 [details]
acpidump output after cold boot

Yeah, the noacpi parameter is in the bootloader; I double-checked the module param even after a reboot.

So, to summarise, with nvme.noacpi:
- Cold boot: PC10
- Cold boot + suspend: PC10
- Reboot: PC3
- Reboot + suspend: PC3

Attaching the acpidump output for both a cold boot and after a reboot (there's a couple of bytes that are different, not sure if they're significant).
Comment 25 Toke Høiland-Jørgensen 2022-12-22 09:54:25 UTC
Created attachment 303452 [details]
acpidump output after reboot
Comment 26 Toke Høiland-Jørgensen 2023-01-31 17:34:49 UTC
Friendly ping? :)
Comment 27 David Box 2023-02-01 19:14:19 UTC
Sorry for the delay. Your tables show that your nvme is being forced to use D3 for suspend rather than ASPM which is the default for an s0ix system. The noacpi option ignores the BIOS request for D3 and this seemed to help for at least suspends after a cold boot. The presumption was that the D3 flow is not restoring the device properly on resume. But if this is true I don't know why it's not helping with reboots. That flow is different so it's likely the device is being put into D3, by I'd expect the restart to reset the device configuration.

Anyway, to see if any of this is the case please capture lspci -vvv -xxxx output from your system under both PC10 and PC3 conditions.
Comment 28 Toke Høiland-Jørgensen 2023-02-01 19:51:24 UTC
Created attachment 303673 [details]
lspci output in PS3-only state
Comment 29 Toke Høiland-Jørgensen 2023-02-01 19:52:48 UTC
Created attachment 303674 [details]
lspci output in PS10 state

Doesn't appear to be any difference on the NVME device itself, but there's a flag on the host bridge that's different between the two?
Comment 30 David Box 2023-02-02 00:47:05 UTC
Sorry should have mentioned you need to run the command as root to get the extended config space detail.
Comment 31 Toke Høiland-Jørgensen 2023-02-02 11:33:55 UTC
Created attachment 303675 [details]
lspci output in PS3-only state (as root)

Ah, doh, should have realised. Okay, trying again... :)
Comment 32 Toke Høiland-Jørgensen 2023-02-02 11:34:14 UTC
Created attachment 303676 [details]
lspci output in PS10 state (as root)
Comment 33 David Box 2023-02-17 01:52:09 UTC
The diffs showed that the extended config space of both the PCIe root port and NVMe device changed but it's not clear what the cause is. The change didn't happen in the area I was suspecting. But, since it looks like not doing D3 for NVMe suspend allows your system to continue getting PC10 on resume, let's try something similar for reboot.

Please try the attached patch. It will treat shutdown the same as suspend and use NVMe APST (PCIe ASPM) instead of D3. With this patch you don't need to use the noacpi flag as it will not do D3 even if it's set. This patch is intended solely to test this theory. I've tested it a few times with no observable issues on the drive but I can't say that it's safe. If you have important items on your drive you may want to change it first. But while on the topic, the drive could very well be the issue too. If you have another, preferably a different model or vendor, then you should try it without this patch.
Comment 34 David Box 2023-02-17 01:55:07 UTC
Created attachment 303741 [details]
[PATCH] Test using NVMe APST for shutdown instead of D3.
Comment 35 Toke Høiland-Jørgensen 2023-02-17 12:44:12 UTC
Okay, that sounds a little scary :)

What's the risk here? Frying the drive, or overheating, or something? If I'm understanding you correctly your patch should just apply the same as the noacpi flag does, but for reboot as well? I've been running with the noacpi flag on since you suggested it back in December which does not appear to have broken anything, so in that case I guess it should be relatively safe? Or is there some additional risk in applying this to reboots?

I do have an old Samsung NVME drive lying around from a previous laptop (the current one is a Seagate drive), but this is my daily driver laptop, so it's not quite trivial for me to find the time to replace the drive and do a fresh install to test it out; probably easier to try the patch, assuming it doesn't fry the drive :)
Comment 36 David Box 2023-02-18 01:48:59 UTC
Created attachment 303745 [details]
Test using NVMe APST for shutdown instead of D3
Comment 37 David Box 2023-02-18 04:17:02 UTC
Sorry. It should be fine. On shutdown the NVMe is normally put into D3 (off) and the system powered down from that state. With this patch it is instead put into a very low power idle state and powered down from there. There are no accesses to the device in either instance. It's just a different condition under which the plug is getting pulled as it were. The caution is just that this is not the standard flow. But to your NVMe, it would be the equivalent of removing the power from your laptop while your system is suspended (like if the battery died).
Comment 38 Lev Melnikovsky 2023-02-21 11:17:59 UTC
Hi. I have a similar (?) problem on ThinkPad T14s (3Gen Intel). Resume from S3 suspend leads to immediate kernel panic and the battery drain in S2idle was inconsistent (raniging from 10% to 50% overnight). It turned out to be a side effect of the laptop-mode-tools playing with /sys/module/pcie_aspm/parameters/policy. For some reason policies "default" and "powersupersave" correspond to low battery drain, while "performance" and "powersave" prevents PC8. I can provide S0ixSelftestTool -s logs for all cases if necessary. Sorry, if this is unrelated and I should rather have opened a separate ticket.
Comment 39 David Box 2023-02-21 15:24:57 UTC
(In reply to Lev Melnikovsky from comment #38)
> Hi. I have a similar (?) problem on ThinkPad T14s (3Gen Intel). Resume from
> S3 suspend leads to immediate kernel panic and the battery drain in S2idle
> was inconsistent (raniging from 10% to 50% overnight). It turned out to be a
> side effect of the laptop-mode-tools playing with
> /sys/module/pcie_aspm/parameters/policy. For some reason policies "default"
> and "powersupersave" correspond to low battery drain, while "performance"
> and "powersave" prevents PC8. I can provide S0ixSelftestTool -s logs for all
> cases if necessary. Sorry, if this is unrelated and I should rather have
> opened a separate ticket.

Please do create a new ticket. The symptoms you shared are not the same as this one.