Created attachment 297823 [details] lspci -tvv Hi, I bought a new notebook and I noticed that there is something wrong with the suspend. When I put the notebook into sleep it stays warm a little and the battery is drained for 2-3 % per hour. On my previous notebook and my wife's notebook the battery usage is around 0.5 % per hour in s2idle. After suspend /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us stays on zero while /sys/devices/system/cpu/cpuidle/low_power_idle_cpu_residency_us is around ~95 % so it looks like part of the system is simply not sleeping. The notebook is: HP Spectre 13-aw2002nc CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz Based on my testing the problem occurs on 5.13.0 and 5.14-rc1. I haven't tried older kernels because sound is broken in those. I found this issue: https://bugzilla.kernel.org/show_bug.cgi?id=207893#c2 Which is pretty similar to mine. There is a mention that non-connected ports could prevent the system from sleeping so I tried: echo 1 | tee /sys/bus/pci/devices/0000:00:07.0/remove echo 1 | tee /sys/bus/pci/devices/0000:00:07.1/remove But it didn't help. I attach output of turbostat, lspci and dmesg with pm_debug_messages enabled and I would like to ask for a little with this issue.
Created attachment 297825 [details] dmesg after suspend
Created attachment 297827 [details] turbostat
Please disable VMD controller from your BIOS, then double check.
I disabled VMD at the beginning even before I installed my Fedora.
From lspci log I see: +-0e.0 Intel Corporation Volume Management Device NVMe RAID Controller If VMD is enabled, there is a known issue from VMD, which will block s0ix. Fix patches are WIP: https://bugzilla.kernel.org/show_bug.cgi?id=213047
You are right. I went to the BIOS and it looks different and I cannot disable VMD there anymore. Probably after I upgraded it to the latest version. I will give a shot to the patch you shared. Thank you very much.
I applied the patch on 5.13.2 but unfortunately it doesn't work for me :-( The battery drain is still the same and low_power_idle_system_residency_us remains zero after resume.
VMD is used in RAID mode. To disable it you would want to switch to AHCI mode. Is that option not available?
Yep, that's the problem. HP has started blocking advanced features of BIOS setup a few years ago and it's not available even as a hidden menu accessible via a shortcut anymore. At the beginning I remember I selected to not use the RAID but I guess it didn't turn VMD off but just stopped combining those two drives together.
Created attachment 297939 [details] BIOS setup VMD related options Only those three screens are available for VMD in the BIOS. If I enter the specific drive it just gives me info about what it is. No way to turn VMD off.
Okay. Seems that is not configurable. Let's see which root port this is. Your lspci output didn't actually capture that. Can you attach just the output from 'lspci -n'.
Created attachment 297947 [details] lspci -n Sure, here it is.
Created attachment 297951 [details] PCI-ASPM-Enable-ASPM-for-links-under-VMD-domain
Created attachment 297953 [details] PCI-ASPM-Enable-LTR-for-endpoints-behind-VMD
Check the following commands. You should see ASPM is disabled in both the Root Port and the NVMe device. And you should see the LTR on the NVMe device set to 0. sudo lspci -vv -s 10000:e0:1d.0 Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 ... ... ... LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ sudo lspci -vv -s 10000:e1:00.0 Capabilities: [70] Express (v2) Endpoint, MSI 00 ... ... ... LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Try the attached patches from Canonical. When running the new kernel you should see ASPM enabled and a high value for the LTR. Then check to see if you get s0ix.
Created attachment 297955 [details] dmesg2
Created attachment 297957 [details] tc2.out I tried to apply those two patches and s0ix CPU residency is now 0.25 for CPU%LPI and remained zero for SYS%LPI. I think it's worse because CPU%LPI used to be ~95.
I forgot to write that everything was as you described and after applying those patches ASPM is enabled and latencies are non-zero. I attached dmesg and turbostat output and I hope it can be helpful.
Is the kernel new? Was *this* kernel getting 95% CPU%LPI before the patches? Also I missed that you have 2 NVMe drives. Is the 2nd pair (root port and drive) still showing the same initial values? sudo lspci -vv -s 10000:e0:1d.2 sudo lspci -vv -s 10000:e2:00.0
Created attachment 297959 [details] Add root port A0B2 Use this patch on top of the other two to add the work around to the 2nd drive.
It's clean kernel 5.13.2 with only those two patches applied. Yes, it was getting 95% CPU%LPI before. I think it's a single NVMe module with optane cache integrated but in the system both the SSD and optane behaves like two different NVMe devices. Probably covered by VMD somehow. $ sudo parted /dev/nvme0n1 p Model: INTEL HBRPEKNX0202AH (nvme) Disk /dev/nvme0n1: 512GB $ sudo parted /dev/nvme1n1 p Model: INTEL HBRPEKNX0202AHO (nvme) Disk /dev/nvme1n1: 29,3GB I will post the lspci soon. The initial values look same.
Created attachment 297961 [details] lspci-10000:e0:1d.2.txt
Created attachment 297963 [details] lspci-10000:e2:00.0.txt
Created attachment 297965 [details] s2idle test script Since you are using a 5.13 kernel we should be able to see which s0ix requirements aren't getting met. But for this to work you need to be getting PC10 (CPU%LPI). Try the 3rd patch to see if that gets you 95% again. If it does then you can run this script (assuming you're not getting s0ix after the patch). If the 3rd patch still doesn't get you back to 95%, undo all of them to get back to the working state and run this script.
You are awesome. It looks like the second quirk did the job: $ cat /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us 58959985 CPU%LPI 95.65 SYS%LPI 95.56 I am gonna make more tests tomorrow, especially battery usage while the machine is sleeping and I will let you know how it goes. Are those patches coming into the kernel or are they just a workaround?
Great. There was an attempt by Canonical to upstream these patches but it didn't happen due to disagreement over the solution. Rafael, can you take a look at these patches and see what can be done for a proper solution? There is a fix in the works to ensure these values get programmed by firmware, but I'm not sure if it will apply to Tiger Lake, and we have this issue of VMD-enabled systems in the field without this support.
It's a shame about those patches but I am happy I have at least a way how to keep the notebook sleeping now. I left the notebook sleeping for three hours and battery remained on 100 % and its body wasn't warm anymore. There is probably one more issue. Initiating s0ix sometimes take more time than it should. It's like a minute or even more. It seems that the system goes sleeping because bluetooth speakers are disconnected almost immediately but the power LED stays on for a longer period of time. Is there a way how can I debug the issue?
(In reply to David Box from comment #26) > Great. There was an attempt by Canonical to upstream these patches but it > didn't happen due to disagreement over the solution. > > Rafael, can you take a look at these patches and see what can be done for a > proper solution? It looks like the VMD host bridge driver needs to walk the PCI bus segment exposed by it and enable ASPM/LTR for all of the devices in there having these features in their config space.
There is one big problem with the three patches enabling ASPM/LTR stuff. My computer started crashing like once a day when they were applied to the Vanilla 5.13.x kernel. It happens always after the computer resumed but not right after that. System just freezes so I am not able to get any logs. I know it's hard to say what could be the problem but is there a way how to obtain logs from the system? I can ping the machine but I am unable to get into it via SSH.
Hello Rafael, is this patch: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20210910&id=59dc33252ee777e02332774fbdf3381b1d5d5f5d fix for my issue?
Created attachment 299049 [details] PCI-ASPM-Create-ASPM-override Hi Adam, can you try to applying the following two patches and let me know if they solve the issue?
Created attachment 299051 [details] PCI-vmd-Override-ASPM-on-TGL-ADL-VMD-devices
Hi Michael, I applied the patches on 5.14.9. /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us is still zero `$ turbostat rtcwake -m freeze -s 30` returns: CPU%LPI 91.50 SYS%LPI 0.00 `$ lspci -vv -s 10000:e0:1d.0 | grep ASPM` returns: ... LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ... `$ lspci -vv -s 10000:e1:00.0` returns ... LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ... and ... Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 3145728ns Max no snoop latency: 3145728ns ... If there is anything I can do to debug it please let me know.
Hi Adam, Can you attach the results of doing `lspci -vvv`? We need more information on these devices.
Created attachment 299089 [details] lspci -vvv Hi Michael, yes, sure, here it is.
Created attachment 299183 [details] Patch that adds pci_warn debug messages Hi Adam, Two questions: 1. Can you print the results of doing sudo cat /sys/module/pcie_aspm/parameters/policy I want to make sure that it's not in "performance." 2. Can you apply the attached patch in addition to the previous two and print the results of doing sudo dmesg | grep VMD_DEBUG I want to see if something weird is going on with the classes and links.
Created attachment 299185 [details] Corrected patch-0003 I made a typo. Please apply the attached patch instead.
Hi Michael, sorry for the delay. It took me a moment to find some time to deal with the patch and build the kernel. It couldn't be applied to 5.14.11 but I added the lines manually. here it is: 1. ❯ sudo cat /sys/module/pcie_aspm/parameters/policy [default] performance powersave powersupersave I don't think there is anything that would change this in my system. 2. Here are the debug messages: [ 2.485951] pci 10000:e0:1d.0: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 394240 [ 2.485952] pci 10000:e1:00.0: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 67586 [ 2.485957] pci 10000:e1:00.0: VMD_DEBUG: We've found a quirk, finding parent... [ 2.485957] pci 10000:e1:00.0: VMD_DEBUG: parent = 000000002435851c, pos = 376 [ 2.485959] pci 10000:e1:00.0: VMD_DEBUG: Parent and capability exist, setting LTR and policy... [ 2.485965] pci 10000:e0:1d.2: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 394240 [ 2.485966] pci 10000:e2:00.0: VMD_DEBUG: Inside vmd_enable_aspm, features = 46, class = 67586 [ 2.485973] pci 10000:e2:00.0: VMD_DEBUG: We've found a quirk, finding parent... [ 2.485973] pci 10000:e2:00.0: VMD_DEBUG: parent = 000000003cfa6410, pos = 720 [ 2.485974] pci 10000:e2:00.0: VMD_DEBUG: Parent and capability exist, setting LTR and policy... I will attach full dmesg if it helps.
Created attachment 299195 [details] dmesg 5.14.11 after suspend
Created attachment 299203 [details] 0001-PCI-ASPM-Create-ASPM-override.patch Hi Adam, please apply these three patches to Linux 5.15-rc6. If it doesn't work, please do sudo dmesg | grep VMD_DEBUG as before.
Created attachment 299205 [details] 0002-PCI-vmd-Override-ASPM-on-TGL-ADL-VMD-devices.patch
Created attachment 299207 [details] 0003-Add-pci-warnings-everywhere.patch
Created attachment 299215 [details] dmesg 5.15-rc5 after suspend Hi Michael, I applied the patches on 5.15-rc5 because rc6 is not out yet. Do you want me to use master branch instead? The /sys/devices/system/cpu/cpuidle/low_power_idle_system_residency_us is still zero but ASPM looks enabled on both PCIe devices. # sudo lspci -vv -s 10000:e1:00.0 | grep ASPM LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+ L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1+ # sudo lspci -vv -s 10000:e0:1d.0 | grep ASPM LnkCap: Port #9, Speed 8GT/s, Width x2, ASPM L1, Exit Latency L1 <16us ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+ L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1+ This is the dmesg output: [ 2.405165] pci 10000:e0:1d.0: VMD_DEBUG: features = 46 [ 2.405166] pci 10000:e1:00.0: VMD_DEBUG: features = 46 [ 2.405171] pci 10000:e1:00.0: VMD_DEBUG: pos = 376 [ 2.405172] pci 10000:e1:00.0: VMD_DEBUG: Setting LTRs and policy... [ 2.405178] pci 10000:e1:00.0: VMD_DEBUG: Inside aspm_override_policy, link = 00000000beebc0fe [ 2.405179] pci 10000:e1:00.0: VMD_DEBUG: Found link, setting policy... [ 2.405209] pci 10000:e0:1d.2: VMD_DEBUG: features = 46 [ 2.405210] pci 10000:e2:00.0: VMD_DEBUG: features = 46 [ 2.405217] pci 10000:e2:00.0: VMD_DEBUG: pos = 720 [ 2.405218] pci 10000:e2:00.0: VMD_DEBUG: Setting LTRs and policy... [ 2.405965] pci 10000:e2:00.0: VMD_DEBUG: Inside aspm_override_policy, link = 00000000ce58e343 [ 2.405967] pci 10000:e2:00.0: VMD_DEBUG: Found link, setting policy... Attaching full dmesg too. Btw: the third patch didn't work, I had to pick the changed lines manually.
Hello, I don't mean to bump in, but the patches that Micheal sent fixed my suspend issue on the dell XPS 13(9305). I opened another bug report (https://bugzilla.kernel.org/show_bug.cgi?id=215063) to keep from cluttering this one.
(In reply to Adhitya Mohan from comment #44) Thanks Adhitya. We were waiting on Adam to test the last revision which works on the systems we've tested on and now is confirmed to work for you. May we add a tested-by tag for you to this patch set?
Hi David, I didn't know you have been waiting for me. Which patches do you have on mind? I tested the latest ones from Michael with not success - low_power_idle_system_residency_us was still zero. It's possible there is another problem with this specific notebook.
(In reply to David Box from comment #45) > (In reply to Adhitya Mohan from comment #44) > > Thanks Adhitya. We were waiting on Adam to test the last revision which > works on the systems we've tested on and now is confirmed to work for you. > May we add a tested-by tag for you to this patch set? Yes of course, you could see the dmesg and the result of the self test tool, that I posted in my earlier bug report. Happy to help.
(In reply to Adam Štrauch from comment #46) > Hi David, > > I didn't know you have been waiting for me. Which patches do you have on > mind? I tested the latest ones from Michael with not success - > low_power_idle_system_residency_us was still zero. It's possible there is > another problem with this specific notebook. Oh. Well there certainly could be a different issue. I'd run the same script Adhitya ran and see what it says, https://github.com/qwang59/S0ixSelftestTool If you are not able to get S0ix, but lspci shows that ASPM is enabled and LTR is set on both drives, then it's unlikely to be due to VMD.
Hi David, Any idea which release cycle these patches might make it into?
My understanding is that David's work has been merged in v6.3.