Bug 217321
Summary: | Intel platforms can't sleep deeper than PC3 during long idle | ||
---|---|---|---|
Product: | Drivers | Reporter: | KobaKo (koba.ko) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | NEEDINFO --- | ||
Severity: | normal | CC: | binli, bjorn, david.e.box, mika.westerberg, neo.wong, regressions |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.1.0-1007-oem (kobako@barbatos) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
lspci_vvv_202304061800
turbostat_202304111627 dmesg_202304111633 dmesg_202304121049 lspci_nvvv_202304121041 lspci_vvnn_202306090022.txt lspci_xxxx_202306100805 |
Description
KobaKo
2023-04-11 08:32:04 UTC
Created attachment 304116 [details]
lspci_vvv_202304061800
Created attachment 304117 [details]
turbostat_202304111627
Created attachment 304118 [details]
dmesg_202304111633
This patch must have been reverted, please check kernel 6.1.23 or 6.2.10. https://www.spinics.net/lists/stable-commits/msg285503.html @Atem, this patch revert "PCI/ASPM: Save L1 PM Substates Capability for suspend/resume" and cause the CPU can't sleep deeper than pc3. after revert the patch(you also mentioned), the issue is gone. Created attachment 304124 [details]
dmesg_202304121049
Created attachment 304125 [details]
lspci_nvvv_202304121041
KobaKo, FWIW, Bjorn had some questions for you here: https://lore.kernel.org/all/20230411204229.GA4168208@bhelgaas/ @Bjorn, 1. just boot the machine and observe the PCx state, the issue could be observed. don't need to trigger suspend/resume. 2. i will try to find which device is affecting. 3. sudo lspci uploaded. (lspci_nvvv_202304121041) I'm removing the "regression" tag on this bug report. This bug is definitely real. We want to be able to sleep deeper than PC3 (I don't actually know what PC3 is, but I take your word for it :)). 4ff116d0d5fd ("PCI/ASPM: Save L1 PM Substates Capability for suspend/resume") gave us that deep sleep capability temporarily, but 4ff116d0d5fd *also* broke suspend/resume completely on some machines. So 4ff116d0d5fd is basically just a patch that needs a little more work before we merge it again. @KobaKo, Please help to replicate this issue in the latest RC kernel. See if issue remain. @Neo, is there any solutions landed on vanilla kernel? No, but that's the request from engineering team for taking this issue. thanks @Neo, i tried the latest generic and pci next before and could try one more time. @Neo, i tried the latest generic and pci next before and could try one more time. @Koba, thanks, please make sure to upload the log once you saw the same issue. I believe the patch that was reverted should have only corrected the system state after the first suspend/resume since not saving state on suspend was the issue. Do you find that you can get deeper than PC3 after boot but not after resume from suspend? If you can get deeper than PC3 after boot (before suspend) can you send an lspci output of this? @David, i tried w/ patch 1. boot, can't get into deeper than pc3. 2. suspend/resume, cant get too. i could revert the patch and get lspci output, is it ok? (In reply to KobaKo from comment #18) > @David, i tried w/ patch > 1. boot, can't get into deeper than pc3. > 2. suspend/resume, cant get too. > > i could revert the patch and get lspci output, is it ok? Sure. Created attachment 304384 [details]
lspci_vvnn_202306090022.txt
@David, please check lspci_vvnn_202306090022.txt also tried 6.4.RC5, VP. @David, it seem ltr1.2_threshold has been configured during booting. nvme.ltr1.2_threshold is still zero through lspci. ~~~ [ 0.000000] Linux version 6.1.0-1014-oem (kobako@barbatos) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #14 SMP PREEMPT_DYNAMIC Fri Jun 9 08:03:27 UTC 2023 [ 0.897741] pci 0000:02:00.0: aspm l1ss 120 [ 0.897749] pci 0000:02:00.0: l1_2_threshold 118 [ 2.165297] pci 10000:e1:00.0: aspm l1ss 112 [ 2.165313] pci 10000:e1:00.0: l1_2_threshold 616 ~~~ $ sudo lspci -s 10000:e1:00.0 -vvv 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp Device 5015 (rev 01) (prog-if 02 [NVM Express]) Subsystem: Sandisk Corp Device 5015 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 0 NUMA node: 0 Region 0: Memory at 72000000 (64-bit, non-prefetchable) [size=16K] Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [b0] MSI-X: Enable+ Count=65 Masked- Vector table: BAR=0 offset=00003000 PBA: BAR=0 offset=00002000 Capabilities: [c0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 16GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [1b8 v1] Latency Tolerance Reporting Max snoop latency: 3145728ns Max no snoop latency: 3145728ns Capabilities: [300 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [900 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+ PortCommonModeRestoreTime=32us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Capabilities: [910 v1] Data Link Feature <?> Capabilities: [920 v1] Lane Margining at the Receiver <?> Capabilities: [9c0 v1] Physical Layer 16.0 GT/s <?> Kernel driver in use: nvme Kernel modules: nvme ~~~ With Davids patch, only save&restore l1.2_threshold, VP. ~~~ u@u-Precision-3460:~$ sudo lspci -s 10000:e1:00.0 -vvv 10000:e1:00.0 Non-Volatile memory controller: Sandisk Corp Device 5015 (rev 01) (prog-if 02 [NVM Express]) Subsystem: Sandisk Corp Device 5015 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 0 NUMA node: 0 Region 0: Memory at 72000000 (64-bit, non-prefetchable) [size=16K] Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [b0] MSI-X: Enable+ Count=65 Masked- Vector table: BAR=0 offset=00003000 PBA: BAR=0 offset=00002000 Capabilities: [c0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 16GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [1b8 v1] Latency Tolerance Reporting Max snoop latency: 3145728ns Max no snoop latency: 3145728ns Capabilities: [300 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [900 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+ PortCommonModeRestoreTime=32us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=616448ns L1SubCtl2: T_PwrOn=500us Capabilities: [910 v1] Data Link Feature <?> Capabilities: [920 v1] Lane Margining at the Receiver <?> Capabilities: [9c0 v1] Physical Layer 16.0 GT/s <?> Kernel driver in use: nvme Kernel modules: nvme u@u-Precision-3460:~$ sudo rtcwake -m mem -s 10 rtcwake: assuming RTC uses UTC ... rtcwake: wakeup from "mem" using /dev/rtc0 at Fri Jun 9 23:59:57 2023 u@u-Precision-3460:~$ sudo cat /sys/kernel/debug/pmc_core/package_cstate_show Package C2 : 181494097 Package C3 : 45220243 Package C6 : 0 Package C7 : 0 Package C8 : 0 Package C9 : 0 Package C10 : 0 ~~~ Created attachment 304392 [details]
lspci_xxxx_202306100805
Can you try the patch in: https://bugzilla.kernel.org/show_bug.cgi?id=216782#c73 It should add back the save/restore functionality but also follow the spec more closely in the restore phase. It also skips restore on the ASUS system involved just in case. (In reply to Mika Westerberg from comment #26) > Can you try the patch in: Mika, what happened to all of this? It looks like it fell through the cracks, but maybe I'm missing something. Hi, it is still being worked on (but thanks for reminding). There is one system (see: https://bugzilla.kernel.org/show_bug.cgi?id=216877) where this fails and we are trying to figure out what might be causing it. |