Bug 215467
Summary: | BISECTED nvme blocks PC10 since v5.15 | ||
---|---|---|---|
Product: | Drivers | Reporter: | MarcelHB (m.heingbecker) |
Component: | Other | Assignee: | drivers_other |
Status: | NEEDINFO --- | ||
Severity: | normal | CC: | kbusch, leo, regressions, rjw |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.15.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
output of the grep over /sys/devices/system/cpu/cpu*/cpuidle/state
turbostat output over rtcwake -s 15 dmesg output until waking up again dmesg with NVMe mode |
Description
MarcelHB
2022-01-08 22:22:41 UTC
Created attachment 300242 [details]
turbostat output over rtcwake -s 15
Created attachment 300243 [details]
dmesg output until waking up again
What is the latest kernel that worked properly -- 5.14? Can you bisect to find the patch that broke deep C-states on this machine? https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html Yes, the latest working kernel was 5.14.21. Skipping the RCs, the problem appeared in 5.15.0. I'm familiar bisection, I just need to figure out how to build and launch from custom kernel versions on my distro first. So I'm confident to report that commit any time soon. I got something: git tells me that first bad rev is e5ad96f388b765fe6b52f64f37e910c0ba4f3de7 ("nvme-pci: disable hmb on idle suspend ") but we need the patch from the successor rev a5df5e79c43c84d9fb88f56b707c5ff52b27ccca ("nvme: allow user toggling hmb usage") as well for making the kernel compiling without error. The latter one mentions the ability to specify /sys/class/nvme/nvme0/hbm but toggling that value seems to make no difference here w. r. t. successful sleep mode. To be more precise: `/sys/class/nvme/nvme0/hbm` is reset to 1 after an attempt to go into suspend mode, even when I set it to 0 right before. great job bisecting! You have found 2 issues that that code: 1. nvme patch breaks PC10 2. hbm setting gets reset across suspend/resume Thank you for identifying the issues. It looks easy enough to preserve the user HMB setting across resets, but that's certainly not going to help with the low power mode. And truthfully, uses probably shouldn't toggle this setting unless they really want to sacrifice storage performance to gain more system memory. Getting the correct nvme power state is less obvious. The behavior the driver currently does was specifically requested by other OEMs because it happened to get better power saving and faster resume compared to full shutdowns on their platforms. There doesn't seem to be a programatic way to make everyone happy. I'll consult with some folks internally and see if we can come up with anything better than quirk lists. Thanks for your reply. In case you need specific device information, this is from `lscpi`: 10000:e1:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0001 (prog-if 02 [NVM Express]) Subsystem: KIOXIA Corporation Device 0001 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 0 NUMA node: 0 Region 0: Memory at 72000000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+ RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <32us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+ EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [80] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [90] MSI: Enable- Count=1/32 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [b0] MSI-X: Enable+ Count=32 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00003000 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [150 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [260 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [300 v1] Secondary PCI Express LnkCtl3: LnkEquIntrruptEn- PerformEqu- LaneErrStat: 0 Capabilities: [400 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+ PortCommonModeRestoreTime=60us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=98304ns L1SubCtl2: T_PwrOn=50us Kernel driver in use: nvme Kernel modules: nvme EOD If you need anything else, let me know. I've proposed to default to the "simple" shutdown instead of trying other power saving methods. The patch was sent to the developer mailing list here: http://lists.infradead.org/pipermail/linux-nvme/2022-February/029644.html I'm sure someone will complain, but the more complicated power savings seems to have caused problems for more platforms than it helped. I'll send an update on this bz if there's any movement on the proposed solution. (In reply to MarcelHB from comment #9) > Thanks for your reply. In case you need specific device information, this is > from `lscpi`: > > 10000:e1:00.0 Eew, it's on a VMD domain?! If you disable VMD in BIOS, does it continue to fail even with VMD disabled? The other nvme pci maintainer insists on quirking platforms for this behavior as we discover them, so I just want to constrain this correctly. The offending behavior was in fact requested by the same OEM for a different platform :( (In reply to Keith Busch from comment #11) > Eew, it's on a VMD domain?! If you disable VMD in BIOS, does it continue to > fail even with VMD disabled? Indeed, when I switch from this Intel thing to NVMe mode in BIOS, everything is green. (In reply to MarcelHB from comment #12) > (In reply to Keith Busch from comment #11) > > Eew, it's on a VMD domain?! If you disable VMD in BIOS, does it continue to > > fail even with VMD disabled? > > Indeed, when I switch from this Intel thing to NVMe mode in BIOS, everything > is green. Could you possibly attach the dmesg with NVMe mode enabled? Created attachment 300425 [details]
dmesg with NVMe mode
Sure, find it attached.
Hm, the dmesg didn't show what I expected. So low power is successful in NVMe mode, and fails in Intel RAID mode. I'm not seeing the message for the special suspend method in either mode, so they should be operate the same way. I'm not entirely sure right now what to make of this. @MarcelHB: Can you please test this patch and report back: https://lore.kernel.org/linux-nvme/20220216084313.GA11360@lst.de/raw I built an affected kernel version with this patch, reverted to Intel VMD mode and did a test run, but the device still fails to enter deeper suspend states, so no difference yet. Not sure if this is helpful information, but I noticed the following as well: With VMD enabled power states remain no deeper than PC2, no matter the machine state. Without VMD, power states go even down to PC8 without suspending, and PC9 and PC10 when doing so. So what I did not notice until now is that obviously Intel VMD also prevents the online power saving. Personally, I do not see any benefit from VMD on this low-medium machine with its only device any way, so I see no problem choosing NVMe. It was just a factory setting by Dell, obviously, maybe for something on Windows? I cannot check this anymore. Indeed, some platforms have VMD enabled by default. This mode does not provide any benefit to Linux. What the status wrt to getting this regression resolved? It looks like nothing of substance happened for quite a while. Since we've found a workaround for the original problem without sacrificing anything for me, I can live with this even without fixing that regression. Not sure how much Intel VMD is useful for end-users anyway, so recommending switching to native NVMe interface mode is probably even a far better solution, see #18. (In reply to MarcelHB from comment #21) > Since we've found a workaround for the original problem without sacrificing > anything for me, I can live with this even without fixing that regression. Well, normally I'd now say "fine for me, I'll drop this from the list of tracked regressions". But in this case I'd first would like to know: how likely is it that this is a general problem that affects a lot of (all?) other users of machines where VMD is enabled by default? I fear a lot of people might not even notice. I would just hate to discard a report for a regression that later turns out to needlessly increase power consumption on many devices. My simple work-around wasn't accepted by the co-maintainers, so that's currently a dead end. I think we'd need someone on Intel side to explain why nvme power management works fine without VMD, but fails to achieve the same system power state when it's enabled. That kind of information may help guide us to a more acceptable patch. (In reply to Keith Busch from comment #23) > I think we'd need someone on Intel side to explain why nvme power management > works fine without VMD, but fails to achieve the same system power state > when it's enabled. Thx for the answer. To me that sounds like you assume the problem might happen on all systems where VMD is enabled. In that case I'll continue to track this regression -- and thus poke developers when things look stalled from the outside (and for now assume someone will try or already tried to contact Intel about this). Forgot: Should the culprit maybe be reverted now and reapplied later once the VMD issue was solved? Or is that out of the question, as that might lead to a regression for systems where the culprit reduces the power consumption? We are not considering a revert at this time. The identified patch brings significant power and latency improvements to sane platforms. Is there any working patch for this? I have an issue where I cannot disable VMD because of OEM locking down the BIOS, so is there any solution for me? My problem is a broader battery drain from VMD... Thanks! |