Bug 215467 - BISECTED nvme blocks PC10 since v5.15
Summary: BISECTED nvme blocks PC10 since v5.15
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_other
Depends on:
Reported: 2022-01-08 22:22 UTC by MarcelHB
Modified: 2022-03-10 18:31 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.15.0
Tree: Mainline
Regression: Yes

output of the grep over /sys/devices/system/cpu/cpu*/cpuidle/state (21.48 KB, text/plain)
2022-01-08 22:22 UTC, MarcelHB
turbostat output over rtcwake -s 15 (5.27 KB, text/plain)
2022-01-08 22:23 UTC, MarcelHB
dmesg output until waking up again (78.35 KB, text/plain)
2022-01-08 22:24 UTC, MarcelHB
dmesg with NVMe mode (71.34 KB, text/plain)
2022-02-10 17:36 UTC, MarcelHB

Description MarcelHB 2022-01-08 22:22:41 UTC
Created attachment 300241 [details]
output of the grep over /sys/devices/system/cpu/cpu*/cpuidle/state


Starting with Kernel 5.15, my Intel TGL notebook no longer enters power
states deeper than PC2. Previous kernels did so.


On my notebook, I recently started to observed that the battery drained
quickly while in a presumed suspend mode. In the past, there was no such

I was able to bisect the problem on a tag-basis, analyzing deeper
information on what looks good and what does not.

In the short: From somewhere at least v5.11.x until v5.14.21, the
system operates as expected when closing the lid. Starting with 5.15.0, I observed the anomalies as reported below.


* Dell Inspiron 5402, BIOS v1.80
* Intel Core i5 1135G7 (Tiger Lake)
* no discrete GPU chip
* Kioxia NVMe disk
* Qualcomm WiFi qca6174, module: ath10k_pci
* Realtek ALC3204 sound


* Looking at `/sys/kernel/debug/pmc_core/package_cstate_show` shows that
  the histogram is always 100% PC2 now, never anything below. Previous
  kernel versions went down to PC10.
* Using FirmwareTestSuite (FWTS) and running `fwts s3` starts reporting
  one failure:

  > s3: Expected /sys/kernel/debug/pmc_core/slp_s0_residency_usec to increase from 0, got 0.

  Previous kernel versions passed all tests.

Diagnosis from troubleshooting so far:

I know this guide [1] from the past to check some things first:

* PowerTop reports core C-states for 98%+ at cc7, good.
* PowerTop reports GPU C-states at 99%+ at RC6, good.
* PowerTop shows nothing else left for tuning, good.
* `dmesg` reports nothing marked as error around a suspend run,
  with `/sys/power/pm_debug_messages` enabled, good.
* There aren't any excessive numbers of interrupts during freeze-state,
  just a few hundreds after 15s of this state, and I removed some
  optional hardware modules there were listed but made no overall
  difference, so good I presume.

As recommended in [1] in section "Check CPU core 7 residency", I
considered filing a bug.

* CPU Idle driver is `intel_idle`.
* `cpuidle_states.out` is the output of the grep over
* `dmesg.log` is the full output of `dmesg` until waking up fom that `turbostat`
  command, with `/sys/power/pm_debug_messages` enabled.
* `ts.out` is the output of `turbostat` itself.

Other things relevant for debugging:

* Reports above are from running kernel v5.15.13.
* I always used the precompiled AMD64 generic kernel images by Ubuntu
  [2], and used a `turbostat` with a matching config and source.
* Kernel reports no indication of taintedness.
* PCIe ASPM is set to `powersupersave` and need to be set for me
* Disabling WiFi and sound via BIOS makes no difference here
* DKMS only manages the module required for FWTS that I need here

Thanks for your help.

[1] https://01.org/blogs/qwang59/2020/linux-s0ix-troubleshooting
[2] https://kernel.ubuntu.com/~kernel-ppa/mainline/
Comment 1 MarcelHB 2022-01-08 22:23:35 UTC
Created attachment 300242 [details]
turbostat output over rtcwake -s 15
Comment 2 MarcelHB 2022-01-08 22:24:11 UTC
Created attachment 300243 [details]
dmesg output until waking up again
Comment 3 Len Brown 2022-01-13 15:20:22 UTC
What is the latest kernel that worked properly -- 5.14?
Can you bisect to find the patch that broke deep C-states on this machine?

Comment 4 MarcelHB 2022-01-13 18:54:25 UTC
Yes, the latest working kernel was 5.14.21. Skipping the RCs, the problem appeared in 5.15.0.

I'm familiar bisection, I just need to figure out how to build and launch from custom kernel versions on my distro first. So I'm confident to report that commit any time soon.
Comment 5 MarcelHB 2022-01-14 21:36:50 UTC
I got something:

git tells me that first bad rev is e5ad96f388b765fe6b52f64f37e910c0ba4f3de7 ("nvme-pci: disable hmb on idle suspend ") but we need the patch from the successor rev a5df5e79c43c84d9fb88f56b707c5ff52b27ccca ("nvme: allow user toggling hmb usage") as well for making the kernel compiling without error.

The latter one mentions the ability to specify /sys/class/nvme/nvme0/hbm but toggling that value seems to make no difference here w. r. t. successful sleep mode.
Comment 6 MarcelHB 2022-01-14 21:47:37 UTC
To be more precise: `/sys/class/nvme/nvme0/hbm` is reset to 1 after an attempt to go into suspend mode, even when I set it to 0 right before.
Comment 7 Len Brown 2022-01-20 15:46:22 UTC
great job bisecting!

You have found 2 issues that that code:

1. nvme patch breaks PC10

2. hbm setting gets reset across suspend/resume
Comment 8 Keith Busch 2022-01-28 23:09:48 UTC
Thank you for identifying the issues. It looks easy enough to preserve the user HMB setting across resets, but that's certainly not going to help with the low power mode. And truthfully, uses probably shouldn't toggle this setting unless they really want to sacrifice storage performance to gain more system memory.

Getting the correct nvme power state is less obvious. The behavior the driver currently does was specifically requested by other OEMs because it happened to get better power saving and faster resume compared to full shutdowns on their platforms. There doesn't seem to be a programatic way to make everyone happy.

I'll consult with some folks internally and see if we can come up with anything better than quirk lists.
Comment 9 MarcelHB 2022-01-29 14:27:39 UTC
Thanks for your reply. In case you need specific device information, this is from `lscpi`:

10000:e1:00.0 Non-Volatile memory controller: KIOXIA Corporation Device 0001 (prog-if 02 [NVM Express])
	Subsystem: KIOXIA Corporation Device 0001
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 0
	NUMA node: 0
	Region 0: Memory at 72000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <32us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [b0] MSI-X: Enable+ Count=32 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [260 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [300 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [400 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
			  PortCommonModeRestoreTime=60us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=98304ns
		L1SubCtl2: T_PwrOn=50us
	Kernel driver in use: nvme
	Kernel modules: nvme


If you need anything else, let me know.
Comment 10 Keith Busch 2022-02-01 19:14:05 UTC
I've proposed to default to the "simple" shutdown instead of trying other power saving methods. The patch was sent to the developer mailing list here:


I'm sure someone will complain, but the more complicated power savings seems to have caused problems for more platforms than it helped. I'll send an update on this bz if there's any movement on the proposed solution.
Comment 11 Keith Busch 2022-02-09 20:08:12 UTC
(In reply to MarcelHB from comment #9)
> Thanks for your reply. In case you need specific device information, this is
> from `lscpi`:
> 10000:e1:00.0

Eew, it's on a VMD domain?! If you disable VMD in BIOS, does it continue to fail even with VMD disabled?

The other nvme pci maintainer insists on quirking platforms for this behavior as we discover them, so I just want to constrain this correctly. The offending behavior was in fact requested by the same OEM for a different platform :(
Comment 12 MarcelHB 2022-02-10 08:44:42 UTC
(In reply to Keith Busch from comment #11)
> Eew, it's on a VMD domain?! If you disable VMD in BIOS, does it continue to
> fail even with VMD disabled?

Indeed, when I switch from this Intel thing to NVMe mode in BIOS, everything is green.
Comment 13 Keith Busch 2022-02-10 15:24:09 UTC
(In reply to MarcelHB from comment #12)
> (In reply to Keith Busch from comment #11)
> > Eew, it's on a VMD domain?! If you disable VMD in BIOS, does it continue to
> > fail even with VMD disabled?
> Indeed, when I switch from this Intel thing to NVMe mode in BIOS, everything
> is green.

Could you possibly attach the dmesg with NVMe mode enabled?
Comment 14 MarcelHB 2022-02-10 17:36:43 UTC
Created attachment 300425 [details]
dmesg with NVMe mode

Sure, find it attached.
Comment 15 Keith Busch 2022-02-10 18:51:35 UTC
Hm, the dmesg didn't show what I expected.

So low power is successful in NVMe mode, and fails in Intel RAID mode. I'm not seeing the message for the special suspend method in either mode, so they should be operate the same way. I'm not entirely sure right now what to make of this.
Comment 16 Rafael J. Wysocki 2022-02-16 12:43:36 UTC
@MarcelHB: Can you please test this patch and report back:

Comment 17 MarcelHB 2022-02-16 15:02:41 UTC
I built an affected kernel version with this patch, reverted to Intel VMD mode and did a test run, but the device still fails to enter deeper suspend states, so no difference yet.
Comment 18 MarcelHB 2022-02-28 20:11:19 UTC
Not sure if this is helpful information, but I noticed the following as well:

With VMD enabled power states remain no deeper than PC2, no matter the machine state.

Without VMD, power states go even down to PC8 without suspending, and PC9 and PC10 when doing so.

So what I did not notice until now is that obviously Intel VMD also prevents the online power saving.

Personally, I do not see any benefit from VMD on this low-medium machine with its only device any way, so I see no problem choosing NVMe. It was just a factory setting by Dell, obviously, maybe for something on Windows? I cannot check this anymore.
Comment 19 Keith Busch 2022-02-28 20:21:03 UTC
Indeed, some platforms have VMD enabled by default. This mode does not provide any benefit to Linux.
Comment 20 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-09 08:28:24 UTC
What the status wrt to getting this regression resolved? It looks like nothing of substance happened for quite a while.
Comment 21 MarcelHB 2022-03-09 10:58:24 UTC
Since we've found a workaround for the original problem without sacrificing anything for me, I can live with this even without fixing that regression.

Not sure how much Intel VMD is useful for end-users anyway, so recommending switching to native NVMe interface mode is probably even a far better solution, see #18.
Comment 22 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-09 13:53:29 UTC
(In reply to MarcelHB from comment #21)
> Since we've found a workaround for the original problem without sacrificing
> anything for me, I can live with this even without fixing that regression.

Well, normally I'd now say "fine for me, I'll drop this from the list of tracked regressions". But in this case I'd first would like to know: how likely is it that this is a general problem that affects a lot of (all?) other users of machines where VMD is enabled by default? I fear a lot of people might not even notice. I would just hate to discard a report for a regression that later turns out to needlessly increase power consumption on many devices.
Comment 23 Keith Busch 2022-03-09 16:08:25 UTC
My simple work-around wasn't accepted by the co-maintainers, so that's currently a dead end.

I think we'd need someone on Intel side to explain why nvme power management works fine without VMD, but fails to achieve the same system power state when it's enabled. That kind of information may help guide us to a more acceptable patch.
Comment 24 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-09 16:27:49 UTC
(In reply to Keith Busch from comment #23)

> I think we'd need someone on Intel side to explain why nvme power management
> works fine without VMD, but fails to achieve the same system power state
> when it's enabled.

Thx for the answer. To me that sounds like you assume the problem might happen on all systems where VMD is enabled. In that case I'll continue to track this regression -- and thus poke developers when things look stalled from the outside (and for now assume someone will try or already tried to contact Intel about this).
Comment 25 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-10 10:08:50 UTC
Forgot: Should the culprit maybe be reverted now and reapplied later once the VMD issue was solved? Or is that out of the question, as that might lead to a regression for systems where the culprit reduces the power consumption?
Comment 26 Keith Busch 2022-03-10 18:31:48 UTC
We are not considering a revert at this time. The identified patch brings significant power and latency improvements to sane platforms.

Note You need to log in before you can comment on or make changes to this bug.