The commit 7e4fdea changes the ACPI power system's initialization to turn off any unused power resources, even ones in an unknown state. This essentially has the effect of ensuring that all unused power resources are turned off, which is great in theory.
I have recently encountered an issue on 5.13.0+ where this behavior is taking down the NVMe SSD before the driver can initialize, resulting in the following in the kernel log:
acpi LNXPOWER:02: Turning OFF
pci 0000:02:00.0: CLS mismatch (64 != 1020), using 64 bytes
nvme 0000:02:00.0: can't change power state from D3hot to D0 (config space
(See bug 214025 for a discussion about making this error message more clear)
What is happening is that the LNXPOWER:02 resource is controlling the power to the PCIe port where the NVMe SSD is attached, but no other ACPI object is claiming this in power_resources_D*, and so:
$ cat /sys/bus/acpi/devices/LNXPOWER:02/resource_in_use
This causes the acpi_turn_off_unused_power_resources() function to believe that the resource is fair game and turn off the PCIe port, between the time that the PCIe device is discovered and the time that the driver gets a chance to probe the device.
I'm currently working around this by bypassing acpi_turn_off_unused_power_resources() entirely, but a proper fix will require flagging the power resource as "in use." I don't know whether this is a problem with the device's ACPI or if Linux should be claiming all LNXPOWER:* resources under each PCI bridge's firmware_node.
Happy to do any additional debugging steps.
https://git.kernel.org/linus/7e4fdeafa61f ("ACPI: power: Turn off unused power resources unconditionally")
Is this a regression caused by 7e4fdeafa61f, i.e., did this device work correctly in v5.12 and broke in v5.13? If so, this is a much higher priority problem.
Could you attach the complete dmesg log and the output of acpidump?
Created attachment 298279 [details]
Created attachment 298281 [details]
dmesg output up to when NVMe driver is running (with acpi_turn_off_unused_power_resources disabled)
I do know that this worked fine on a 5.12.x kernel and the issue appeared when I attempted to boot 5.13.0. I can see if 7e4fdeafa61f itself introduced the problem (by trying a boot with 7e4fdeafa61f and 7e4fdeafa61f^) if desired.
I'll spend some time configuring a more minimalist kernel that I can kexec to test patches and do any additional debug steps.
The improper device poweroff DOES happen with 7e4fdeafa61f, but NOT with 4b9ee772eaa8 (7e4fdea's parent commit).
So it's a regression with 7e4fdeafa61f, although the resource_in_use=0 is not new.
I don't know if there is a correlation with this problem, but with the kernel 5.13.x my notebook (asus) sometimes does not perform turn off and I have to do it manually. Also, while it is locked after last kernel log message, the notebook gets overheats a lot.
I will try to recompile the kernel without this patch to check if it resolve.
This issue has now made it into the 5.14.0 release
I'm also affected by this bug.
Using kernel 5.11 my NVME drive was detected properly.
Now using 5.13 or 5.14 I'm getting:
pci 0000:02:00.0: CLS mismatch (64 != 1020), using 64 bytes
nvme 0000:02:00.0: can't change power state from D3hot to D0 (config space inaccessible)
A small update.
I've replaced the SSSTC nvme ssd with another one from WD and the same thing happens.
So for the record, the first (internal) one is now from WD and the second one is the Samsung EVO 970 ssd.
So it seems the nvme ssd type/brand is not to blame here.
It just doesn't initialize properly when it's in the first slot regardless of the brand.
A rather large regression..
If you don't have a second ssd installed, you could try to move it to the optional slot which is right above it.
Or optionally enable the Intel VMD Rapid Storage chipset (you don't have to create a RAID set), but that will require you to reinstall Windows with the floppy driver from Intel.
(In reply to Sam Edwards from comment #7)
> This issue has now made it into the 5.14.0 release
There are several improvements after commit 7e4fdeafa61f, and they're all shipped before 5.14.
6381195ad7d0 ACPI: power: Rework turning off unused power resources
9b7ff25d129d ACPI: power: Refine turning off unused power resources
29038ae2ae56 Revert "Revert "ACPI: scan: Turn off unused power resources during initialization""
5db91e9cb5b3 Revert "ACPI: scan: Turn off unused power resources during initialization"
7e4fdeafa61f ACPI: power: Turn off unused power resources unconditionally
So do you mean that the problem still occurs in 5.14 final release?
> So do you mean that the problem still occurs in 5.14 final release?
Yes, precisely that.
Trying to make sense of the conversation here- I have nvme hardware that appears to be affected by this issue.
There appears to be no kernel command line arguments that can be used to work around the issue and it's still present in the latest release of the kernel?
Would it not be better to revert the problematic change? It's almost been a quarter of a year at this point.
I'd be happy to help if possible.
So can you test 5.16-rc2, please?
This has been reworked in 5.15 and 5.16-rc.
Sorry to reply so late.
But for me the problem resolved itself after upgrading to kernel 5.15.
Both my nvme drives are being detected at all times now.
No need to enable the Intel VMD Rapid Storage controller any longer.
I can confirm that with 5.15.5 I am now able to boot.
Same here - had this problem on 5.13.0, on 5.15.4 the SSD is detected correctly.
Good to know.
I'm experiencing exactly the same problem again now, with kernel 6.1.11. My laptop has two NVMe SSD slots, and when I plug my SSD into slot 1, it will be OK. But if I use slot 2, this SSD will be shut down:
nvme 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible
Also, when I plug the SSD into slot 2, not only will the SSD be shut down, my Nvidia GPU will be shut down too. This makes me unable to use the Nvidia GPU, until I remove the SSD at slot 2. Journalctl log about Nvidia GPU:
nvidia 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessible
I tried many kernel versions. 5.13 has this problem, but with 5.15 it doesn't. But, I encounter this problem again with kernel 5.18, and now with the near-latest 6.1.11 it's the same thing.