Shujun Wang <wsj20369@163.com> has reported a regression caused by commit 4b9ee772ea("ACPI: scan: Turn off unused power resources during initialization") on a Lenovo Thinkpad X1 > > It may cause some NVMe device probes to fail, and the system may get stuck > > when using an NVMe device as the root filesystem. > > > > In the function nvme_pci_enable(struct nvme_dev *dev), as shown below, > > readl(NVME_REG_CSTS) always returns -1 with the commit, which results in > > the probe failed. > > > > if (readl(dev->bar + NVME_REG_CSTS) == -1) { > > result = -ENODEV; > > goto disable; > > } > > > > dmesg: > > [ 1.106280] nvme 0000:04:00.0: platform quirk: setting simple suspend > > [ 1.109111] nvme nvme0: pci function 0000:04:00.0 > > [ 1.113066] nvme 0000:04:00.0: enabling device (0000 -> 0002) > > [ 1.121040] nvme nvme0: Removing after probe failure status: -19 While I'm experiencing a similar problem and the sata disk fails caused by this commit. The root cause of my issue is that BIOS exposes a Power Resource which shows it is OFF, but actually it is not. Because explicitly power off this power resource by invoking its _OFF method causes the system failure, which is what the offending commit does. This means that the platform firmware implementation doesn't follow the ACPI specification. Currently, for me issue, I'm trying to push a BIOS fix. For the issue reported by Shujun Wang <wsj20369@163.com>, it probably caused by a similar BIOS issue. So, Shujun Wang <wsj20369@163.com>, please attach 1. the full acpidump output so that I can verify if it is the same issue 2. the dmidecode output, which we can use to have a quirk for this particular platform.
Created attachment 296755 [details] dmidecode Microsoft Surface Pro (2017) Microsoft's Surface Pro (2017) seems to hit this regression too, failing the probe the exact same way, resulting in a failed boot because the filesystem could not be mounted.
Created attachment 296757 [details] acpidump Microsoft Surface Pro (2017)
Created attachment 296805 [details] acpidump for Thinkpad X1 Sorry for late, here is the acpidump in my computer, which has this BIOS issue.
Created attachment 296807 [details] dmidecode for Thinkpad X1 And this is the dmidecode output
Created attachment 296827 [details] debug patch for power resource issue From the acpidump attached, I can not confirm if it is exactly the same BIOS bug. So, please do 1. git revert 5db91e9cb5b3f645a9540d2ab67a19e464d89754 2. git am the-debug-patch-attached 3. boot with acpi_pr_force_on=all 4. check the dmesg which shows a series of devices that would be powered off during boot by default 5. copy the strings somewhere, and boot with specified power resource forced_on during boot, one by one, to narrow down which power resource device causes the problem when it is powered off during boot. We'd better confirm they are BIOS problems before we revert the "revert" patch.
(In reply to Zhang Rui from comment #5) > Created attachment 296827 [details] > debug patch for power resource issue > > From the acpidump attached, I can not confirm if it is exactly the same BIOS > bug. > > So, please do > 1. git revert 5db91e9cb5b3f645a9540d2ab67a19e464d89754 > 2. git am the-debug-patch-attached > 3. boot with acpi_pr_force_on=all > 4. check the dmesg which shows a series of devices that would be powered off > during boot by default > 5. copy the strings somewhere, and boot with specified power resource > forced_on during boot, one by one, to narrow down which power resource > device causes the problem when it is powered off during boot. > You will get boot failures if the broken power resource is still powered off, so when you boot with some power resource specified by acpi_pr_force_on="\xx.yy.zz" parameter, it means that the \xx.yy.zz is the device that causes the problem when it is powered off. BTW, acpi_pr_force_on parameter support specifying multiple devices at one time, e.g. acpi_pr_force_on="\xx,\yy.zz"
Created attachment 296831 [details] ACPI: power: Rework turning off unused power resources While this can be regarded as a platform firmware issue (because it means that the platform firmware doesn't follow the ACPI spec), it turns out to be quite popular and so the list of quirked systems could end up being quite long. On the other hand, the code in 5.13-rc2 does more than it needs to, because it attempts to turn off all of the power resources with reference counts equal to zero at the end of the namespace walks, which may include power resources with users that happen to be in low-power states at that point, while it needs to turn off power resources without any users at all. Everyone who has responded to this report, please check if the attached patch makes a difference for you.
Created attachment 296843 [details] ACPI: power: Turn off unused ACPI power resources during s2idle suspend Please test this patch instead of the previous one. The main difference is that the unused power resources are now turned off during the first s2idle suspend transition instead of being turned off right away after the enumeration of devices.
Everyone. please test both patches from the two previous comments if possible and let me know the results.
Is the second patch broken? It can not be applied to vanilla kernel. $ patch -p1 < ~/Downloads/acpi-power-unused-resources.patch patching file drivers/acpi/scan.c Reversed (or previously applied) patch detected! Assume -R? [n] n Apply anyway? [n] y Hunk #1 FAILED at 2356. 1 out of 1 hunk FAILED -- saving rejects to file drivers/acpi/scan.c.rej patching file drivers/acpi/sleep.h patching file drivers/acpi/internal.h patching file drivers/acpi/sleep.c patching file drivers/acpi/power.c patching file drivers/acpi/x86/s2idle.c
Created attachment 296857 [details] ACPI: power: Turn off unused ACPI power resources during s2idle suspend Yes, the hunk in scan.c is broken, so here goes a fixed patch to test.
Created attachment 296863 [details] ACPI: power: Rework turning off unused power resources Unfortunately, the first one was broken too, because it was missing the hunk in scan.c entirely. My bad, sorry about that. Here goes an updated patch to test.
According to my reading of the ASL in the acpidump files, the patch from comment #12 should reduce the number of unused power resources turned off during initialization on both Surface Pro 2017 and Thinkpad X1, so it is likely to help, but please test it.
(In reply to Rafael J. Wysocki from comment #13) > According to my reading of the ASL in the acpidump files, the patch from > comment #12 should reduce the number of unused power resources turned off > during initialization on both Surface Pro 2017 and Thinkpad X1, so it is > likely to help, but please test it. I can confirm this works on the Surface Pro 2017, applying the patch of comment#12 on top of upstream commit 8ac91e6c6033.
I tested the patch of comment#12 and confirm it works on the Thinkpad X1, on top of upstream commit 7a42b92b6d30c3.
Created attachment 296899 [details] ACPI: power: Get the state of each power resource once Thanks! Please also test this patch. This applies a "lazy update" approach to ACPI power resources (the state of each power resource is only acquired from _STA once, at the time of its first use) and turns off the unused power resources whose state is not "off" (that is, they are "on" or haven't been used yet at all).
I tested the patch of comment#16 on top of upstream commit 7a42b92b6d30c3, It works on Thinkpad X1.
I tested the patch of comment#16 on top of upstream commit d07f6ca923ea, It works on the Surface Pro 2017.
Thank you! So the plan is to apply the patch from comment #12 as a temporary fix for 5.13-rc and then switch over to the one from comment #16 during the 5.14 development cycle.