Bug 213019

Summary: regression: system can not boot with 4b9ee772ea("ACPI: scan: Turn off unused power resources during initialization")
Product: ACPI Reporter: Zhang Rui (rui.zhang)
Component: BIOSAssignee: Rafael J. Wysocki (rjw)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: dave, mika.westerberg, rjw, wendy.wang, wsj20369
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.13-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmidecode Microsoft Surface Pro (2017)
acpidump Microsoft Surface Pro (2017)
acpidump for Thinkpad X1
dmidecode for Thinkpad X1
debug patch for power resource issue
ACPI: power: Rework turning off unused power resources
ACPI: power: Turn off unused ACPI power resources during s2idle suspend
ACPI: power: Turn off unused ACPI power resources during s2idle suspend
ACPI: power: Rework turning off unused power resources
ACPI: power: Get the state of each power resource once

Description Zhang Rui 2021-05-10 13:56:45 UTC
Shujun Wang <wsj20369@163.com> has reported a regression caused by commit 4b9ee772ea("ACPI: scan: Turn off unused power resources during initialization") on a Lenovo Thinkpad X1

> > It may cause some NVMe device probes to fail, and the system may get stuck
> > when using an NVMe device as the root filesystem.
> >
> > In the function nvme_pci_enable(struct nvme_dev *dev), as shown below,
> > readl(NVME_REG_CSTS) always returns -1 with the commit, which results in
> > the probe failed.
> >
> >   if (readl(dev->bar + NVME_REG_CSTS) == -1) {
> >       result = -ENODEV;
> >       goto disable;
> >   }
> >
> > dmesg:
> >   [    1.106280] nvme 0000:04:00.0: platform quirk: setting simple suspend
> >   [    1.109111] nvme nvme0: pci function 0000:04:00.0
> >   [    1.113066] nvme 0000:04:00.0: enabling device (0000 -> 0002)
> >   [    1.121040] nvme nvme0: Removing after probe failure status: -19

While I'm experiencing a similar problem and the sata disk fails caused by this commit.

The root cause of my issue is that BIOS exposes a Power Resource which shows it is OFF, but actually it is not. Because explicitly power off this power resource by invoking its _OFF method causes the system failure, which is what the offending commit does.

This means that the platform firmware implementation doesn't follow the ACPI
specification.

Currently, for me issue, I'm trying to push a BIOS fix.

For the issue reported by Shujun Wang <wsj20369@163.com>, it probably caused by a similar BIOS issue.

So, Shujun Wang <wsj20369@163.com>, please attach
1. the full acpidump output so that I can verify if it is the same issue
2. the dmidecode output, which we can use to have a quirk for this particular platform.
Comment 1 Dave Olsthoorn 2021-05-15 09:53:19 UTC
Created attachment 296755 [details]
dmidecode Microsoft Surface Pro (2017)

Microsoft's Surface Pro (2017) seems to hit this regression too, failing the probe the exact same way, resulting in a failed boot because the filesystem could not be mounted.
Comment 2 Dave Olsthoorn 2021-05-15 09:54:53 UTC
Created attachment 296757 [details]
acpidump Microsoft Surface Pro (2017)
Comment 3 Shujun Wang 2021-05-17 09:39:09 UTC
Created attachment 296805 [details]
acpidump for Thinkpad X1

Sorry for late, here is the acpidump in my computer, which has this BIOS issue.
Comment 4 Shujun Wang 2021-05-17 09:40:45 UTC
Created attachment 296807 [details]
dmidecode for Thinkpad X1

And this is the dmidecode output
Comment 5 Zhang Rui 2021-05-18 09:21:51 UTC
Created attachment 296827 [details]
debug patch for power resource issue

From the acpidump attached, I can not confirm if it is exactly the same BIOS bug.

So, please do
1. git revert 5db91e9cb5b3f645a9540d2ab67a19e464d89754
2. git am the-debug-patch-attached
3. boot with acpi_pr_force_on=all
4. check the dmesg which shows a series of devices that would be powered off during boot by default
5. copy the strings somewhere, and boot with specified power resource forced_on during boot, one by one, to narrow down which power resource device causes the problem when it is powered off during boot.

We'd better confirm they are BIOS problems before we revert the "revert" patch.
Comment 6 Zhang Rui 2021-05-18 09:26:03 UTC
(In reply to Zhang Rui from comment #5)
> Created attachment 296827 [details]
> debug patch for power resource issue
> 
> From the acpidump attached, I can not confirm if it is exactly the same BIOS
> bug.
> 
> So, please do
> 1. git revert 5db91e9cb5b3f645a9540d2ab67a19e464d89754
> 2. git am the-debug-patch-attached
> 3. boot with acpi_pr_force_on=all
> 4. check the dmesg which shows a series of devices that would be powered off
> during boot by default
> 5. copy the strings somewhere, and boot with specified power resource
> forced_on during boot, one by one, to narrow down which power resource
> device causes the problem when it is powered off during boot.
> 
You will get boot failures if the broken power resource is still powered off, so when you boot with some power resource specified by acpi_pr_force_on="\xx.yy.zz" parameter, it means that the \xx.yy.zz is the device that causes the problem when it is powered off.

BTW, acpi_pr_force_on parameter support specifying multiple devices at one time, e.g. acpi_pr_force_on="\xx,\yy.zz"
Comment 7 Rafael J. Wysocki 2021-05-18 11:14:05 UTC
Created attachment 296831 [details]
ACPI: power: Rework turning off unused power resources

While this can be regarded as a platform firmware issue (because it means that the platform firmware doesn't follow the ACPI spec), it turns out to be quite popular and so the list of quirked systems could end up being quite long.

On the other hand, the code in 5.13-rc2 does more than it needs to, because it attempts to turn off all of the power resources with reference counts equal to zero at the end of the namespace walks, which may include power resources with users that happen to be in low-power states at that point, while it needs to turn off power resources without any users at all.

Everyone who has responded to this report, please check if the attached patch makes a difference for you.
Comment 8 Rafael J. Wysocki 2021-05-18 16:53:11 UTC
Created attachment 296843 [details]
ACPI: power: Turn off unused ACPI power resources during s2idle suspend

Please test this patch instead of the previous one.

The main difference is that the unused power resources are now turned off during the first s2idle suspend transition instead of being turned off right away after the enumeration of devices.
Comment 9 Rafael J. Wysocki 2021-05-18 18:48:32 UTC
Everyone. please test both patches from the two previous comments if possible and let me know the results.
Comment 10 Zhang Rui 2021-05-19 07:45:13 UTC
Is the second patch broken? It can not be applied to vanilla kernel.

$ patch -p1 < ~/Downloads/acpi-power-unused-resources.patch 
patching file drivers/acpi/scan.c
Reversed (or previously applied) patch detected!  Assume -R? [n] n
Apply anyway? [n] y
Hunk #1 FAILED at 2356.
1 out of 1 hunk FAILED -- saving rejects to file drivers/acpi/scan.c.rej
patching file drivers/acpi/sleep.h
patching file drivers/acpi/internal.h
patching file drivers/acpi/sleep.c
patching file drivers/acpi/power.c
patching file drivers/acpi/x86/s2idle.c
Comment 11 Rafael J. Wysocki 2021-05-19 10:04:47 UTC
Created attachment 296857 [details]
ACPI: power: Turn off unused ACPI power resources during s2idle suspend

Yes, the hunk in scan.c is broken, so here goes a fixed patch to test.
Comment 12 Rafael J. Wysocki 2021-05-19 10:23:20 UTC
Created attachment 296863 [details]
ACPI: power: Rework turning off unused power resources

Unfortunately, the first one was broken too, because it was missing the hunk in scan.c entirely.  My bad, sorry about that.

Here goes an updated patch to test.
Comment 13 Rafael J. Wysocki 2021-05-19 11:03:42 UTC
According to my reading of the ASL in the acpidump files, the patch from comment #12 should reduce the number of unused power resources turned off during initialization on both Surface Pro 2017 and Thinkpad X1, so it is likely to help, but please test it.
Comment 14 Dave Olsthoorn 2021-05-19 18:29:46 UTC
(In reply to Rafael J. Wysocki from comment #13)
> According to my reading of the ASL in the acpidump files, the patch from
> comment #12 should reduce the number of unused power resources turned off
> during initialization on both Surface Pro 2017 and Thinkpad X1, so it is
> likely to help, but please test it.

I can confirm this works on the Surface Pro 2017, applying the patch of comment#12 on top of upstream commit 8ac91e6c6033.
Comment 15 Shujun Wang 2021-05-20 12:27:12 UTC
I tested the patch of comment#12 and confirm it works on the Thinkpad X1, on top of upstream commit 7a42b92b6d30c3.
Comment 16 Rafael J. Wysocki 2021-05-20 13:36:12 UTC
Created attachment 296899 [details]
ACPI: power: Get the state of each power resource once

Thanks!

Please also test this patch.

This applies a "lazy update" approach to ACPI power resources (the state of each power resource is only acquired from _STA once, at the time of its first use) and  turns off the unused power resources whose state is not "off" (that is, they are "on" or haven't been used yet at all).
Comment 17 Shujun Wang 2021-05-21 04:59:43 UTC
I tested the patch of comment#16 on top of upstream commit 7a42b92b6d30c3,
It works on Thinkpad X1.
Comment 18 Dave Olsthoorn 2021-05-21 08:39:18 UTC
I tested the patch of comment#16 on top of upstream commit d07f6ca923ea, It works on the Surface Pro 2017.
Comment 19 Rafael J. Wysocki 2021-05-21 09:14:40 UTC
Thank you!

So the plan is to apply the patch from comment #12 as a temporary fix for 5.13-rc and then switch over to the one from comment #16 during the 5.14 development cycle.