Bug 217966 - Weird battery detection issue triggered by extra PCIe slot/port
Summary: Weird battery detection issue triggered by extra PCIe slot/port
Status: NEW
Alias: None
Product: ACPI
Classification: Unclassified
Component: EC (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: acpi_ec
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-02 08:42 UTC by Tom Yan
Modified: 2023-10-11 13:00 UTC (History)
5 users (show)

See Also:
Kernel Version: 6.5.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Tom Yan 2023-10-02 08:42:38 UTC
So I have a laptop with two M.2 slot / PCIe port:

00:06.0 PCI bridge: Intel Corporation 12th Gen Core Processor PCI
Express x4 Controller #0 (rev 04)
00:1d.0 PCI bridge: Intel Corporation Device 51b0 (rev 01)

Apparently 00:06.0 is wired to the CPU directly and the 00:1d.0 is
weird to the PCH. Both of them can be disabled independently in UEFI
settings, and both have an NVMe drive installed.

Currently only the drive on the "CPU slot" is used, with both Windows
and Linux installed. The other drive has been wiped and remains unused
for now due to the issue I'm reporting here.

The problem I am having is that for some reason, when the "PCH slot"
is enabled, Linux has estimatedly only < 50% chance to detect the
battery. If it has failed to do so, seemingly I have to "cold reboot"
(I mean like, shutdown normally and power on again, but not just
reboot or S3 suspend) to get it back, in which case the successful
rate is the same.

I track it down to have the following sysfs findings:

[tom@corebook ~]$ ls -Al '/sys/devices/pci0000:00/0000:00:1f.0/PNP0C09:00/'
total 0
-rw-r--r-- 1 root root 4096 Sep 17 11:01 driver_override
lrwxrwxrwx 1 root root    0 Sep 17 11:01 firmware_node ->
../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:0b/PNP0C09:00
drwxr-xr-x 3 root root    0 Sep 17 11:01 INT33D3:00
drwxr-xr-x 3 root root    0 Sep 17 11:01 INT33D4:00
drwxr-xr-x 3 root root    0 Sep 17 11:01 INTC1046:01
drwxr-xr-x 3 root root    0 Sep 17 11:01 INTC1046:02
drwxr-xr-x 3 root root    0 Sep 17 11:01 INTC1046:03
drwxr-xr-x 3 root root    0 Sep 17 11:01 INTC1048:00
-r--r--r-- 1 root root 4096 Sep 17 11:01 modalias
drwxr-xr-x 3 root root    0 Sep 17 11:01 PNP0C0D:00
drwxr-xr-x 2 root root    0 Sep 17 11:01 power
lrwxrwxrwx 1 root root    0 Sep 17 11:01 subsystem -> ../../../../bus/platform
-rw-r--r-- 1 root root 4096 Sep 17 11:01 uevent
-r--r--r-- 1 root root 4096 Sep 17 11:01 waiting_for_supplier
[tom@corebook ~]$ ls -Al
'/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:0b/PNP0C09:00/PNP0C0A:00/'
total 0
-r--r--r-- 1 root root 4096 Sep 17 11:02 hid
-r--r--r-- 1 root root 4096 Sep 17 11:02 modalias
-r--r--r-- 1 root root 4096 Sep 17 11:02 path
drwxr-xr-x 2 root root    0 Sep 17 11:02 power
-r--r--r-- 1 root root 4096 Sep 17 11:02 status
lrwxrwxrwx 1 root root    0 Sep 17 11:01 subsystem ->
../../../../../../../bus/acpi
-rw-r--r-- 1 root root 4096 Sep 17 11:01 uevent
-r--r--r-- 1 root root 4096 Sep 17 11:02 uid

whereas if the "PCH slot" is disabled, or if it succeeded in the
"detection trial":

[tom@corebook ~]$ ls -Al '/sys/devices/pci0000:00/0000:00:1f.0/PNP0C09:00/'
total 0
-rw-r--r-- 1 root root 4096 Sep 17 11:56 driver_override
lrwxrwxrwx 1 root root    0 Sep 17 11:56 firmware_node ->
../../../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:0b/PNP0C09:00
drwxr-xr-x 3 root root    0 Sep 17 11:53 INT33D3:00
drwxr-xr-x 3 root root    0 Sep 17 11:53 INT33D4:00
drwxr-xr-x 3 root root    0 Sep 17 11:53 INTC1046:01
drwxr-xr-x 3 root root    0 Sep 17 11:53 INTC1046:02
drwxr-xr-x 3 root root    0 Sep 17 11:53 INTC1046:03
drwxr-xr-x 3 root root    0 Sep 17 11:53 INTC1048:00
-r--r--r-- 1 root root 4096 Sep 17 11:56 modalias
drwxr-xr-x 3 root root    0 Sep 17 11:53 PNP0C0A:00
drwxr-xr-x 3 root root    0 Sep 17 11:53 PNP0C0D:00
drwxr-xr-x 2 root root    0 Sep 17 11:56 power
lrwxrwxrwx 1 root root    0 Sep 17 11:53 subsystem -> ../../../../bus/platform
-rw-r--r-- 1 root root 4096 Sep 17 11:53 uevent
-r--r--r-- 1 root root 4096 Sep 17 11:56 waiting_for_supplier
[tom@corebook ~]$ ls -Al
'/sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:0b/PNP0C09:00/PNP0C0A:00/'
total 0
lrwxrwxrwx 1 root root    0 Sep 17 11:53 driver ->
../../../../../../../bus/acpi/drivers/battery
-r--r--r-- 1 root root 4096 Sep 17 11:56 hid
-r--r--r-- 1 root root 4096 Sep 17 11:56 modalias
-r--r--r-- 1 root root 4096 Sep 17 11:56 path
lrwxrwxrwx 1 root root    0 Sep 17 11:56 physical_node ->
../../../../../../pci0000:00/0000:00:1f.0/PNP0C09:00/PNP0C0A:00
drwxr-xr-x 2 root root    0 Sep 17 11:56 power
drwxr-xr-x 3 root root    0 Sep 17 11:53 power_supply
-r--r--r-- 1 root root 4096 Sep 17 11:56 status
lrwxrwxrwx 1 root root    0 Sep 17 11:53 subsystem ->
../../../../../../../bus/acpi
-rw-r--r-- 1 root root 4096 Sep 17 11:53 uevent
-r--r--r-- 1 root root 4096 Sep 17 11:56 uid
drwxr-xr-x 3 root root    0 Sep 17 11:53 wakeup

As you can see, the "physical node" `PNP0C0A:00` is gone in the
failing case and the "firmware node" of it hence has nothing to
"attach"(?) to, so the battery driver will therefore see nothing. (The
parent device `PNP0C09:00` is managed by the driver `ec`, for the
record.)

I don't know if this issue is caused by a certain bug or flaw in the
UEFI/EC firmware, but the problem does not seem to occur in Windows.
Either way, I'm writing to see if I can get any insight from you guys
on what might be the potential reason/rationale here. (Note that as
mentioned, it does not always occur in Linux either, so it looks like
some kind of "mapping race"(?) to me.)

P.S. While there are some ACPI errors (as it does for most laptops
these days), I don't see any potentially relevant difference in the
kernel log between the successful and failing cases. The only
difference is that in the successful case, there's the extra expected
line of "battery detected". Anyway, I'm adding the warnings/errors in
the kernel log that might be remotely relevant, but please do note
that I see them all in both cases:

ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PC00.I2C0.TPD0],
AE_NOT_FOUND (20230331/dswload2-162)
ACPI Error: AE_NOT_FOUND, During name lookup/catalog (20230331/psobject-220)
pnp 00:02: disabling [mem 0xc0000000-0xcfffffff] because it overlaps
0000:00:02.0 BAR 9 [mem 0x00000000-0xdfffffff 64bit pref]
hpet_acpi_add: no address or irqs in _CRS
i8042: PNP: PS/2 appears to have AUX port disabled, if this is
incorrect please boot with i8042.nopnp
ACPI BIOS Error (bug): Could not resolve symbol
[\_SB.PC00.LPCB.HEC.TSR1], AE_NOT_FOUND (20230331/psargs-330)
ACPI Error: Aborting method \_SB.PC00.LPCB.H_EC.SEN1._TMP due to
previous error (AE_NOT_FOUND) (20230331/psparse-529)
ACPI BIOS Error (bug): Could not resolve symbol
[\_SB.PC00.LPCB.HEC.TSR1], AE_NOT_FOUND (20230331/psargs-330)
ACPI Error: Aborting method \_SB.PC00.LPCB.H_EC.SEN1._TMP due to
previous error (AE_NOT_FOUND) (20230331/psparse-529)
intel-hid INTC1070:00: failed to enable HID power button
resource: resource sanity check: requesting [mem
0x00000000fedc0000-0x00000000fedcffff], which spans more than pnp
00:02 [mem 0xfedc0000-0xfedc7fff]
caller igen6_probe+0x1a0/0x8d0 [igen6_edac] mapping multiple BARs
i2c i2c-11: Systems with more than 4 memory slots not supported yet,
not instantiating SPD
Comment 1 Tom Yan 2023-10-08 04:41:42 UTC
It seems that it might to a certain extent have something to do with the drive on the extra/PCH slot as well. When the battery is not detected, I see something like this from `lsblk`:

NAME             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme1n1          259:0    0 476.9G  0 disk 
nvme0n1          259:1    0 931.5G  0 disk 

and when the battery is detected:

NAME             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme0n1          259:0    0 476.9G  0 disk 
nvme1n1          259:1    0 931.5G  0 disk 

Perhaps maybe some kind of slowness of the drive's firmware triggered the race/bug?
Comment 2 Tom Yan 2023-10-08 04:47:10 UTC
Just to clarity, the odd thing wasn't that the drive on the PCH slot could get enumerated latter than the drive on the CPU slot, but rather it seems to be partially detected or so before the CPU drive and then reacts slower than it.

I'm also not sure if this is more of a cause (trigger) or a result of the main issue here though.
Comment 3 Bjorn Helgaas 2023-10-09 17:31:04 UTC
1) I'm not sure the ACPI folks monitor bugzilla, so I suggest forwarding this bug report to the ACPI maintainers at:

  "Rafael J. Wysocki" <rafael@kernel.org>
  linux-acpi@vger.kernel.org

2) Also attach to this bugzilla (don't paste) the complete dmesg logs for both cases (battery detected and not detected) when booting with this kernel command line argument:

  dyndbg="file drivers/acpi/* +p"
Comment 4 Tom Yan 2023-10-11 08:19:02 UTC
Thanks for the pointer. Will check with dyndbg once I got the time.

I actually sent a email to Rafael and the linux-acpi mailing list before I file a report here. So far I haven't got any response over there.
Comment 5 Bjorn Helgaas 2023-10-11 13:00:42 UTC
Here's the report of this issue on mailing lists: https://lore.kernel.org/all/CAGnHSE=KP8rArKmNbgo3iG489PXrwjqWXLTmUp+nCOPd4VVRhA@mail.gmail.com/

Note You need to log in before you can comment on or make changes to this bug.