Created attachment 300589 [details] dmesg output from error case We got a Elo AIO i2 device which most current BIOS where WiFi was working with Kernel 5.12.x but stopped working since Kernel 5.13.19 and newer (5.17.0-rc5). The kernel error message is: [ 3.419766] iwlwifi 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible) [ 3.419781] iwlwifi 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible) [ 3.419975] iwlwifi 0000:01:00.0: HW_REV=0xFFFFFFFF, PCI issues? [ 3.420911] iwlwifi: probe of 0000:01:00.0 failed with error -5 The issue can be solved by adding the iwlwifi driver to the initramfs so it is loaded much earlier and seems to work. To narrow down the issue I did an git bisect between v5.13.18 and v5.13.19 mainline kernel version which got me to following commit which if reverted fix the issue on this device. Reverting commit d0660d8ab123ea471064f0828f290bec9593e16b : PCI: Use pci_update_current_state() in pci_enable_device_flags() resolve the issue also in the newer kernels. Seems like the function platform_pci_get_power_state is always returning PCI_D3cold on this device also if this is not true.
Created attachment 300590 [details] dmidecode of affected device
Created attachment 300591 [details] acpidump of affected device
Created attachment 300592 [details] lshw output of affected device
Can you please enable dynamic debug in drives/acpi/device_pm.c and attach dmesg output with that enabled?
Created attachment 300638 [details] PCI: ACPI: PM: Power up ACPI companion of PCI device during registration Please apply this patch to the newer kernel without reverting the problematic commit and check if it makes any difference.
Created attachment 300639 [details] PCI: ACPI: PM: Power up ACPI companion of PCI device during registration (v2) The previous version is broken, please test this one.
Created attachment 300647 [details] Dmesg with dynamic debugging active for device_pm.c
Created attachment 300648 [details] Dmesg with dynamic debugging active for device_pm.c (Patch applied) The Patch does not fix the issue as it seems.
Created attachment 300649 [details] Stock 5.15.31 dmesg where iwlwfi and iwlmvm were added to the initramfs
The problematic commit somehow allows the PCIe port holding the WiFi device to go into D3cold via PM-runtime which apparently doesn't happen without that commit (it is unclear to me why this is the case yet). When the driver loads and probe triggers, the port goes back into D0 as it should, but the WiFi device remains inaccessible which appears to be a platform issue. Please collect dmesg with the problematic commit reverted, with dynamic debugging active for device_pm.c enabled and without adding the iwl* modules to initramfs.
Created attachment 300650 [details] The dmesg with reverted commit and dynamic debug active The iwlwifi and iwlmvm are not in the initramfs anymore. Hope this helps.
Created attachment 300651 [details] PCI: PM: Rework pci_update_current_state() Please also collect dmesg in the failing configuration with the attached patch applied.
(In reply to Stefan Gottwald from comment #11) > Created attachment 300650 [details] > The dmesg with reverted commit and dynamic debug active > > The iwlwifi and iwlmvm are not in the initramfs anymore. Hope this helps. As it turns out, the port holding the WiFi device is able to runtime-suspend even in this case and it resumes when the driver starts probing, but the WiFi device is accessible this time. It looks like reading the PCI_PM_CTRL register in pci_update_current_state() makes a difference.
Created attachment 300652 [details] Dmesg with the latest patch Sorry took a little longer as the: #define PCI_ERROR_RESPONSE (~0ULL) #define PCI_SET_ERROR_RESPONSE(val) (*(val) = ((typeof(*(val))) PCI_ERROR_RESPONSE)) #define PCI_POSSIBLE_ERROR(val) ((val) == ((typeof(val)) PCI_ERROR_RESPONSE)) is not present in 5.15.31 so I needed to add it manually.
Created attachment 300653 [details] PCI: PM: Rework pci_update_current_state() (v2) OK, thanks! Let's try this one (please collect dmesg with it applied).
Created attachment 300654 [details] PCI: PM: Rework pci_update_current_state() (v3) Sorry, broken dev_info(). Please test this one.
Created attachment 300655 [details] Here the dmesg output with correct dev_info.
Created attachment 300657 [details] Did extent the logging a bit but now I am completely lost Did the following changes to the code (quick and dirty only) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a101faf3e88a..bce9bc5dbbc7 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1163,15 +1163,17 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state) */ void pci_update_current_state(struct pci_dev *dev, pci_power_t state) { - if (platform_pci_get_power_state(dev) == PCI_D3cold || - !pci_device_is_present(dev)) { + u16 pmcsr; + + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); + if (platform_pci_get_power_state(dev) == PCI_D3cold) { dev->current_state = PCI_D3cold; + dev_info(&dev->dev, "%s: From platform_pci_get_power_state current_state = PCI_D3cold (%d) - (pmcsr %d)\n", __func__, dev->current_state, (pmcsr & PCI_PM_CTRL_STATE_MASK)); } else if (dev->pm_cap) { - u16 pmcsr; - - pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); - dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK); + dev_info(&dev->dev, "%s: From pmcsr current_state = %d - (platform_pci_get_power_state %d)\n", __func__, (pmcsr & PCI_PM_CTRL_STATE_MASK), platform_pci_get_power_state(dev)); + dev->current_state = pmcsr & PCI_PM_CTRL_STATE_MASK; } else { + dev_info(&dev->dev, "%s: From given state current_state = %d - (pmcsr %d)\n", __func__, state, (pmcsr & PCI_PM_CTRL_STATE_MASK)); dev->current_state = state; } } @@ -1917,6 +1919,13 @@ static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags) * (e.g. if the device really is in D0 at enable time). */ pci_update_current_state(dev, dev->current_state); + if (dev->pm_cap) { + u16 pmcsr; + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); + dev_info(&dev->dev, "%s: From current_state = %d (dev->current_state %d)\n", __func__, (pmcsr & PCI_PM_CTRL_STATE_MASK), dev->current_state); + //dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK); + } + if (atomic_inc_return(&dev->enable_cnt) > 1) return 0; /* already enabled */
Created attachment 300658 [details] Why this version works is beyond me The only change here is I now set dev->current_state to the patch before but this changes everything more or less also for the things happen before (see dmesg). diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index a101faf3e88a..bce9bc5dbbc7 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1163,15 +1163,17 @@ static int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state) */ void pci_update_current_state(struct pci_dev *dev, pci_power_t state) { - if (platform_pci_get_power_state(dev) == PCI_D3cold || - !pci_device_is_present(dev)) { + u16 pmcsr; + + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); + if (platform_pci_get_power_state(dev) == PCI_D3cold) { dev->current_state = PCI_D3cold; + dev_info(&dev->dev, "%s: From platform_pci_get_power_state current_state = PCI_D3cold (%d) - (pmcsr %d)\n", __func__, dev->current_state, (pmcsr & PCI_PM_CTRL_STATE_MASK)); } else if (dev->pm_cap) { - u16 pmcsr; - - pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); - dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK); + dev_info(&dev->dev, "%s: From pmcsr current_state = %d - (platform_pci_get_power_state %d)\n", __func__, (pmcsr & PCI_PM_CTRL_STATE_MASK), platform_pci_get_power_state(dev)); + dev->current_state = pmcsr & PCI_PM_CTRL_STATE_MASK; } else { + dev_info(&dev->dev, "%s: From given state current_state = %d - (pmcsr %d)\n", __func__, state, (pmcsr & PCI_PM_CTRL_STATE_MASK)); dev->current_state = state; } } @@ -1917,6 +1919,13 @@ static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags) * (e.g. if the device really is in D0 at enable time). */ pci_update_current_state(dev, dev->current_state); + if (dev->pm_cap) { + u16 pmcsr; + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr); + dev_info(&dev->dev, "%s: From current_state = %d (dev->current_state %d)\n", __func__, (pmcsr & PCI_PM_CTRL_STATE_MASK), dev->current_state); + dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK); + } + if (atomic_inc_return(&dev->enable_cnt) > 1) return 0; /* already enabled */
Okay seems like my fault was looking only on the iwlwifi parts should have looked at 0000:00:14.0 here without dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK); we have: [ 0.324212] pci 0000:00:14.0: [8086:31d6] type 01 class 0x060400 [ 0.324328] pci 0000:00:14.0: PME# supported from D0 D3hot D3cold [ 0.324436] pci 0000:00:14.0: Wakeup disabled by ACPI [ 0.336686] pci 0000:00:14.0: PCI bridge to [bus 01] [ 0.336699] pci 0000:00:14.0: bridge window [mem 0xa1200000-0xa12fffff] [ 0.418353] pci 0000:00:14.0: PCI bridge to [bus 01] [ 0.418367] pci 0000:00:14.0: bridge window [mem 0xa1200000-0xa12fffff] [ 0.441019] pcieport 0000:00:14.0: pci_update_current_state: From platform_pci_get_power_state current_state = PCI_D3cold (4) - (pmcsr 0) [ 0.441035] pcieport 0000:00:14.0: pci_enable_device_flags: From current_state = 0 (dev->current_state 4) [ 0.465313] pcieport 0000:00:14.0: pci_update_current_state: From pmcsr current_state = 0 - (platform_pci_get_power_state 0) [ 0.465767] pcieport 0000:00:14.0: PME: Signaling with IRQ 122 [ 4.450461] pcieport 0000:00:14.0: Wakeup enabled by ACPI [ 4.501329] pcieport 0000:00:14.0: pci_update_current_state: From platform_pci_get_power_state current_state = PCI_D3cold (4) - (pmcsr 3) [ 4.527994] pcieport 0000:00:14.0: pci_update_current_state: From platform_pci_get_power_state current_state = PCI_D3cold (4) - (pmcsr 3) [ 4.563280] pcieport 0000:00:14.0: pci_update_current_state: From pmcsr current_state = 3 - (platform_pci_get_power_state 0) [ 4.581889] pcieport 0000:00:14.0: Wakeup disabled by ACPI [ 4.929344] pcieport 0000:00:14.0: Wakeup enabled by ACPI [ 5.005310] pcieport 0000:00:14.0: pci_update_current_state: From platform_pci_get_power_state current_state = PCI_D3cold (4) - (pmcsr 3) and with the dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK); we have: [ 0.328727] pci 0000:00:14.0: [8086:31d6] type 01 class 0x060400 [ 0.328827] pci 0000:00:14.0: PME# supported from D0 D3hot D3cold [ 0.328936] pci 0000:00:14.0: Wakeup disabled by ACPI [ 0.341424] pci 0000:00:14.0: PCI bridge to [bus 01] [ 0.341436] pci 0000:00:14.0: bridge window [mem 0xa1200000-0xa12fffff] [ 0.422445] pci 0000:00:14.0: PCI bridge to [bus 01] [ 0.422461] pci 0000:00:14.0: bridge window [mem 0xa1200000-0xa12fffff] [ 0.445018] pcieport 0000:00:14.0: pci_update_current_state: From platform_pci_get_power_state current_state = PCI_D3cold (4) - (pmcsr 0) [ 0.445033] pcieport 0000:00:14.0: pci_enable_device_flags: From current_state = 0 (dev->current_state 4) [ 0.445500] pcieport 0000:00:14.0: PME: Signaling with IRQ 122 [ 4.351586] pcieport 0000:00:14.0: Wakeup enabled by ACPI [ 4.369279] pcieport 0000:00:14.0: pci_update_current_state: From platform_pci_get_power_state current_state = PCI_D3cold (4) - (pmcsr 3) [ 4.477428] pcieport 0000:00:14.0: pci_update_current_state: From platform_pci_get_power_state current_state = PCI_D3cold (4) - (pmcsr 3) [ 4.501413] pcieport 0000:00:14.0: pci_update_current_state: From pmcsr current_state = 3 - (platform_pci_get_power_state 0) [ 4.521397] pcieport 0000:00:14.0: Wakeup disabled by ACPI Seems like we did not proper enable the PCI bridge but this is only a guess.
Well, I think I know what's going on. Without the problematic commit, the root port holding the WiFi device doesn't really change the power state at all, because the first pci_enable_device_flags() finds that it is in D0 by reading the PCI_PM_CTRL register (which is not in sync with ACPI power resources). Then, while suspending, it goes into D3cold, but the state of ACPI power resources indicates D3cold already, so nothing happens. Next, when resuming, it goes back to D0, but the only thing that changes is the ACPI power resources state and the port's PCI_PM_CTRL register doesn't change. In contrast, with the problematic commit, the first pci_update_current_state() in pci_enable_device_flags() finds the root port in D3cold, so it is put into D0 (which basically causes the ACPI power resources to sync up with the port's PCI_PM_CTRL register) and while suspending D3hot is programmed into its PCI_PM_CTRL register and then it doesn't work when put back into D0. So on this platform we probably should not touch the PCI_PM_CTRL registers of root ports.
Created attachment 300661 [details] PCI: ACPI: Power up PCI devices with ACPI companions during initialization Please collect dmesg with this patch applied.
Created attachment 300662 [details] Only pci power up patch applied. I reverted all other patches and only applied the last one but I'm not sure if this was correct.
Yes, it was. The information collected so far indicates that reverting the problematic commit from the kernel with the patch from comment #22 applied will not make the WiFi device work. Can you please verify this?
Okay I applied the patch from comment #22 and did the git revert d0660d8ab123ea471064f0828f290bec9593e16b which normally get it working. And yes this combination does not work.
Thanks for the confirmation! Please try to run the configuration from comment #23 with pcie_port_pm=off in the kernel command line. It should make the WiFi work again if I'm not mistaken.
Created attachment 300663 [details] PCI: PM: Special case bridges when setting power state to D3cold Also please test the configuration from comment #23 with the attached patch applied (and no other changes).
Tested configuration from comment #23 and pcie_port_pm=off works fine. Tested Patch from comment #27 (also with comment #22 patch) without the pcie_port_pm=off and this did not work.
Well, this looks like a firmware issue unfortunately exposed by an otherwise correct commit. It may be necessary to add a quirk to permanently disable PCIe port power management on this platform, but there is one more thing to try (stay tuned).
Created attachment 300669 [details] PCI: ACPI: PM: Power up PCI devices on subordinate buses before scanning them Please test this patch on top of a failing kernel without making any other changes to it.
Hi thanks for the effort but also the last patch still not solve the issue. I also think the firmware of this device is not really good but there is no update available (already asked for this). So the configuration with pcie_port_pm=off seems here the way to go.
Created attachment 300670 [details] PCI: PM: Disable bridge D3 on Elo i2 Thanks for testing! There is a list of systems where PCI bridge D3 is known to fail, so this one can be added to it. The effect of this patch should be equivalent to adding pcie_port_pm=off to the kernel command line. If it works for you, I'll send it for applying.
Thanks for the work the Patch from comment #32 is working quite well.
Thanks for your patience! Patch posted: https://patchwork.kernel.org/project/linux-pm/patch/11980172.O9o76ZdvQC@kreacher/
Created attachment 300676 [details] PCI: PM: Force completio of transitions to D0 in some cases Can you please additionally test this patch on top of a failing kernel? In principle, the root port's BARs may need to be restored when bringing it back into D0.
Created attachment 300677 [details] Tested new patch did not work see dmesg log
OK, thanks for testing! This pretty much exhausted my ideas for what could be done to avoid adding the quirk.
This problem also affects EloPOS E2/S2/H2 RevA (EPS22H2 in my case) based on the same Geminilake Continental Z2 board: BIOS Information Vendor: Elo Touch Solutions Inc. Version: B203.12B Release Date: 03/12/2020 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 5 MB Characteristics: ... BIOS Revision: 5.13 Firmware Revision: 0.24 Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: Elo Touch Solutions Product Name: EloPOS E2/S2/H2 Version: RevA Serial Number: E203011216 UUID: 0071dd3d-0c34-e111-8000-1ceec92f0ef7 Wake-up Type: Power Switch SKU Number: E937154 Family: Windows EloPOS Handle 0x0002, DMI type 2, 15 bytes Base Board Information Manufacturer: Elo Touch Solutions Product Name: Geminilake Version: Continental Z2 Serial Number: Default string Asset Tag: Default string Features: Board is a hosting board Board is replaceable Location In Chassis: Default string Chassis Handle: 0x0003 Type: Motherboard The patch should probably be modified to match against DMI_BOARD_VENDOR, DMI_BOARD_NAME and DMI_BOARD_VERSION.
Done: https://lore.kernel.org/lkml/20230519085853.1079-1-linux@zary.sk/