Bug 219731 - Commit 1db806ec06b7 regressed the igc driver when L1 substates are enabled in EFI
Summary: Commit 1db806ec06b7 regressed the igc driver when L1 substates are enabled in...
Status: RESOLVED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Run-Time-PM (show other bugs)
Hardware: Intel Linux
: P3 low
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-01-28 13:49 UTC by Niklāvs Koļesņikovs
Modified: 2025-02-09 23:05 UTC (History)
2 users (show)

See Also:
Kernel Version: 6.14-pre-rc1
Subsystem:
Regression: Yes
Bisected commit-id: 1db806ec06b7c6e08e8af57088da067963ddf117


Attachments
dmesg of the broken igc with dyndbg and lockdown=confidentially (191.29 KB, text/plain)
2025-01-28 16:34 UTC, Niklāvs Koļesņikovs
Details
lspci -vvv with broken igc; L1.1 and L1.2 are enabled in EFI settings (69.11 KB, text/plain)
2025-01-28 16:35 UTC, Niklāvs Koļesņikovs
Details
lspci -vvv with working igc; L1 substates are disabled in EFI settings (72.45 KB, text/plain)
2025-01-28 16:36 UTC, Niklāvs Koļesņikovs
Details
lspci -vvv with working igc and L1 substates; commit reverted (73.06 KB, text/plain)
2025-01-29 16:08 UTC, Niklāvs Koļesņikovs
Details
dmesg of the working igc with dyndbg and commit reverted (172.58 KB, text/plain)
2025-01-29 16:09 UTC, Niklāvs Koļesņikovs
Details
L1SS saving fix (2.40 KB, patch)
2025-01-30 15:02 UTC, Ilpo Järvinen
Details | Diff

Description Niklāvs Koļesņikovs 2025-01-28 13:49:25 UTC
Since commit 1db806ec06b7c6e08e8af57088da067963ddf117 the current Linux git master, which will become 6.14, prints the following log message:

`igc 0000:06:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached`

After bisection led me to that commit, I found an EFI/BIOS configuration key mentioning L1 substates and set that to the `disabled` state, which, IIRC, should be the default value and therefore probably is the reason why no one else has hit this regression, yet.

After disabling the L1 substates, now the current git master, which contains the regression, prints the following (expected?) messages instead:

`igc 0000:06:00.0 (unnamed net_device) (uninitialized): PHC added`
`igc 0000:06:00.0: 4.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x1 link)`
Comment 1 Niklāvs Koļesņikovs 2025-01-28 13:53:50 UTC
I forgot to mention that the motherboard is Asus ROG Strix Z690-G WiFi and using BIOS version 4001 from late 2024.

I hope it's okay, if I try to add the relevant people to the CC list. And sorry if I put this in the wrong category.
Comment 2 Ilpo Järvinen 2025-01-28 14:19:56 UTC
Could you please be more specific when it prints that message? Did you do something at that point?

Please provide lspci -vvv from working and non-working configuration.

We might also need to look at dmesg but that should be taken with PCI dynamic debugging on ("file drivers/pci/*.c +p" written into /sys/kernel/debug/dynamic_debug/control or put into dyndbg on the kernel command line if the former is too late).
Comment 3 Niklāvs Koļesņikovs 2025-01-28 16:34:26 UTC
Created attachment 307543 [details]
dmesg of the broken igc with dyndbg and lockdown=confidentially
Comment 4 Niklāvs Koļesņikovs 2025-01-28 16:35:45 UTC
Created attachment 307544 [details]
lspci -vvv with broken igc; L1.1 and L1.2 are enabled in EFI settings
Comment 5 Niklāvs Koļesņikovs 2025-01-28 16:36:33 UTC
Created attachment 307545 [details]
lspci -vvv with working igc; L1 substates are disabled in EFI settings
Comment 6 Niklāvs Koļesņikovs 2025-01-28 16:38:47 UTC
Sorry for the confusion. The messages get printed during regular bootup and presumably come from either kernel or systemd-udev device discovery procedures.

I have captured dmesg with the required kernel cmdline which seems to have worked but only later did I realize that the lockdown LSM was still set to confidentially mode. If required, I can redo the dmesg with lockdown disabled.

I also have lspci -vvv output with only L1.1 enabled but I did not include it, since I felt it's probably not very relevant over L1.1&L1.2 case (though igc is likewise broken).
Comment 7 Ilpo Järvinen 2025-01-29 14:33:00 UTC
With the working case, I meant a kernel prior to the commit 1db806ec06b7c6e08e8af57088da067963ddf117 (or that commit reverted if that too works). That is, do not disable L1 substates because I cannot compare how their configuration is different when the entire L1 PM substates capability is removed.

Could you also take a dmesg from the working case so I can see if the D3hot->D0 pattern for the 1c.2 bridge is the same.
Comment 8 Ilpo Järvinen 2025-01-29 14:49:11 UTC
It might be worth a try to add this as the last thing in pci_enable_bridge():

pci_bridge_wait_for_secondary_bus(dev, "pm wakeup");

(Just let me know if you want my to provide a patch instead.)

It seems to me that 1c.2 has just woken up from D3hot due to igc_probe() calling pci_enable_device_mem() but AFAICT, there's nothing that ensures the PCIe link has come up before the pci_enable_bridge() returns and probe continues and finds the device is not yet accessible.
Comment 9 Niklāvs Koļesņikovs 2025-01-29 16:08:41 UTC
Created attachment 307551 [details]
lspci -vvv with working igc and L1 substates; commit reverted
Comment 10 Niklāvs Koļesņikovs 2025-01-29 16:09:51 UTC
Created attachment 307552 [details]
dmesg of the working igc with dyndbg and commit reverted
Comment 11 Niklāvs Koļesņikovs 2025-01-29 16:13:57 UTC
I have uploaded the updated lspci and dmesg of the current git master with the commit reverted.

The `pci_bridge_wait_for_secondary_bus(dev, "pm wakeup");` modification on top of the current git master compiled and booted but igc was still broken with the same PCIe link lost message as before.
Comment 12 Ilpo Järvinen 2025-01-30 15:02:44 UTC
Created attachment 307555 [details]
L1SS saving fix

It seems I found at least one problem from the commit 1db806ec06b7. The fix patch attached, please test.
Comment 13 Niklāvs Koļesņikovs 2025-01-30 23:22:14 UTC
I can confirm that igc driver is again working with the fix applied on top of the current git master. Thank you for looking into this and providing a fix.
Comment 14 Ilpo Järvinen 2025-01-31 14:43:07 UTC
Great, thanks for testing.

Can I put you into a Tested-by tag?
Comment 15 Niklāvs Koļesņikovs 2025-01-31 14:56:52 UTC
Yes, thank you. Sorry for not specifying that in the previous reply.

Tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>

Or yo can use Reported-and-tested-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com> if that's an official thing and not a made up tag.
Comment 16 Niklāvs Koļesņikovs 2025-02-03 04:35:31 UTC
I have re-opened the bug, because the fix hasn't landed in time for 6.14-rc1 and a bug marked as resolved or verified does not show up in default search results. Sorry for the noise.
Comment 17 Niklāvs Koļesņikovs 2025-02-09 23:05:46 UTC
The fix has been released as part of Linux 6.14-rc2. Thanks again to everyone involved for the quick fix and seeing it through to a successful merge.

I'd also like to note that there's remarkably few known reports of things breaking even after rc1 was released, which highlights just how unlikely it was to get caught during developer testing before merging upstream.

Cheers.

Note You need to log in before you can comment on or make changes to this bug.