Bug 219143

Summary: [BISECTED][REGRESSION] igc does not function anymore after second resume from standby
Product: Drivers Reporter: Martin (mwolf)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: NEW ---    
Severity: high CC: anthony.l.nguyen, dima.ruinskiy, frequent_doing807, gaaf, kubakici, metron, regressions, sasha.neftin, vitaly.lifshits
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 6.10 Subsystem:
Regression: Yes Bisected commit-id: 6f31d6b643a32cc126cf86093fca1ea575948bf0
Attachments: dmesg log after two suspends with no working network
lspci -v
enp4s0
enp6s0
ifconfig before suspend
ifconfig after two suspends resulting in no network
lspci -t ran after a fresh boot
lspci -s 06:00.0 -vvv after suspend
lspci -s 04:00.0 -vvv after suspend
dmesg netdev debug including two standby cycles
attachment-10352-0.html
dmidecode
System Info
dmidecode output
dmidecode output
lspci -vvv output
config file of last bisect from vanilla kernel (make localmodconfig)
attachment-17823-0.html
extra debug statements
with extra debug statements compiled in
with extra debug statements compiled in - annotation of suspend events
check if open is failing and whether ethtool races
dbg for open and ethtool
change msi interrupts to msix
debug prints in current state of __igc_open
dmesg of two suspend cycles with both patches applied
dmesg of two suspend cycles with all three patches applied
attachment-19617-0.html
igc_resume debug prints
dmesg igc resume patch applied
dmesg aftet two suspend cycles with latest patch
dmesg-6.11.0 boot and modules reload
fresh boot, two standby cycles
two suspends booting with suspend kprintf
kprintfs for igc_suspend flow
I added here more prints in netdev and PM flows.
two time suspend/resume with attachment 306975
using latest print patch
network-manager logs
attachment-18845-0.html
patch file from mailing list
attachment-5297-0.html
attachment-2741-0.html
attachment-26861-0.html
attachment-29439-0.html

Description Martin 2024-08-09 15:17:49 UTC
Starting with Kernel 6.10.x I experienced network connection problems after resuming my system for the second time.

My system contains two Intel I225-V (rev2 and rev3) cards.

I ran a bisection and got a hit: 6f31d6b643a32cc126cf86093fca1ea575948bf0

rmmod igc ; modprobe igc remedies the issue till the next but one resume.
Comment 1 Artem S. Tashkinov 2024-08-10 19:40:36 UTC
Sasha,

Please take a look, it's your commit.
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-08-14 10:33:51 UTC
Thx for CCing, I already had picked this up for tracking, but it's time to do more. 

@martin: can I CC you while forewarning this by mail? That would expose your email address to the public.
Comment 3 Martin 2024-08-14 10:36:25 UTC
Of course!

Thank you for your assistance here!
Comment 4 Vitaly Lifshits 2024-08-15 07:13:14 UTC
Hello @Martin,

Thank you for reporting this issue.
Please share the following information:
1. dmesg log
2. lspci -v dump
3. ifconfig dump before and after the failure
4. NVM versions of the NICs (ethtool -i <interface name>)
5. Do both NICs fail?
6. Does a link come up after resuming? (check LEDs)
7. What is the reproduction rate? Does it always happen?
Comment 5 Martin 2024-08-15 09:13:29 UTC
Created attachment 306731 [details]
dmesg log after two suspends with no working network

I suspended the PC twice and then ran dmesg
Comment 6 Martin 2024-08-15 09:13:45 UTC
Created attachment 306732 [details]
lspci -v
Comment 7 Martin 2024-08-15 09:14:35 UTC
Created attachment 306733 [details]
enp4s0

This card is in one of the pci-e slots
Comment 8 Martin 2024-08-15 09:15:15 UTC
Created attachment 306734 [details]
enp6s0

This is the onboard card from my Asrock B550 Taichi Mainboard
Comment 9 Martin 2024-08-15 09:15:52 UTC
Created attachment 306735 [details]
ifconfig before suspend

ifconfig pre failure
Comment 10 Martin 2024-08-15 09:16:29 UTC
Created attachment 306736 [details]
ifconfig after two suspends resulting in no network

ifconfig post failure
Comment 11 Martin 2024-08-15 09:22:39 UTC
I used Kernel 6.10.4-200.fc40.x86_64 for testing.

to 4.
I added the ethtool output in two separate files, but for testing I took a look what happens, if I run the command when the problem occurs. This is the result:
ethtool -i enp4s0 
Cannot get driver information: No such device
ethtool -i enp6s0 
Cannot get driver information: No such device

to 5.
yes

to 6.
no leds on both cards

to 7.
yes, every second standby

If you need anything else, please let me know!
Comment 12 Martin 2024-08-15 10:28:31 UTC
Is there maybe a firmware update for my cards?
Comment 13 Vitaly Lifshits 2024-08-15 11:31:15 UTC
(In reply to Martin from comment #12)
> Is there maybe a firmware update for my cards?

Yes, there is, the firmware versions on both of your cards are old. You can try contacting your board's vendor to get a newer version.

Anyway, I would like to ask you for some more logs:
1. lspci -t on boot.
2. lspci -s 06:00.0 -vvv and lspci -s 04:00.0 -vvv after reproduction
3. disconnect the PCI-E NIC and see if the issue reproduces on one card only.
4. dmesg logs with netdev debug prints (echo "module igc +p" | sudo tee /sys/kernel/debug/dynamic_debug/control), please disable console suspend (echo N | sudo tee /sys/module/printk/parameters/console_suspend)
Comment 14 Martin 2024-08-15 11:48:29 UTC
I doubt that I will get a firmwareupdate from Asrock and the NIC-reseller.
I asked Asrock for that a while ago, and they did not respond.

Could you please supply me with a firmware upgrade?

I have to work now, I will do the testing on the next boot, and yes, I can remove the card.
Comment 15 Martin 2024-08-15 14:50:06 UTC
Created attachment 306739 [details]
lspci -t ran after a fresh boot
Comment 16 Martin 2024-08-15 14:54:29 UTC
Created attachment 306740 [details]
lspci -s 06:00.0 -vvv after suspend
Comment 17 Martin 2024-08-15 14:54:46 UTC
Created attachment 306741 [details]
lspci -s 04:00.0 -vvv after suspend
Comment 18 Martin 2024-08-15 15:09:36 UTC
I have to postpone the removal of the NIC.
Comment 19 Martin 2024-08-15 15:50:42 UTC
You are lucky, my customer cancelled the appointment, and I was able to remove the card.
This led to some complications. Now enp6s0 got renamed to enp5s0, which led to problems with my bridge. 

I will run the desired dmesg logs.
Comment 20 Martin 2024-08-15 15:53:48 UTC
to point 3

the problem also occurs with just one card.
Comment 21 Martin 2024-08-15 16:00:57 UTC
Created attachment 306742 [details]
dmesg netdev debug including two standby cycles
Comment 22 Martin 2024-08-15 20:25:03 UTC
I reinstalled the PCI-E Card and the network cards got renamed back to enp4s0 and enp6s0.

A friend told me, that Intel itself provides firmware-updates for its network cards, so I downloaded the complete driver pack 29.2.1.

Sadly it does not contain firmware-updates for my Device-ID 15F3 probably only for 15F2 according to the Firmware Files:
FoxPond1_I225_15F2_2MB_1p94_800003BB.bin
Foxpond1_I225_15F2_LM_1MB_1p94_800003BC.bin


Num Description                          Ver.(hex)  DevId S:B    Status
=== ================================== ============ ===== ====== ==============
01) Intel(R) Ethernet Controller (3)      N/A(N/A)   15F3 00:004 Update not    
    I225-V                                                       available     
02) Intel(R) Ethernet Controller (2)      N/A(N/A)   15F3 00:006 Update not    
    I225-V                                                       available
Comment 23 Martin 2024-08-16 16:53:46 UTC
Vitaly,
I contacted Asrock again, and mentioned this bug, but they said that cannot supply me with a newer firmware. 
Would you please be so kind and try to reach out at Intel?

Best Regards
Martin
Comment 24 amir.avivi 2024-08-16 16:54:12 UTC
Created attachment 306748 [details]
attachment-10352-0.html

Hello,
Thank you for your email. I am currently out of the office on PTO until August 25th.
For any urgent matters, please contact my cover, Rex Tsai, at rex.tsai@intel.com<mailto:rex.tsai@intel.com>. Alternatively, you can reach out to my manager, Shmuel Ben-Nisan, at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>.
I appreciate your understanding and will get back to you as soon as possible upon my return.

Best regards,
Amir
Comment 25 metron@gmail.com 2024-08-17 21:07:47 UTC
I have the same problem with a slightly newer I226-V. It's a built-in and breaks on the second resume as well. Thanks for the `rmmod`+`modprobe` workaround Martin. 

firmware-version: 2017:888d
bus-info: 0000:04:00.0

It looks like there is no change to lspci. The NIC is stuck in a bad state and seems to think the cable is unplugged. If you want the same information from me Amir, I'd be happy to add it.
Comment 26 Vitaly Lifshits 2024-08-18 06:34:08 UTC
Hi Metron,

Can you please share your system information?
You can even share the output of dmidecode for this.
Comment 27 Martin 2024-08-18 10:15:39 UTC
Created attachment 306753 [details]
dmidecode
Comment 28 Martin 2024-08-18 10:18:01 UTC
Created attachment 306754 [details]
System Info

inxi -F
Comment 29 Martin 2024-08-18 10:43:51 UTC
sorry, I missread ;)
Comment 30 metron@gmail.com 2024-08-18 15:35:25 UTC
Created attachment 306756 [details]
dmidecode output

It's an MSI Tomohawk Wifi board (Z790 chipset) with an integrated I226-V
Comment 31 metron@gmail.com 2024-08-18 15:51:31 UTC
I am willing to test a change too if that's helpful. I can probably build a patched kernel from source successfully.
Comment 32 Martin 2024-08-25 10:17:09 UTC
Anything new?
Comment 33 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-08-29 15:31:57 UTC
As this seems to be taking some time to fix: did anyone try if a revert in latest mainline is able to fix this (without causing another regressions)?
Comment 34 Alex Hermann 2024-08-29 18:54:14 UTC
I have a similar, but not exactly the same problem. In my case, the network devices are unusable in 6.10.6 after every boot, not just after suspend/resume. No problems with 6.9.8 and kernels before that.

The devices are created and renamed by udev but trying to use the device results in:

networking[959]: warning: wan: netlink: wan: ip link set dev wan up: operation failed with 'No such device' (19)

I built the kernel with the earlier mentioned commit reverted. Now my network devices are fully working with kernel 6.10.6. So far no other regressions found, but I have only just booted the system.

Because my system has  broken DMI tables, I also reverted 0ef11f604503b1862a21597436283f158114d77e (firmware: dmi: Stop decoding on broken entry). I don't know if this is necessary to fix the problem.

Unfortunately, I have no time to do further test the next few weeks, I just wanted to let you know reverting the commit fixes the problem for me.
Comment 35 Alex Hermann 2024-08-29 18:56:46 UTC
Created attachment 306792 [details]
dmidecode output

The system has 4x I226-V (rev 04) controllers.
Comment 36 Alex Hermann 2024-08-29 18:57:55 UTC
Created attachment 306793 [details]
lspci -vvv output

The system has 4x I226-V (rev 04) controllers, this is the lspci output of one of them.
Comment 37 Alex Hermann 2024-08-29 20:20:40 UTC
(In reply to Alex Hermann from comment #34)
> Because my system has  broken DMI tables, I also reverted
> 0ef11f604503b1862a21597436283f158114d77e (firmware: dmi: Stop decoding on
> broken entry). I don't know if this is necessary to fix the problem.

Because this might have skewed the result, I also tested without this last revert and the system still works.

To reiterate, just reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0 fixes the problem for me. Without any side effects so far.
Comment 38 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-08-30 08:09:08 UTC
(In reply to Alex Hermann from comment #34)
> No problems with 6.9.8 and kernels before that.

So 6.9.9 or a later 6.9.y release was also broken? Then this might be another problem, as 6f31d6b643a32cc126cf86093fca1ea575948bf0 was not backported to 6.9.y.

(In reply to Alex Hermann from comment #37)
> To reiterate, just reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0 fixes
> the problem for me. Without any side effects so far.

Thx for that, but you missed an important detail: where did you perform the revert? With a recent mainline or something else?
Comment 39 Alex Hermann 2024-08-30 08:17:32 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #38)
> (In reply to Alex Hermann from comment #34)
> > No problems with 6.9.8 and kernels before that.
> 
> So 6.9.9 or a later 6.9.y release was also broken? Then this might be
> another problem, as 6f31d6b643a32cc126cf86093fca1ea575948bf0 was not
> backported to 6.9.y.

Sorry for the confusion. It just means I haven't tested any 6.9 kernel beyond 6.9.8. I noticed the problem when upgrading from 6.9.8 to 6.10.6.
 

> (In reply to Alex Hermann from comment #37)
> > To reiterate, just reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0 fixes
> > the problem for me. Without any side effects so far.
> 
> Thx for that, but you missed an important detail: where did you perform the
> revert? With a recent mainline or something else?

The revert was against 6.10.6.
Comment 40 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-08-30 08:33:10 UTC
(In reply to Alex Hermann from comment #39)
> The revert was against 6.10.6.

This is good to know, but as mentioned earlier: it would be really important to know if a revert on mainline works and resolves the problem, as then Linus can fix this with a revert himself if he wants to.
Comment 41 metron@gmail.com 2024-09-01 03:31:50 UTC
I'm not sure if it's helpful but I built a kernel reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0. The network card works but I get no video after suspend/resume so I'm unable to confirm the revert is enough.

Dmesg looks better in the sense that now it just has nvidia in there instead of kernel taints. Looking at the revert patch, I wonder if this is just that the refactor changed what happens on resume. If you read carefully, __igc_open no longer does netif_set_real_num_tx_queues but that's the only thing that's called by the resume function.
Comment 42 metron@gmail.com 2024-09-01 03:55:39 UTC
Ok with a nvidia-beta-dkms, I got it to work. The problem is fixed with the revert.
Comment 43 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-09-01 06:18:35 UTC
(In reply to metron@gmail.com from comment #42)
> Ok with a nvidia-beta-dkms, I got it to work. The problem is fixed with the
> revert.

Never use out of tree drivers when reporting bugs upstream, they could be causing your problems or interfere. It depends on the developer, they might now all your previous. 

See also: https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/#your-kernel-apparently-loaded-add-on-drivers
Comment 44 Vitaly Lifshits 2024-09-01 08:04:21 UTC
We are still not able to reproduce this issue.
I wonder if it is related to a kernel configuration.
Can someone share his .config file?

I suggest not to rush into reverting this patch since originally it fixed a deadlock.
Comment 45 Martin 2024-09-01 08:56:48 UTC
Created attachment 306800 [details]
config file of last bisect from vanilla kernel (make localmodconfig)
Comment 46 metron@gmail.com 2024-09-03 04:25:20 UTC
I built mine using Manjaro's config (which I was already using) and their PKGBLD; link https://gitlab.manjaro.org/packages/core/linux611/-/blob/master/config?ref_type=heads 

I removed Manjaro's ROG Ally patches since I was compiling anyway and tested with and without the revert. I didn't upload logs from this test but they look like Martin's dmesg except that I have a lot of UI junk in there and only one intel adapter. From both logs though, it looks like trouble starts on suspend. Could it be that the code setting up wake-on-lan can't be run twice?

My system is just a home hobby system so there's no rush on my account. I was just trying to be helpful in my spare time since Martin helped me a lot with this bug report.
Comment 47 amir.avivi 2024-09-03 04:25:42 UTC
Created attachment 306808 [details]
attachment-17823-0.html



Hello,

Thank you for your email. I am currently out of the office on Reserve Duty until September 5th.

For any urgent matters, you can reach out to my manager, Shmuel Ben-Nisan, at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>.

I appreciate your understanding and will get back to you as soon as possible upon my return.


Best regards,
Amir
Comment 48 metron@gmail.com 2024-09-09 12:36:06 UTC
Created attachment 306836 [details]
extra debug statements
Comment 49 metron@gmail.com 2024-09-09 12:36:39 UTC
Created attachment 306837 [details]
with extra debug statements compiled in
Comment 50 metron@gmail.com 2024-09-09 12:37:06 UTC
Created attachment 306838 [details]
with extra debug statements compiled in - annotation of suspend events
Comment 51 metron@gmail.com 2024-09-09 12:41:01 UTC
I added a patch that prints out the structures as the events are processed. I'm not sure why the adapter works after the first suspend but I think the issue is with the state variable. I'm guessing there is a guard somewhere and it shouldn't get stuck in state==4.
Comment 52 Vitaly Lifshits 2024-09-10 05:33:10 UTC
We are still working on a reproduction of this issue, unfortunately we haven't been able yet to do so. We even tried different distributions we different runtime D3 configurations.

We would like to do the following:
1. Force MSI or MSI-X interrupts and see if influences the results.
2. Add prints in the changed flow of the "igc: Refactor runtime power management flow" patch to see which pathway a successful resume follows and what happens when the resume fails.
Comment 53 Martin 2024-09-10 11:05:41 UTC
Hello Vitaly,
thank you for letting us know.

Can you please elaborate, how we should test that?
Comment 54 Jakub Kicinski 2024-09-10 13:53:25 UTC
Created attachment 306850 [details]
check if open is failing and whether ethtool races

Could you test with my debug patch applied and share the output?
Comment 55 Jakub Kicinski 2024-09-10 13:56:00 UTC
Created attachment 306851 [details]
dbg for open and ethtool
Comment 56 Vitaly Lifshits 2024-09-10 15:53:00 UTC
Created attachment 306852 [details]
change msi interrupts to msix

This patch will force interrupts to msix.
To force msi, change msix = false to true
Comment 57 Vitaly Lifshits 2024-09-10 15:56:41 UTC
Created attachment 306853 [details]
debug prints in current state of __igc_open

This patch will print the return values in a case of an error in __igc_open.
This patch might give a us a direction what fails when the Foxville device fails to resume. I didn't add the print in the error condition in igc_resume function since Kuba had already done it in his patch.
Comment 58 Martin 2024-09-11 14:21:25 UTC
I will try to build a fedora kernel with the added two patches.
Comment 59 Martin 2024-09-11 19:04:54 UTC
Created attachment 306864 [details]
dmesg of two suspend cycles with both patches applied
Comment 60 Martin 2024-09-11 19:08:22 UTC
(damn, I totally did not notice Jakubs patch, I will rebuild)
sorry
Comment 61 Martin 2024-09-11 20:05:25 UTC
 ethtool -i enp6s0 
Cannot get driver information: No such device 
(after two resume cycles)
Comment 62 Martin 2024-09-11 20:30:21 UTC
Created attachment 306865 [details]
dmesg of two suspend cycles with all three patches applied
Comment 63 Vitaly Lifshits 2024-09-18 08:20:45 UTC
Hi Martin,

I went over your dmesg logs.
From what I understand the igc driver is still running after igc_resume is called. I understand this from these prints:
[  101.584797] igc: adapter->flags = 0x109
[  101.584799] igc: adapter->flags = 0x109

However, it fails before reaching to __igc_open call.

Since there are no other prints I assume that either:
1. pci_device_is_present returns false, in that case maybe there is a race condition that the driver is resuming while the device is in D3 cold state.

2. netif_running(netdev) returns false so the function is not being called at all.

Can you add two debug print in igc_resume function as follows?
1:

-        if (!pci_device_is_present(pdev))
                return -ENODEV;
+        if (!pci_device_is_present(pdev)) {
+               printk("igc: pci_device_is_present = false\n");
                return -ENODEV;
+        }


2:

        if (netif_running(netdev)) {
                err = __igc_open(netdev, true);
                if (!err)
                        netif_device_attach(netdev);
-        }
+        } else {
+                printk("igc: netif_running(netdev) = false\n");
+        }


Let me know if you want that I generate the patch for you.
Comment 64 amir.avivi 2024-09-18 08:21:02 UTC
Created attachment 306888 [details]
attachment-19617-0.html



Hello,

Thank you for your email. I am currently out of the office on PTO until September 22nd.

For any urgent matters, you can reach out to my manager, Shmuel Ben-Nisan, at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>.

I appreciate your understanding and will get back to you as soon as possible upon my return.


Best regards,
Amir
Comment 65 Martin 2024-09-18 09:28:12 UTC
Hello Vitaly,
it would be nice if you could please generate a patch, that applies on Kernel 6.9.x
Comment 66 Vitaly Lifshits 2024-09-18 10:06:14 UTC
(In reply to Martin from comment #65)
> Hello Vitaly,
> it would be nice if you could please generate a patch, that applies on
> Kernel 6.9.x

Didn't you mean 6.10? You mentioned that 6.9 doesn't have an issue.
Comment 67 Martin 2024-09-18 10:27:48 UTC
yes, that is correct. I mistyped.
I apologize.
Comment 68 Vitaly Lifshits 2024-09-18 11:57:42 UTC
Created attachment 306889 [details]
igc_resume debug prints
Comment 69 Martin 2024-09-18 12:21:29 UTC
I am sorry, the patch does not apply to Kernel 6.10.10

Patch3: 0001-debug-prints-in-igc_resume.patch
+ case "$patch" in
+ git --work-tree=. apply
error: patch failed: drivers/net/ethernet/intel/igc/igc_main.c:7255
error: drivers/net/ethernet/intel/igc/igc_main.c: patch does not apply
error: Bad exit status from /var/tmp/rpm-tmp.AvJAl2 (%prep)
Comment 70 Martin 2024-09-18 12:27:19 UTC
odd, if I do it with a fresh downloaded Kernel 6.10.10 it works, maybe there is a problem with my rpm build process. I will use the standard way to build it.
Comment 71 Martin 2024-09-18 12:28:25 UTC
oh wait, I found the issue, the force-msi patch does not work with your last patch.

Now I would like to know which patches should I apply.

Sorry for the confusion.
Comment 72 Vitaly Lifshits 2024-09-18 13:06:48 UTC
(In reply to Martin from comment #71)
> oh wait, I found the issue, the force-msi patch does not work with your last
> patch.
> 
> Now I would like to know which patches should I apply.
> 
> Sorry for the confusion.

You don't need the MSI patch anymore as we know that it doesn't fix the issue.
My goal in the latest patch is to understand where the failure occurs in the resume flow. Therefore, applying only the last patch should be sufficient.

BTW, did you try to force msix and see if it helps?
Comment 73 metron@gmail.com 2024-09-18 14:12:10 UTC
Created attachment 306890 [details]
dmesg igc resume patch applied

6.11.0-3-DEBUG - hits igc: netif_running returns false
Comment 74 Martin 2024-09-18 14:33:14 UTC
Created attachment 306891 [details]
dmesg aftet two suspend cycles with latest patch

less igc_suspend.log | grep -i igc
[   16.650998] igc 0000:04:00.0: enabling device (0000 -> 0002)
[   16.651140] igc 0000:04:00.0: PCIe PTM not supported by PCIe bus/controller
[   16.707941] igc 0000:04:00.0 (unnamed net_device) (uninitialized): PHC added
[   16.736402] igc 0000:04:00.0: 4.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x1 link)
[   16.736407] igc 0000:04:00.0 eth0: MAC: 68:54:5a:62:a3:01
[   16.736539] igc 0000:06:00.0: enabling device (0000 -> 0002)
[   16.736627] igc 0000:06:00.0: PCIe PTM not supported by PCIe bus/controller
[   16.792831] igc 0000:06:00.0 (unnamed net_device) (uninitialized): PHC added
[   16.819624] igc 0000:06:00.0: 4.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x1 link)
[   16.819628] igc 0000:06:00.0 eth1: MAC: a8:a1:59:36:84:43
[   16.824232] igc 0000:04:00.0 enp4s0: renamed from eth0
[   16.824360] igc 0000:06:00.0 enp6s0: renamed from eth1
[   24.591160] igc 0000:06:00.0 enp6s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[   24.613173] igc 0000:06:00.0 enp6s0: entered allmulticast mode
[   24.613349] igc 0000:06:00.0 enp6s0: entered promiscuous mode
[   68.407229] igc: Enter igc_resume
[   68.407260] igc: Enter igc_resume
[   70.921273] igc 0000:06:00.0 enp6s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[   88.055377] igc 0000:06:00.0 enp6s0: left allmulticast mode
[   88.055417] igc 0000:06:00.0 enp6s0: left promiscuous mode
[   93.900387] igc: Enter igc_resume
[   93.900446] igc: Enter igc_resume
[   93.925461] igc: netif_running returns false
[   93.925462] igc: netif_running returns false
Comment 75 Alex Hermann 2024-09-19 15:21:28 UTC
I had some time to test some things today. Remember, in my case there is no suspend/resume cycle, the problems exist directly after a boot.

6.9.8: All OK.

6.10.6: No devices usable after boot
6.10.6 + 6f31d6b643a reverted: All OK

6.11.0: No devices usable, but all OK after a modprobe -r igc; modprobe igc cycle.
6.11.0 + 6f31d6b643a reverted: No devices usable after boot, but all OK after a modprobe -r igc; modprobe igc cycle.


So where reverting 6f31d6b643a fixed the problem in 6.10, it does not in 6.11.

dmesg follows.
Comment 76 Alex Hermann 2024-09-19 15:25:24 UTC
Created attachment 306898 [details]
dmesg-6.11.0 boot and modules reload

Kernel 6.11.0 + logging patches
After boot, all interfaces broken.
After module reload (160 sec), all interfaces working.

Repeatability: 3 out of 3 attempts.
Comment 77 Martin 2024-09-19 15:43:06 UTC
Alex, since you have an Intel I-226V I would strongly recommend to create a separate bug report, since your issue varies from my original report.
Comment 78 Vitaly Lifshits 2024-09-23 08:14:50 UTC
I would like to test whether there is a race condition between the pci driver and the net core. Specifically since the device is removed from the PCI tree, I would like to ask you to try to reproduce the issue without D3cold support.

Please try to reproduce the issue with D3cold support disabled in the sysfs and reproduce the issues:
 > echo 0 > /sys/bus/pci/devices/.../d3cold_allowed
Comment 79 Martin 2024-09-23 08:26:01 UTC
Created attachment 306911 [details]
fresh boot, two standby cycles

echo 0 > /sys/bus/pci/devices/0000\:06\:00.0/d3cold_allowed 
echo 0 > /sys/bus/pci/devices/0000\:04\:00.0/d3cold_allowed 

cat /sys/bus/pci/devices/0000\:06\:00.0/d3cold_allowed 
0
cat /sys/bus/pci/devices/0000\:04\:00.0/d3cold_allowed 
0
Comment 80 metron@gmail.com 2024-09-25 17:17:10 UTC
Comment on attachment 306891 [details]
dmesg aftet two suspend cycles with latest patch

I added some more prints for igc_suspend and I think they are interesting.

first suspend (these are from the igc_suspend flow):
[   55.843617]  igc: netif_running was true, calling __igc_close
[   55.884575]  igc: wufc was true
[   55.884636]  igc: wake is true, igc_power_up_link

second suspend (these are from the igc_suspend flow):
[   81.194595]  igc: netif_running was false, skipping __igc_close
[   81.195310]  igc: wufc & IGC_WUFC_MC was  false
[   81.195314]  igc: wake is false, igc_power_down_phy_copper_base

and unsurprisingly when you come up from that it's in a bad state. I'll attach the logs and patch, maybe it will give you guys new ideas.
Comment 81 metron@gmail.com 2024-09-25 17:17:52 UTC
Created attachment 306920 [details]
two suspends booting with suspend kprintf
Comment 82 metron@gmail.com 2024-09-25 17:18:27 UTC
Created attachment 306921 [details]
kprintfs for igc_suspend flow
Comment 83 metron@gmail.com 2024-10-05 16:11:35 UTC
Vitaly, from my understanding of kernel drivers which is admittedly super limited you need the rntl_lock earlier in the resume flow. Just looking at the i40e driver, and it takes out a lock much earlier (before i40e_restore_interrupt_scheme) but the igc driver does it just around __igc_open. That might be why things don't work correctly.
Comment 84 Vitaly Lifshits 2024-10-06 15:36:45 UTC
(In reply to metron@gmail.com from comment #83)
> Vitaly, from my understanding of kernel drivers which is admittedly super
> limited you need the rntl_lock earlier in the resume flow. Just looking at
> the i40e driver, and it takes out a lock much earlier (before
> i40e_restore_interrupt_scheme) but the igc driver does it just around
> __igc_open. That might be why things don't work correctly.

The reason that I believe that it is fine not to hold the locker is because we don't handle the queues in __igc_open or in igc_resume, but rather in igc_open which is in an RTNL locked context.

By the way, thanks for the previous patch, I don't think that the wakes are relevant to this issue. However, I believe that this print might be indeed related:
"igc: netif_running was false, skipping __igc_close"

I'll upload a new patch with more prints soon.
Comment 85 Vitaly Lifshits 2024-10-06 15:38:54 UTC
Created attachment 306975 [details]
I added here more prints in netdev and PM flows.
Comment 86 Martin 2024-10-06 18:36:14 UTC
Created attachment 306978 [details]
two time suspend/resume with attachment 306975 [details]
Comment 87 metron@gmail.com 2024-10-06 20:58:55 UTC
Created attachment 306980 [details]
using latest print patch

Mine is a little different and you can see igc_close is called.
[   56.203323] igc: Enter igc_close

I had to work hard to do this with nouveau; after one suspend, video did not return but I could blindly type terminal commands :)
Comment 88 metron@gmail.com 2024-10-06 21:08:16 UTC
(In reply to metron@gmail.com from comment #87)
> Created attachment 306980 [details]
> using latest print patch
> 
> Mine is a little different and you can see igc_close is called.
> [   56.203323] igc: Enter igc_close
> 
> I had to work hard to do this with nouveau; after one suspend, video did not
> return but I could blindly type terminal commands :)

Ah nevermind, Martin's has that too
Comment 89 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-10-09 10:07:51 UTC
Vitaly Lifshits, what's the status here? Is there any hope that this two months old regression will be fixed soon?
Comment 90 Vitaly Lifshits 2024-10-09 10:47:00 UTC
Hi Thorsten,

We are on it.

This issue does not reproduce on any system in our lab and we tried on many systems with many different configurations. It appears to be affecting only a small number of users, represented in this thread.

Thus, we are doing a lot of back-and-forth triages and analysis with the folks that reported the issue and are helping us debug it.

I can assure you that as soon as we have a root-cause and a possible fix, we will share it.

Currently, we see that there is a difference in the way the network stack operates. I am consulting with Jakub Kicinski on this.
Comment 91 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-10-09 10:55:15 UTC
thx for the update!
Comment 92 Vitaly Lifshits 2024-10-09 16:13:40 UTC
Which OS distribution do you use? What application manages networking on your system?
Comment 93 Martin 2024-10-09 16:30:10 UTC
Fedora 40 with Networkmanager
Comment 94 metron@gmail.com 2024-10-09 21:45:14 UTC
Created attachment 306995 [details]
network-manager logs

I'm using Manjaro, also with NetworkManager.

I ran this and did two suspend cycles to generate logs. As a bonus there is one with linux 6.6 LTS.

before: sudo nmcli g logging level DEBUG 

after: journalctl -u NetworkManger --since "today"

I did a suspend with 6.6 at the start and two with 6.11 after rebooting (line marker is "-- Boot d77d5cd258d94681a2f69c5d1e3401d1 --")
Comment 95 Vitaly Lifshits 2024-10-20 07:52:54 UTC
I attempted to reproduce the issue on a system running Fedora 40 with NetworkManager. Despite setting up a bridge connection, I was unable to replicate the problem.

Could you please check if the issue persists when NetworkManager is stopped? If possible, share also the dmesg log.
Comment 96 amir.avivi 2024-10-20 07:53:21 UTC
Created attachment 307024 [details]
attachment-18845-0.html

Hello,
I am currently out of the office on PTO (Sukkot) and will return on October 27st. During this period, my access to Email and Teams will be limited.
Thank you for understanding. I will address your inquiries upon my return.
Best regards, Amir
Comment 97 Martin 2024-10-20 08:59:35 UTC
I will give it a try, may I ask you which hardware are you using?
Are you testing on an AM4 System?
Comment 98 Vitaly Lifshits 2024-10-20 09:43:36 UTC
I don't have an AM4 system, but according to Metron, the issue was reproduced on a Z790 system, which we also tested.
I doubt that this issue is related to a specific system because our debug prints indicate that the network stack behaves differently when the issue occurs.
On all of my systems, during suspend and resume flows, the driver is not being opened and closed:

[504092.919367] igc: Enter igc_suspend
[504092.919371] igc: __igc_shutdown: netif_running was true, calling __igc_close
[504093.394002] igc: Enter igc_resume
[504097.888682] igc 0000:ae:00.0 enp174s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[504120.025521] igc: Enter igc_suspend
[504120.025526] igc: __igc_shutdown: netif_running was true, calling __igc_close
[504120.500791] igc: Enter igc_resume
[504124.737054] igc 0000:ae:00.0 enp174s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX


I believe there might be a user-level application that calls ndo->close before suspending, and without the rtnl_lock, it doesn't call ndo->open on resume before igc_resume is invoked.
Comment 99 metron@gmail.com 2024-10-21 21:18:07 UTC
Could it be related to the high core counts? 5950x and 13900kf have that in common.
Comment 100 metron@gmail.com 2024-10-22 03:48:19 UTC
I switched my system to systemd-networkd with a static IP and it works after the second suspend. 

Do you think that this is a bug in NetworkManager or a configuration issue perhaps? 

Increasing carrier-wait-timeout from 6s might help. It seems like it gives up if the carrier isn't detected within that time. https://www.networkmanager.dev/docs/api/latest/NetworkManager.conf.html
Comment 101 Vitaly Lifshits 2024-10-22 16:44:06 UTC
Hi Metron,

I don't think that it is related to a high core count, because we see different suspend and resume flows. Though I might be wrong here.

Regarding your suggestion: (In reply to metron@gmail.com from comment #100)
> I switched my system to systemd-networkd with a static IP and it works after
> the second suspend. 
> 
> Do you think that this is a bug in NetworkManager or a configuration issue
> perhaps? 

Honestly, I am not entirely sure. We see that there is a race condition between PM suspend/resume and NDO open/close. It might be that there is some issue in the NetworkManager related to the NDO flows.

> 
> Increasing carrier-wait-timeout from 6s might help. It seems like it gives
> up if the carrier isn't detected within that time.
> https://www.networkmanager.dev/docs/api/latest/NetworkManager.conf.html

Can I ask you to increase this timeout and see if it helps?
Comment 102 metron@gmail.com 2024-10-26 23:00:22 UTC
It didn't seem to make a difference when I increased the timeout. I suppose that makes sense, the problems in dmesg don't take six seconds to appear.
Comment 103 Martin 2024-10-28 17:15:48 UTC
I am sorry for having disappeared for a while.
I have tested it with a higher timeout as well, and had no success either.
Is there anything else we can do?
Comment 104 Paul 2024-10-29 09:29:31 UTC
I think I am also having the same issue, only difference is that it happens after every resume from suspend, not every second resume.   
  
My system:  
Ethernet Controller I225-V (rev 03) - (on Asus X670E extreme mobo)  
firmware-version: 1082:8770  
   
Kernel: 6.11.0-9-generic (on Ubuntu 24.10, also using NetworkManager)  
  
CPU: AMD Ryzen 9 7950X
Comment 105 Vitaly Lifshits 2024-10-29 09:49:41 UTC
Please try to reproduce with this patch:
https://patchwork.ozlabs.org/project/intel-wired-lan/patch/20241028195243.52488-3-jdamato@fastly.com/
Comment 106 metron@gmail.com 2024-11-09 13:39:22 UTC
Created attachment 307189 [details]
patch file from mailing list

I made this patch file from the mailing list you linked but I haven't had time to try it yet.
Comment 107 amir.avivi 2024-11-09 13:39:51 UTC
Created attachment 307190 [details]
attachment-5297-0.html

Hello and Thank you for your email.
I am currently out of the office on Army Reserve Duty. During this time, my ability to respond to emails will be limited, and there may be a delay in my responses.
I appreciate your understanding and will do my best to reply as soon as possible.

Thank you for your patience.
Best regards
Comment 108 Martin 2024-11-09 14:16:00 UTC
My apologies, I have missed the patch, building it right now. Will test later today.
Comment 109 Martin 2024-11-09 15:07:57 UTC
sadly this patch does not help :(
I am still offline after the second standby cycle.

Metron, were you successful?
Comment 110 metron@gmail.com 2024-11-09 22:25:17 UTC
So, it's kind of a funny story but I had invested some time getting systemd-boot working instead of grub and when I tried to test this patch I ended up completely breaking my system. After a few tries, I decided to give up on Manjaro and put on Fedora 41.

Now that I am running Fedora, I can no longer reproduce the problem.

One thing that's different in my Fedora 41 system is that I didn't configure my wifi networks. Now I get kernel opps from iwlwifi but the ethernet interface works fine.
Comment 111 Martin 2024-11-09 23:11:58 UTC
That is odd, since I am on F41 (but upgraded from F24 and following)
Comment 112 Martin 2024-11-11 11:31:39 UTC
@metron
I ran rpmconf -a to verify that I am on the most recent config files, but nothing besides the usual java security enhancements came up.
I am really curious what a new Fedora 41 installation changes. 
I might just boot the Fedora 41 installation stick and send it to standby twice for testing.
Comment 113 amir.avivi 2024-11-11 11:32:05 UTC
Created attachment 307204 [details]
attachment-2741-0.html

Hello and Thank you for your email.
I am currently out of the office on Army Reserve Duty. During this time, my ability to respond to emails will be limited, and there may be a delay in my responses.
I appreciate your understanding and will do my best to reply as soon as possible.

For urgent cases, please contact my manager: shmuel.ben-nisan@intel.com

Thank you for your patience.
Best regards
Comment 114 metron@gmail.com 2024-11-11 13:40:12 UTC
I can't test my theory, but I think the wifi interface being configured caused NetworkManager to give up on the igc interface. Unfortunately my wifi no longer works at all so I can't test this idea. The interface takes a really long time to come up, so I think there's some code in NetworkManager that is giving up on it. 

I'll try installing and booting a 6.10 kernel today to see if the issue reappears with and without wifi.

Manjaro was on 6.11.2 and my wifi used to work. Fedora has 6.11.6 brought in this other regression: https://bugzilla.kernel.org/show_bug.cgi?id=219447

I see you have a bridge and I assume you are using both interfaces. Have you tried with just one?
Comment 115 Martin 2024-11-11 14:13:13 UTC
My Mainboard does have Wifi as well, but I disabled it in UEFI, so I guess this should not affect linux.

I am only using one of the two cards, the second card is just for projects where I need to access a second router I am programming for a customer, so it is mostly inactive.
Comment 116 Martin 2024-11-18 15:58:15 UTC
I tested with a current live image of Fedora 41 (https://dl.fedoraproject.org/pub/alt/live-respins/) and the error did not occur after resume number two.

Now what does that mean?
I ran rpmconf -a which updates config files and I found nothing, that could be related to the issue. 

I really really do not want to reinstall.
Comment 117 amir.avivi 2024-11-18 15:58:43 UTC
Created attachment 307227 [details]
attachment-26861-0.html

Hello,
I am currently out of the office on Army Reserve Duty, which limits my ability to respond to emails. Please do not communicate through email for existing or new issues; instead, use the IPS system. Instructions on opening an IPS ticket are attached to this auto-reply.

For urgent matters, you may contact my manager at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>.

To ensure effective monitoring and resolution of this issue, we request that you open an IPS ticket and provide the following information:
1.       A comprehensive and detailed description of the failing scenario, broken down into steps, e.g.,

·        Boot to OS
·        Enter S4
·        Return from S4
·        Issue reproduced
2.       Confirm if this issue is reproducible on Vpro or Non Vpro SKU (V/LM).
3.       Provide fail rate and number of systems that can reproduce the issue, along with the number of tests conducted and failures, e.g.,
·        2 out of 15 systems failed
·        System 1 fail rate: 1 out of 20
·        System 2 fail rate: 1 out of 15
4.       Specify the test environment (Windows/Linux/UEFI/PXE). If Windows, include the LAN device driver version (e.g., 20.0.2.8).
5.       Provide the LAN NVM version.
6.       Indicate whether the LAN cable is connected or disconnected and if it is part of the test flow.
7.       Describe the expected behavior in this scenario.
8.       State the pass criteria for this scenario.
Failure to provide this information will result in automatic resolution and closure of the issue.

Thank you for your cooperation.
Best regards,
Comment 118 Martin 2024-12-04 16:02:06 UTC
I "diffed" the entire /etc folder to the one from the Fedora 41 live-cd and found no indicators of differences in the installed components.
Of course I do have a large variety of additional software on my installed system.

Do you have some hints how I could explore the problems?

I am thinking of mapping all installed packages, take another SSD and install a fresh system, but my time before xmas is really the issue.

That's why I ask, if you are still confident and determined to find the issue.
Comment 119 metron@gmail.com 2024-12-15 20:40:11 UTC
I'm pretty sure this problem is an interaction between NetworkManager and the driver. I'm not sure why NetworkManager gives up on the interface, but it seems to do so if there is another interface configured. I think that's why there's not many people affected. I noticed issues on their gitlab don't go so well, but you could try starting a thread there.

Switching from NetworkManager to systemd-networkd is probably an option that 'solved' the problem for me. You loose out on the network icon (at least in gnome) but everything works.

This bug wasn't as problematic for me as all my nvidia/intel-cpu problems, so I bought a whole new computer that's more linux friendly. Now I have new problems but suspend and resume works.
Comment 120 amir.avivi 2024-12-15 20:41:19 UTC
Created attachment 307361 [details]
attachment-29439-0.html

Hello,
Thank you for your email.
I am currently out of the office until December 22nd.
For urgent matters, you may contact my manager at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>.

Thank you for your cooperation.
Best regards,
Comment 121 Martin 2024-12-16 00:53:41 UTC
@metron
I created a systemd service to automatically unload and load the module.

I still hope this gets fixed, when I browsed, I found a few other users with this issue on reddit.