Bug 219143
Description
Martin
2024-08-09 15:17:49 UTC
Sasha, Please take a look, it's your commit. Thx for CCing, I already had picked this up for tracking, but it's time to do more. @martin: can I CC you while forewarning this by mail? That would expose your email address to the public. Of course! Thank you for your assistance here! Hello @Martin, Thank you for reporting this issue. Please share the following information: 1. dmesg log 2. lspci -v dump 3. ifconfig dump before and after the failure 4. NVM versions of the NICs (ethtool -i <interface name>) 5. Do both NICs fail? 6. Does a link come up after resuming? (check LEDs) 7. What is the reproduction rate? Does it always happen? Created attachment 306731 [details]
dmesg log after two suspends with no working network
I suspended the PC twice and then ran dmesg
Created attachment 306732 [details]
lspci -v
Created attachment 306733 [details]
enp4s0
This card is in one of the pci-e slots
Created attachment 306734 [details]
enp6s0
This is the onboard card from my Asrock B550 Taichi Mainboard
Created attachment 306735 [details]
ifconfig before suspend
ifconfig pre failure
Created attachment 306736 [details]
ifconfig after two suspends resulting in no network
ifconfig post failure
I used Kernel 6.10.4-200.fc40.x86_64 for testing. to 4. I added the ethtool output in two separate files, but for testing I took a look what happens, if I run the command when the problem occurs. This is the result: ethtool -i enp4s0 Cannot get driver information: No such device ethtool -i enp6s0 Cannot get driver information: No such device to 5. yes to 6. no leds on both cards to 7. yes, every second standby If you need anything else, please let me know! Is there maybe a firmware update for my cards? (In reply to Martin from comment #12) > Is there maybe a firmware update for my cards? Yes, there is, the firmware versions on both of your cards are old. You can try contacting your board's vendor to get a newer version. Anyway, I would like to ask you for some more logs: 1. lspci -t on boot. 2. lspci -s 06:00.0 -vvv and lspci -s 04:00.0 -vvv after reproduction 3. disconnect the PCI-E NIC and see if the issue reproduces on one card only. 4. dmesg logs with netdev debug prints (echo "module igc +p" | sudo tee /sys/kernel/debug/dynamic_debug/control), please disable console suspend (echo N | sudo tee /sys/module/printk/parameters/console_suspend) I doubt that I will get a firmwareupdate from Asrock and the NIC-reseller. I asked Asrock for that a while ago, and they did not respond. Could you please supply me with a firmware upgrade? I have to work now, I will do the testing on the next boot, and yes, I can remove the card. Created attachment 306739 [details]
lspci -t ran after a fresh boot
Created attachment 306740 [details]
lspci -s 06:00.0 -vvv after suspend
Created attachment 306741 [details]
lspci -s 04:00.0 -vvv after suspend
I have to postpone the removal of the NIC. You are lucky, my customer cancelled the appointment, and I was able to remove the card. This led to some complications. Now enp6s0 got renamed to enp5s0, which led to problems with my bridge. I will run the desired dmesg logs. to point 3 the problem also occurs with just one card. Created attachment 306742 [details]
dmesg netdev debug including two standby cycles
I reinstalled the PCI-E Card and the network cards got renamed back to enp4s0 and enp6s0. A friend told me, that Intel itself provides firmware-updates for its network cards, so I downloaded the complete driver pack 29.2.1. Sadly it does not contain firmware-updates for my Device-ID 15F3 probably only for 15F2 according to the Firmware Files: FoxPond1_I225_15F2_2MB_1p94_800003BB.bin Foxpond1_I225_15F2_LM_1MB_1p94_800003BC.bin Num Description Ver.(hex) DevId S:B Status === ================================== ============ ===== ====== ============== 01) Intel(R) Ethernet Controller (3) N/A(N/A) 15F3 00:004 Update not I225-V available 02) Intel(R) Ethernet Controller (2) N/A(N/A) 15F3 00:006 Update not I225-V available Vitaly, I contacted Asrock again, and mentioned this bug, but they said that cannot supply me with a newer firmware. Would you please be so kind and try to reach out at Intel? Best Regards Martin Created attachment 306748 [details] attachment-10352-0.html Hello, Thank you for your email. I am currently out of the office on PTO until August 25th. For any urgent matters, please contact my cover, Rex Tsai, at rex.tsai@intel.com<mailto:rex.tsai@intel.com>. Alternatively, you can reach out to my manager, Shmuel Ben-Nisan, at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>. I appreciate your understanding and will get back to you as soon as possible upon my return. Best regards, Amir I have the same problem with a slightly newer I226-V. It's a built-in and breaks on the second resume as well. Thanks for the `rmmod`+`modprobe` workaround Martin. firmware-version: 2017:888d bus-info: 0000:04:00.0 It looks like there is no change to lspci. The NIC is stuck in a bad state and seems to think the cable is unplugged. If you want the same information from me Amir, I'd be happy to add it. Hi Metron, Can you please share your system information? You can even share the output of dmidecode for this. Created attachment 306753 [details]
dmidecode
Created attachment 306754 [details]
System Info
inxi -F
sorry, I missread ;) Created attachment 306756 [details]
dmidecode output
It's an MSI Tomohawk Wifi board (Z790 chipset) with an integrated I226-V
I am willing to test a change too if that's helpful. I can probably build a patched kernel from source successfully. Anything new? As this seems to be taking some time to fix: did anyone try if a revert in latest mainline is able to fix this (without causing another regressions)? I have a similar, but not exactly the same problem. In my case, the network devices are unusable in 6.10.6 after every boot, not just after suspend/resume. No problems with 6.9.8 and kernels before that. The devices are created and renamed by udev but trying to use the device results in: networking[959]: warning: wan: netlink: wan: ip link set dev wan up: operation failed with 'No such device' (19) I built the kernel with the earlier mentioned commit reverted. Now my network devices are fully working with kernel 6.10.6. So far no other regressions found, but I have only just booted the system. Because my system has broken DMI tables, I also reverted 0ef11f604503b1862a21597436283f158114d77e (firmware: dmi: Stop decoding on broken entry). I don't know if this is necessary to fix the problem. Unfortunately, I have no time to do further test the next few weeks, I just wanted to let you know reverting the commit fixes the problem for me. Created attachment 306792 [details]
dmidecode output
The system has 4x I226-V (rev 04) controllers.
Created attachment 306793 [details]
lspci -vvv output
The system has 4x I226-V (rev 04) controllers, this is the lspci output of one of them.
(In reply to Alex Hermann from comment #34) > Because my system has broken DMI tables, I also reverted > 0ef11f604503b1862a21597436283f158114d77e (firmware: dmi: Stop decoding on > broken entry). I don't know if this is necessary to fix the problem. Because this might have skewed the result, I also tested without this last revert and the system still works. To reiterate, just reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0 fixes the problem for me. Without any side effects so far. (In reply to Alex Hermann from comment #34) > No problems with 6.9.8 and kernels before that. So 6.9.9 or a later 6.9.y release was also broken? Then this might be another problem, as 6f31d6b643a32cc126cf86093fca1ea575948bf0 was not backported to 6.9.y. (In reply to Alex Hermann from comment #37) > To reiterate, just reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0 fixes > the problem for me. Without any side effects so far. Thx for that, but you missed an important detail: where did you perform the revert? With a recent mainline or something else? (In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #38) > (In reply to Alex Hermann from comment #34) > > No problems with 6.9.8 and kernels before that. > > So 6.9.9 or a later 6.9.y release was also broken? Then this might be > another problem, as 6f31d6b643a32cc126cf86093fca1ea575948bf0 was not > backported to 6.9.y. Sorry for the confusion. It just means I haven't tested any 6.9 kernel beyond 6.9.8. I noticed the problem when upgrading from 6.9.8 to 6.10.6. > (In reply to Alex Hermann from comment #37) > > To reiterate, just reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0 fixes > > the problem for me. Without any side effects so far. > > Thx for that, but you missed an important detail: where did you perform the > revert? With a recent mainline or something else? The revert was against 6.10.6. (In reply to Alex Hermann from comment #39) > The revert was against 6.10.6. This is good to know, but as mentioned earlier: it would be really important to know if a revert on mainline works and resolves the problem, as then Linus can fix this with a revert himself if he wants to. I'm not sure if it's helpful but I built a kernel reverting 6f31d6b643a32cc126cf86093fca1ea575948bf0. The network card works but I get no video after suspend/resume so I'm unable to confirm the revert is enough. Dmesg looks better in the sense that now it just has nvidia in there instead of kernel taints. Looking at the revert patch, I wonder if this is just that the refactor changed what happens on resume. If you read carefully, __igc_open no longer does netif_set_real_num_tx_queues but that's the only thing that's called by the resume function. Ok with a nvidia-beta-dkms, I got it to work. The problem is fixed with the revert. (In reply to metron@gmail.com from comment #42) > Ok with a nvidia-beta-dkms, I got it to work. The problem is fixed with the > revert. Never use out of tree drivers when reporting bugs upstream, they could be causing your problems or interfere. It depends on the developer, they might now all your previous. See also: https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/#your-kernel-apparently-loaded-add-on-drivers We are still not able to reproduce this issue. I wonder if it is related to a kernel configuration. Can someone share his .config file? I suggest not to rush into reverting this patch since originally it fixed a deadlock. Created attachment 306800 [details]
config file of last bisect from vanilla kernel (make localmodconfig)
I built mine using Manjaro's config (which I was already using) and their PKGBLD; link https://gitlab.manjaro.org/packages/core/linux611/-/blob/master/config?ref_type=heads I removed Manjaro's ROG Ally patches since I was compiling anyway and tested with and without the revert. I didn't upload logs from this test but they look like Martin's dmesg except that I have a lot of UI junk in there and only one intel adapter. From both logs though, it looks like trouble starts on suspend. Could it be that the code setting up wake-on-lan can't be run twice? My system is just a home hobby system so there's no rush on my account. I was just trying to be helpful in my spare time since Martin helped me a lot with this bug report. Created attachment 306808 [details] attachment-17823-0.html Hello, Thank you for your email. I am currently out of the office on Reserve Duty until September 5th. For any urgent matters, you can reach out to my manager, Shmuel Ben-Nisan, at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>. I appreciate your understanding and will get back to you as soon as possible upon my return. Best regards, Amir Created attachment 306836 [details]
extra debug statements
Created attachment 306837 [details]
with extra debug statements compiled in
Created attachment 306838 [details]
with extra debug statements compiled in - annotation of suspend events
I added a patch that prints out the structures as the events are processed. I'm not sure why the adapter works after the first suspend but I think the issue is with the state variable. I'm guessing there is a guard somewhere and it shouldn't get stuck in state==4. We are still working on a reproduction of this issue, unfortunately we haven't been able yet to do so. We even tried different distributions we different runtime D3 configurations. We would like to do the following: 1. Force MSI or MSI-X interrupts and see if influences the results. 2. Add prints in the changed flow of the "igc: Refactor runtime power management flow" patch to see which pathway a successful resume follows and what happens when the resume fails. Hello Vitaly, thank you for letting us know. Can you please elaborate, how we should test that? Created attachment 306850 [details]
check if open is failing and whether ethtool races
Could you test with my debug patch applied and share the output?
Created attachment 306851 [details]
dbg for open and ethtool
Created attachment 306852 [details]
change msi interrupts to msix
This patch will force interrupts to msix.
To force msi, change msix = false to true
Created attachment 306853 [details]
debug prints in current state of __igc_open
This patch will print the return values in a case of an error in __igc_open.
This patch might give a us a direction what fails when the Foxville device fails to resume. I didn't add the print in the error condition in igc_resume function since Kuba had already done it in his patch.
I will try to build a fedora kernel with the added two patches. Created attachment 306864 [details]
dmesg of two suspend cycles with both patches applied
(damn, I totally did not notice Jakubs patch, I will rebuild) sorry ethtool -i enp6s0 Cannot get driver information: No such device (after two resume cycles) Created attachment 306865 [details]
dmesg of two suspend cycles with all three patches applied
Hi Martin, I went over your dmesg logs. From what I understand the igc driver is still running after igc_resume is called. I understand this from these prints: [ 101.584797] igc: adapter->flags = 0x109 [ 101.584799] igc: adapter->flags = 0x109 However, it fails before reaching to __igc_open call. Since there are no other prints I assume that either: 1. pci_device_is_present returns false, in that case maybe there is a race condition that the driver is resuming while the device is in D3 cold state. 2. netif_running(netdev) returns false so the function is not being called at all. Can you add two debug print in igc_resume function as follows? 1: - if (!pci_device_is_present(pdev)) return -ENODEV; + if (!pci_device_is_present(pdev)) { + printk("igc: pci_device_is_present = false\n"); return -ENODEV; + } 2: if (netif_running(netdev)) { err = __igc_open(netdev, true); if (!err) netif_device_attach(netdev); - } + } else { + printk("igc: netif_running(netdev) = false\n"); + } Let me know if you want that I generate the patch for you. Created attachment 306888 [details] attachment-19617-0.html Hello, Thank you for your email. I am currently out of the office on PTO until September 22nd. For any urgent matters, you can reach out to my manager, Shmuel Ben-Nisan, at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>. I appreciate your understanding and will get back to you as soon as possible upon my return. Best regards, Amir Hello Vitaly, it would be nice if you could please generate a patch, that applies on Kernel 6.9.x (In reply to Martin from comment #65) > Hello Vitaly, > it would be nice if you could please generate a patch, that applies on > Kernel 6.9.x Didn't you mean 6.10? You mentioned that 6.9 doesn't have an issue. yes, that is correct. I mistyped. I apologize. Created attachment 306889 [details]
igc_resume debug prints
I am sorry, the patch does not apply to Kernel 6.10.10 Patch3: 0001-debug-prints-in-igc_resume.patch + case "$patch" in + git --work-tree=. apply error: patch failed: drivers/net/ethernet/intel/igc/igc_main.c:7255 error: drivers/net/ethernet/intel/igc/igc_main.c: patch does not apply error: Bad exit status from /var/tmp/rpm-tmp.AvJAl2 (%prep) odd, if I do it with a fresh downloaded Kernel 6.10.10 it works, maybe there is a problem with my rpm build process. I will use the standard way to build it. oh wait, I found the issue, the force-msi patch does not work with your last patch. Now I would like to know which patches should I apply. Sorry for the confusion. (In reply to Martin from comment #71) > oh wait, I found the issue, the force-msi patch does not work with your last > patch. > > Now I would like to know which patches should I apply. > > Sorry for the confusion. You don't need the MSI patch anymore as we know that it doesn't fix the issue. My goal in the latest patch is to understand where the failure occurs in the resume flow. Therefore, applying only the last patch should be sufficient. BTW, did you try to force msix and see if it helps? Created attachment 306890 [details]
dmesg igc resume patch applied
6.11.0-3-DEBUG - hits igc: netif_running returns false
Created attachment 306891 [details]
dmesg aftet two suspend cycles with latest patch
less igc_suspend.log | grep -i igc
[ 16.650998] igc 0000:04:00.0: enabling device (0000 -> 0002)
[ 16.651140] igc 0000:04:00.0: PCIe PTM not supported by PCIe bus/controller
[ 16.707941] igc 0000:04:00.0 (unnamed net_device) (uninitialized): PHC added
[ 16.736402] igc 0000:04:00.0: 4.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x1 link)
[ 16.736407] igc 0000:04:00.0 eth0: MAC: 68:54:5a:62:a3:01
[ 16.736539] igc 0000:06:00.0: enabling device (0000 -> 0002)
[ 16.736627] igc 0000:06:00.0: PCIe PTM not supported by PCIe bus/controller
[ 16.792831] igc 0000:06:00.0 (unnamed net_device) (uninitialized): PHC added
[ 16.819624] igc 0000:06:00.0: 4.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x1 link)
[ 16.819628] igc 0000:06:00.0 eth1: MAC: a8:a1:59:36:84:43
[ 16.824232] igc 0000:04:00.0 enp4s0: renamed from eth0
[ 16.824360] igc 0000:06:00.0 enp6s0: renamed from eth1
[ 24.591160] igc 0000:06:00.0 enp6s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 24.613173] igc 0000:06:00.0 enp6s0: entered allmulticast mode
[ 24.613349] igc 0000:06:00.0 enp6s0: entered promiscuous mode
[ 68.407229] igc: Enter igc_resume
[ 68.407260] igc: Enter igc_resume
[ 70.921273] igc 0000:06:00.0 enp6s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 88.055377] igc 0000:06:00.0 enp6s0: left allmulticast mode
[ 88.055417] igc 0000:06:00.0 enp6s0: left promiscuous mode
[ 93.900387] igc: Enter igc_resume
[ 93.900446] igc: Enter igc_resume
[ 93.925461] igc: netif_running returns false
[ 93.925462] igc: netif_running returns false
I had some time to test some things today. Remember, in my case there is no suspend/resume cycle, the problems exist directly after a boot. 6.9.8: All OK. 6.10.6: No devices usable after boot 6.10.6 + 6f31d6b643a reverted: All OK 6.11.0: No devices usable, but all OK after a modprobe -r igc; modprobe igc cycle. 6.11.0 + 6f31d6b643a reverted: No devices usable after boot, but all OK after a modprobe -r igc; modprobe igc cycle. So where reverting 6f31d6b643a fixed the problem in 6.10, it does not in 6.11. dmesg follows. Created attachment 306898 [details]
dmesg-6.11.0 boot and modules reload
Kernel 6.11.0 + logging patches
After boot, all interfaces broken.
After module reload (160 sec), all interfaces working.
Repeatability: 3 out of 3 attempts.
Alex, since you have an Intel I-226V I would strongly recommend to create a separate bug report, since your issue varies from my original report. I would like to test whether there is a race condition between the pci driver and the net core. Specifically since the device is removed from the PCI tree, I would like to ask you to try to reproduce the issue without D3cold support. Please try to reproduce the issue with D3cold support disabled in the sysfs and reproduce the issues: > echo 0 > /sys/bus/pci/devices/.../d3cold_allowed Created attachment 306911 [details]
fresh boot, two standby cycles
echo 0 > /sys/bus/pci/devices/0000\:06\:00.0/d3cold_allowed
echo 0 > /sys/bus/pci/devices/0000\:04\:00.0/d3cold_allowed
cat /sys/bus/pci/devices/0000\:06\:00.0/d3cold_allowed
0
cat /sys/bus/pci/devices/0000\:04\:00.0/d3cold_allowed
0
Comment on attachment 306891 [details]
dmesg aftet two suspend cycles with latest patch
I added some more prints for igc_suspend and I think they are interesting.
first suspend (these are from the igc_suspend flow):
[ 55.843617] igc: netif_running was true, calling __igc_close
[ 55.884575] igc: wufc was true
[ 55.884636] igc: wake is true, igc_power_up_link
second suspend (these are from the igc_suspend flow):
[ 81.194595] igc: netif_running was false, skipping __igc_close
[ 81.195310] igc: wufc & IGC_WUFC_MC was false
[ 81.195314] igc: wake is false, igc_power_down_phy_copper_base
and unsurprisingly when you come up from that it's in a bad state. I'll attach the logs and patch, maybe it will give you guys new ideas.
Created attachment 306920 [details]
two suspends booting with suspend kprintf
Created attachment 306921 [details]
kprintfs for igc_suspend flow
Vitaly, from my understanding of kernel drivers which is admittedly super limited you need the rntl_lock earlier in the resume flow. Just looking at the i40e driver, and it takes out a lock much earlier (before i40e_restore_interrupt_scheme) but the igc driver does it just around __igc_open. That might be why things don't work correctly. (In reply to metron@gmail.com from comment #83) > Vitaly, from my understanding of kernel drivers which is admittedly super > limited you need the rntl_lock earlier in the resume flow. Just looking at > the i40e driver, and it takes out a lock much earlier (before > i40e_restore_interrupt_scheme) but the igc driver does it just around > __igc_open. That might be why things don't work correctly. The reason that I believe that it is fine not to hold the locker is because we don't handle the queues in __igc_open or in igc_resume, but rather in igc_open which is in an RTNL locked context. By the way, thanks for the previous patch, I don't think that the wakes are relevant to this issue. However, I believe that this print might be indeed related: "igc: netif_running was false, skipping __igc_close" I'll upload a new patch with more prints soon. Created attachment 306975 [details]
I added here more prints in netdev and PM flows.
Created attachment 306978 [details] two time suspend/resume with attachment 306975 [details] Created attachment 306980 [details]
using latest print patch
Mine is a little different and you can see igc_close is called.
[ 56.203323] igc: Enter igc_close
I had to work hard to do this with nouveau; after one suspend, video did not return but I could blindly type terminal commands :)
(In reply to metron@gmail.com from comment #87) > Created attachment 306980 [details] > using latest print patch > > Mine is a little different and you can see igc_close is called. > [ 56.203323] igc: Enter igc_close > > I had to work hard to do this with nouveau; after one suspend, video did not > return but I could blindly type terminal commands :) Ah nevermind, Martin's has that too Vitaly Lifshits, what's the status here? Is there any hope that this two months old regression will be fixed soon? Hi Thorsten, We are on it. This issue does not reproduce on any system in our lab and we tried on many systems with many different configurations. It appears to be affecting only a small number of users, represented in this thread. Thus, we are doing a lot of back-and-forth triages and analysis with the folks that reported the issue and are helping us debug it. I can assure you that as soon as we have a root-cause and a possible fix, we will share it. Currently, we see that there is a difference in the way the network stack operates. I am consulting with Jakub Kicinski on this. thx for the update! Which OS distribution do you use? What application manages networking on your system? Fedora 40 with Networkmanager Created attachment 306995 [details]
network-manager logs
I'm using Manjaro, also with NetworkManager.
I ran this and did two suspend cycles to generate logs. As a bonus there is one with linux 6.6 LTS.
before: sudo nmcli g logging level DEBUG
after: journalctl -u NetworkManger --since "today"
I did a suspend with 6.6 at the start and two with 6.11 after rebooting (line marker is "-- Boot d77d5cd258d94681a2f69c5d1e3401d1 --")
I attempted to reproduce the issue on a system running Fedora 40 with NetworkManager. Despite setting up a bridge connection, I was unable to replicate the problem. Could you please check if the issue persists when NetworkManager is stopped? If possible, share also the dmesg log. Created attachment 307024 [details]
attachment-18845-0.html
Hello,
I am currently out of the office on PTO (Sukkot) and will return on October 27st. During this period, my access to Email and Teams will be limited.
Thank you for understanding. I will address your inquiries upon my return.
Best regards, Amir
I will give it a try, may I ask you which hardware are you using? Are you testing on an AM4 System? I don't have an AM4 system, but according to Metron, the issue was reproduced on a Z790 system, which we also tested. I doubt that this issue is related to a specific system because our debug prints indicate that the network stack behaves differently when the issue occurs. On all of my systems, during suspend and resume flows, the driver is not being opened and closed: [504092.919367] igc: Enter igc_suspend [504092.919371] igc: __igc_shutdown: netif_running was true, calling __igc_close [504093.394002] igc: Enter igc_resume [504097.888682] igc 0000:ae:00.0 enp174s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [504120.025521] igc: Enter igc_suspend [504120.025526] igc: __igc_shutdown: netif_running was true, calling __igc_close [504120.500791] igc: Enter igc_resume [504124.737054] igc 0000:ae:00.0 enp174s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX I believe there might be a user-level application that calls ndo->close before suspending, and without the rtnl_lock, it doesn't call ndo->open on resume before igc_resume is invoked. Could it be related to the high core counts? 5950x and 13900kf have that in common. I switched my system to systemd-networkd with a static IP and it works after the second suspend. Do you think that this is a bug in NetworkManager or a configuration issue perhaps? Increasing carrier-wait-timeout from 6s might help. It seems like it gives up if the carrier isn't detected within that time. https://www.networkmanager.dev/docs/api/latest/NetworkManager.conf.html Hi Metron, I don't think that it is related to a high core count, because we see different suspend and resume flows. Though I might be wrong here. Regarding your suggestion: (In reply to metron@gmail.com from comment #100) > I switched my system to systemd-networkd with a static IP and it works after > the second suspend. > > Do you think that this is a bug in NetworkManager or a configuration issue > perhaps? Honestly, I am not entirely sure. We see that there is a race condition between PM suspend/resume and NDO open/close. It might be that there is some issue in the NetworkManager related to the NDO flows. > > Increasing carrier-wait-timeout from 6s might help. It seems like it gives > up if the carrier isn't detected within that time. > https://www.networkmanager.dev/docs/api/latest/NetworkManager.conf.html Can I ask you to increase this timeout and see if it helps? It didn't seem to make a difference when I increased the timeout. I suppose that makes sense, the problems in dmesg don't take six seconds to appear. I am sorry for having disappeared for a while. I have tested it with a higher timeout as well, and had no success either. Is there anything else we can do? I think I am also having the same issue, only difference is that it happens after every resume from suspend, not every second resume. My system: Ethernet Controller I225-V (rev 03) - (on Asus X670E extreme mobo) firmware-version: 1082:8770 Kernel: 6.11.0-9-generic (on Ubuntu 24.10, also using NetworkManager) CPU: AMD Ryzen 9 7950X Please try to reproduce with this patch: https://patchwork.ozlabs.org/project/intel-wired-lan/patch/20241028195243.52488-3-jdamato@fastly.com/ Created attachment 307189 [details]
patch file from mailing list
I made this patch file from the mailing list you linked but I haven't had time to try it yet.
Created attachment 307190 [details]
attachment-5297-0.html
Hello and Thank you for your email.
I am currently out of the office on Army Reserve Duty. During this time, my ability to respond to emails will be limited, and there may be a delay in my responses.
I appreciate your understanding and will do my best to reply as soon as possible.
Thank you for your patience.
Best regards
My apologies, I have missed the patch, building it right now. Will test later today. sadly this patch does not help :( I am still offline after the second standby cycle. Metron, were you successful? So, it's kind of a funny story but I had invested some time getting systemd-boot working instead of grub and when I tried to test this patch I ended up completely breaking my system. After a few tries, I decided to give up on Manjaro and put on Fedora 41. Now that I am running Fedora, I can no longer reproduce the problem. One thing that's different in my Fedora 41 system is that I didn't configure my wifi networks. Now I get kernel opps from iwlwifi but the ethernet interface works fine. That is odd, since I am on F41 (but upgraded from F24 and following) @metron I ran rpmconf -a to verify that I am on the most recent config files, but nothing besides the usual java security enhancements came up. I am really curious what a new Fedora 41 installation changes. I might just boot the Fedora 41 installation stick and send it to standby twice for testing. Created attachment 307204 [details] attachment-2741-0.html Hello and Thank you for your email. I am currently out of the office on Army Reserve Duty. During this time, my ability to respond to emails will be limited, and there may be a delay in my responses. I appreciate your understanding and will do my best to reply as soon as possible. For urgent cases, please contact my manager: shmuel.ben-nisan@intel.com Thank you for your patience. Best regards I can't test my theory, but I think the wifi interface being configured caused NetworkManager to give up on the igc interface. Unfortunately my wifi no longer works at all so I can't test this idea. The interface takes a really long time to come up, so I think there's some code in NetworkManager that is giving up on it. I'll try installing and booting a 6.10 kernel today to see if the issue reappears with and without wifi. Manjaro was on 6.11.2 and my wifi used to work. Fedora has 6.11.6 brought in this other regression: https://bugzilla.kernel.org/show_bug.cgi?id=219447 I see you have a bridge and I assume you are using both interfaces. Have you tried with just one? My Mainboard does have Wifi as well, but I disabled it in UEFI, so I guess this should not affect linux. I am only using one of the two cards, the second card is just for projects where I need to access a second router I am programming for a customer, so it is mostly inactive. I tested with a current live image of Fedora 41 (https://dl.fedoraproject.org/pub/alt/live-respins/) and the error did not occur after resume number two. Now what does that mean? I ran rpmconf -a which updates config files and I found nothing, that could be related to the issue. I really really do not want to reinstall. Created attachment 307227 [details] attachment-26861-0.html Hello, I am currently out of the office on Army Reserve Duty, which limits my ability to respond to emails. Please do not communicate through email for existing or new issues; instead, use the IPS system. Instructions on opening an IPS ticket are attached to this auto-reply. For urgent matters, you may contact my manager at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>. To ensure effective monitoring and resolution of this issue, we request that you open an IPS ticket and provide the following information: 1. A comprehensive and detailed description of the failing scenario, broken down into steps, e.g., · Boot to OS · Enter S4 · Return from S4 · Issue reproduced 2. Confirm if this issue is reproducible on Vpro or Non Vpro SKU (V/LM). 3. Provide fail rate and number of systems that can reproduce the issue, along with the number of tests conducted and failures, e.g., · 2 out of 15 systems failed · System 1 fail rate: 1 out of 20 · System 2 fail rate: 1 out of 15 4. Specify the test environment (Windows/Linux/UEFI/PXE). If Windows, include the LAN device driver version (e.g., 20.0.2.8). 5. Provide the LAN NVM version. 6. Indicate whether the LAN cable is connected or disconnected and if it is part of the test flow. 7. Describe the expected behavior in this scenario. 8. State the pass criteria for this scenario. Failure to provide this information will result in automatic resolution and closure of the issue. Thank you for your cooperation. Best regards, I "diffed" the entire /etc folder to the one from the Fedora 41 live-cd and found no indicators of differences in the installed components. Of course I do have a large variety of additional software on my installed system. Do you have some hints how I could explore the problems? I am thinking of mapping all installed packages, take another SSD and install a fresh system, but my time before xmas is really the issue. That's why I ask, if you are still confident and determined to find the issue. I'm pretty sure this problem is an interaction between NetworkManager and the driver. I'm not sure why NetworkManager gives up on the interface, but it seems to do so if there is another interface configured. I think that's why there's not many people affected. I noticed issues on their gitlab don't go so well, but you could try starting a thread there. Switching from NetworkManager to systemd-networkd is probably an option that 'solved' the problem for me. You loose out on the network icon (at least in gnome) but everything works. This bug wasn't as problematic for me as all my nvidia/intel-cpu problems, so I bought a whole new computer that's more linux friendly. Now I have new problems but suspend and resume works. Created attachment 307361 [details] attachment-29439-0.html Hello, Thank you for your email. I am currently out of the office until December 22nd. For urgent matters, you may contact my manager at shmuel.ben-nisan@intel.com<mailto:shmuel.ben-nisan@intel.com>. Thank you for your cooperation. Best regards, @metron I created a systemd service to automatically unload and load the module. I still hope this gets fixed, when I browsed, I found a few other users with this issue on reddit. |