Created attachment 303473 [details] dmesg
Created attachment 303474 [details] lspci
Windows doesn't have this issue probably because AER isn't enabled on external-facing TBT root port.
Created attachment 303486 [details] suspend/resume dmesg from an Asus Z790-I & i9 13900K
I have the same issue on a brand new Asus Z790-I + i9 13900K, although igc works just fine after resume. IOW, it is completely harmless in my case.
The dmesg I attached is from kernel 6.1.1.
Javier, The dmesg you attached is truncated so it's hard to understand what happaned.
Created attachment 304459 [details] dmesg on 6.4-rc7
From the comment #1 dmesg: [ 41.128385] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0 [ 41.128525] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 41.128529] pcieport 0000:00:1d.0: device [8086:7ab0] error status/mask=00100000/00004000 [ 41.128534] pcieport 0000:00:1d.0: [20] UnsupReq (First) [ 41.128538] pcieport 0000:00:1d.0: AER: TLP Header: 34000000 0a000052 00000000 00000000 [ 41.128543] pcieport 0000:00:1d.0: AER: Error of this Agent is reported first [ 41.128562] pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 41.128566] pcieport 0000:04:01.0: device [8086:1136] error status/mask=00300000/00000000 [ 41.128570] pcieport 0000:04:01.0: [20] UnsupReq (First) [ 41.128573] pcieport 0000:04:01.0: [21] ACSViol [ 41.128576] pcieport 0000:04:01.0: AER: TLP Header: 34000000 04000052 00000000 00000000 Decoding the 00:1d.0 TLP Header 34000000 0a000052 per PCIe r6.0: 34000000: 001 Fmt: 4 DW header, no data (sec 2.2.1.1) 1_0100 Type: Msg, Local - Terminate at Receiver (sec 2.2.1.1) 0a000052: 0a00 Requester ID 0a:00.0 (sec 2.2.8.10) 0101_0010 PTM Request (sec 2.2.8.10) Decoding the 04:01.0 TLP Header 34000000 04000052 (same except Requester ID): 04000052: 0400 Requester ID 04:00.0 Both UnsupReq errors seem to be caused by a PTM Request when the receiver has PTM disabled (see sec 6.21.3). I don't understand the 0a:00.0 or 04:00.0 Requester IDs because there's no 0a:00.0 device, and 04:00.0 wouldn't send a PTM request to 04:01.0. But all PTM Messages use "Local" message routing, so they terminate at the other end of the link and no addressing is necessary, so maybe these Requester IDs aren't important. We know the hierarchy here is: 00:1d.0 Root Port to [bus 03-6c] 03:00.0 Switch Upstream Port to [bus 04-6c] 04:01.0 Switch Downstream Port to [bus 06-38] 06:00.0 Switch Upstream Port to [bus 07-38] 07:04.0 Switch Downstream Port to [bus 38] 38:00.0 igc I225 NIC So any PTM Request received by 00:1d.0 must have been sent by 03:00.0, and any request received by 04:01.0 must have come from 06:00.0. IIUC, the PTM link protocol only involves the two components on a single link. In other words, a PTM request from 38:00.0 is not forwarded all the way to the Root Port. 38:00.0 and 07:04.0 trade request/response messages, 06:00.0 and 04:01.0 trade their own request/response messages, and 03:00.0 and 00:1d.0 trade their own. These are all separate conversations that might happen to be close in time. 00:1d.0 logged a UR, so it had PTM disabled when it received a PTM Request from 03:00.0, which must have had PTM enabled. 04:01.0 also logged a UR, so it had PTM disabled when it received a PTM Request from 06:00.0. The PTM Capability in 03:00.0 controls PTM for the entire switch, so 03:00.0 must have had PTM *disabled*. All AER interrupts come from the Root Port, so we got one interrupt from 00:1d.0. Then I think we traversed the hierarchy below 00:1d.0 searching AER Capabilities for any logged errors, but we don't know the ordering of the 00:1d.0 UR versus the 04:01.0 UR. It seems possible that PTM is being enabled in the wrong order. I think we would see the comment #1 dmesg logging if we had this sequence: - PTM disabled in all devices - Software enables 06:00.0 PTM - 06:00.0 sends PTM Request to 04:01.0 - 04:01.0 logs UR error because it has PTM disabled - Software enables 03:00.0 PTM - 03:00.0 sends PTM Request to 00:1d.0 - 00:1d.0 logs UR error because it has PTM disabled - 00:1d.0 generates AER interrupt - AER handler finds both UR errors logged
Is the following scenario possible: - PTM disabled in all devices - System suspends to S3 - System resumes from S3 - Thunderbolt switch, I225 NIC, etc gets power cycled - 06:00.0 sends PTM Request to 04:01.0 - 04:01.0 logs UR error because PTM is still disabled. - 03:00.0 sends PTM Request to 00:1d.0 - 00:1d.0 logs UR error because PTM is still disabled. - 00:1d.0 generates AER interrupt - AER handler finds both UR errors logged
Which devices are in the dock? Obviously 00:1d.0 is in the laptop. What about the switch with 03:00.0 and 04:xx.x?
Created attachment 304673 [details] lspci without dock
Created attachment 304674 [details] lspci with dock
So 03:00.0 and 04:xx.x are in the laptop, and the followings are in the dock: > 06:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge > 2020] (rev 03) > 07:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge > 2020] (rev 03) > 07:01.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge > 2020] (rev 03) > 07:02.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge > 2020] (rev 03) > 07:03.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge > 2020] (rev 03) > 07:04.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge > 2020] (rev 03) > 38:00.0 Ethernet controller: Intel Corporation Ethernet Controller (2) > I225-LMvP (rev 03)
It looks like there's already a quirk for I225-V in the kernel, but it's not applied for other i225 IDs: https://github.com/torvalds/linux/blob/9b2ffa6148b1e4468d08f7e0e7e371c43cac9ffe/drivers/net/ethernet/intel/igc/igc_ptp.c#L932 In igc_is_crosststamp_supported: > FIXME: it was noticed that enabling support for PCIe PTM in > some i225-V models could cause lockups when bringing the > interface up/down. There should be no downsides to > disabling crosstimestamping support for i225-V, as it > doesn't have any PTP support. That way we gain some time > while root causing the issue. This quirk really needs to have other i225 models added. From the CalDigit TS4 dock: 6e:00.0 Intel Corporation Ethernet Controller (2) I225-LMvP [8086:5502] (rev 03) From my machine itself, the NIC likes to disappear, so I couldn't check the subvendor and subdevice IDs, but the main ID is: 6f:00.0 Intel Corporation Ethernet Controller I225-LM [8086:15f2] Weirdly, the onboard i225-LM seems to sometimes disappear from the PCIe bus entirely when the dock i225-LMvP is connected. Perhaps vPro makes the dock shadow the onboard, or perhaps the onboard merely got wedged. An excerpt from the journal when it deadlocks: > Dec 24 19:13:23 nuc kernel: INFO: task .NET TP Worker:24685 blocked for more > than 122 seconds. > Dec 24 19:13:23 nuc kernel: Tainted: P O 6.8.12-5-pve > #1 > Dec 24 19:13:23 nuc kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Dec 24 19:13:23 nuc kernel: task:.NET TP Worker state:D stack:0 > pid:24685 tgid:1946 ppid:1 flags:0x00004002 > Dec 24 19:13:23 nuc kernel: Call Trace: > Dec 24 19:13:23 nuc kernel: <TASK> > Dec 24 19:13:23 nuc kernel: __schedule+0x401/0x15e0 > Dec 24 19:13:23 nuc kernel: schedule+0x33/0x110 > Dec 24 19:13:23 nuc kernel: schedule_preempt_disabled+0x15/0x30 > Dec 24 19:13:23 nuc kernel: __mutex_lock.constprop.0+0x3f8/0x7a0 > Dec 24 19:13:23 nuc kernel: ? igc_tsn_reset+0x45c/0x600 [igc] > Dec 24 19:13:23 nuc kernel: ? __pfx_pci_pm_runtime_resume+0x10/0x10 > Dec 24 19:13:23 nuc kernel: __mutex_lock_slowpath+0x13/0x20 > Dec 24 19:13:23 nuc kernel: mutex_lock+0x3c/0x50 > Dec 24 19:13:23 nuc kernel: rtnl_lock+0x15/0x20 > Dec 24 19:13:23 nuc kernel: igc_resume+0xfd/0x220 [igc] > Dec 24 19:13:23 nuc kernel: igc_runtime_resume+0xe/0x20 [igc] > Dec 24 19:13:23 nuc kernel: pci_pm_runtime_resume+0xa0/0x100 > Dec 24 19:13:23 nuc kernel: __rpm_callback+0x4d/0x170 > Dec 24 19:13:23 nuc kernel: rpm_callback+0x3b/0x80 > Dec 24 19:13:23 nuc kernel: ? __pfx_pci_pm_runtime_resume+0x10/0x10 > Dec 24 19:13:23 nuc kernel: rpm_resume+0x594/0x7e0 > Dec 24 19:13:23 nuc kernel: ? sock_do_ioctl+0x118/0x140 > Dec 24 19:13:23 nuc kernel: ? kmalloc_trace+0x139/0x360 > Dec 24 19:13:23 nuc kernel: __pm_runtime_resume+0x4e/0x80 > Dec 24 19:13:23 nuc kernel: dev_ethtool+0x153/0x2f20