Bug 216850 - I225 device on TBT dock stops working on S3 resume
Summary: I225 device on TBT dock stops working on S3 resume
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-12-26 15:26 UTC by Kai-Heng Feng
Modified: 2023-07-21 02:13 UTC (History)
3 users (show)

See Also:
Kernel Version: mainline, linux-next
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (160.64 KB, text/plain)
2022-12-26 15:27 UTC, Kai-Heng Feng
Details
lspci (106.26 KB, text/plain)
2022-12-26 15:27 UTC, Kai-Heng Feng
Details
suspend/resume dmesg from an Asus Z790-I & i9 13900K (10.44 KB, text/plain)
2022-12-27 17:06 UTC, Javier Marcet
Details
dmesg on 6.4-rc7 (500.16 KB, text/plain)
2023-06-20 12:35 UTC, Kai-Heng Feng
Details
lspci without dock (2.54 KB, text/plain)
2023-07-21 02:08 UTC, Kai-Heng Feng
Details
lspci with dock (3.15 KB, text/plain)
2023-07-21 02:09 UTC, Kai-Heng Feng
Details

Description Kai-Heng Feng 2022-12-26 15:26:56 UTC

    
Comment 1 Kai-Heng Feng 2022-12-26 15:27:18 UTC
Created attachment 303473 [details]
dmesg
Comment 2 Kai-Heng Feng 2022-12-26 15:27:36 UTC
Created attachment 303474 [details]
lspci
Comment 3 Kai-Heng Feng 2022-12-26 15:28:28 UTC
Windows doesn't have this issue probably because AER isn't enabled on external-facing TBT root port.
Comment 4 Javier Marcet 2022-12-27 17:06:19 UTC
Created attachment 303486 [details]
suspend/resume dmesg from an Asus Z790-I & i9 13900K
Comment 5 Javier Marcet 2022-12-27 17:09:11 UTC
I have the same issue on a brand new Asus Z790-I + i9 13900K, although igc works just fine after resume. IOW, it is completely harmless in my case.
Comment 6 Javier Marcet 2022-12-27 17:11:16 UTC
The dmesg I attached is from kernel 6.1.1.
Comment 7 Kai-Heng Feng 2023-06-20 05:39:06 UTC
Javier,

The dmesg you attached is truncated so it's hard to understand what happaned.
Comment 8 Kai-Heng Feng 2023-06-20 12:35:32 UTC
Created attachment 304459 [details]
dmesg on 6.4-rc7
Comment 9 Bjorn Helgaas 2023-07-15 00:01:29 UTC
From the comment #1 dmesg:

[   41.128385] pcieport 0000:00:1d.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:1d.0
[   41.128525] pcieport 0000:00:1d.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   41.128529] pcieport 0000:00:1d.0:   device [8086:7ab0] error status/mask=00100000/00004000
[   41.128534] pcieport 0000:00:1d.0:    [20] UnsupReq               (First)
[   41.128538] pcieport 0000:00:1d.0: AER:   TLP Header: 34000000 0a000052 00000000 00000000
[   41.128543] pcieport 0000:00:1d.0: AER:   Error of this Agent is reported first
[   41.128562] pcieport 0000:04:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[   41.128566] pcieport 0000:04:01.0:   device [8086:1136] error status/mask=00300000/00000000
[   41.128570] pcieport 0000:04:01.0:    [20] UnsupReq               (First)
[   41.128573] pcieport 0000:04:01.0:    [21] ACSViol
[   41.128576] pcieport 0000:04:01.0: AER:   TLP Header: 34000000 04000052 00000000 00000000

Decoding the 00:1d.0 TLP Header 34000000 0a000052 per PCIe r6.0:

  34000000:
    001        Fmt: 4 DW header, no data (sec 2.2.1.1)
    1_0100     Type: Msg, Local - Terminate at Receiver (sec 2.2.1.1)

  0a000052:
    0a00       Requester ID 0a:00.0 (sec 2.2.8.10)
    0101_0010  PTM Request (sec 2.2.8.10)

Decoding the 04:01.0 TLP Header 34000000 04000052 (same except Requester ID):

  04000052:
    0400       Requester ID 04:00.0

Both UnsupReq errors seem to be caused by a PTM Request when the receiver has PTM disabled (see sec 6.21.3).

I don't understand the 0a:00.0 or 04:00.0 Requester IDs because there's no 0a:00.0 device, and 04:00.0 wouldn't send a PTM request to 04:01.0.  But all PTM Messages use "Local" message routing, so they terminate at the other end of the link and no addressing is necessary, so maybe these Requester IDs aren't important.

We know the hierarchy here is:

  00:1d.0 Root Port to [bus 03-6c]
  03:00.0 Switch Upstream Port to [bus 04-6c]
  04:01.0 Switch Downstream Port to [bus 06-38]
  06:00.0 Switch Upstream Port to [bus 07-38]
  07:04.0 Switch Downstream Port to [bus 38]
  38:00.0 igc I225 NIC

So any PTM Request received by 00:1d.0 must have been sent by 03:00.0, and any request received by 04:01.0 must have come from 06:00.0.

IIUC, the PTM link protocol only involves the two components on a single link.  In other words, a PTM request from 38:00.0 is not forwarded all the way to the Root Port.  38:00.0 and 07:04.0 trade request/response messages, 06:00.0 and 04:01.0 trade their own request/response messages, and 03:00.0 and 00:1d.0 trade their own.  These are all separate conversations that might happen to be close in time.

00:1d.0 logged a UR, so it had PTM disabled when it received a PTM Request from 03:00.0, which must have had PTM enabled.

04:01.0 also logged a UR, so it had PTM disabled when it received a PTM Request from 06:00.0.  The PTM Capability in 03:00.0 controls PTM for the entire switch, so 03:00.0 must have had PTM *disabled*.

All AER interrupts come from the Root Port, so we got one interrupt from 00:1d.0.  Then I think we traversed the hierarchy below 00:1d.0 searching AER Capabilities for any logged errors, but we don't know the ordering of the 00:1d.0 UR versus the 04:01.0 UR.

It seems possible that PTM is being enabled in the wrong order.  I think we would see the comment #1 dmesg logging if we had this sequence:

  - PTM disabled in all devices
  - Software enables 06:00.0 PTM
  - 06:00.0 sends PTM Request to 04:01.0
  - 04:01.0 logs UR error because it has PTM disabled
  - Software enables 03:00.0 PTM
  - 03:00.0 sends PTM Request to 00:1d.0
  - 00:1d.0 logs UR error because it has PTM disabled
  - 00:1d.0 generates AER interrupt
  - AER handler finds both UR errors logged
Comment 10 Kai-Heng Feng 2023-07-17 07:51:02 UTC
Is the following scenario possible:
  - PTM disabled in all devices
  - System suspends to S3
  - System resumes from S3
  - Thunderbolt switch, I225 NIC, etc gets power cycled
  - 06:00.0 sends PTM Request to 04:01.0
  - 04:01.0 logs UR error because PTM is still disabled.
  - 03:00.0 sends PTM Request to 00:1d.0
  - 00:1d.0 logs UR error because PTM is still disabled.
  - 00:1d.0 generates AER interrupt
  - AER handler finds both UR errors logged
Comment 11 Bjorn Helgaas 2023-07-17 15:51:33 UTC
Which devices are in the dock?  Obviously 00:1d.0 is in the laptop.  What about the switch with 03:00.0 and 04:xx.x?
Comment 12 Kai-Heng Feng 2023-07-21 02:08:45 UTC
Created attachment 304673 [details]
lspci without dock
Comment 13 Kai-Heng Feng 2023-07-21 02:09:07 UTC
Created attachment 304674 [details]
lspci with dock
Comment 14 Kai-Heng Feng 2023-07-21 02:13:25 UTC
So 03:00.0 and 04:xx.x are in the laptop, and the followings are in the dock:
> 06:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge
> 2020] (rev 03)
> 07:00.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge
> 2020] (rev 03)
> 07:01.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge
> 2020] (rev 03)
> 07:02.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge
> 2020] (rev 03)
> 07:03.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge
> 2020] (rev 03)
> 07:04.0 PCI bridge: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge
> 2020] (rev 03)
> 38:00.0 Ethernet controller: Intel Corporation Ethernet Controller (2)
> I225-LMvP (rev 03)

Note You need to log in before you can comment on or make changes to this bug.