Bug 214259

Summary: Discrete Thunderbolt Controller 8086:1137 throws DMAR and XHCI errors and is only partially functional
Product: Drivers Reporter: Werner Sembach [TUXEDO] (wse)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: REOPENED ---    
Severity: normal CC: bjorn, calvin.walton, dion, jwrdegoede, kjhambrick, mika.westerberg
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.13.12 and 5.14-rc7 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmsg of boot without tb dock connected
dmsg after connecting tb dock
dmsg of boot with tb dock connected
lspci of boot without tb dock connected
lspci after connecting tb dock
lspci of boot with tb dock connected
dmsg of boot without tb dock connected (5.13.12)
dmsg after connecting tb dock (5.13.12)
dmsg of boot with tb dock connected (5.13.12)
lspci of boot without tb dock connected (5.13.12)
lspci after connecting tb dock (5.13.12)
lspci of boot with tb dock connected (5.13.12)
Linux 5.15.77 with Discrete Thunderbolt Enabled ( no patch )
This is 5.15.77 with your patch applied
Linux 5.15.77 with Discrete Thunderbolt ( no patch )
5.15.77 with your Patch with Discrete Thunderbolt Disabled
lspci -vvv for my Sager NP9672M / Clevo X170KM-G
dmesg after boot (gui did not load but that seems an unrelated bug because rc)
connecting a dock after boot, worked just fine
Clevo X170KM-G ACPI Dump BIOS 1.07.09RTR6
intel_iommu=off iommu=off
intel_iommu=off iommu=off after attaching thunderbolt dock
dmesg after boot
lspci after boot
dmesg tbt dock disconnected
lspci tbt dock disconnected
dmesg tbt dock reconnected
lspci tbt dock reconnected
Anker Apex, 12-in-1, Thunderbolt 4 dmesg and lspci output
Same tests with pci=noaer
blacklist thunderbolt ; boot unplugged
Add debugging to ICM startup
dmesg with debug output patch and no dock connected
dmesg with debug output patch and dock connected during boot
no dock no iommu
dock during boot no iommu
More retries with 50ms timeout for ICM driver ready
6.1.16 patched with attachment 303919
TBolt 4 Plugged at boot then Unplugged then Replugged
new patch, no dock during boot
new patch, dock during boot
ASUS TUF GAMING B550-PLUS discrete Maple Ridge dmesg
Workaround for IOMMU faults on certain systems (updated)
Thunderbolt unplugged at boot - 6.1.46 with 0001-thunderbolt-Workaround-an-IOMMU-fault-on-certain-sys.patch
6.1.46 with your latest patches and thunderbolt plugged at boot
dmesg ASUS TUF GAMING B550-PLUS 6.4.11_kepstin-00001-g8ff95b7fc9f5 (304888)
dmesg ASUS TUF GAMING B550-PLUS 6.4.11_kepstin-00001-gb09f7209b080 (303919)
Workaround for IOMMU faults on certain systems (v2)
non working state (kernel 6.2)
working state (kernel 6.5 with latest workaround applied)
6.1.47 + Aug 22 Patch - TB unplugged at boot
6.1.47 + Mika Aug 22 Patch ; TB Plugged at boot -- OOPS
6.1.47 with the Aug 18 Patch
dmesg for Linux 6.1.52 with Mika's Sep 12 Patch cold boot ; plugged
Linux 6.1.53 + Mika Sep 12 Patch cold boot plugged
reduce timeout to 250ms
6.1.65 - no patch
6.1.66 with Werner's 250 ms patch

Description Werner Sembach [TUXEDO] 2021-09-01 13:44:11 UTC
Affected devices: Clevo X170KM Barebone (I have one here) and according to this reddit thread that describes the exact same problem a Thunderbold PCIe exapansion card: https://www.reddit.com/r/Thunderbolt/comments/ohjakr/asus_thunderboltex_4_linux_compatability/

The Clevo does not use the build in thunderbold controler of the CPU but a discrete Thunderbold controler, which seems to be the exact same one from that expansion card with the pci id of 8086:1137.

High level problem desciption: I have Thunderbold dock with DP-out, USB ports and an Ethernet port. When I plug it in, only the DP port works. When its plugged in before boot, ethernet also works. The USB ports on the Dock never work.

dmesg is showing several erros regarding DMAR and xhci, since the DMAR errors are shwing up first is suspect them to be the root cause making the rest afterwards fails also.

The error is slightly different between 5.13 and 5.14

5.14-rc7:
[    3.148557] DMAR: DRHD: handling fault status reg 2
[    3.148561] DMAR: [DMA Write NO_PASID] Request device [0x04:0x00.0] fault addr 0x69974000 [fault reason 0x05] PTE Write access is not set

5.13.12:
[    3.737783] DMAR: DRHD: handling fault status reg 2
[    3.737790] DMAR: [DMA Write] Request device [04:00.0] PASID ffffffff fault addr 69974000 [fault reason 05] PTE Write access is not set

04.00.0 is the thunderbold controller:
04:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020] [8086:1137] (prog-if 40 [USB4 Host Interface])
Comment 1 Werner Sembach [TUXEDO] 2021-09-01 13:45:20 UTC
Created attachment 298567 [details]
dmsg of boot without tb dock connected
Comment 2 Werner Sembach [TUXEDO] 2021-09-01 13:45:58 UTC
Created attachment 298569 [details]
dmsg after connecting tb dock
Comment 3 Werner Sembach [TUXEDO] 2021-09-01 13:46:41 UTC
Created attachment 298571 [details]
dmsg of boot with tb dock connected
Comment 4 Werner Sembach [TUXEDO] 2021-09-01 13:47:45 UTC
Created attachment 298573 [details]
lspci of boot without tb dock connected
Comment 5 Werner Sembach [TUXEDO] 2021-09-01 13:48:30 UTC
Created attachment 298575 [details]
lspci after connecting tb dock
Comment 6 Werner Sembach [TUXEDO] 2021-09-01 13:48:57 UTC
Created attachment 298577 [details]
lspci of boot with tb dock connected
Comment 7 Werner Sembach [TUXEDO] 2021-09-01 13:50:01 UTC
These logs are all for kernel 5.14-rc7
Comment 8 Werner Sembach [TUXEDO] 2021-09-01 13:52:01 UTC
Created attachment 298579 [details]
dmsg of boot without tb dock connected (5.13.12)
Comment 9 Werner Sembach [TUXEDO] 2021-09-01 13:52:22 UTC
Created attachment 298581 [details]
dmsg after connecting tb dock (5.13.12)
Comment 10 Werner Sembach [TUXEDO] 2021-09-01 13:52:43 UTC
Created attachment 298583 [details]
dmsg of boot with tb dock connected (5.13.12)
Comment 11 Werner Sembach [TUXEDO] 2021-09-01 13:53:12 UTC
Created attachment 298585 [details]
lspci of boot without tb dock connected (5.13.12)
Comment 12 Werner Sembach [TUXEDO] 2021-09-01 13:53:42 UTC
Created attachment 298587 [details]
lspci after connecting tb dock (5.13.12)
Comment 13 Werner Sembach [TUXEDO] 2021-09-01 13:53:59 UTC
Created attachment 298589 [details]
lspci of boot with tb dock connected (5.13.12)
Comment 14 Werner Sembach [TUXEDO] 2021-09-01 17:40:00 UTC
New info: The intel_iommu=off boot flag makes the DMAR errors go away. However xhci errors stay and the USB ports of the dock are still disfunctional.
So they are seperate issues after all.
Comment 15 Werner Sembach [TUXEDO] 2021-09-28 16:24:03 UTC
For reference: A reddit thread discussing the descrete Asus thunderbolt pcie card failing in the exact same way: https://www.reddit.com/r/Thunderbolt/comments/ohjakr/asus_thunderboltex_4_linux_compatability/
Comment 16 Werner Sembach [TUXEDO] 2021-10-08 08:32:37 UTC
Found a preexisteng hack originally for tb3 fixing the issue also on this tb4 controller: https://bugzilla.kernel.org/show_bug.cgi?id=206459#c59
Comment 17 Werner Sembach [TUXEDO] 2021-10-08 08:33:24 UTC

*** This bug has been marked as a duplicate of bug 206459 ***
Comment 18 Hans de Goede 2022-05-19 18:32:52 UTC
Thank you for your bug report.

I've prepared a patch series fixing bug 206459 as well as this bug:

https://lore.kernel.org/linux-pci/20220519152150.6135-1-hdegoede@redhat.com/T/#t

This series is using DMI matching to identify affected systems and to enable the workaround only on affected systems.

I've used DMI_MATCH(DMI_BOARD_NAME, "X170KM-G") as match for this Clevo Barebone.

Can you confirm that:

cat /sys/class/dmi/id/board_name

outputs "X170KM-G" ?

Or even better, give this patch series a try ? Note the series is based on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/resource
Comment 19 Werner Sembach [TUXEDO] 2022-05-23 13:25:17 UTC
Thank you for the patch.

Yes, "X170KM-G" is the exact match for /sys/class/dmi/id/board_name on the affected device.

Kernel with patch series is compiling atm. Will add another post wether or not it worked.
Comment 20 Werner Sembach [TUXEDO] 2022-05-23 16:33:28 UTC
Successfully tested the patchset: Works like a charm.
Comment 21 Konrad J Hambrick 2022-07-07 21:13:16 UTC
Hans --

I have exactly the same Hardware and when I Enable my Discrete ThunderBolt in the BIOS, I experience a series of TimeOut delays during Boot.

I am running 5.15.53 on Slackware64 15.0 ...

I don't have a Dock for my LapTop so I am not sure this bug is a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=206459

Q1: Is your Patch Series still viable ?

If so, your second link above for the patch series no longer works:  
https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/resource

Q2:  Was this patch ever included in the kernel source tree ?

If so, what version should I try ?

Thanks !

-- kjh
Comment 22 Konrad J Hambrick 2022-07-07 21:16:50 UTC
(In reply to Konrad J Hambrick from comment #21)
> Hans --
> 
> I have exactly the same Hardware and when I Enable my Discrete ThunderBolt
> in the BIOS, I experience a series of TimeOut delays during Boot.
> 
> I am running 5.15.53 on Slackware64 15.0 ...
> 
> I don't have a Dock for my LapTop so I am not sure this bug is a duplicate
> of https://bugzilla.kernel.org/show_bug.cgi?id=206459
> 
> Q1: Is your Patch Series still viable ?
> 
> If so, your second link above for the patch series no longer works:  
> https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/
> resource
> 
> Q2:  Was this patch ever included in the kernel source tree ?
> 
> If so, what version should I try ?
> 
> Thanks !
> 
> -- kjh

p.s. this is my dmesg with the Discrete Thunderbolt Enabled:

[    9.359600] input: SynPS/2 Synaptics TouchPad as /devices/platform/i8042/serio2/input/input15
[   27.953327] DMAR: DRHD: handling fault status reg 2
[   27.959653] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set
[   48.433330] DMAR: DRHD: handling fault status reg 2
[   48.438777] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set
[   68.913306] DMAR: DRHD: handling fault status reg 2
[   68.919850] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set
[   89.387726] thunderbolt 0000:07:00.0: failed to send driver ready to ICM
[   89.394835] thunderbolt: probe of 0000:07:00.0 failed with error -110
Comment 23 Werner Sembach [TUXEDO] 2022-07-28 16:32:51 UTC
(In reply to Konrad J Hambrick from comment #21)
> Hans --
> 
> I have exactly the same Hardware and when I Enable my Discrete ThunderBolt
> in the BIOS, I experience a series of TimeOut delays during Boot.
> 
> I am running 5.15.53 on Slackware64 15.0 ...
> 
> I don't have a Dock for my LapTop so I am not sure this bug is a duplicate
> of https://bugzilla.kernel.org/show_bug.cgi?id=206459
> 
> Q1: Is your Patch Series still viable ?
> 
> If so, your second link above for the patch series no longer works:  
> https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/
> resource
> 
> Q2:  Was this patch ever included in the kernel source tree ?
> 
> If so, what version should I try ?
> 
> Thanks !
> 
> -- kjh

The series got merged in mainline: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=Add+kernel+cmdline+options+to+use%2Fignore+E820+reserved+regions

the kernel.org cgit is always a good option to search if a patch got applied in the end

Also don't forget to check linux-next https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?qt=grep&q=Add+kernel+cmdline+options+to+use%2Fignore+E820+reserved+regions

or subsystem specific branches like drm-tip (on the fredesktop cgit: https://cgit.freedesktop.org/)
Comment 24 Konrad J Hambrick 2022-07-28 20:56:32 UTC
Thanks for the info and the tip, wse !

-- kjh
Comment 25 Bjorn Helgaas 2022-11-03 15:45:11 UTC
I don't know the connection to the DMAR faults, but from the first log (https://bugzilla.kernel.org/attachment.cgi?id=298567):

  BIOS-e820: [mem 0x000000006bc00000-0x00000000efffffff] reserved
  pci_bus 0000:00: root bus resource [mem 0x71000000-0xdfffffff window]

This entire PCI host bridge aperture is "reserved" in the E820 map, which means we won't allocate any PCI BARs in that area, which means hot-add won't work.

The current workaround for this is https://git.kernel.org/linus/d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks"), which appeared in v5.19.

I think the underlying issue is that this machine has EFI, Linux converts the EFI memory map to E820 format, and it converts EFI_MEMORY_MAPPED_IO to E820_TYPE_RESERVED.  EFI_MEMORY_MAPPED_IO means "the OS must map this memory for use by EFI runtime services."  It does *not* mean "the OS can never use this memory."  I think Linux should omit EFI_MEMORY_MAPPED_IO areas completely from the E820 map.

This is basically the same issue as bug #216565.  I attached a patch there to omit EFI_MEMORY_MAPPED_IO.

I would love to hear from anybody with a Clevo machine that shows similar problems.  If you can, please boot with the patch at https://bugzilla.kernel.org/attachment.cgi?id=303123 with the "efi=debug" kernel parameter, open new bugzilla with the complete dmesg log, and assign it to me.
Comment 26 Konrad J Hambrick 2022-11-03 17:19:35 UTC
Bjorn --

The referenced efi_mmio-2 Patch applied cleanly on 5.15.77 and I am building a 5.15.77_bjorn_p Kernel now.

I'll have to wait until after work to install and boot but I'll enable Discrete Thunderbolt in my Insyde EFI Bios and boot 5.15.bjorn_p

Thank You !

-- kjh
Comment 27 Konrad J Hambrick 2022-11-04 10:16:14 UTC
Created attachment 303131 [details]
Linux 5.15.77 with Discrete Thunderbolt Enabled ( no patch )

Bjorn --

This dmesg for linux 5.15.77 without your Patch.

When I enable 'Discrete Thunderbolt' in the Insyde BIOS there is a long delay for each of the three DMAR Errors and a final delay for Thunderbolt ( see time stamp: [11.534560] thru [88.380241]
Comment 28 Konrad J Hambrick 2022-11-04 10:21:55 UTC
Created attachment 303132 [details]
This is 5.15.77 with your patch applied

This is linux 5.15.77 with your patch applied.

There are extra log entries before event 'do_add_efi_memmap' at timestamp [0.000000]

The same three DMAR faults and timeouts still happen between timestamps [26.938364] and [88.372760] and there is still a Discrete Thunderbold error at [88.378716]
Comment 29 Konrad J Hambrick 2022-11-04 10:42:48 UTC
Created attachment 303133 [details]
Linux 5.15.77 with Discrete Thunderbolt ( no patch )

Bjorn --

For reference, this is dmesg for 5.15.77 with Discrete Thunderbolt disabled and without your Patch
Comment 30 Konrad J Hambrick 2022-11-04 10:44:37 UTC
Created attachment 303134 [details]
5.15.77 with your Patch with Discrete Thunderbolt Disabled

And again for reference, this is 5.15.77 with your patch with Discrete Thunderbolt Disabled in the INSYDE BIOS
Comment 31 Bjorn Helgaas 2022-11-04 21:13:04 UTC
Thank you very much for all this work, Konrad!

I'm confused.  Comments #16 and #17 suggest that this DMAR issue was fixed by https://git.kernel.org/linus/d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks"), which appeared in v5.19.  Can you confirm that?
Comment 32 Konrad J Hambrick 2022-11-05 18:28:48 UTC
Bjorn --

You're welcome.  

I was unable to successfully apply those patches against 5.15.y and then I believe the patches were reverted or partially reverted in 5.19.y.

I will consider running a newer Long-Term Kernel when it is announced next month, especially if it fixes Discrete Thunderbolt and especially since 5.15.y will go EoL in Oct 2023.

Thanks for all you do for the Linux Kernel, Bjorn !

-- kjh
Comment 33 Werner Sembach [TUXEDO] 2022-11-07 13:15:13 UTC
I think the long delay is a different issue then the e820 region patch and has more to do with this:

[   88.372827] thunderbolt 0000:07:00.0: failed to send driver ready to ICM

the 20s delay between the 4 retries come from this:

https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L1024

or this

https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L1630

respecifly.

Having the icm not correctly initialized seems to be no problem however? I read somewhere that there is a software fallback for it. But I don't really know what the icm actually does in the first place xD.
Comment 34 Werner Sembach [TUXEDO] 2022-11-07 13:16:13 UTC
But I think we should propably open another bugreport for this.
Comment 36 Konrad J Hambrick 2022-11-07 14:31:48 UTC
Thanks for all the info wse@tuxedocomputers.com

The comments in Hans de Goede's three patches look familiar but I'll try all three patches against 5.15.y again.

Or should I be running a different Kernel Version ?

The Code referenced at elixir.bootlin.com is from Kernel Version 6.0.7 ...

I'll test next weekend when I've got time to build ; boot ; dmesg ; repeat 

Then I will open a new bug report.

Thank you and thank you and Tuxedo Computers for the tuxedo-keyboard-master and tuxedo-keyboard-ite kernel modules ( they work great for me ) !

-- kjh
Comment 37 Bjorn Helgaas 2022-11-07 16:15:28 UTC
Regarding the delay, i.e., this from Konrad's attachment at comment 27:

[    6.837625] DMAR: DRHD: handling fault status reg 2
[    6.837629] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set
[   26.937996] DMAR: DRHD: handling fault status reg 2
[   26.944592] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set
[   47.417982] DMAR: DRHD: handling fault status reg 2
[   47.424711] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set 
[   67.897999] DMAR: DRHD: handling fault status reg 2
[   67.904639] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set
[   88.372827] thunderbolt 0000:07:00.0: failed to send driver ready to ICM
[   88.380241] thunderbolt: probe of 0000:07:00.0 failed with error -110

This all involves the same device (07:00.0) and it's all likely related, so I think we should reserve *this* bug report for this DMAR/timeout issue.

As far as I can tell, this is unrelated to bug 206459 and the E820 issues addressed by Hans de Goede's patches, so I'm going to reopen this bug report.

If we're lucky, maybe the DMAR/timeout stuff has been fixed by something between v5.15 and v6.0.
Comment 38 Werner Sembach [TUXEDO] 2022-11-07 16:34:09 UTC
(In reply to Konrad J Hambrick from comment #36)
> Thanks for all the info wse@tuxedocomputers.com
> 
> The comments in Hans de Goede's three patches look familiar but I'll try all
> three patches against 5.15.y again.
> 
> Or should I be running a different Kernel Version ?
> 
> The Code referenced at elixir.bootlin.com is from Kernel Version 6.0.7 ...
> 
> I'll test next weekend when I've got time to build ; boot ; dmesg ; repeat 
> 
> Then I will open a new bug report.
> 
> Thank you and thank you and Tuxedo Computers for the tuxedo-keyboard-master
> and tuxedo-keyboard-ite kernel modules ( they work great for me ) !
> 
> -- kjh

the two lines are the same in 5.15.77

https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm.c#L1024

https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm.c#L1630
Comment 39 Konrad J Hambrick 2022-11-07 16:52:32 UTC
Created attachment 303142 [details]
lspci -vvv for my Sager NP9672M / Clevo X170KM-G

Thanks Bjorn

Attached is lspci -vvv on the unpatched 5.15.77 Kernel if it helps.

It is large-ish but more may be better than less.

Should I try Linux 6.0.7 or maybe even 6.1.rc3 ?

-- kjh

# uname -a

Linux kjhlt7.kjh.home 5.15.77.kjh #1 SMP PREEMPT Thu Nov 3 10:27:53 CDT 2022 x86_64 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz GenuineIntel GNU/Linux
Comment 40 Konrad J Hambrick 2022-11-07 16:55:37 UTC
(In reply to wse from comment #38)
> (In reply to Konrad J Hambrick from comment #36)
> > Thanks for all the info wse@tuxedocomputers.com
> > 
> > The Code referenced at elixir.bootlin.com is from Kernel Version 6.0.7 ...
> > 
> 
> the two lines are the same in 5.15.77
> 
> https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm.
> c#L1024
> 
> https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm.
> c#L1630

Thanks wse@tuxedocomputers.com 

See Bjorn's post -- I'll wait and see what he finds.

-- kjh
Comment 41 Werner Sembach [TUXEDO] 2022-11-07 17:40:46 UTC
Just tested 6.1-rc4 (from ubuntu mainline ppa)

Still same behaviour: 4*20s delay during boot, but works just fine afterwards.
Comment 42 Werner Sembach [TUXEDO] 2022-11-07 17:44:54 UTC
Created attachment 303144 [details]
dmesg after boot (gui did not load but that seems an unrelated bug because rc)
Comment 43 Werner Sembach [TUXEDO] 2022-11-07 17:45:24 UTC
Created attachment 303145 [details]
connecting a dock after boot, worked just fine
Comment 44 Bjorn Helgaas 2022-11-07 17:47:24 UTC
Reassigning to PCI because I don't think this is a USB issue.
Comment 45 Mika Westerberg 2022-11-08 06:20:38 UTC
The 20s timeouts happen because the IOMMU faults prevent the driver from doing DMA so first we need to figure out why is that happening. I have seen similar but they typically were BIOS issues related to ACPI DMAR table setup so I suggest people first to find out whether there is a BIOS upgrade and try that one. In addition can someone attach acpidump from such system so we can look at the DMAR table?
Comment 46 Werner Sembach [TUXEDO] 2022-11-08 11:05:12 UTC
Created attachment 303147 [details]
Clevo X170KM-G ACPI Dump BIOS 1.07.09RTR6
Comment 47 Werner Sembach [TUXEDO] 2022-11-08 11:09:23 UTC
Thanks for looking into this.

ACPI Dump is attached.

The BIOS is the newest one I could find (There is also a RTR7-G2 version, but that one i identical to RTR6 except slower clock speed for ram for compatibility reasons).
Comment 48 Bjorn Helgaas 2022-11-08 13:24:13 UTC
Tangent: *almost* all the info in the dmar.dat is already in the dmesg log.  Maybe the dmesg logging could be tightened up a bit (four copies of the base address, random "0x" vs not, etc) and any missing bits added?  It would be really nice if we could connect this to the PCI hierarchy, but that part seems missing.

  DMAR: DRHD base: 0x000000fed91000 flags: 0x1
  DMAR: dmar0: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
  DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 0
  DMAR-IR: HPET id 0 under DRHD base 0xfed91000
  DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
  DMAR-IR: Enabled IRQ remapping in x2apic mode
Comment 49 Werner Sembach [TUXEDO] 2022-11-08 13:28:48 UTC
note that when setting intel_iommu=off and iommu=off the dmar error disappear, but the delay and the icm error stay. Dock still functional
dmesg attached
Comment 50 Werner Sembach [TUXEDO] 2022-11-08 13:29:21 UTC
Created attachment 303148 [details]
intel_iommu=off iommu=off
Comment 51 Werner Sembach [TUXEDO] 2022-11-08 13:30:10 UTC
Created attachment 303149 [details]
intel_iommu=off iommu=off after attaching thunderbolt dock
Comment 52 Mika Westerberg 2022-11-08 13:42:02 UTC
Hi, Okay I see now in the dmesg that even if IOMMU is turned off the driver fails to communicate with the TBT firmware. This is unexpected. Have you tried Windows on that system and if yes, does the TBT driver work? The workaround for this is to blacklist the driver but that also prevents power management (which probably does not matter because it is not working anyway). One additional suggestion is to check if there is TBT firmware upgrade available for that system.
Comment 53 Werner Sembach [TUXEDO] 2022-11-08 15:03:48 UTC
Thunderbolt works on both windows and linux, at least the basic stuff like usb devices, audio, network, screens attached to a thunderbolt only dock.

I'm not familiar enough with windows to know where to look for this icm error if it happens there too.

We already reached out to clevo, but the firmware is already the latest they offer for the tbt chip in this barebone (something ending in .29 if I remember correctly).

Is there some generic intel firmware update(r) I can try? didn't find one when I searched a while back.
Comment 54 Mika Westerberg 2022-11-08 15:19:30 UTC
There is no generic firmware unfortunately.

You should be able to see in Windows device manager that after a while (when there is nothing connected) the PCIe root port that has the Thunderbolt controller connected is in D3 Power State. If that's the case then the Windows side of the driver is working as it enters low power states.
Comment 55 Konrad J Hambrick 2022-11-08 15:46:56 UTC
Mika --

I've still got a bootable, up-to-date Win11 Partition but like wse, I am not sure
what info to gather for comparison.

I've also got an acpidump if you want it for my slightly different Sager NP9672M-G1 ( a rebranded CLEVO X170KM-G ).

Thank you very much for looking into this :)

-- kjh
Comment 56 Werner Sembach [TUXEDO] 2022-11-08 18:29:58 UTC
(In reply to Mika Westerberg from comment #54)
> There is no generic firmware unfortunately.
> 
> You should be able to see in Windows device manager that after a while (when
> there is nothing connected) the PCIe root port that has the Thunderbolt
> controller connected is in D3 Power State. If that's the case then the
> Windows side of the driver is working as it enters low power states.

I assume you mean: System Devices -> Thunderbolt(tm) Controller - 1137 -> Properties -> Details -> Power data Current power state

Strange behaviour: When I freshly installed windows 11 with only the thunderbolt driver device manager was showing me constant D3 there, even when using a thunderbolt dock. I then proceded to install all the other drivers for the device (and windows update was doing stuff in the background), now it's in a constant D0 state.
Comment 57 Mika Westerberg 2022-11-09 06:42:24 UTC
Yes correct. If you don't plug in any devices and wait, does it then enter D3?
Comment 58 Werner Sembach [TUXEDO] 2022-11-09 12:24:18 UTC
No.

Vanilla Windows Install -> The Thunderbolt controller does not show up in the device manager (not under a recognizable name at least), but the thunderbolt only docking station still seems to be functional?

Installing the "Thunderbolt" driver from the Clevo driver package on this Vanilla Windows install -> A thunderbolt controller shows up in the device manager. But regardless of if a dock is plugged in or not it shows as D3 power state. Dock functions correctly however.

Installing the rest of the clevo driver package (Including a "Chipset" labled driver)-> The thunderbolt device is now permanently in D0 poer state.
Comment 59 Werner Sembach [TUXEDO] 2022-11-09 12:26:11 UTC
Here are the drivers for reference: https://mytuxedo.de/index.php/s/ZeB8FTf8CrpEtJr?path=%2FXUX_Series%2FXUX7_Gen13
Comment 60 Mika Westerberg 2022-11-09 12:38:31 UTC
Okay so it is not entirely clear whether the Windows "driver" works any better than the Linux one (I'm not too familiar with the Windows side of things but my understanding is that the driver does the exact same flow to establish communication with the firmware).

The reason why the "dock is functional" is that on these configurations there is really not much the Thunderbolt driver needs to do so all the PCIe/USB3/DP tunneling is done in the firmware. The only thing the Thunderbolt driver is needs is to make the controller enter D3 when idle to allow the whole Thunderbolt add-in-card to enter low power state.

It could also be that the "Chipset" package provides some driver to a device on that dock that does not support Windows version of runtime PM so it basically keeps the thing in D0.
Comment 61 Konrad J Hambrick 2022-11-09 14:15:29 UTC
wse --

I am curious.

Does your Thunderbolt Dock work if you disable Discrete Thunderbolt in the INSYDE BIOS > Advanced Screen ?

The reason I wonder is when I disable Discrete Thunderbolt in the BIOS, there are no DMAR: DRHD: handling fault status' timeout delays but I've still got  Thunderbolt Driver Modules.

I don't have a Dock so I can't test that but my Thunderbolt External SSD does work.


Mika --

I 'found' a windows command ( msinfo32 ) that can print all resources as an XML File.

C:\> %CommonProgramFiles%\Microsoft Shared\MSInfo\msinfo32.exe /nfo c:\temp\msinfo32-ascii.xml

I enabled Discrete Thunderbolt, booted Win11 and dumped the output to a large .xml file ( 1.75 MB as Unicode / 0.88 MB with the null bytes stripped )

It does show Memory Resources for all Devices just about everything in the Device Manager Interface.

I can append the output if it would be useful.

HTH and thanks to all !

-- kjh
Comment 62 Werner Sembach [TUXEDO] 2022-11-09 15:30:56 UTC
Found the "culprit" by installing the drivers one by one:

The (Clevo) Control Center is what's causing the TBT Controller to stay in D0.

Clevo Control Center installed -> TBT always D0
Clevo Control Center not installed -> TBT always D3

Regardless of weather or not i have anything connected to the thunderbolt port or am actifly using it.
Comment 63 Werner Sembach [TUXEDO] 2022-11-09 15:32:59 UTC
Since this is only a power safing feature, can we quirk it so that the kernel just skips the initialisation? To at least eliminate the long boot delay?
Comment 64 Werner Sembach [TUXEDO] 2022-11-09 16:04:46 UTC
Looking at the code maybe above here: https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L1968

(Pseudo code) Adding a "if (tb->root_switch->quirks & QUIRK_SKIP_ICM_INIT) {return 0;}" and adding this quirk to the quirktable here: https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/quirks.c
Comment 65 Mika Westerberg 2022-11-10 06:11:47 UTC
You can disable the TBT driver also by blacklisting so adding "modprobe.blacklist=thunderbolt" in the kernel command line should work this around.

Is this only Clevo X170M system that has this issue? I will talk to our Windows driver folks if they have seen this.
Comment 66 Mika Westerberg 2022-11-10 06:38:56 UTC
BTW, does the xHCI work normally? If you plug in a USB 3.x device to the Thunderbolt Type-C ports, do they show up and are functional?
Comment 67 Werner Sembach [TUXEDO] 2022-11-10 12:41:41 UTC
(In reply to Mika Westerberg from comment #65)
> You can disable the TBT driver also by blacklisting so adding
> "modprobe.blacklist=thunderbolt" in the kernel command line should work this
> around.

thanks, with the driver blacklisted the delay is gone and the tbt dock is still functional
 
> Is this only Clevo X170M system that has this issue? I will talk to our
> Windows driver folks if they have seen this.

That Clevo is the only device with a descrete thunderbolt controller I have at hand. So I can't really say if it's the only one or not.

(In reply to Mika Westerberg from comment #66)
> BTW, does the xHCI work normally? If you plug in a USB 3.x device to the
> Thunderbolt Type-C ports, do they show up and are functional?

Yes, a USB 3.0 thumbdrive is correctly recognized as such both connected directly or through the tbt dock.
Comment 68 Werner Sembach [TUXEDO] 2022-11-10 12:53:52 UTC
(In reply to Konrad J Hambrick from comment #61)
> wse --
> 
> I am curious.
> 
> Does your Thunderbolt Dock work if you disable Discrete Thunderbolt in the
> INSYDE BIOS > Advanced Screen ?
> 
> The reason I wonder is when I disable Discrete Thunderbolt in the BIOS,
> there are no DMAR: DRHD: handling fault status' timeout delays but I've
> still got  Thunderbolt Driver Modules.
> 
> I don't have a Dock so I can't test that but my Thunderbolt External SSD
> does work.

With Discrete Thunderbolt disbled in the bios the ports are reduced down to usb 2.0.
Comment 69 Werner Sembach [TUXEDO] 2022-11-10 12:54:32 UTC
so no: the tbt only dock does not work anymore
Comment 70 Konrad J Hambrick 2022-11-10 14:47:14 UTC
(In reply to wse from comment #69)
> so no: the tbt only dock does not work anymore

OMG !

I see now that my USB 3.2 Drive is connected as USB 2.0 !

Sorry about that rabbit hole.

OTOH, I believe I understand from the posts above that if I append modprobe.blacklist=thunderbolt to my Kernel Commandline that I can connect at USB 3.2 Speed ?

In addition, your TBT 4.0 Dock still works too ?

That's great news !

I need to wrap up some Source Code Edits for work then I'll set up the blacklist and reboot.

Thank you wse and Mika !

-- kjh
Comment 71 Werner Sembach [TUXEDO] 2022-11-10 18:28:21 UTC
it is a tbt 3 dock
Comment 72 Konrad J Hambrick 2022-11-11 11:19:55 UTC
My system works perfectly as far as I can tell in this config:

   1. Append modprobe.blacklist=thunderbolt to the kernel commandline
   2. Enable Discrete Thunderbolt in the INSYDE BIOS
   3. reboot -- no delays due to the DMAR/timeout issue.
   
My USB 3.2 drive works and I will order a TBT 4 Dock for firther testing

Thanks to all !

-- kjh

p.s. I updated my kernel to 5.15.78 last night at the same time.  I built it without Bjorn's "omit EFI_MEMORY_MAPPED_IO" Patch from [comment #25](https://bugzilla.kernel.org/show_bug.cgi?id=214259#c25)

Should I try the patch on 5.15.78 and append efi=debug to the CommandLine ?

Thanks again to all !
Comment 73 Konrad J Hambrick 2022-11-11 11:22:12 UTC
(In reply to Konrad J Hambrick from comment #72)
> My system works perfectly as far as I can tell in this config:
> 
>    1. Append modprobe.blacklist=thunderbolt to the kernel commandline
>    2. Enable Discrete Thunderbolt in the INSYDE BIOS
>    3. reboot -- no delays due to the DMAR/timeout issue.
 
Sorry ... Markdown ate my list.  Looked OK in preview mode.
Comment 74 Mika Westerberg 2022-11-11 11:32:22 UTC
Okay good to know. However, the blacklist thing is just a workaround and it causes your system to draw more energy than it has to because it keeps the whole Maple Ridge controller in D0 (powered on). The issue should still be root cause and then fixed properly. We are looking into this.
Comment 75 Bjorn Helgaas 2022-11-11 15:41:50 UTC
(In reply to Konrad J Hambrick from comment #72)

> p.s. I updated my kernel to 5.15.78 last night at the same time.  I built it
> without Bjorn's "omit EFI_MEMORY_MAPPED_IO" Patch from [comment
> #25](https://bugzilla.kernel.org/show_bug.cgi?id=214259#c25)
> 
> Should I try the patch on 5.15.78 and append efi=debug to the CommandLine ?

No.  That patch (https://bugzilla.kernel.org/attachment.cgi?id=303123) is dead and will never go anywhere, so don't waste your time on it.
Comment 76 Konrad J Hambrick 2022-11-11 16:27:29 UTC
(In reply to Bjorn Helgaas from comment #75)
> (In reply to Konrad J Hambrick from comment #72)
> 
> > p.s. I updated my kernel to 5.15.78 last night at the same time.  I built it
> > without Bjorn's "omit EFI_MEMORY_MAPPED_IO" Patch from [comment
> > #25](https://bugzilla.kernel.org/show_bug.cgi?id=214259#c25)
> > 
> > Should I try the patch on 5.15.78 and append efi=debug to the CommandLine ?
> 
> No.  That patch (https://bugzilla.kernel.org/attachment.cgi?id=303123) is
> dead and will never go anywhere, so don't waste your time on it.

Thanks Bjorn.

-- kjh
Comment 77 Mika Westerberg 2022-11-14 13:42:16 UTC
Tried to reproduce this on my local setup (it is Alder Lake-S based system with the same Thunderbolt controller) but I was not able to trigger the issue.

However, I went through some of the logs you attached and it seems that at least sometimes the driver is able to communicate with the firmware. It looks like it works when you boot with the dock already connected. I wonder if you could try with the v6.x kernel so that you add "thunderbolt.dyndbg=+p intel_iommu=off" in the kernel command line and boot the system up with the dock connected. Then unplug the dock, wait for 1 minute and plug it back. Does it still work? And can you attach dmesg and lspci (sudo lspci -vv) outputs here?
Comment 78 Konrad J Hambrick 2022-11-16 22:09:46 UTC
Mika --

I don't have a TBT Dock but I do have a USB 3.2 External Enclosure with a 1TB Samsung 980 Pro NVMe installed.

I can build 6.0.9 or 6.1-rc5 and boot as requested if that will help.

Now that TBT seems to work, I am saving up for a Thunderbolt 4 Dock but that won't be until later in the month.

If you would like me to do the tests with the USB 3.2 Drive, which Kernel would you prefer ( 6.0.y or 6.1.rcX ) ?

Thanks

-- kjh
Comment 79 Mika Westerberg 2022-11-17 06:26:52 UTC
Hi, unfortunately I don't think USB 3.x devices do the same. For this experiment we would need to get the TBT/USB4 link up before the driver loads.

BTW, it is not guaranteed that TBT actually works on that system fully. Because the driver is blacklisted there is no power management among other things (networking over USB4 etc).
Comment 80 Konrad J Hambrick 2022-11-17 13:25:27 UTC
Thanks Mika.

I will let you know when I have the TB4 Dock in hand and request your recommendations for which Kernel to test when I am good to go.

Thanks again.

-- kjh
Comment 81 Werner Sembach [TUXEDO] 2022-11-17 16:50:00 UTC
Created attachment 303196 [details]
dmesg after boot
Comment 82 Werner Sembach [TUXEDO] 2022-11-17 16:50:30 UTC
Created attachment 303197 [details]
lspci after boot
Comment 83 Werner Sembach [TUXEDO] 2022-11-17 16:51:14 UTC
Created attachment 303198 [details]
dmesg tbt dock disconnected
Comment 84 Werner Sembach [TUXEDO] 2022-11-17 16:51:49 UTC
Created attachment 303199 [details]
lspci tbt dock disconnected
Comment 85 Werner Sembach [TUXEDO] 2022-11-17 16:52:15 UTC
Created attachment 303200 [details]
dmesg tbt dock reconnected
Comment 86 Werner Sembach [TUXEDO] 2022-11-17 16:52:47 UTC
Created attachment 303201 [details]
lspci tbt dock reconnected
Comment 87 Werner Sembach [TUXEDO] 2022-11-17 16:54:33 UTC
(In reply to Mika Westerberg from comment #77)
> Tried to reproduce this on my local setup (it is Alder Lake-S based system
> with the same Thunderbolt controller) but I was not able to trigger the
> issue.
> 
> However, I went through some of the logs you attached and it seems that at
> least sometimes the driver is able to communicate with the firmware. It
> looks like it works when you boot with the dock already connected. I wonder
> if you could try with the v6.x kernel so that you add "thunderbolt.dyndbg=+p
> intel_iommu=off" in the kernel command line and boot the system up with the
> dock connected. Then unplug the dock, wait for 1 minute and plug it back.
> Does it still work? And can you attach dmesg and lspci (sudo lspci -vv)
> outputs here?

sorry for the long wait, i uploaded the logs now, i used 6.1-rc5

Unplugging and replugging works just fine
Comment 88 Werner Sembach [TUXEDO] 2022-11-17 16:58:38 UTC
i realized: here the wait is just ~35 seconds and i don't get the icm error
Comment 89 Mika Westerberg 2022-11-18 12:48:18 UTC
Thanks for the logs! I will go them through.

Can you also check what the /sys/bus/thunderbolt/devices/0-0/nvm_version shows?
Comment 90 Werner Sembach [TUXEDO] 2022-11-18 14:11:58 UTC
0-0/nvm_version is 29.0

good to know now where in linux i can check the tbt firmware version ^^
Comment 91 Werner Sembach [TUXEDO] 2022-11-18 14:21:39 UTC
that's the newest and only tbt firmware clevo is offering for the device: https://www.clevo.com.tw/en/e-services/download/ftpOut.asp?Lmodel=X170KM-G&ltype=9
Comment 92 Werner Sembach [TUXEDO] 2022-11-18 14:25:20 UTC
hm. the dl link there is broken and i'm unsure if that is the firmware update or just the driver ...
Comment 93 Konrad J Hambrick 2022-11-27 11:29:49 UTC
Created attachment 303303 [details]
Anker Apex, 12-in-1, Thunderbolt 4 dmesg and lspci output

Mika --

Attached is anker-tbt4-testing.tgz

I hope that's OK.  It seemed easier than a bunch of separate .txt file attachments,

These are the contents:

Kernel Configs for 6.0.10 

   kernel-config/.config-6.0.10-generic

TBT4 Dock Turned on and Plugged in at boot

   test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on.txt
   test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on.txt
   test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on-after-lspci-hang.txt
   test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on-2.txt
   test-1-with-tbt4-on-at-boot/dmesg-unplugged-tbt4.txt
   test-1-with-tbt4-on-at-boot/lspci-unplugged-tbt4.txt
   test-1-with-tbt4-on-at-boot/dmesg-replugged-tbt4.txt
   test-1-with-tbt4-on-at-boot/lspvi-vv-replugged.txt

TBT4 not Plugged in at boot

   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4.txt
   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-plugged.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-plugged.txt
   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-unplugged.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-unplugged.txt
   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-replugged.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-replugged.txt

Repeated Test 1 - TBT4 Plugged in at boot.

   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot.txt
   test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-after-lspci.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged.txt
   test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-unplugged.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged-after-lspci.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged.txt
   test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-replugged.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged-after-lspci.txt

sys_bus_thunderbolt_devices_0-0_nvm_version.txt -> 26.0

Each of the test-[1-3]-* directories contain dmesg and lspci output for one boot session on a new 6.0.10 Kernel.

Please let me know if I need to further explain the contents of the files.

When the TBT4 is Plugged in at boot, lspci -vv hung for about 90-seconds which shows in the test-3 lspci-vv0*.txt files where I ran ( date ; lspci -vv ; date ) 

One note:  There is a USB3.2 NVMe SSD Enclosure plugged into the USB3.2 Port in the TBT Dock and it never powered up but it seems to have been detected ?

Please let me know if I can do any more testing -- I would love to make this work!

Thank you and wse !

-- kjh
Comment 94 Konrad J Hambrick 2022-11-27 11:35:20 UTC
I have the hardest time with this interface !

Test 1 - initial session with TBT4 attached and turned on at boot

   test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on.txt
   test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on.txt
   test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on-after-lspci-hang.txt
   test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on-2.txt
   test-1-with-tbt4-on-at-boot/dmesg-unplugged-tbt4.txt
   test-1-with-tbt4-on-at-boot/lspci-unplugged-tbt4.txt
   test-1-with-tbt4-on-at-boot/dmesg-replugged-tbt4.txt
   test-1-with-tbt4-on-at-boot/lspvi-vv-replugged.txt

Test 2 - second session without TBT 4 attached

   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4.txt
   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-plugged.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-plugged.txt
   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-unplugged.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-unplugged.txt
   test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-replugged.txt
   test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-replugged.txt

Test 3 - third session with TBT4 attached and on at boot ( repeat #1 )

   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot.txt
   test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-after-lspci.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged.txt
   test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-unplugged.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged-after-lspci.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged.txt
   test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-replugged.txt
   test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged-after-lspci.txt

Sorry !

-- kjh
Comment 95 Mika Westerberg 2022-11-28 08:19:04 UTC
Thanks for the logs! I see there is AER error when the controller is runtime resumed so I wonder if you can try to boot with "pci=noaer" (with device connected too) and see if that makes it work any better?

I contacted Clevo last week but so far haven't got any reply from them.
Comment 96 Konrad J Hambrick 2022-11-28 13:03:06 UTC
Mika --

I've got a project due this morning but I'll append pci=noaer to my boot args and run the same tests later today after my code review.

I also received a BIOS Update from Sager which originally came from Clevo ( the file name is X170KM-G08_LS1 )

Should I update the BIOS and then gather logs or should I run the tests on the same old BIOS version ?

Finally, I still have a bootable Win11 partition.

Are there any tests I could run on the Windows Side ?

Thanks for all you're doing !

-- kjh

p.s. The BIOS Update is a .zip file and I could attach it here if that's helpful.
Comment 97 Mika Westerberg 2022-11-28 13:13:53 UTC
It's fine to update the BIOS prior testing. Hopefully the problem goes away (but most likely not). I don't know any additional testing to be done in Windows. Our Windows folks confirmed that the timeout is the same in Windows side.
Comment 98 Konrad J Hambrick 2022-11-28 20:02:18 UTC
Created attachment 303313 [details]
Same tests with pci=noaer

Mika --

Attached is anker-tests-4..7.tar.gz ... with 4 test cases

Test 4 - appended pci-noaer boot with TBT4 Plugged ( old BIOS - 107.06.LS1 )

test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot.txt
test-4-with-tbt4-on-at-boot-pci-noaer/lspci-vv-tbt4-on-at-boot.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot-after-lspci-vv.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot-unplugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/lspci-vv-tbt4-on-at-boot-unplugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot-replugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/lspci-vv-tbt4-on-at-boot-replugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-plugged-shutdown.txt

Test 5 - appended pci-noaer boot with TBT4 Unplugged ( old BIOS - 107.06.LS1 )

test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4-plugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4-plugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4-unplugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4-unplugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4-replugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4-replugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-plugged-shutdown.txt

Test 6 - appended pci-noaer boot with TBT4 Plugged ( new BIOS - 107.08.LS1 )

test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-tbt4-on-at-boot.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-tbt4-on-at-boot-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-unplugged-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-replugged.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-tbt4-on-at-boot-replugged.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-replugged-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-replugged-shutdown.txt

Test 7 - appended pci-noaer boot with TBT4 Unplugged ( new BIOS - 107.08.LS1 )

test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4-plugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4-plugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4-unplugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4-unplugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4-replugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4-replugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/sys_bus_thunderbolt_devices_0-0_nvm_version.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-replugged-shutdown.txt

Kernel Parameters:

pci=noaer             [PCIE] If the PCIEAER kernel config parameter is
                      enabled, this kernel boot option can be used to
                      disable the use of PCIE advanced error reporting.

intel_iommu=off       [DMAR] Disable Intel IOMMU driver (DMAR) option

thunderbolt.dyndbg=+p Enable debug messages at boot time.  See
                      Documentation/admin-guide/dynamic-debug-howto.rst
                      for details.
-- kjh
Comment 99 Mika Westerberg 2022-11-29 07:45:17 UTC
Thanks for the logs! At least the AER error is now gone (temporarily). However, even after the BIOS upgrade the issue persist. Can we try yet another experiment?

1. Put "modprobe.blacklist=thunderbolt" in the kernel command line
2. Boot the system up, nothing connected
3. Wait a couple of minutes
4. Load the driver manually

  # modprobe thunderbolt dyndbg

(it may be that it does not load because of the blacklist, I did not try myself but if that's the case you can run insmod instead).

The idea here is to delay the driver load if that gives the Maple Ridge firmware enough time to come up properly.
Comment 100 Konrad J Hambrick 2022-11-29 11:50:21 UTC
Created attachment 303317 [details]
blacklist thunderbolt ; boot unplugged

Mika --

modprobe failed so I rmmod'd thunderbolt and insmod'd thunderbolt.ko 

Both modprobe and insmod need CONFIG_DYNAMIC_DEBUG 

I also included my Kernel .config.  I will rebuild with CONFIG_DYNAMIC_DEBUG ... are there other configs I should set ?

Thanks !

-- kjh

test-8-without-tbt4-at-boot-pci-noaer-blacklist-thunderbolt/

   # modprobe then plug 

   dmesg-boot-without-tbt4.txt
   lspci-vv-boot-without-tbt4.txt
   lsmod-boot-without-tbt4.txt

   dmesg-boot-without-tbt4-after-modprobe-thunderbolt_dyndbg.txt
   lspci-vv-boot-without-tbt4-after-modprobe-thunderbolt_dyndbg.txt
   lsmod-boot-without-tbt4-after-modprobe-thunderbolt_dyndbg.txt

   dmesg-boot-without-tbt4-plugged.txt
   lspci-vv-boot-without-tbt4-plugged.txt

   dmesg-boot-without-tbt4-unplugged.txt
   lspci-vv-boot-without-tbt4-unplugged.txt

   # rmmod thunderbolt && insmod /lib/modules/6.0.10.kjh/kernel/drivers/thunderbolt/thunderbolt.ko dyndbg

   dmesg-after-rmmod-insmod-thunderbolt.ko_dyndbg
   lspci-vv-after-rmmod-insmod-thunderbolt.ko_dyndbg

   dmesg-after-rmmod-insmod-thunderbolt.ko_dyndbg-plugged
   lspcii-vv-after-rmmod-insmod-thunderbolt.ko_dyndbg-plugged

   dmesg-after-rmmod-insmod-thunderbolt.ko_dyndbg-unplugged
   lspci-vv-after-rmmod-insmod-thunderbolt.ko_dyndbg-unplugged

test-9-without-tbt4-at-boot-pci-noaer-blacklist-thunderbolt-insmod/

   # repeat test-8 without modprobe ( insmod then plug )

   dmesg-boot-without-tbt4.txt
   lspci-vv-boot-without-tbt4.txt
   lsmod-boot-without-tbt4.txt

   dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg
   lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg
   lsmod-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg

   dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-plugged
   lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-plugged

   dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-unplugged
   lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-unplugged

   dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-replugged
   lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-replugged

test-a-without-tbt4-at-boot-pci-noaer-blacklist-thunderbolt-insmod-after-plug/

   # plug then insmod

   dmesg-boot-without-tbt4.txt
   lspci-vv-boot-without-tbt4.txt
   lsmod-boot-without-tbt4.txt

   dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg.txt
   lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg.txt
   lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg.txt

   insmod-thunderbolt.ko_dyndbg.txt
   dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod.txt
   lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod.txt
   lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod.txt

   dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-unplugged.txt
   lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-unplugged.txt
   lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-unplugged.txt

   dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-replugged.txt
   lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-replugged.txt
   lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-replugged.txt

Kernel Parameters

   modprobe.blacklist=thunderbolt 
   intel_iommu=off 
   pci=noaer

Mika Westerberg 2022-11-29 07:45:17 UTC
Thanks for the logs! At least the AER error is now gone (temporarily).
However, even after the BIOS upgrade the issue persist.
Can we try yet another experiment?

1. Put "modprobe.blacklist=thunderbolt" in the kernel command line
2. Boot the system up, nothing connected
3. Wait a couple of minutes
4. Load the driver manually

  # modprobe thunderbolt dyndbg

(it may be that it does not load because of the blacklist,
I did not try myself but if that's the case you can run insmod instead).

The idea here is to delay the driver load if that gives the Maple Ridge
firmware enough time to come up properly.
Comment 101 Werner Sembach [TUXEDO] 2022-11-29 19:01:51 UTC
(In reply to Mika Westerberg from comment #99)
> Thanks for the logs! At least the AER error is now gone (temporarily).
> However, even after the BIOS upgrade the issue persist. Can we try yet
> another experiment?
> 
> 1. Put "modprobe.blacklist=thunderbolt" in the kernel command line
> 2. Boot the system up, nothing connected
> 3. Wait a couple of minutes
> 4. Load the driver manually
> 
>   # modprobe thunderbolt dyndbg
> 
> (it may be that it does not load because of the blacklist, I did not try
> myself but if that's the case you can run insmod instead).
> 
> The idea here is to delay the driver load if that gives the Maple Ridge
> firmware enough time to come up properly.

No change in behaviour, still the same timeout and error as if it was loaded during boot.
Comment 102 Konrad J Hambrick 2022-11-29 21:40:56 UTC
Mika --

Are there any other tests I can run ?

I am building a 6.0.10 Kernel with CONFIG_DYNAMIC_DEBUG=y

I am at your service, sir :)

Thanks for all your time and attention !

-- kjh
Comment 103 Konrad J Hambrick 2022-11-29 21:54:31 UTC
Mika --

Way back in Comment #45, you asked about an acpidump.

Would that be helpful with the latest Clevo BIOS ?

If so, do you recommend any specific commandline args ?

Thanks again.

-- kjh
Comment 104 Mika Westerberg 2022-11-30 13:13:55 UTC
No need for acpidump from the latest BIOS. It probably does not have anything interesting over the previous one.
Comment 105 Konrad J Hambrick 2022-12-01 13:10:00 UTC
Thanks Mika.

Looking at wse's dmesg files from Bug 206459 - thinkpad thunderbolt 3 dock gen2 pci memory allocation errors on Yoga C940 unless plugged in before boot ...

I am more than a little disappointed to see that the 'latest' BIOS File I received from Sager is from January of 2020 where wse's Clevo BIOS File is from March, 2022 ...

Oh well ...

Thanks again for all that you're doing !

-- kjh
Comment 106 Werner Sembach [TUXEDO] 2022-12-02 09:45:56 UTC
(In reply to Konrad J Hambrick from comment #105)
> Thanks Mika.
> 
> Looking at wse's dmesg files from Bug 206459 - thinkpad thunderbolt 3 dock
> gen2 pci memory allocation errors on Yoga C940 unless plugged in before boot
> ...
> 
> I am more than a little disappointed to see that the 'latest' BIOS File I
> received from Sager is from January of 2020 where wse's Clevo BIOS File is
> from March, 2022 ...
> 
> Oh well ...
> 
> Thanks again for all that you're doing !
> 
> -- kjh

Have you compaired version numbers?
Maybe the dates only changed because of the time the respective branding was inserted (Does Sager use the generic "StyleNote" branding? The bios is used for example has Tuxedo branding.
Comment 107 Werner Sembach [TUXEDO] 2022-12-02 09:51:09 UTC
There is a unoffical mirror of all the unbranded, undmoddified clevo bioses: repo:repo@repo.palkeo.com/clevo-mirror/X170KM-G/

Sadly not always the best version as sometimes clevo pushes fixes out to the vendors individually.
Comment 108 Konrad J Hambrick 2022-12-02 10:32:15 UTC
Wow !

Thanks for Info wse.

My latest Sager BIOS Version is 'lower' than yours:

Mine is 1.07.08LS1:

# grep ' DMI:' dmesg-boot-without-tbt4-on-plugged.txt
[    0.000000] DMI: Notebook X170KM-G/X170KM-G, BIOS 1.07.08LS1 01/11/2020

Yours is 1.07.09RTR6:

# grep ' DMI:' ../wse/attached_after_boot2
[    0.000000] DMI: TUXEDO TUXEDO Book XUX7 Gen13/X170KM-G, BIOS 1.07.09RTR6 03/04/2022

So ... 

Q1:  if you had to guess ... do you believe it would be safe to install a Generic Clevo BIOS on a Sager Rebranded Laptop ?

Q2:  does the Clevo BIOS Update blow away your EFI Directory ?

Maybe I did it wrong, but when I installed the Sager BIOS Update, it replaced my GRUB2 EFI Partition with Windows 11 Only and I had to restore GRUB2 so I am a tad afraid of Sager's BIOS Updates now.

Thank you, wse !!

-- kjh
Comment 109 Konrad J Hambrick 2022-12-02 10:42:19 UTC
wse --

I checked the repo site that you referenced in comment 107 and the latest BIOS File that I see is 1.07.07 which is 'lower' than my 1.07.08 and your 1.07.09

Thanks again for that link though, it could be handy some day !

-- kjh
Comment 110 Konrad J Hambrick 2022-12-02 10:54:29 UTC
wse --

Sorry about the dangling replies,

Sager seems to use 'Notebook' for it's Style Branding ( sorry, that's a new word for me :)

I do like your 'TUXEDO Book XUX7 Gen13/X170KM-G' String better :)

Given your active participation in Linux Development, you guys are definitely at the top of my short list for my next Laptop :)

-- kjh
Comment 111 Werner Sembach [TUXEDO] 2022-12-02 13:52:34 UTC
(In reply to Konrad J Hambrick from comment #108)
> Wow !
> 
> Thanks for Info wse.
> 
> My latest Sager BIOS Version is 'lower' than yours:
> 
> Mine is 1.07.08LS1:
> 
> # grep ' DMI:' dmesg-boot-without-tbt4-on-plugged.txt
> [    0.000000] DMI: Notebook X170KM-G/X170KM-G, BIOS 1.07.08LS1 01/11/2020
> 
> Yours is 1.07.09RTR6:
> 
> # grep ' DMI:' ../wse/attached_after_boot2
> [    0.000000] DMI: TUXEDO TUXEDO Book XUX7 Gen13/X170KM-G, BIOS 1.07.09RTR6
> 03/04/2022
> 
> So ... 
> 
> Q1:  if you had to guess ... do you believe it would be safe to install a
> Generic Clevo BIOS on a Sager Rebranded Laptop ?
Only if you know that your vendor did not request any hardware alterations. Sometimes there are subtile or not so subtile differences, like some vendors have a barebone with one fan and other have the "same" barebone but with 2 fans.
Seing your bios string has a LS1 appended says that there happened at least some softwarer alteration (could be only the bios logo)
> 
> Q2:  does the Clevo BIOS Update blow away your EFI Directory ?
from my experience, you can never tell before what exactly on the bios level gets resetted when flashing a new bios. so expect to rewrite the efi entry again ^^
> 
> Maybe I did it wrong, but when I installed the Sager BIOS Update, it
> replaced my GRUB2 EFI Partition with Windows 11 Only and I had to restore
> GRUB2 so I am a tad afraid of Sager's BIOS Updates now.
> 
> Thank you, wse !!
> 
> -- kjh

(In reply to Konrad J Hambrick from comment #109)
> wse --
> 
> I checked the repo site that you referenced in comment 107 and the latest
> BIOS File that I see is 1.07.07 which is 'lower' than my 1.07.08 and your
> 1.07.09
> 
> Thanks again for that link though, it could be handy some day !
> 
> -- kjh

Yeah, that's what I mean with vendor specific fixes that clevo never rolls out in gerneral.
Comment 112 Werner Sembach [TUXEDO] 2022-12-02 13:53:31 UTC
(In reply to Konrad J Hambrick from comment #110)
> wse --
> 
> Sorry about the dangling replies,
> 
> Sager seems to use 'Notebook' for it's Style Branding ( sorry, that's a new
> word for me :)
> 
> I do like your 'TUXEDO Book XUX7 Gen13/X170KM-G' String better :)
> 
> Given your active participation in Linux Development, you guys are
> definitely at the top of my short list for my next Laptop :)
> 
> -- kjh

Motivating to hear, thanks ^^
Comment 113 Werner Sembach [TUXEDO] 2023-02-22 14:05:40 UTC
Since it doesn't seem that this device will get anymore firmware updates, what about introducing a quirk to at least get rid of the delay? (The device is semi stationary anyways so the missing energy saving is no big loss)

I'm wondering where to exactly but such a quirk, there are 2 independent quirk lists in the thunderbolt driver it seems: the obvious quirk.c and here: https://elixir.bootlin.com/linux/v6.2/source/drivers/thunderbolt/icm.c#L1119

The quirk.c does seem to only be related to tb_switch which doesn't seem to be related with icm, the other one seems to be related to icm because the icm probe function gets passed a nhi object.

I have no idea what there acronyms stand for or how exactly they are connected xD

My plan is however, to include a pci id based qurik list like this https://elixir.bootlin.com/linux/v6.2/source/drivers/thunderbolt/quirks.c#L31 here https://elixir.bootlin.com/linux/v6.2/source/drivers/thunderbolt/icm.c#L1119 and add an if case for that quirk here https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/nhi.c#L1251 or here https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L2548 returning null to abort the initialization.
Comment 114 Konrad J Hambrick 2023-02-22 20:46:48 UTC
wse --

I am running 6.1.13 on my Clevo.

The 6.1.13 source is identical to the code you referenced above.

I would be more than happy to apply any Patches you come up with and boot the patched Kernel :)

I will be watching this BUG.

Thanks for resurecting this thread !

-- kjh
Comment 115 Mika Westerberg 2023-02-27 07:30:45 UTC
If a quirk is needed then it should be using DMI strings to identify the exact system as in general Maple Ridge works in Linux. Simplest way is to just disable CONFIG_USB4 in kernel .config. If that does not work for you can you attach output of dmidecode command so we look into adding that quirk.
Comment 116 Werner Sembach [TUXEDO] 2023-02-27 12:07:37 UTC
Changing the kernel .config would no sufficide as I want to fix the device across different distros and their prebuild kernels and iso installs.

Thought the subproduct id is also per device. But ofc also possible using dmi, the board_name is "X170KM-G", the system_vendor and board_vendor is reseller specific e.g. Tuxedo devices have system vendor "TUXEDO" and board_vendor "NB01", a lot of other resellers, but not all, leave both at the default "Notebook" value.
Comment 117 Mika Westerberg 2023-02-27 12:56:59 UTC
Okay but before adding any quirks, I wonder if we can perhaps try few additional things? I checked the lspci output again and it seems the BIOS has enabled ASPM L1 substates so can we try disable them? If I read the code right passing "pcie_aspm.policy=performance" in the kernel command line should disable all ASPM states. Can you try that out (and check lspci output under the 00:1c.0 root port that L1SubCtl1 fields are -, and also LnkCtl fields should say that ASPM is disabled).
Comment 118 Werner Sembach [TUXEDO] 2023-03-06 16:18:37 UTC
$ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-6.1.0-1009-tuxedo root=UUID=14ff4402-0e6b-4601-9522-f81175e2548f ro pcie_aspm.policy=performance

L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1-
           T_CommonMode=40us LTR1.2_Threshold=90112ns

LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

Boot delay sadly still present.
Comment 119 Mika Westerberg 2023-03-09 11:11:34 UTC
Created attachment 303906 [details]
Add debugging to ICM startup

Okay, let's try to see if there is something unexpected in the startup case then. Can you apply the attached patch, enable CONFIG_DYNAMIC_DEBUG, and add thunderbolt.dyndbg=+p in the command line. Then boot with no device connected, and then with device connected and see if there is difference.
Comment 120 Werner Sembach [TUXEDO] 2023-03-09 15:49:22 UTC
Created attachment 303909 [details]
dmesg with debug output patch and no dock connected
Comment 121 Werner Sembach [TUXEDO] 2023-03-09 15:50:06 UTC
Created attachment 303910 [details]
dmesg with debug output patch and dock connected during boot
Comment 122 Mika Westerberg 2023-03-09 16:00:58 UTC
Thanks! Sorry I forgot to ask you to disable iommu (intel_iommu=off). If the output is still the same then no need to attach new dmesg but can you check that? We are looking at the REG_FW_STS log.
Comment 123 Werner Sembach [TUXEDO] 2023-03-09 17:27:53 UTC
Created attachment 303912 [details]
no dock no iommu
Comment 124 Werner Sembach [TUXEDO] 2023-03-09 17:28:26 UTC
Created attachment 303913 [details]
dock during boot no iommu
Comment 125 Werner Sembach [TUXEDO] 2023-03-09 17:30:22 UTC
REG_FW_STS is the same in all 4 cases
Comment 126 Mika Westerberg 2023-03-10 15:57:54 UTC
Created attachment 303919 [details]
More retries with 50ms timeout for ICM driver ready

Thanks for the logs! I went through my old emails and noticed that there was a similar issue on certain systems but it was never root caused properly because a BIOS fix made it go away (so it was assumed to be a BIOS issue). I raised it again on our side but at the time I also did a hack patch that made the thing "work" by sending the driver ready several times instead of just 3 with smaller timeout. I wonder if you can try this patch and see if it makes any difference? It should not matter whether IOMMU is enabled or not.
Comment 127 Konrad J Hambrick 2023-03-10 19:34:20 UTC
Created attachment 303923 [details]
6.1.16 patched with attachment 303919 [details]

Mika --

I think you're my HERO :)

I applied your Patch to 6.1.16 and made a few dmesg files.

There is no delay at boot !

Attached is boot with TBolt 4 Unplugged, then plugged.

Thank you for all your efforts !

-- kjh
Comment 128 Konrad J Hambrick 2023-03-10 19:41:57 UTC
Created attachment 303924 [details]
TBolt 4 Plugged at boot then Unplugged then Replugged

This is the same patched kernel.

These are hdparm tests of an NVMe Drive in the replugged TBolt 4 Dock
```
[root@kjhlt7 ~]# hdparm -t --direct /dev/sda

/dev/sda:
 Timing O_DIRECT disk reads: 2622 MB in  3.00 seconds = 873.65 MB/sec
[root@kjhlt7 ~]# hdparm -t --direct /dev/sda

/dev/sda:
 Timing O_DIRECT disk reads: 2664 MB in  3.00 seconds = 887.94 MB/sec
[root@kjhlt7 ~]# hdparm -t --direct /dev/sda

/dev/sda:
 Timing O_DIRECT disk reads: 2668 MB in  3.00 seconds = 889.21 MB/sec
```
Reasonable results.

Thanks again !

-- kjh
Comment 129 Konrad J Hambrick 2023-03-10 20:00:14 UTC
Hmmm ... 

Looking thru the logs the thunderbolt ICM response is consistently -110 

Does not matter if plugged or unplugged on boot.

Not sure why I can access the NVMe drive in the ANKER Thunderbolt 4 Dock ?

-- kjh
Comment 130 Mika Westerberg 2023-03-11 06:36:59 UTC
Hi, yes the timeouts are expected but eventually they go away and the "driver ready" message passes through and the TBT domain becomes accessible. This is what I was after. So this is the same issue we saw some time ago. We will look at this internally.
Comment 131 Werner Sembach [TUXEDO] 2023-03-14 13:11:19 UTC
Created attachment 303948 [details]
new patch, no dock during boot
Comment 132 Werner Sembach [TUXEDO] 2023-03-14 13:13:36 UTC
Created attachment 303949 [details]
new patch, dock during boot
Comment 133 Werner Sembach [TUXEDO] 2023-03-14 13:15:11 UTC
was away for the weekend

attached is the dmesg output with the patch -> more retries also fixes the problem for me
Comment 134 Calvin Walton 2023-04-28 16:47:49 UTC
Created attachment 304195 [details]
ASUS TUF GAMING B550-PLUS discrete Maple Ridge dmesg

I've got a AMD system on an TUF GAMING B550-PLUS (WI-FI) motherboard, using ASUS's "THUNDERBOLTEX 4" discrete Maple Ridge add-on card.
Kernel 6.2.13, Motherboard BIOS 3002, Thunderbolt NVM 36.0.

I was also getting long delays during boot with the card installed. Applying the patch from attachment 303919 [details] has fixed the boot delay, but I haven't had a chance to test anything more than basic USB functionality on the card yet.

I've attached dmesg with debug output from attachment 303906 [details] also enabled. I'm getting some errors that there is insufficient space for some PCI BARs, and it is also reporting some IOMMU faults during initialization (I haven't tried with IOMMU disabled yet).
Comment 135 Mika Westerberg 2023-05-29 13:32:28 UTC
Thanks for the logs! I'm still working on this but it has been quite hard to get the exact setup on our internal reference systems. I have one now but I need to flash firmwares and everything and currently being busy with other things. I hope to get it up and running (and the bug reproduced in a couple of weeks). If we manage to reproduce it we can hopefully root cause it locally. Apologies for the long delay with this one.
Comment 136 Konrad J Hambrick 2023-05-29 14:26:24 UTC
Thanks Mika 

Just in case this is useful info ...

I've been applying your Patch from Comment #126 to the 6.1.y Kernels as they're released and the patches have applied cleanly so far thru 6.1.30 ( installed on May 25 ).

Like Calvin, all I've tested in a long time is USB which works OK.

I do have HDMI and Ethernet in my Anker Hub if testing would be useful.

Thanks for looking at this !

-- kjh
Comment 137 Mika Westerberg 2023-08-18 12:52:41 UTC
Created attachment 304888 [details]
Workaround for IOMMU faults on certain systems (updated)

Hi all,

I'm sorry for the long delay. I tried to reproduce the issue on Intel reference board that was used as basis for these systems with the BIOS that should have the issue but still the issue did not reproduce :( This makes it hard to debug. So instead let's try to work it around with increasing the number of retries. I've updated the patch to do this only when it sends driver ready to the firmware. I wonder if you could try it out and let me know if it still helps. If it does, I will send it upstream.
Comment 138 Konrad J Hambrick 2023-08-18 21:21:32 UTC
Mika --

I'll build 6.1.46 or better tomorrow and boot and let you know.

Do you want dmesg files for any specific set of boot args ?

Thank you !

-- kjh
Comment 139 Mika Westerberg 2023-08-19 05:14:49 UTC
If you can pass "thunderbolt.dyndbg=+p" in the command line and attach full dmesg that would be perfect, thanks!
Comment 140 Konrad J Hambrick 2023-08-19 08:47:24 UTC
Created attachment 304908 [details]
Thunderbolt unplugged at boot - 6.1.46 with 0001-thunderbolt-Workaround-an-IOMMU-fault-on-certain-sys.patch

Mika --

I applied your new patches to 6.1.46 and it applied without errors and built a 6.1.46.kjh_p kernel ( .kjh means it is not a Slackware Kernel and _p is for patched ).

I have been running your older patches up to now because they eliminated the 90-sec hang during boot.

Your latest patches also eliminated the boot-time hang ( ! woo hoo ! )

I am running 6.1.46.kjh_p and I'll keep running it until 6.1.47 is released when I will continue to apply your latest patches unless they're merged into upstream.

This is with the thunderbolt unplugged at boot.

-- kjh
Comment 141 Konrad J Hambrick 2023-08-19 08:48:31 UTC
Created attachment 304909 [details]
6.1.46 with your latest patches and thunderbolt plugged at boot

this is 6.1.46 with your latest patches and thunderbolt plugged at boot
Comment 142 Calvin Walton 2023-08-19 16:21:23 UTC
Created attachment 304911 [details]
dmesg ASUS TUF GAMING B550-PLUS 6.4.11_kepstin-00001-g8ff95b7fc9f5 (304888)

I've tried the updated patch from Attachment 304888 [details] on my affected AMD system with no thunderbolt devices connected, and the new patch does not completely solve the startup delay problem. The delay is reduced compared to no patch, but I'm still seeing a total of 10 seconds of delay during the thunderbolt device probe as it appears to perform 5 retries each 2 seconds apart.

The patch from attachment 303919 [details] reduced the boot delay to around 250ms.
Comment 143 Calvin Walton 2023-08-19 16:23:30 UTC
Created attachment 304912 [details]
dmesg ASUS TUF GAMING B550-PLUS 6.4.11_kepstin-00001-gb09f7209b080 (303919)

For comparison, here is the same kernel version with the patch from attachment 303919 [details] applied instead.
Comment 144 Calvin Walton 2023-08-19 16:33:54 UTC
Looking at kjh's logs, I see the same thing - when probing the thunderbolt controller with no devices connected, the new version of the patch has a 10 second delay while the old patch was able to complete in around 200ms.
Comment 145 Mika Westerberg 2023-08-21 07:59:33 UTC
Thanks for testing! Yes, the patch tries to get the system functional not speed up the boot time (although it does that too). I suppose we could change the timeout to be shorter still like this:

icm_tr_driver_ready():
        ...
 	ret = icm_request(tb, &request, sizeof(request), &reply, sizeof(reply),
			  1, 40, 500);

Would this work for you?
Comment 146 Konrad J Hambrick 2023-08-21 09:04:54 UTC
Mika --

I didn't notice the 10-sec delay while watching my laptop boot because there was no obvious hang like I saw with no patches when the Discrete Thunderbolt enabled in the Insyde BIOS.

But after reading Calvin's messages and looking closely at the logs I see that it does take an additional 8 -to- 10 sec to boot with the latest patch compared to the original patch that eliminated the hang.

I would be happy to test any patches you come up with.

Thanks again for working on this !

-- kjh
Comment 147 Mika Westerberg 2023-08-21 09:37:40 UTC
Hi, yes there is the 5 * 2s = 10s delay now in the log:


[    3.926875] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b300 flags=0x0030]
...
[    6.081338] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b400 flags=0x0030]
[    8.214284] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b500 flags=0x0030]
[   10.347715] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b600 flags=0x0030]
[   12.481318] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b700 flags=0x0030]
[   14.621068] thunderbolt 0000:06:00.0: USB4 proxy operations supported

this means the IOMMU blocks the DMA 5 times before it magically starts working. If we cut that 2s to 500ms then it would take 2.5s in theory to get the DMA working (if it always takes 5 times but sometimes it might take more and sometimes less).
Comment 148 Calvin Walton 2023-08-21 14:37:51 UTC
Even with a 500ms retry, that's still a 2.5s delay at boot, a rather long time compared to the ~200-300ms total that it takes to initialize with the older patch. Looking at the timing, it appears that the older patch retries immediately - i.e. there is no noticeable delay or sleep (<1ms) before it tries again:

[    3.403260] thunderbolt 0000:06:00.0: enabling interrupt at register 0x38200 bit 12 (0x1 -> 0x1001)
[    3.404331] thunderbolt 0000:06:00.0: sending ICM request
[    3.409092] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b300 flags=0x0030]
[    3.456684] thunderbolt 0000:06:00.0: ICM request finished -110
[    3.456741] thunderbolt 0000:06:00.0: sending ICM request
[    3.460190] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b400 flags=0x0030]
[    3.510018] thunderbolt 0000:06:00.0: ICM request finished -110
[    3.510075] thunderbolt 0000:06:00.0: sending ICM request
[    3.513519] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b500 flags=0x0030]
[    3.563351] thunderbolt 0000:06:00.0: ICM request finished -110
[    3.563407] thunderbolt 0000:06:00.0: sending ICM request
[    3.566853] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b600 flags=0x0030]
[    3.616683] thunderbolt 0000:06:00.0: ICM request finished -110
[    3.616740] thunderbolt 0000:06:00.0: sending ICM request
[    3.620247] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b700 flags=0x0030]
[    3.673353] thunderbolt 0000:06:00.0: ICM request finished -110
[    3.673412] thunderbolt 0000:06:00.0: sending ICM request
[    3.677120] thunderbolt 0000:06:00.0: ICM request finished 0
[    3.683640] thunderbolt 0000:06:00.0: USB4 proxy operations supported

It's really strange that it takes exactly 5 retries on both my system and kjh's system - *regardless of the amount of time between retries*. I wonder what the cause of that is… Also, kind of curious that it is giving IOMMU errors on the first 5 failed requests on both of our systems (both with the Intel and AMD IOMMU).

As far as a workaround, is there any downside to doing an immediate retry? Would it be ok to do a limited number of immediate retries - might as well do 5, since that's the number that's needed on both my system and kjh's system - before enabling a delay for the remaining retries?
Comment 149 Mika Westerberg 2023-08-21 14:58:11 UTC
The IOMMU is sitting between the DMA (the one used by the Thunderbolt driver) and the the memory and that is blocking all the requests (the first ~5). This is hard to debug without means to reproduce this locally and I tried all possible ways I could think of. Ideally once reproduced we would plug in PCIe analyzer and similar devices to figure out what exactly is happening. The Intel based systems at least got fixed with BIOS upgrade (except the ones here of course there does not seem to be an upgrade available. For the AMD one, I'm not sure).

Anyways boot time typically does not matter too much as that is not done too often but if you really think it matters here, we can make the timeout 50ms and retry max say, 10 times (to be on the safe side) but then I think we need to separate the Maple Ridge path from Titan Ridge to avoid any possible issues (need to check that still).
Comment 150 Mika Westerberg 2023-08-22 12:05:40 UTC
Created attachment 304924 [details]
Workaround for IOMMU faults on certain systems (v2)

Can you try the attached patch? This one tries 10 times with 50ms timeout and then reverts to the longer timeout.
Comment 151 Konrad J Hambrick 2023-08-22 12:13:25 UTC
Mika --

I've got a deadline at work and I can't be without my Laptop for now.

I should be done tomorrow on Thursday at the latest and I'll build a patched Kernel and post the two dmesg files ASAP.

Sorry about the delay !

-- kjh
Comment 152 Werner Sembach [TUXEDO] 2023-08-23 16:38:14 UTC
sorry for the slow reply, was on holiday ^^

here is my output

on the fail case it gives out

DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Write NO_PASID] Request device [05:00.0] fault addr 0x69974000 [fault reason 0x05] PTE Write access is not set

3 times and fails 20 secs later

in the successfull case with the workaround it gives out

DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Write NO_PASID] Request device [05:00.0] fault addr 0x69974000 [fault reason 0x05] PTE Write access is not set

3 times and an aditional

DMAR: DRHD: handling fault status reg 2

without the other line, but then successfully initializes the thunderbolt things
Comment 153 Werner Sembach [TUXEDO] 2023-08-23 16:38:52 UTC
Created attachment 304926 [details]
non working state (kernel 6.2)
Comment 154 Werner Sembach [TUXEDO] 2023-08-23 16:39:39 UTC
Created attachment 304927 [details]
working state (kernel 6.5 with latest workaround applied)
Comment 155 Mika Westerberg 2023-08-24 08:10:35 UTC
Do I understand correctly, the last patch does not work for you? Or I missed something? The second log you attached it seems things work fine but the v6.2 kernel log there is still the timeout so you have the patch applied on both cases?
Comment 156 Werner Sembach [TUXEDO] 2023-08-24 08:41:21 UTC
the patch does work
the 6.2 log was without the patch applied and is just here as a reference regarding the error messages that happen in both cases
sorry for the confusion
with the patch applied it works as intended (some fails and then a successfull initialization)
Comment 157 Mika Westerberg 2023-08-24 09:48:38 UTC
Got it, thanks for clarification. Okay, let's wait for others to try it out and then I will send it upstream after the merge window closes.
Comment 158 Konrad J Hambrick 2023-08-24 09:48:56 UTC
Mika --

I was able to apply your latest patch to 6.1.47 without any issues.

Everything booted on when I booted with the Tbunderbold Hub UNPLUGGED.

However there was a long delay from 11.5 to 28.9 secs followed by a pair of Kernel OOPS starting at 49.3 sec when I booted with TB PLUGGED.

Rebooted 6.1.46.kjh_p with the Aug 18 Patches ... no issues there.

The two dmesg files ( unplugged and plugged ) for 6.1.47 with the Aug 22 patches are attached.

Thanks again, Mika !

-- kjh
Comment 159 Konrad J Hambrick 2023-08-24 09:51:21 UTC
Created attachment 304931 [details]
6.1.47 + Aug 22 Patch - TB unplugged at boot

6.1.47 + Mika Aug 22 Patch ; TB Unplugged at boot
Comment 160 Konrad J Hambrick 2023-08-24 09:53:00 UTC
Created attachment 304932 [details]
6.1.47 + Mika Aug 22 Patch ; TB Plugged at boot -- OOPS

6.1.47 + Mika Aug 22 Patch ; TB Plugged at boot resulted in a kernel OOPS
Comment 161 Mika Westerberg 2023-08-24 11:09:22 UTC
Hmm, it is pretty unlikely that the patch affects on that part. Your log pretty much shows that the PCIe link from the root port to the Thunderbolt host controller went down. Does that happen 100% if you boot with the patch applied and the dock connected?
Comment 162 Konrad J Hambrick 2023-08-24 13:41:16 UTC
Mika --

I'll have to try again later today.

I am running 6.1.46 with your Aug 18 patch right now but I can reboot 6.1.47 with your Aug 23 Patch later today when I finish with a remote server setup.

I'll let you know when I can run some more tests.

-- kjh
Comment 163 Konrad J Hambrick 2023-08-25 09:53:03 UTC
Mika --

I rebooted ( cold boot ) three times and each time resulted in more or less the same OOPS.

NOTE:  there is a similar OOPS on shutdown

I saved dmesg files for each if you want them but I don't know where insert a dmesg command in the shutdown scripts to capture the shutdown OOPS.

I will remove 6.1.47.kjh_p and rebuild it with the Aug 18 patch to determine if the OOPS is due to an unrelated kernel issue and report back later ...

-- kjh
Comment 164 Konrad J Hambrick 2023-08-25 10:41:51 UTC
Created attachment 304942 [details]
6.1.47 with the Aug 18 Patch

Mika --

This is 6.1.47 with the Aug 18 Patch.

There is no OOPS on boot or shutdown.

Maybe I messed up when I applied the Aug 23 patch ?

I don't think so but I'll build 6.1.47 again with the Aug 23 patch this weekend and report my results.

-- kjh
Comment 165 Konrad J Hambrick 2023-08-25 10:42:42 UTC
p.s. Thunderbolt was plugged when I booted.

sorry about the noise ...
Comment 166 Mika Westerberg 2023-08-25 11:54:17 UTC
All the last three patches are pretty much the same the timeout and the number of retries vary but the end result should be the same and I see in your system based on the last two logs you attached that after 4 DMAR faults the driver ready messages starts passing through and things work as expected.

The "OOPS you say is not actually OOPS but it is just the driver complaining because the underlying hardware went away e.g the PCIe link from the root port to the Maple Ridge is down. I agree that trying several times on a clean build with the last patch applied would tell us if this is indeed somehow related to the patch or not.
Comment 167 Mika Westerberg 2023-08-26 06:04:51 UTC
I realized that the patch might actually affect this. The new patch makes the boot faster but at the same time it allows the driver to runtime suspend earlier and may cause the whole thing to enter D3cold. Now, this should be fine in normal cases but these systems are "special" anyway.

If we go back to the Aug 18 patch with longer timeout then this should not happen but I wonder if it works well over system sleep etc? Once you have checked the Aug 23 patch again with clean build, and if you see the same behavior that the link goes down, can you try with 18 Aug again too and also run some suspend/resume cycles just in case?
Comment 168 Konrad J Hambrick 2023-08-26 12:16:51 UTC
Mika --

My wife reminded me that we're going out of town this weekend and we won't return until Sun nite.

I'll be able to get back on the testing early Monday morning America/Chicago time.

I'll do the sleep tests first with the existing Aug 18 patches.

Then I'll rebuild 6.1.48 with Aug 23 to see if I messed up the patches, including the sleep testing.

Thank you for all your work on this ! 

-- kjh
Comment 169 Calvin Walton 2023-09-11 14:35:35 UTC
I've been using the patch from attachment 304924 [details] for a while with good results. The initialization speed is pretty much indistinguishable from the initial patch that retried without delay. I'm using a desktop system and don't use suspend to ram, so I can't comment on that aspect of the behaviour.

Feel free to add a Tested-by: Calvin Walton <calvin.walton@kepstin.ca> to the patch.
Comment 170 Mika Westerberg 2023-09-11 14:59:00 UTC
Hi, I went for the more conservative version to avoid the issue Konrad saw. It slows down the boot but in the end both systems should be working. I Cc'd you all with the patches (sent them out today morning).
Comment 171 Mika Westerberg 2023-09-12 05:03:59 UTC
(In reply to Mika Westerberg from comment #170)
> Hi, I went for the more conservative version to avoid the issue Konrad saw.
> It slows down the boot but in the end both systems should be working. I Cc'd
> you all with the patches (sent them out today morning).

Here's the patch link:

https://lore.kernel.org/linux-usb/20230911100445.3612655-2-mika.westerberg@linux.intel.com/
Comment 172 Konrad J Hambrick 2023-09-12 08:38:19 UTC
Mika --

Thank you for the patches.

Do I need to apply the linked patch IN ADDITION TO the four patches I received via email ?

Or is the linked patch all I need ?

Thanks again !

-- kjh
Comment 173 Mika Westerberg 2023-09-12 08:54:01 UTC
The linked one is all you need. I CC'd you as well with the patch so you should have it in your inbox too.
Comment 174 Konrad J Hambrick 2023-09-12 10:08:43 UTC
Thanks Mika.

Yes, it is attached to an email in my inbox.

Building 6.1.52 now with only the linked patch for `diff --git a/drivers/thunderbolt/icm.c b/drivers/thunderbolt/icm.c`

Skipping these:

diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c
diff --git a/drivers/thunderbolt/tmu.c b/drivers/thunderbolt/tmu.c
diff --git a/drivers/thunderbolt/quirks.c b/drivers/thunderbolt/quirks.c
diff --git a/drivers/thunderbolt/xdomain.c b/drivers/thunderbolt/xdomain.c

Will build today, install tonite and follow up tomorrow moring.

Thanks again !

-- kjh
Comment 175 Konrad J Hambrick 2023-09-13 08:29:58 UTC
Created attachment 305098 [details]
dmesg for Linux 6.1.52 with Mika's Sep 12 Patch cold boot ; plugged

Mika --

Attached is is the dmesg output for 6.1.52 where applied the single Sep 12 patch I cold-booted with the Thunderbolt Hub turned on and plugged.

Thank you for all your work on the Discrete Thunderbolt Issue !

-- kjh
Comment 176 Mika Westerberg 2023-09-13 10:38:07 UTC
Thanks a lot for testing! Does it also work if you boot with no device connected? (no need to attach dmesg if it does).
Comment 177 Konrad J Hambrick 2023-09-13 11:55:36 UTC
Mika --

Yes, linux 6.1.52  boots fine when I cold boot without the Thunderbolt Hub Plugged in.

Everything also works fine when I subsequently turn on the Hub after booting unplugged.

Linux 6.1.53 was released this morning so I'll be building that later today or tomorrow morning.

I'll apply the same Sep 12 patches and let you know what I see with 6.1.53

Thanks again Mika !

-- kjh
Comment 178 Konrad J Hambrick 2023-09-14 08:42:23 UTC
Created attachment 305108 [details]
Linux 6.1.53 + Mika Sep 12 Patch cold boot plugged

Mika --

Attached is 6.1.53 + your Sep 12 Patch after a cold boot.

The old Linux 6.1.52 shutdown fine without any delay.

Linux 6.1.53 Booted just fine without a noticable delay.

I forgot to grab a dmesg file until quite some time after booting and entering my KDE Desktop.

When I did, I noted the verbose periodic Kernel comms with Thunderbolt toward EoF in the dmesg file.

Thanks Mika !

-- kjh
Comment 179 Mika Westerberg 2023-09-14 09:09:30 UTC
Thanks for testing!

It seems that the TBT wakes up periodically but that probably is related to the same issue the DMAR faults are happening which we cannot fix in the OS side properly without understanding the root cause (and this requires that we would be able to reproduce this locally). I suspect simply firmware update (if it were available) would fix this. Anyways, the workaround seems to work so we good. I will try to get this patch to v6.6-rc and stable trees.
Comment 180 Konrad J Hambrick 2023-09-14 09:33:02 UTC
There are no obvious effects on the performance of my Slackware Desktop.

And it makes sense that the TBT would wake up periodically.

I'll watch for Firmware updates but I am not hopeful about something useful trickling down from Clevo to Sager for this particular set of hardware ( Clevo X170M / Sager NP9672m-g1 ) but one never knows ...

I'll be watching the kernel logs for your patches and maybe Greg KH will eventually pull the patches back into the 6.1.y ( or the next ) longterm tree !

Thanks again Mika, you're my hero !

-- kjh
Comment 181 Calvin Walton 2023-09-18 07:11:14 UTC
So, I've tested the patch posted to the ML, and it's performing as intended on my system. With 5 attempts and 2 seconds between attempts, my boot is slowed down by nearly 8s (it's unclear why waiting for thunderbolt to initialize is in the critical boot path), but the hang is significantly reduced compared to no patch, and the thunderbolt controller is functional.

I'll probably keep using the reduced timeout patch for faster boots, since I'm not hitting/triggering the power management related problem that Konrad reported.

Assuming this is a firmware problem - which firmware is affected? I've got the latest available TB controller firmware update from ASUS applied (NVM 36). As far as I know there's unlikely to be much other firmware in common between my system (an AMD desktop board from ASUS) and Konrad's system (an Intel laptop from Clevo), so it seems odd that a firmware bug would cause us to see the same symptoms, right down to the same number of timing-independent retries needed before the adapter works.

For what it's worth, I dual-boot Windows on my system, and the Thunderbolt controller is working there - but when I looked in to it closer, it turns out that Windows "Kernel DMA Protection" feature is off (despite having the IOMMU setting enabled in the BIOS). From what I've read, this indicates that the BIOS is not setting a flag to tell Windows that DMA protection is supported. So the combination of Thunderbolt controller installed and DMA protection enabled on this board is probably untested or unsupported by ASUS :/

I have a second ASUS board (Pro B550M-C/CSM) which is also compatible with this ASUS Thunderbolt card - same AMD chipset generation, but from a business oriented model line (instead of gaming) and with a different type of CPU installed (Ryzen 5 PRO 4650G APU). At some point I'll test that system to see if it also has the same problem.
Comment 182 Mika Westerberg 2023-09-18 07:43:45 UTC
Thanks for testing! Yeah, I had to choose a version that works on both although in your system it slows down the boot (but ends up with functional TBT).

The firmware in question is the BIOS. There was a Dell system with similar symptoms and that went away with BIOS upgrade. I contacted Clevo regarding this but they do not support the older models anymore so if I understood correctly it is unlikely (unfortunately) that there is any BIOS upgrade for that system. For ASUS of course it might be different story.
Comment 183 Werner Sembach [TUXEDO] 2023-09-26 21:07:56 UTC
Sorry took me some time to look at the final patch.

With 2000ms wait and 4-5 retries that's still 8-10 seconds longer boot time which is quite noticeable.

I'm currently in home office, but end of next week I can look at the test device again and see if the original less delay patch does cause the oops for me too.

If not, maybe the delay time can be quirked?
Comment 184 Mika Westerberg 2023-09-27 07:51:39 UTC
Okay let me know how it goes. Please check connected and not connected cases and over suspend/resume.
Comment 185 Werner Sembach [TUXEDO] 2023-11-08 18:35:22 UTC
Sorry took me some time to come back to this device: I could not reproduce the issue Konrad described even when going as low as 50ms for the delay.

Looking into the dmesg timestamps: For the TUXEDO device the tunderbolt initialization starts at around 1.7s and other stuff that is seemingly initialized in parralel is finishd at around 2.8-3s. so as long as the 4 known to fail retries take less then 1s there should be no noticeable delay at least for this device.

@Konrad Did you try with a timeout of 250ms?
Comment 186 Werner Sembach [TUXEDO] 2023-12-07 13:24:58 UTC
A gentle bump to ask if this reduced timeout would also be acceptable?
Comment 187 Mika Westerberg 2023-12-07 13:28:37 UTC
Sure, it is acceptable if it does not break the systems that started working with the initial fix.
Comment 188 Konrad J Hambrick 2023-12-07 21:50:55 UTC
@wse --

Sorry ... we're getting busy at work and I've been distracted.

No I never did try the 250 ms timeout because the small lag has not been bothering me.

All the 6.1.y kernels have run good enough since Mika's patches were merged.

Is the 250 ms timeout something I can try with a kernel commandline arg or would it require changing the source and recompiling ?

Thanks !

-- kjh
Comment 189 Werner Sembach [TUXEDO] 2023-12-07 21:52:46 UTC
The value is hardcoded so a patch and recompile is required. I can prep a patch for you later or tomorrow.
Comment 190 Werner Sembach [TUXEDO] 2023-12-07 23:42:23 UTC
Created attachment 305567 [details]
reduce timeout to 250ms
Comment 191 Werner Sembach [TUXEDO] 2023-12-07 23:44:36 UTC
Theres a patch based on 6.7-rc4 but it also applies cleanly to v6.1.65
Comment 192 Konrad J Hambrick 2023-12-08 17:40:11 UTC
(In reply to Werner Sembach [TUXEDO] from comment #191)
> Theres a patch based on 6.7-rc4 but it also applies cleanly to v6.1.65

Werner --

I'll apply your patch to 6.1.65 or later this weekend and boot it.

If it runs well, I'll keep it going until something newer comes along :)

I'll let you know.

Are there any debug configs you would like me to set ?

Are there any logs that you would like to see ?

Thanks for this !

-- kjh
Comment 193 Werner Sembach [TUXEDO] 2023-12-08 18:36:47 UTC
Lets see if it runs without issues
Comment 194 Konrad J Hambrick 2023-12-09 13:01:21 UTC
Created attachment 305571 [details]
6.1.65 - no patch

Werner --

This was 6.1.65 before I applied your 250 ms patch.

This is for reference ...

It had been running several days so I trimmed out lines after boot up to where the Thunderbolt code threw a kernel trace into the log.

This happens occasionally ever since I started using Mika's patches ( reported here in this thread ).

I also occasionally see the kernel hang for a few seconds at shutdown but it eventually lets go and shuts down normally.

Thanks for looking at this.

The next attachment will be 6.1.66 with your 250 ms patch.

-- kjh
Comment 195 Konrad J Hambrick 2023-12-09 13:06:00 UTC
Created attachment 305572 [details]
6.1.66 with Werner's 250 ms patch

Werner --

This is 6.1.66 with your 250 ms patch.

It runs fine.

-- kjh
Comment 196 Werner Sembach [TUXEDO] 2023-12-20 15:44:35 UTC
Thx, on its way to upstream: https://lore.kernel.org/all/20231220150956.230227-1-wse@tuxedocomputers.com/