Bug 214259
Description
Werner Sembach [TUXEDO]
2021-09-01 13:44:11 UTC
Created attachment 298567 [details]
dmsg of boot without tb dock connected
Created attachment 298569 [details]
dmsg after connecting tb dock
Created attachment 298571 [details]
dmsg of boot with tb dock connected
Created attachment 298573 [details]
lspci of boot without tb dock connected
Created attachment 298575 [details]
lspci after connecting tb dock
Created attachment 298577 [details]
lspci of boot with tb dock connected
These logs are all for kernel 5.14-rc7 Created attachment 298579 [details]
dmsg of boot without tb dock connected (5.13.12)
Created attachment 298581 [details]
dmsg after connecting tb dock (5.13.12)
Created attachment 298583 [details]
dmsg of boot with tb dock connected (5.13.12)
Created attachment 298585 [details]
lspci of boot without tb dock connected (5.13.12)
Created attachment 298587 [details]
lspci after connecting tb dock (5.13.12)
Created attachment 298589 [details]
lspci of boot with tb dock connected (5.13.12)
New info: The intel_iommu=off boot flag makes the DMAR errors go away. However xhci errors stay and the USB ports of the dock are still disfunctional. So they are seperate issues after all. For reference: A reddit thread discussing the descrete Asus thunderbolt pcie card failing in the exact same way: https://www.reddit.com/r/Thunderbolt/comments/ohjakr/asus_thunderboltex_4_linux_compatability/ Found a preexisteng hack originally for tb3 fixing the issue also on this tb4 controller: https://bugzilla.kernel.org/show_bug.cgi?id=206459#c59 *** This bug has been marked as a duplicate of bug 206459 *** Thank you for your bug report. I've prepared a patch series fixing bug 206459 as well as this bug: https://lore.kernel.org/linux-pci/20220519152150.6135-1-hdegoede@redhat.com/T/#t This series is using DMI matching to identify affected systems and to enable the workaround only on affected systems. I've used DMI_MATCH(DMI_BOARD_NAME, "X170KM-G") as match for this Clevo Barebone. Can you confirm that: cat /sys/class/dmi/id/board_name outputs "X170KM-G" ? Or even better, give this patch series a try ? Note the series is based on top of: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/resource Thank you for the patch. Yes, "X170KM-G" is the exact match for /sys/class/dmi/id/board_name on the affected device. Kernel with patch series is compiling atm. Will add another post wether or not it worked. Successfully tested the patchset: Works like a charm. Hans -- I have exactly the same Hardware and when I Enable my Discrete ThunderBolt in the BIOS, I experience a series of TimeOut delays during Boot. I am running 5.15.53 on Slackware64 15.0 ... I don't have a Dock for my LapTop so I am not sure this bug is a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=206459 Q1: Is your Patch Series still viable ? If so, your second link above for the patch series no longer works: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/resource Q2: Was this patch ever included in the kernel source tree ? If so, what version should I try ? Thanks ! -- kjh (In reply to Konrad J Hambrick from comment #21) > Hans -- > > I have exactly the same Hardware and when I Enable my Discrete ThunderBolt > in the BIOS, I experience a series of TimeOut delays during Boot. > > I am running 5.15.53 on Slackware64 15.0 ... > > I don't have a Dock for my LapTop so I am not sure this bug is a duplicate > of https://bugzilla.kernel.org/show_bug.cgi?id=206459 > > Q1: Is your Patch Series still viable ? > > If so, your second link above for the patch series no longer works: > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/ > resource > > Q2: Was this patch ever included in the kernel source tree ? > > If so, what version should I try ? > > Thanks ! > > -- kjh p.s. this is my dmesg with the Discrete Thunderbolt Enabled: [ 9.359600] input: SynPS/2 Synaptics TouchPad as /devices/platform/i8042/serio2/input/input15 [ 27.953327] DMAR: DRHD: handling fault status reg 2 [ 27.959653] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set [ 48.433330] DMAR: DRHD: handling fault status reg 2 [ 48.438777] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set [ 68.913306] DMAR: DRHD: handling fault status reg 2 [ 68.919850] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set [ 89.387726] thunderbolt 0000:07:00.0: failed to send driver ready to ICM [ 89.394835] thunderbolt: probe of 0000:07:00.0 failed with error -110 (In reply to Konrad J Hambrick from comment #21) > Hans -- > > I have exactly the same Hardware and when I Enable my Discrete ThunderBolt > in the BIOS, I experience a series of TimeOut delays during Boot. > > I am running 5.15.53 on Slackware64 15.0 ... > > I don't have a Dock for my LapTop so I am not sure this bug is a duplicate > of https://bugzilla.kernel.org/show_bug.cgi?id=206459 > > Q1: Is your Patch Series still viable ? > > If so, your second link above for the patch series no longer works: > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/ > resource > > Q2: Was this patch ever included in the kernel source tree ? > > If so, what version should I try ? > > Thanks ! > > -- kjh The series got merged in mainline: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=Add+kernel+cmdline+options+to+use%2Fignore+E820+reserved+regions the kernel.org cgit is always a good option to search if a patch got applied in the end Also don't forget to check linux-next https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?qt=grep&q=Add+kernel+cmdline+options+to+use%2Fignore+E820+reserved+regions or subsystem specific branches like drm-tip (on the fredesktop cgit: https://cgit.freedesktop.org/) Thanks for the info and the tip, wse ! -- kjh I don't know the connection to the DMAR faults, but from the first log (https://bugzilla.kernel.org/attachment.cgi?id=298567): BIOS-e820: [mem 0x000000006bc00000-0x00000000efffffff] reserved pci_bus 0000:00: root bus resource [mem 0x71000000-0xdfffffff window] This entire PCI host bridge aperture is "reserved" in the E820 map, which means we won't allocate any PCI BARs in that area, which means hot-add won't work. The current workaround for this is https://git.kernel.org/linus/d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks"), which appeared in v5.19. I think the underlying issue is that this machine has EFI, Linux converts the EFI memory map to E820 format, and it converts EFI_MEMORY_MAPPED_IO to E820_TYPE_RESERVED. EFI_MEMORY_MAPPED_IO means "the OS must map this memory for use by EFI runtime services." It does *not* mean "the OS can never use this memory." I think Linux should omit EFI_MEMORY_MAPPED_IO areas completely from the E820 map. This is basically the same issue as bug #216565. I attached a patch there to omit EFI_MEMORY_MAPPED_IO. I would love to hear from anybody with a Clevo machine that shows similar problems. If you can, please boot with the patch at https://bugzilla.kernel.org/attachment.cgi?id=303123 with the "efi=debug" kernel parameter, open new bugzilla with the complete dmesg log, and assign it to me. Bjorn -- The referenced efi_mmio-2 Patch applied cleanly on 5.15.77 and I am building a 5.15.77_bjorn_p Kernel now. I'll have to wait until after work to install and boot but I'll enable Discrete Thunderbolt in my Insyde EFI Bios and boot 5.15.bjorn_p Thank You ! -- kjh Created attachment 303131 [details]
Linux 5.15.77 with Discrete Thunderbolt Enabled ( no patch )
Bjorn --
This dmesg for linux 5.15.77 without your Patch.
When I enable 'Discrete Thunderbolt' in the Insyde BIOS there is a long delay for each of the three DMAR Errors and a final delay for Thunderbolt ( see time stamp: [11.534560] thru [88.380241]
Created attachment 303132 [details]
This is 5.15.77 with your patch applied
This is linux 5.15.77 with your patch applied.
There are extra log entries before event 'do_add_efi_memmap' at timestamp [0.000000]
The same three DMAR faults and timeouts still happen between timestamps [26.938364] and [88.372760] and there is still a Discrete Thunderbold error at [88.378716]
Created attachment 303133 [details]
Linux 5.15.77 with Discrete Thunderbolt ( no patch )
Bjorn --
For reference, this is dmesg for 5.15.77 with Discrete Thunderbolt disabled and without your Patch
Created attachment 303134 [details]
5.15.77 with your Patch with Discrete Thunderbolt Disabled
And again for reference, this is 5.15.77 with your patch with Discrete Thunderbolt Disabled in the INSYDE BIOS
Thank you very much for all this work, Konrad! I'm confused. Comments #16 and #17 suggest that this DMAR issue was fixed by https://git.kernel.org/linus/d341838d776a ("x86/PCI: Disable E820 reserved region clipping via quirks"), which appeared in v5.19. Can you confirm that? Bjorn -- You're welcome. I was unable to successfully apply those patches against 5.15.y and then I believe the patches were reverted or partially reverted in 5.19.y. I will consider running a newer Long-Term Kernel when it is announced next month, especially if it fixes Discrete Thunderbolt and especially since 5.15.y will go EoL in Oct 2023. Thanks for all you do for the Linux Kernel, Bjorn ! -- kjh I think the long delay is a different issue then the e820 region patch and has more to do with this: [ 88.372827] thunderbolt 0000:07:00.0: failed to send driver ready to ICM the 20s delay between the 4 retries come from this: https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L1024 or this https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L1630 respecifly. Having the icm not correctly initialized seems to be no problem however? I read somewhere that there is a software fallback for it. But I don't really know what the icm actually does in the first place xD. But I think we should propably open another bugreport for this. For reference: what fixed the Clevo X170KM for me exept the delay was this patch series: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa6dae5d82081e8d9f8e6a2baf7149442a6c1ba5 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d341838d776abadb3ac48abdd2f1f40df5a4fc10 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0ae084d5a6744b1318407d8e20fb88ac0fd85d47 Thanks for all the info wse@tuxedocomputers.com The comments in Hans de Goede's three patches look familiar but I'll try all three patches against 5.15.y again. Or should I be running a different Kernel Version ? The Code referenced at elixir.bootlin.com is from Kernel Version 6.0.7 ... I'll test next weekend when I've got time to build ; boot ; dmesg ; repeat Then I will open a new bug report. Thank you and thank you and Tuxedo Computers for the tuxedo-keyboard-master and tuxedo-keyboard-ite kernel modules ( they work great for me ) ! -- kjh Regarding the delay, i.e., this from Konrad's attachment at comment 27: [ 6.837625] DMAR: DRHD: handling fault status reg 2 [ 6.837629] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set [ 26.937996] DMAR: DRHD: handling fault status reg 2 [ 26.944592] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set [ 47.417982] DMAR: DRHD: handling fault status reg 2 [ 47.424711] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set [ 67.897999] DMAR: DRHD: handling fault status reg 2 [ 67.904639] DMAR: [DMA Write NO_PASID] Request device [07:00.0] fault addr 0x69935000 [fault reason 0x05] PTE Write access is not set [ 88.372827] thunderbolt 0000:07:00.0: failed to send driver ready to ICM [ 88.380241] thunderbolt: probe of 0000:07:00.0 failed with error -110 This all involves the same device (07:00.0) and it's all likely related, so I think we should reserve *this* bug report for this DMAR/timeout issue. As far as I can tell, this is unrelated to bug 206459 and the E820 issues addressed by Hans de Goede's patches, so I'm going to reopen this bug report. If we're lucky, maybe the DMAR/timeout stuff has been fixed by something between v5.15 and v6.0. (In reply to Konrad J Hambrick from comment #36) > Thanks for all the info wse@tuxedocomputers.com > > The comments in Hans de Goede's three patches look familiar but I'll try all > three patches against 5.15.y again. > > Or should I be running a different Kernel Version ? > > The Code referenced at elixir.bootlin.com is from Kernel Version 6.0.7 ... > > I'll test next weekend when I've got time to build ; boot ; dmesg ; repeat > > Then I will open a new bug report. > > Thank you and thank you and Tuxedo Computers for the tuxedo-keyboard-master > and tuxedo-keyboard-ite kernel modules ( they work great for me ) ! > > -- kjh the two lines are the same in 5.15.77 https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm.c#L1024 https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm.c#L1630 Created attachment 303142 [details]
lspci -vvv for my Sager NP9672M / Clevo X170KM-G
Thanks Bjorn
Attached is lspci -vvv on the unpatched 5.15.77 Kernel if it helps.
It is large-ish but more may be better than less.
Should I try Linux 6.0.7 or maybe even 6.1.rc3 ?
-- kjh
# uname -a
Linux kjhlt7.kjh.home 5.15.77.kjh #1 SMP PREEMPT Thu Nov 3 10:27:53 CDT 2022 x86_64 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz GenuineIntel GNU/Linux
(In reply to wse from comment #38) > (In reply to Konrad J Hambrick from comment #36) > > Thanks for all the info wse@tuxedocomputers.com > > > > The Code referenced at elixir.bootlin.com is from Kernel Version 6.0.7 ... > > > > the two lines are the same in 5.15.77 > > https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm. > c#L1024 > > https://elixir.bootlin.com/linux/v5.15.77/source/drivers/thunderbolt/icm. > c#L1630 Thanks wse@tuxedocomputers.com See Bjorn's post -- I'll wait and see what he finds. -- kjh Just tested 6.1-rc4 (from ubuntu mainline ppa) Still same behaviour: 4*20s delay during boot, but works just fine afterwards. Created attachment 303144 [details]
dmesg after boot (gui did not load but that seems an unrelated bug because rc)
Created attachment 303145 [details]
connecting a dock after boot, worked just fine
Reassigning to PCI because I don't think this is a USB issue. The 20s timeouts happen because the IOMMU faults prevent the driver from doing DMA so first we need to figure out why is that happening. I have seen similar but they typically were BIOS issues related to ACPI DMAR table setup so I suggest people first to find out whether there is a BIOS upgrade and try that one. In addition can someone attach acpidump from such system so we can look at the DMAR table? Created attachment 303147 [details]
Clevo X170KM-G ACPI Dump BIOS 1.07.09RTR6
Thanks for looking into this. ACPI Dump is attached. The BIOS is the newest one I could find (There is also a RTR7-G2 version, but that one i identical to RTR6 except slower clock speed for ram for compatibility reasons). Tangent: *almost* all the info in the dmar.dat is already in the dmesg log. Maybe the dmesg logging could be tightened up a bit (four copies of the base address, random "0x" vs not, etc) and any missing bits added? It would be really nice if we could connect this to the PCI hierarchy, but that part seems missing. DMAR: DRHD base: 0x000000fed91000 flags: 0x1 DMAR: dmar0: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 0 DMAR-IR: HPET id 0 under DRHD base 0xfed91000 DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping. DMAR-IR: Enabled IRQ remapping in x2apic mode note that when setting intel_iommu=off and iommu=off the dmar error disappear, but the delay and the icm error stay. Dock still functional dmesg attached Created attachment 303148 [details]
intel_iommu=off iommu=off
Created attachment 303149 [details]
intel_iommu=off iommu=off after attaching thunderbolt dock
Hi, Okay I see now in the dmesg that even if IOMMU is turned off the driver fails to communicate with the TBT firmware. This is unexpected. Have you tried Windows on that system and if yes, does the TBT driver work? The workaround for this is to blacklist the driver but that also prevents power management (which probably does not matter because it is not working anyway). One additional suggestion is to check if there is TBT firmware upgrade available for that system. Thunderbolt works on both windows and linux, at least the basic stuff like usb devices, audio, network, screens attached to a thunderbolt only dock. I'm not familiar enough with windows to know where to look for this icm error if it happens there too. We already reached out to clevo, but the firmware is already the latest they offer for the tbt chip in this barebone (something ending in .29 if I remember correctly). Is there some generic intel firmware update(r) I can try? didn't find one when I searched a while back. There is no generic firmware unfortunately. You should be able to see in Windows device manager that after a while (when there is nothing connected) the PCIe root port that has the Thunderbolt controller connected is in D3 Power State. If that's the case then the Windows side of the driver is working as it enters low power states. Mika -- I've still got a bootable, up-to-date Win11 Partition but like wse, I am not sure what info to gather for comparison. I've also got an acpidump if you want it for my slightly different Sager NP9672M-G1 ( a rebranded CLEVO X170KM-G ). Thank you very much for looking into this :) -- kjh (In reply to Mika Westerberg from comment #54) > There is no generic firmware unfortunately. > > You should be able to see in Windows device manager that after a while (when > there is nothing connected) the PCIe root port that has the Thunderbolt > controller connected is in D3 Power State. If that's the case then the > Windows side of the driver is working as it enters low power states. I assume you mean: System Devices -> Thunderbolt(tm) Controller - 1137 -> Properties -> Details -> Power data Current power state Strange behaviour: When I freshly installed windows 11 with only the thunderbolt driver device manager was showing me constant D3 there, even when using a thunderbolt dock. I then proceded to install all the other drivers for the device (and windows update was doing stuff in the background), now it's in a constant D0 state. Yes correct. If you don't plug in any devices and wait, does it then enter D3? No. Vanilla Windows Install -> The Thunderbolt controller does not show up in the device manager (not under a recognizable name at least), but the thunderbolt only docking station still seems to be functional? Installing the "Thunderbolt" driver from the Clevo driver package on this Vanilla Windows install -> A thunderbolt controller shows up in the device manager. But regardless of if a dock is plugged in or not it shows as D3 power state. Dock functions correctly however. Installing the rest of the clevo driver package (Including a "Chipset" labled driver)-> The thunderbolt device is now permanently in D0 poer state. Here are the drivers for reference: https://mytuxedo.de/index.php/s/ZeB8FTf8CrpEtJr?path=%2FXUX_Series%2FXUX7_Gen13 Okay so it is not entirely clear whether the Windows "driver" works any better than the Linux one (I'm not too familiar with the Windows side of things but my understanding is that the driver does the exact same flow to establish communication with the firmware). The reason why the "dock is functional" is that on these configurations there is really not much the Thunderbolt driver needs to do so all the PCIe/USB3/DP tunneling is done in the firmware. The only thing the Thunderbolt driver is needs is to make the controller enter D3 when idle to allow the whole Thunderbolt add-in-card to enter low power state. It could also be that the "Chipset" package provides some driver to a device on that dock that does not support Windows version of runtime PM so it basically keeps the thing in D0. wse -- I am curious. Does your Thunderbolt Dock work if you disable Discrete Thunderbolt in the INSYDE BIOS > Advanced Screen ? The reason I wonder is when I disable Discrete Thunderbolt in the BIOS, there are no DMAR: DRHD: handling fault status' timeout delays but I've still got Thunderbolt Driver Modules. I don't have a Dock so I can't test that but my Thunderbolt External SSD does work. Mika -- I 'found' a windows command ( msinfo32 ) that can print all resources as an XML File. C:\> %CommonProgramFiles%\Microsoft Shared\MSInfo\msinfo32.exe /nfo c:\temp\msinfo32-ascii.xml I enabled Discrete Thunderbolt, booted Win11 and dumped the output to a large .xml file ( 1.75 MB as Unicode / 0.88 MB with the null bytes stripped ) It does show Memory Resources for all Devices just about everything in the Device Manager Interface. I can append the output if it would be useful. HTH and thanks to all ! -- kjh Found the "culprit" by installing the drivers one by one: The (Clevo) Control Center is what's causing the TBT Controller to stay in D0. Clevo Control Center installed -> TBT always D0 Clevo Control Center not installed -> TBT always D3 Regardless of weather or not i have anything connected to the thunderbolt port or am actifly using it. Since this is only a power safing feature, can we quirk it so that the kernel just skips the initialisation? To at least eliminate the long boot delay? Looking at the code maybe above here: https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L1968 (Pseudo code) Adding a "if (tb->root_switch->quirks & QUIRK_SKIP_ICM_INIT) {return 0;}" and adding this quirk to the quirktable here: https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/quirks.c You can disable the TBT driver also by blacklisting so adding "modprobe.blacklist=thunderbolt" in the kernel command line should work this around. Is this only Clevo X170M system that has this issue? I will talk to our Windows driver folks if they have seen this. BTW, does the xHCI work normally? If you plug in a USB 3.x device to the Thunderbolt Type-C ports, do they show up and are functional? (In reply to Mika Westerberg from comment #65) > You can disable the TBT driver also by blacklisting so adding > "modprobe.blacklist=thunderbolt" in the kernel command line should work this > around. thanks, with the driver blacklisted the delay is gone and the tbt dock is still functional > Is this only Clevo X170M system that has this issue? I will talk to our > Windows driver folks if they have seen this. That Clevo is the only device with a descrete thunderbolt controller I have at hand. So I can't really say if it's the only one or not. (In reply to Mika Westerberg from comment #66) > BTW, does the xHCI work normally? If you plug in a USB 3.x device to the > Thunderbolt Type-C ports, do they show up and are functional? Yes, a USB 3.0 thumbdrive is correctly recognized as such both connected directly or through the tbt dock. (In reply to Konrad J Hambrick from comment #61) > wse -- > > I am curious. > > Does your Thunderbolt Dock work if you disable Discrete Thunderbolt in the > INSYDE BIOS > Advanced Screen ? > > The reason I wonder is when I disable Discrete Thunderbolt in the BIOS, > there are no DMAR: DRHD: handling fault status' timeout delays but I've > still got Thunderbolt Driver Modules. > > I don't have a Dock so I can't test that but my Thunderbolt External SSD > does work. With Discrete Thunderbolt disbled in the bios the ports are reduced down to usb 2.0. so no: the tbt only dock does not work anymore (In reply to wse from comment #69) > so no: the tbt only dock does not work anymore OMG ! I see now that my USB 3.2 Drive is connected as USB 2.0 ! Sorry about that rabbit hole. OTOH, I believe I understand from the posts above that if I append modprobe.blacklist=thunderbolt to my Kernel Commandline that I can connect at USB 3.2 Speed ? In addition, your TBT 4.0 Dock still works too ? That's great news ! I need to wrap up some Source Code Edits for work then I'll set up the blacklist and reboot. Thank you wse and Mika ! -- kjh it is a tbt 3 dock My system works perfectly as far as I can tell in this config: 1. Append modprobe.blacklist=thunderbolt to the kernel commandline 2. Enable Discrete Thunderbolt in the INSYDE BIOS 3. reboot -- no delays due to the DMAR/timeout issue. My USB 3.2 drive works and I will order a TBT 4 Dock for firther testing Thanks to all ! -- kjh p.s. I updated my kernel to 5.15.78 last night at the same time. I built it without Bjorn's "omit EFI_MEMORY_MAPPED_IO" Patch from [comment #25](https://bugzilla.kernel.org/show_bug.cgi?id=214259#c25) Should I try the patch on 5.15.78 and append efi=debug to the CommandLine ? Thanks again to all ! (In reply to Konrad J Hambrick from comment #72) > My system works perfectly as far as I can tell in this config: > > 1. Append modprobe.blacklist=thunderbolt to the kernel commandline > 2. Enable Discrete Thunderbolt in the INSYDE BIOS > 3. reboot -- no delays due to the DMAR/timeout issue. Sorry ... Markdown ate my list. Looked OK in preview mode. Okay good to know. However, the blacklist thing is just a workaround and it causes your system to draw more energy than it has to because it keeps the whole Maple Ridge controller in D0 (powered on). The issue should still be root cause and then fixed properly. We are looking into this. (In reply to Konrad J Hambrick from comment #72) > p.s. I updated my kernel to 5.15.78 last night at the same time. I built it > without Bjorn's "omit EFI_MEMORY_MAPPED_IO" Patch from [comment > #25](https://bugzilla.kernel.org/show_bug.cgi?id=214259#c25) > > Should I try the patch on 5.15.78 and append efi=debug to the CommandLine ? No. That patch (https://bugzilla.kernel.org/attachment.cgi?id=303123) is dead and will never go anywhere, so don't waste your time on it. (In reply to Bjorn Helgaas from comment #75) > (In reply to Konrad J Hambrick from comment #72) > > > p.s. I updated my kernel to 5.15.78 last night at the same time. I built it > > without Bjorn's "omit EFI_MEMORY_MAPPED_IO" Patch from [comment > > #25](https://bugzilla.kernel.org/show_bug.cgi?id=214259#c25) > > > > Should I try the patch on 5.15.78 and append efi=debug to the CommandLine ? > > No. That patch (https://bugzilla.kernel.org/attachment.cgi?id=303123) is > dead and will never go anywhere, so don't waste your time on it. Thanks Bjorn. -- kjh Tried to reproduce this on my local setup (it is Alder Lake-S based system with the same Thunderbolt controller) but I was not able to trigger the issue. However, I went through some of the logs you attached and it seems that at least sometimes the driver is able to communicate with the firmware. It looks like it works when you boot with the dock already connected. I wonder if you could try with the v6.x kernel so that you add "thunderbolt.dyndbg=+p intel_iommu=off" in the kernel command line and boot the system up with the dock connected. Then unplug the dock, wait for 1 minute and plug it back. Does it still work? And can you attach dmesg and lspci (sudo lspci -vv) outputs here? Mika -- I don't have a TBT Dock but I do have a USB 3.2 External Enclosure with a 1TB Samsung 980 Pro NVMe installed. I can build 6.0.9 or 6.1-rc5 and boot as requested if that will help. Now that TBT seems to work, I am saving up for a Thunderbolt 4 Dock but that won't be until later in the month. If you would like me to do the tests with the USB 3.2 Drive, which Kernel would you prefer ( 6.0.y or 6.1.rcX ) ? Thanks -- kjh Hi, unfortunately I don't think USB 3.x devices do the same. For this experiment we would need to get the TBT/USB4 link up before the driver loads. BTW, it is not guaranteed that TBT actually works on that system fully. Because the driver is blacklisted there is no power management among other things (networking over USB4 etc). Thanks Mika. I will let you know when I have the TB4 Dock in hand and request your recommendations for which Kernel to test when I am good to go. Thanks again. -- kjh Created attachment 303196 [details]
dmesg after boot
Created attachment 303197 [details]
lspci after boot
Created attachment 303198 [details]
dmesg tbt dock disconnected
Created attachment 303199 [details]
lspci tbt dock disconnected
Created attachment 303200 [details]
dmesg tbt dock reconnected
Created attachment 303201 [details]
lspci tbt dock reconnected
(In reply to Mika Westerberg from comment #77) > Tried to reproduce this on my local setup (it is Alder Lake-S based system > with the same Thunderbolt controller) but I was not able to trigger the > issue. > > However, I went through some of the logs you attached and it seems that at > least sometimes the driver is able to communicate with the firmware. It > looks like it works when you boot with the dock already connected. I wonder > if you could try with the v6.x kernel so that you add "thunderbolt.dyndbg=+p > intel_iommu=off" in the kernel command line and boot the system up with the > dock connected. Then unplug the dock, wait for 1 minute and plug it back. > Does it still work? And can you attach dmesg and lspci (sudo lspci -vv) > outputs here? sorry for the long wait, i uploaded the logs now, i used 6.1-rc5 Unplugging and replugging works just fine i realized: here the wait is just ~35 seconds and i don't get the icm error Thanks for the logs! I will go them through. Can you also check what the /sys/bus/thunderbolt/devices/0-0/nvm_version shows? 0-0/nvm_version is 29.0 good to know now where in linux i can check the tbt firmware version ^^ that's the newest and only tbt firmware clevo is offering for the device: https://www.clevo.com.tw/en/e-services/download/ftpOut.asp?Lmodel=X170KM-G<ype=9 hm. the dl link there is broken and i'm unsure if that is the firmware update or just the driver ... Created attachment 303303 [details]
Anker Apex, 12-in-1, Thunderbolt 4 dmesg and lspci output
Mika --
Attached is anker-tbt4-testing.tgz
I hope that's OK. It seemed easier than a bunch of separate .txt file attachments,
These are the contents:
Kernel Configs for 6.0.10
kernel-config/.config-6.0.10-generic
TBT4 Dock Turned on and Plugged in at boot
test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on.txt
test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on.txt
test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on-after-lspci-hang.txt
test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on-2.txt
test-1-with-tbt4-on-at-boot/dmesg-unplugged-tbt4.txt
test-1-with-tbt4-on-at-boot/lspci-unplugged-tbt4.txt
test-1-with-tbt4-on-at-boot/dmesg-replugged-tbt4.txt
test-1-with-tbt4-on-at-boot/lspvi-vv-replugged.txt
TBT4 not Plugged in at boot
test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4.txt
test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4.txt
test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-plugged.txt
test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-plugged.txt
test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-unplugged.txt
test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-unplugged.txt
test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-replugged.txt
test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-replugged.txt
Repeated Test 1 - TBT4 Plugged in at boot.
test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot.txt
test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot.txt
test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-after-lspci.txt
test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged.txt
test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-unplugged.txt
test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged-after-lspci.txt
test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged.txt
test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-replugged.txt
test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged-after-lspci.txt
sys_bus_thunderbolt_devices_0-0_nvm_version.txt -> 26.0
Each of the test-[1-3]-* directories contain dmesg and lspci output for one boot session on a new 6.0.10 Kernel.
Please let me know if I need to further explain the contents of the files.
When the TBT4 is Plugged in at boot, lspci -vv hung for about 90-seconds which shows in the test-3 lspci-vv0*.txt files where I ran ( date ; lspci -vv ; date )
One note: There is a USB3.2 NVMe SSD Enclosure plugged into the USB3.2 Port in the TBT Dock and it never powered up but it seems to have been detected ?
Please let me know if I can do any more testing -- I would love to make this work!
Thank you and wse !
-- kjh
I have the hardest time with this interface ! Test 1 - initial session with TBT4 attached and turned on at boot test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on.txt test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on.txt test-1-with-tbt4-on-at-boot/dmesg-boot-with-tbt4-on-after-lspci-hang.txt test-1-with-tbt4-on-at-boot/lspci-vv-boot-with-tbt4-on-2.txt test-1-with-tbt4-on-at-boot/dmesg-unplugged-tbt4.txt test-1-with-tbt4-on-at-boot/lspci-unplugged-tbt4.txt test-1-with-tbt4-on-at-boot/dmesg-replugged-tbt4.txt test-1-with-tbt4-on-at-boot/lspvi-vv-replugged.txt Test 2 - second session without TBT 4 attached test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4.txt test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4.txt test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-plugged.txt test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-plugged.txt test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-unplugged.txt test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-unplugged.txt test-2-without-tbt4-on-at-boot/dmesg-boot-without-tbt4-replugged.txt test-2-without-tbt4-on-at-boot/lspci-vv-boot-without-tbt4-replugged.txt Test 3 - third session with TBT4 attached and on at boot ( repeat #1 ) test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot.txt test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot.txt test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-after-lspci.txt test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged.txt test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-unplugged.txt test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-unplugged-after-lspci.txt test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged.txt test-3-with-tbt4-on-at-boot/lspci-vv-tbt4-on-at-boot-replugged.txt test-3-with-tbt4-on-at-boot/dmesg-tbt4-on-at-boot-replugged-after-lspci.txt Sorry ! -- kjh Thanks for the logs! I see there is AER error when the controller is runtime resumed so I wonder if you can try to boot with "pci=noaer" (with device connected too) and see if that makes it work any better? I contacted Clevo last week but so far haven't got any reply from them. Mika -- I've got a project due this morning but I'll append pci=noaer to my boot args and run the same tests later today after my code review. I also received a BIOS Update from Sager which originally came from Clevo ( the file name is X170KM-G08_LS1 ) Should I update the BIOS and then gather logs or should I run the tests on the same old BIOS version ? Finally, I still have a bootable Win11 partition. Are there any tests I could run on the Windows Side ? Thanks for all you're doing ! -- kjh p.s. The BIOS Update is a .zip file and I could attach it here if that's helpful. It's fine to update the BIOS prior testing. Hopefully the problem goes away (but most likely not). I don't know any additional testing to be done in Windows. Our Windows folks confirmed that the timeout is the same in Windows side. Created attachment 303313 [details]
Same tests with pci=noaer
Mika --
Attached is anker-tests-4..7.tar.gz ... with 4 test cases
Test 4 - appended pci-noaer boot with TBT4 Plugged ( old BIOS - 107.06.LS1 )
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot.txt
test-4-with-tbt4-on-at-boot-pci-noaer/lspci-vv-tbt4-on-at-boot.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot-after-lspci-vv.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot-unplugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/lspci-vv-tbt4-on-at-boot-unplugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-on-at-boot-replugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/lspci-vv-tbt4-on-at-boot-replugged.txt
test-4-with-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-plugged-shutdown.txt
Test 5 - appended pci-noaer boot with TBT4 Unplugged ( old BIOS - 107.06.LS1 )
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4-plugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4-plugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4-unplugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4-unplugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-boot-without-tbt4-replugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/lspci-vv-boot-without-tbt4-replugged.txt
test-5-without-tbt4-on-at-boot-pci-noaer/dmesg-tbt4-plugged-shutdown.txt
Test 6 - appended pci-noaer boot with TBT4 Plugged ( new BIOS - 107.08.LS1 )
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-tbt4-on-at-boot.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-tbt4-on-at-boot-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-unplugged-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-replugged.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-tbt4-on-at-boot-replugged.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-on-at-boot-replugged-after-lspci-vv.txt
test-6-with-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-replugged-shutdown.txt
Test 7 - appended pci-noaer boot with TBT4 Unplugged ( new BIOS - 107.08.LS1 )
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4-plugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4-plugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4-unplugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4-unplugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-boot-without-tbt4-replugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/lspci-vv-boot-without-tbt4-replugged.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/sys_bus_thunderbolt_devices_0-0_nvm_version.txt
test-7-without-tbt4-on-at-boot-pci-noaer-new-bios/dmesg-tbt4-replugged-shutdown.txt
Kernel Parameters:
pci=noaer [PCIE] If the PCIEAER kernel config parameter is
enabled, this kernel boot option can be used to
disable the use of PCIE advanced error reporting.
intel_iommu=off [DMAR] Disable Intel IOMMU driver (DMAR) option
thunderbolt.dyndbg=+p Enable debug messages at boot time. See
Documentation/admin-guide/dynamic-debug-howto.rst
for details.
-- kjh
Thanks for the logs! At least the AER error is now gone (temporarily). However, even after the BIOS upgrade the issue persist. Can we try yet another experiment? 1. Put "modprobe.blacklist=thunderbolt" in the kernel command line 2. Boot the system up, nothing connected 3. Wait a couple of minutes 4. Load the driver manually # modprobe thunderbolt dyndbg (it may be that it does not load because of the blacklist, I did not try myself but if that's the case you can run insmod instead). The idea here is to delay the driver load if that gives the Maple Ridge firmware enough time to come up properly. Created attachment 303317 [details]
blacklist thunderbolt ; boot unplugged
Mika --
modprobe failed so I rmmod'd thunderbolt and insmod'd thunderbolt.ko
Both modprobe and insmod need CONFIG_DYNAMIC_DEBUG
I also included my Kernel .config. I will rebuild with CONFIG_DYNAMIC_DEBUG ... are there other configs I should set ?
Thanks !
-- kjh
test-8-without-tbt4-at-boot-pci-noaer-blacklist-thunderbolt/
# modprobe then plug
dmesg-boot-without-tbt4.txt
lspci-vv-boot-without-tbt4.txt
lsmod-boot-without-tbt4.txt
dmesg-boot-without-tbt4-after-modprobe-thunderbolt_dyndbg.txt
lspci-vv-boot-without-tbt4-after-modprobe-thunderbolt_dyndbg.txt
lsmod-boot-without-tbt4-after-modprobe-thunderbolt_dyndbg.txt
dmesg-boot-without-tbt4-plugged.txt
lspci-vv-boot-without-tbt4-plugged.txt
dmesg-boot-without-tbt4-unplugged.txt
lspci-vv-boot-without-tbt4-unplugged.txt
# rmmod thunderbolt && insmod /lib/modules/6.0.10.kjh/kernel/drivers/thunderbolt/thunderbolt.ko dyndbg
dmesg-after-rmmod-insmod-thunderbolt.ko_dyndbg
lspci-vv-after-rmmod-insmod-thunderbolt.ko_dyndbg
dmesg-after-rmmod-insmod-thunderbolt.ko_dyndbg-plugged
lspcii-vv-after-rmmod-insmod-thunderbolt.ko_dyndbg-plugged
dmesg-after-rmmod-insmod-thunderbolt.ko_dyndbg-unplugged
lspci-vv-after-rmmod-insmod-thunderbolt.ko_dyndbg-unplugged
test-9-without-tbt4-at-boot-pci-noaer-blacklist-thunderbolt-insmod/
# repeat test-8 without modprobe ( insmod then plug )
dmesg-boot-without-tbt4.txt
lspci-vv-boot-without-tbt4.txt
lsmod-boot-without-tbt4.txt
dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg
lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg
lsmod-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg
dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-plugged
lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-plugged
dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-unplugged
lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-unplugged
dmesg-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-replugged
lspci-vv-tbt-unplugged-after-insmod-thunderbolt.ko_dyndbg-replugged
test-a-without-tbt4-at-boot-pci-noaer-blacklist-thunderbolt-insmod-after-plug/
# plug then insmod
dmesg-boot-without-tbt4.txt
lspci-vv-boot-without-tbt4.txt
lsmod-boot-without-tbt4.txt
dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg.txt
lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg.txt
lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg.txt
insmod-thunderbolt.ko_dyndbg.txt
dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod.txt
lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod.txt
lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod.txt
dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-unplugged.txt
lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-unplugged.txt
lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-unplugged.txt
dmesg-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-replugged.txt
lspci-vv-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-replugged.txt
lsmod-tbt-plugged-before-insmod-thunderbolt.ko_dyndbg-insmod-replugged.txt
Kernel Parameters
modprobe.blacklist=thunderbolt
intel_iommu=off
pci=noaer
Mika Westerberg 2022-11-29 07:45:17 UTC
Thanks for the logs! At least the AER error is now gone (temporarily).
However, even after the BIOS upgrade the issue persist.
Can we try yet another experiment?
1. Put "modprobe.blacklist=thunderbolt" in the kernel command line
2. Boot the system up, nothing connected
3. Wait a couple of minutes
4. Load the driver manually
# modprobe thunderbolt dyndbg
(it may be that it does not load because of the blacklist,
I did not try myself but if that's the case you can run insmod instead).
The idea here is to delay the driver load if that gives the Maple Ridge
firmware enough time to come up properly.
(In reply to Mika Westerberg from comment #99) > Thanks for the logs! At least the AER error is now gone (temporarily). > However, even after the BIOS upgrade the issue persist. Can we try yet > another experiment? > > 1. Put "modprobe.blacklist=thunderbolt" in the kernel command line > 2. Boot the system up, nothing connected > 3. Wait a couple of minutes > 4. Load the driver manually > > # modprobe thunderbolt dyndbg > > (it may be that it does not load because of the blacklist, I did not try > myself but if that's the case you can run insmod instead). > > The idea here is to delay the driver load if that gives the Maple Ridge > firmware enough time to come up properly. No change in behaviour, still the same timeout and error as if it was loaded during boot. Mika -- Are there any other tests I can run ? I am building a 6.0.10 Kernel with CONFIG_DYNAMIC_DEBUG=y I am at your service, sir :) Thanks for all your time and attention ! -- kjh Mika -- Way back in Comment #45, you asked about an acpidump. Would that be helpful with the latest Clevo BIOS ? If so, do you recommend any specific commandline args ? Thanks again. -- kjh No need for acpidump from the latest BIOS. It probably does not have anything interesting over the previous one. Thanks Mika. Looking at wse's dmesg files from Bug 206459 - thinkpad thunderbolt 3 dock gen2 pci memory allocation errors on Yoga C940 unless plugged in before boot ... I am more than a little disappointed to see that the 'latest' BIOS File I received from Sager is from January of 2020 where wse's Clevo BIOS File is from March, 2022 ... Oh well ... Thanks again for all that you're doing ! -- kjh (In reply to Konrad J Hambrick from comment #105) > Thanks Mika. > > Looking at wse's dmesg files from Bug 206459 - thinkpad thunderbolt 3 dock > gen2 pci memory allocation errors on Yoga C940 unless plugged in before boot > ... > > I am more than a little disappointed to see that the 'latest' BIOS File I > received from Sager is from January of 2020 where wse's Clevo BIOS File is > from March, 2022 ... > > Oh well ... > > Thanks again for all that you're doing ! > > -- kjh Have you compaired version numbers? Maybe the dates only changed because of the time the respective branding was inserted (Does Sager use the generic "StyleNote" branding? The bios is used for example has Tuxedo branding. There is a unoffical mirror of all the unbranded, undmoddified clevo bioses: repo:repo@repo.palkeo.com/clevo-mirror/X170KM-G/ Sadly not always the best version as sometimes clevo pushes fixes out to the vendors individually. Wow ! Thanks for Info wse. My latest Sager BIOS Version is 'lower' than yours: Mine is 1.07.08LS1: # grep ' DMI:' dmesg-boot-without-tbt4-on-plugged.txt [ 0.000000] DMI: Notebook X170KM-G/X170KM-G, BIOS 1.07.08LS1 01/11/2020 Yours is 1.07.09RTR6: # grep ' DMI:' ../wse/attached_after_boot2 [ 0.000000] DMI: TUXEDO TUXEDO Book XUX7 Gen13/X170KM-G, BIOS 1.07.09RTR6 03/04/2022 So ... Q1: if you had to guess ... do you believe it would be safe to install a Generic Clevo BIOS on a Sager Rebranded Laptop ? Q2: does the Clevo BIOS Update blow away your EFI Directory ? Maybe I did it wrong, but when I installed the Sager BIOS Update, it replaced my GRUB2 EFI Partition with Windows 11 Only and I had to restore GRUB2 so I am a tad afraid of Sager's BIOS Updates now. Thank you, wse !! -- kjh wse -- I checked the repo site that you referenced in comment 107 and the latest BIOS File that I see is 1.07.07 which is 'lower' than my 1.07.08 and your 1.07.09 Thanks again for that link though, it could be handy some day ! -- kjh wse -- Sorry about the dangling replies, Sager seems to use 'Notebook' for it's Style Branding ( sorry, that's a new word for me :) I do like your 'TUXEDO Book XUX7 Gen13/X170KM-G' String better :) Given your active participation in Linux Development, you guys are definitely at the top of my short list for my next Laptop :) -- kjh (In reply to Konrad J Hambrick from comment #108) > Wow ! > > Thanks for Info wse. > > My latest Sager BIOS Version is 'lower' than yours: > > Mine is 1.07.08LS1: > > # grep ' DMI:' dmesg-boot-without-tbt4-on-plugged.txt > [ 0.000000] DMI: Notebook X170KM-G/X170KM-G, BIOS 1.07.08LS1 01/11/2020 > > Yours is 1.07.09RTR6: > > # grep ' DMI:' ../wse/attached_after_boot2 > [ 0.000000] DMI: TUXEDO TUXEDO Book XUX7 Gen13/X170KM-G, BIOS 1.07.09RTR6 > 03/04/2022 > > So ... > > Q1: if you had to guess ... do you believe it would be safe to install a > Generic Clevo BIOS on a Sager Rebranded Laptop ? Only if you know that your vendor did not request any hardware alterations. Sometimes there are subtile or not so subtile differences, like some vendors have a barebone with one fan and other have the "same" barebone but with 2 fans. Seing your bios string has a LS1 appended says that there happened at least some softwarer alteration (could be only the bios logo) > > Q2: does the Clevo BIOS Update blow away your EFI Directory ? from my experience, you can never tell before what exactly on the bios level gets resetted when flashing a new bios. so expect to rewrite the efi entry again ^^ > > Maybe I did it wrong, but when I installed the Sager BIOS Update, it > replaced my GRUB2 EFI Partition with Windows 11 Only and I had to restore > GRUB2 so I am a tad afraid of Sager's BIOS Updates now. > > Thank you, wse !! > > -- kjh (In reply to Konrad J Hambrick from comment #109) > wse -- > > I checked the repo site that you referenced in comment 107 and the latest > BIOS File that I see is 1.07.07 which is 'lower' than my 1.07.08 and your > 1.07.09 > > Thanks again for that link though, it could be handy some day ! > > -- kjh Yeah, that's what I mean with vendor specific fixes that clevo never rolls out in gerneral. (In reply to Konrad J Hambrick from comment #110) > wse -- > > Sorry about the dangling replies, > > Sager seems to use 'Notebook' for it's Style Branding ( sorry, that's a new > word for me :) > > I do like your 'TUXEDO Book XUX7 Gen13/X170KM-G' String better :) > > Given your active participation in Linux Development, you guys are > definitely at the top of my short list for my next Laptop :) > > -- kjh Motivating to hear, thanks ^^ Since it doesn't seem that this device will get anymore firmware updates, what about introducing a quirk to at least get rid of the delay? (The device is semi stationary anyways so the missing energy saving is no big loss) I'm wondering where to exactly but such a quirk, there are 2 independent quirk lists in the thunderbolt driver it seems: the obvious quirk.c and here: https://elixir.bootlin.com/linux/v6.2/source/drivers/thunderbolt/icm.c#L1119 The quirk.c does seem to only be related to tb_switch which doesn't seem to be related with icm, the other one seems to be related to icm because the icm probe function gets passed a nhi object. I have no idea what there acronyms stand for or how exactly they are connected xD My plan is however, to include a pci id based qurik list like this https://elixir.bootlin.com/linux/v6.2/source/drivers/thunderbolt/quirks.c#L31 here https://elixir.bootlin.com/linux/v6.2/source/drivers/thunderbolt/icm.c#L1119 and add an if case for that quirk here https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/nhi.c#L1251 or here https://elixir.bootlin.com/linux/latest/source/drivers/thunderbolt/icm.c#L2548 returning null to abort the initialization. wse -- I am running 6.1.13 on my Clevo. The 6.1.13 source is identical to the code you referenced above. I would be more than happy to apply any Patches you come up with and boot the patched Kernel :) I will be watching this BUG. Thanks for resurecting this thread ! -- kjh If a quirk is needed then it should be using DMI strings to identify the exact system as in general Maple Ridge works in Linux. Simplest way is to just disable CONFIG_USB4 in kernel .config. If that does not work for you can you attach output of dmidecode command so we look into adding that quirk. Changing the kernel .config would no sufficide as I want to fix the device across different distros and their prebuild kernels and iso installs. Thought the subproduct id is also per device. But ofc also possible using dmi, the board_name is "X170KM-G", the system_vendor and board_vendor is reseller specific e.g. Tuxedo devices have system vendor "TUXEDO" and board_vendor "NB01", a lot of other resellers, but not all, leave both at the default "Notebook" value. Okay but before adding any quirks, I wonder if we can perhaps try few additional things? I checked the lspci output again and it seems the BIOS has enabled ASPM L1 substates so can we try disable them? If I read the code right passing "pcie_aspm.policy=performance" in the kernel command line should disable all ASPM states. Can you try that out (and check lspci output under the 00:1c.0 root port that L1SubCtl1 fields are -, and also LnkCtl fields should say that ASPM is disabled). $ cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-6.1.0-1009-tuxedo root=UUID=14ff4402-0e6b-4601-9522-f81175e2548f ro pcie_aspm.policy=performance L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2- ASPM_L1.1- T_CommonMode=40us LTR1.2_Threshold=90112ns LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- Boot delay sadly still present. Created attachment 303906 [details]
Add debugging to ICM startup
Okay, let's try to see if there is something unexpected in the startup case then. Can you apply the attached patch, enable CONFIG_DYNAMIC_DEBUG, and add thunderbolt.dyndbg=+p in the command line. Then boot with no device connected, and then with device connected and see if there is difference.
Created attachment 303909 [details]
dmesg with debug output patch and no dock connected
Created attachment 303910 [details]
dmesg with debug output patch and dock connected during boot
Thanks! Sorry I forgot to ask you to disable iommu (intel_iommu=off). If the output is still the same then no need to attach new dmesg but can you check that? We are looking at the REG_FW_STS log. Created attachment 303912 [details]
no dock no iommu
Created attachment 303913 [details]
dock during boot no iommu
REG_FW_STS is the same in all 4 cases Created attachment 303919 [details]
More retries with 50ms timeout for ICM driver ready
Thanks for the logs! I went through my old emails and noticed that there was a similar issue on certain systems but it was never root caused properly because a BIOS fix made it go away (so it was assumed to be a BIOS issue). I raised it again on our side but at the time I also did a hack patch that made the thing "work" by sending the driver ready several times instead of just 3 with smaller timeout. I wonder if you can try this patch and see if it makes any difference? It should not matter whether IOMMU is enabled or not.
Created attachment 303923 [details] 6.1.16 patched with attachment 303919 [details] Mika -- I think you're my HERO :) I applied your Patch to 6.1.16 and made a few dmesg files. There is no delay at boot ! Attached is boot with TBolt 4 Unplugged, then plugged. Thank you for all your efforts ! -- kjh Created attachment 303924 [details]
TBolt 4 Plugged at boot then Unplugged then Replugged
This is the same patched kernel.
These are hdparm tests of an NVMe Drive in the replugged TBolt 4 Dock
```
[root@kjhlt7 ~]# hdparm -t --direct /dev/sda
/dev/sda:
Timing O_DIRECT disk reads: 2622 MB in 3.00 seconds = 873.65 MB/sec
[root@kjhlt7 ~]# hdparm -t --direct /dev/sda
/dev/sda:
Timing O_DIRECT disk reads: 2664 MB in 3.00 seconds = 887.94 MB/sec
[root@kjhlt7 ~]# hdparm -t --direct /dev/sda
/dev/sda:
Timing O_DIRECT disk reads: 2668 MB in 3.00 seconds = 889.21 MB/sec
```
Reasonable results.
Thanks again !
-- kjh
Hmmm ... Looking thru the logs the thunderbolt ICM response is consistently -110 Does not matter if plugged or unplugged on boot. Not sure why I can access the NVMe drive in the ANKER Thunderbolt 4 Dock ? -- kjh Hi, yes the timeouts are expected but eventually they go away and the "driver ready" message passes through and the TBT domain becomes accessible. This is what I was after. So this is the same issue we saw some time ago. We will look at this internally. Created attachment 303948 [details]
new patch, no dock during boot
Created attachment 303949 [details]
new patch, dock during boot
was away for the weekend attached is the dmesg output with the patch -> more retries also fixes the problem for me Created attachment 304195 [details] ASUS TUF GAMING B550-PLUS discrete Maple Ridge dmesg I've got a AMD system on an TUF GAMING B550-PLUS (WI-FI) motherboard, using ASUS's "THUNDERBOLTEX 4" discrete Maple Ridge add-on card. Kernel 6.2.13, Motherboard BIOS 3002, Thunderbolt NVM 36.0. I was also getting long delays during boot with the card installed. Applying the patch from attachment 303919 [details] has fixed the boot delay, but I haven't had a chance to test anything more than basic USB functionality on the card yet. I've attached dmesg with debug output from attachment 303906 [details] also enabled. I'm getting some errors that there is insufficient space for some PCI BARs, and it is also reporting some IOMMU faults during initialization (I haven't tried with IOMMU disabled yet). Thanks for the logs! I'm still working on this but it has been quite hard to get the exact setup on our internal reference systems. I have one now but I need to flash firmwares and everything and currently being busy with other things. I hope to get it up and running (and the bug reproduced in a couple of weeks). If we manage to reproduce it we can hopefully root cause it locally. Apologies for the long delay with this one. Thanks Mika Just in case this is useful info ... I've been applying your Patch from Comment #126 to the 6.1.y Kernels as they're released and the patches have applied cleanly so far thru 6.1.30 ( installed on May 25 ). Like Calvin, all I've tested in a long time is USB which works OK. I do have HDMI and Ethernet in my Anker Hub if testing would be useful. Thanks for looking at this ! -- kjh Created attachment 304888 [details]
Workaround for IOMMU faults on certain systems (updated)
Hi all,
I'm sorry for the long delay. I tried to reproduce the issue on Intel reference board that was used as basis for these systems with the BIOS that should have the issue but still the issue did not reproduce :( This makes it hard to debug. So instead let's try to work it around with increasing the number of retries. I've updated the patch to do this only when it sends driver ready to the firmware. I wonder if you could try it out and let me know if it still helps. If it does, I will send it upstream.
Mika -- I'll build 6.1.46 or better tomorrow and boot and let you know. Do you want dmesg files for any specific set of boot args ? Thank you ! -- kjh If you can pass "thunderbolt.dyndbg=+p" in the command line and attach full dmesg that would be perfect, thanks! Created attachment 304908 [details]
Thunderbolt unplugged at boot - 6.1.46 with 0001-thunderbolt-Workaround-an-IOMMU-fault-on-certain-sys.patch
Mika --
I applied your new patches to 6.1.46 and it applied without errors and built a 6.1.46.kjh_p kernel ( .kjh means it is not a Slackware Kernel and _p is for patched ).
I have been running your older patches up to now because they eliminated the 90-sec hang during boot.
Your latest patches also eliminated the boot-time hang ( ! woo hoo ! )
I am running 6.1.46.kjh_p and I'll keep running it until 6.1.47 is released when I will continue to apply your latest patches unless they're merged into upstream.
This is with the thunderbolt unplugged at boot.
-- kjh
Created attachment 304909 [details]
6.1.46 with your latest patches and thunderbolt plugged at boot
this is 6.1.46 with your latest patches and thunderbolt plugged at boot
Created attachment 304911 [details] dmesg ASUS TUF GAMING B550-PLUS 6.4.11_kepstin-00001-g8ff95b7fc9f5 (304888) I've tried the updated patch from Attachment 304888 [details] on my affected AMD system with no thunderbolt devices connected, and the new patch does not completely solve the startup delay problem. The delay is reduced compared to no patch, but I'm still seeing a total of 10 seconds of delay during the thunderbolt device probe as it appears to perform 5 retries each 2 seconds apart. The patch from attachment 303919 [details] reduced the boot delay to around 250ms. Created attachment 304912 [details] dmesg ASUS TUF GAMING B550-PLUS 6.4.11_kepstin-00001-gb09f7209b080 (303919) For comparison, here is the same kernel version with the patch from attachment 303919 [details] applied instead. Looking at kjh's logs, I see the same thing - when probing the thunderbolt controller with no devices connected, the new version of the patch has a 10 second delay while the old patch was able to complete in around 200ms. Thanks for testing! Yes, the patch tries to get the system functional not speed up the boot time (although it does that too). I suppose we could change the timeout to be shorter still like this: icm_tr_driver_ready(): ... ret = icm_request(tb, &request, sizeof(request), &reply, sizeof(reply), 1, 40, 500); Would this work for you? Mika -- I didn't notice the 10-sec delay while watching my laptop boot because there was no obvious hang like I saw with no patches when the Discrete Thunderbolt enabled in the Insyde BIOS. But after reading Calvin's messages and looking closely at the logs I see that it does take an additional 8 -to- 10 sec to boot with the latest patch compared to the original patch that eliminated the hang. I would be happy to test any patches you come up with. Thanks again for working on this ! -- kjh Hi, yes there is the 5 * 2s = 10s delay now in the log: [ 3.926875] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b300 flags=0x0030] ... [ 6.081338] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b400 flags=0x0030] [ 8.214284] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b500 flags=0x0030] [ 10.347715] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b600 flags=0x0030] [ 12.481318] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b700 flags=0x0030] [ 14.621068] thunderbolt 0000:06:00.0: USB4 proxy operations supported this means the IOMMU blocks the DMA 5 times before it magically starts working. If we cut that 2s to 500ms then it would take 2.5s in theory to get the DMA working (if it always takes 5 times but sometimes it might take more and sometimes less). Even with a 500ms retry, that's still a 2.5s delay at boot, a rather long time compared to the ~200-300ms total that it takes to initialize with the older patch. Looking at the timing, it appears that the older patch retries immediately - i.e. there is no noticeable delay or sleep (<1ms) before it tries again: [ 3.403260] thunderbolt 0000:06:00.0: enabling interrupt at register 0x38200 bit 12 (0x1 -> 0x1001) [ 3.404331] thunderbolt 0000:06:00.0: sending ICM request [ 3.409092] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b300 flags=0x0030] [ 3.456684] thunderbolt 0000:06:00.0: ICM request finished -110 [ 3.456741] thunderbolt 0000:06:00.0: sending ICM request [ 3.460190] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b400 flags=0x0030] [ 3.510018] thunderbolt 0000:06:00.0: ICM request finished -110 [ 3.510075] thunderbolt 0000:06:00.0: sending ICM request [ 3.513519] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b500 flags=0x0030] [ 3.563351] thunderbolt 0000:06:00.0: ICM request finished -110 [ 3.563407] thunderbolt 0000:06:00.0: sending ICM request [ 3.566853] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b600 flags=0x0030] [ 3.616683] thunderbolt 0000:06:00.0: ICM request finished -110 [ 3.616740] thunderbolt 0000:06:00.0: sending ICM request [ 3.620247] thunderbolt 0000:06:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x800004b700 flags=0x0030] [ 3.673353] thunderbolt 0000:06:00.0: ICM request finished -110 [ 3.673412] thunderbolt 0000:06:00.0: sending ICM request [ 3.677120] thunderbolt 0000:06:00.0: ICM request finished 0 [ 3.683640] thunderbolt 0000:06:00.0: USB4 proxy operations supported It's really strange that it takes exactly 5 retries on both my system and kjh's system - *regardless of the amount of time between retries*. I wonder what the cause of that is… Also, kind of curious that it is giving IOMMU errors on the first 5 failed requests on both of our systems (both with the Intel and AMD IOMMU). As far as a workaround, is there any downside to doing an immediate retry? Would it be ok to do a limited number of immediate retries - might as well do 5, since that's the number that's needed on both my system and kjh's system - before enabling a delay for the remaining retries? The IOMMU is sitting between the DMA (the one used by the Thunderbolt driver) and the the memory and that is blocking all the requests (the first ~5). This is hard to debug without means to reproduce this locally and I tried all possible ways I could think of. Ideally once reproduced we would plug in PCIe analyzer and similar devices to figure out what exactly is happening. The Intel based systems at least got fixed with BIOS upgrade (except the ones here of course there does not seem to be an upgrade available. For the AMD one, I'm not sure). Anyways boot time typically does not matter too much as that is not done too often but if you really think it matters here, we can make the timeout 50ms and retry max say, 10 times (to be on the safe side) but then I think we need to separate the Maple Ridge path from Titan Ridge to avoid any possible issues (need to check that still). Created attachment 304924 [details]
Workaround for IOMMU faults on certain systems (v2)
Can you try the attached patch? This one tries 10 times with 50ms timeout and then reverts to the longer timeout.
Mika -- I've got a deadline at work and I can't be without my Laptop for now. I should be done tomorrow on Thursday at the latest and I'll build a patched Kernel and post the two dmesg files ASAP. Sorry about the delay ! -- kjh sorry for the slow reply, was on holiday ^^ here is my output on the fail case it gives out DMAR: DRHD: handling fault status reg 2 DMAR: [DMA Write NO_PASID] Request device [05:00.0] fault addr 0x69974000 [fault reason 0x05] PTE Write access is not set 3 times and fails 20 secs later in the successfull case with the workaround it gives out DMAR: DRHD: handling fault status reg 2 DMAR: [DMA Write NO_PASID] Request device [05:00.0] fault addr 0x69974000 [fault reason 0x05] PTE Write access is not set 3 times and an aditional DMAR: DRHD: handling fault status reg 2 without the other line, but then successfully initializes the thunderbolt things Created attachment 304926 [details]
non working state (kernel 6.2)
Created attachment 304927 [details]
working state (kernel 6.5 with latest workaround applied)
Do I understand correctly, the last patch does not work for you? Or I missed something? The second log you attached it seems things work fine but the v6.2 kernel log there is still the timeout so you have the patch applied on both cases? the patch does work the 6.2 log was without the patch applied and is just here as a reference regarding the error messages that happen in both cases sorry for the confusion with the patch applied it works as intended (some fails and then a successfull initialization) Got it, thanks for clarification. Okay, let's wait for others to try it out and then I will send it upstream after the merge window closes. Mika -- I was able to apply your latest patch to 6.1.47 without any issues. Everything booted on when I booted with the Tbunderbold Hub UNPLUGGED. However there was a long delay from 11.5 to 28.9 secs followed by a pair of Kernel OOPS starting at 49.3 sec when I booted with TB PLUGGED. Rebooted 6.1.46.kjh_p with the Aug 18 Patches ... no issues there. The two dmesg files ( unplugged and plugged ) for 6.1.47 with the Aug 22 patches are attached. Thanks again, Mika ! -- kjh Created attachment 304931 [details]
6.1.47 + Aug 22 Patch - TB unplugged at boot
6.1.47 + Mika Aug 22 Patch ; TB Unplugged at boot
Created attachment 304932 [details]
6.1.47 + Mika Aug 22 Patch ; TB Plugged at boot -- OOPS
6.1.47 + Mika Aug 22 Patch ; TB Plugged at boot resulted in a kernel OOPS
Hmm, it is pretty unlikely that the patch affects on that part. Your log pretty much shows that the PCIe link from the root port to the Thunderbolt host controller went down. Does that happen 100% if you boot with the patch applied and the dock connected? Mika -- I'll have to try again later today. I am running 6.1.46 with your Aug 18 patch right now but I can reboot 6.1.47 with your Aug 23 Patch later today when I finish with a remote server setup. I'll let you know when I can run some more tests. -- kjh Mika -- I rebooted ( cold boot ) three times and each time resulted in more or less the same OOPS. NOTE: there is a similar OOPS on shutdown I saved dmesg files for each if you want them but I don't know where insert a dmesg command in the shutdown scripts to capture the shutdown OOPS. I will remove 6.1.47.kjh_p and rebuild it with the Aug 18 patch to determine if the OOPS is due to an unrelated kernel issue and report back later ... -- kjh Created attachment 304942 [details]
6.1.47 with the Aug 18 Patch
Mika --
This is 6.1.47 with the Aug 18 Patch.
There is no OOPS on boot or shutdown.
Maybe I messed up when I applied the Aug 23 patch ?
I don't think so but I'll build 6.1.47 again with the Aug 23 patch this weekend and report my results.
-- kjh
p.s. Thunderbolt was plugged when I booted. sorry about the noise ... All the last three patches are pretty much the same the timeout and the number of retries vary but the end result should be the same and I see in your system based on the last two logs you attached that after 4 DMAR faults the driver ready messages starts passing through and things work as expected. The "OOPS you say is not actually OOPS but it is just the driver complaining because the underlying hardware went away e.g the PCIe link from the root port to the Maple Ridge is down. I agree that trying several times on a clean build with the last patch applied would tell us if this is indeed somehow related to the patch or not. I realized that the patch might actually affect this. The new patch makes the boot faster but at the same time it allows the driver to runtime suspend earlier and may cause the whole thing to enter D3cold. Now, this should be fine in normal cases but these systems are "special" anyway. If we go back to the Aug 18 patch with longer timeout then this should not happen but I wonder if it works well over system sleep etc? Once you have checked the Aug 23 patch again with clean build, and if you see the same behavior that the link goes down, can you try with 18 Aug again too and also run some suspend/resume cycles just in case? Mika -- My wife reminded me that we're going out of town this weekend and we won't return until Sun nite. I'll be able to get back on the testing early Monday morning America/Chicago time. I'll do the sleep tests first with the existing Aug 18 patches. Then I'll rebuild 6.1.48 with Aug 23 to see if I messed up the patches, including the sleep testing. Thank you for all your work on this ! -- kjh I've been using the patch from attachment 304924 [details] for a while with good results. The initialization speed is pretty much indistinguishable from the initial patch that retried without delay. I'm using a desktop system and don't use suspend to ram, so I can't comment on that aspect of the behaviour. Feel free to add a Tested-by: Calvin Walton <calvin.walton@kepstin.ca> to the patch. Hi, I went for the more conservative version to avoid the issue Konrad saw. It slows down the boot but in the end both systems should be working. I Cc'd you all with the patches (sent them out today morning). (In reply to Mika Westerberg from comment #170) > Hi, I went for the more conservative version to avoid the issue Konrad saw. > It slows down the boot but in the end both systems should be working. I Cc'd > you all with the patches (sent them out today morning). Here's the patch link: https://lore.kernel.org/linux-usb/20230911100445.3612655-2-mika.westerberg@linux.intel.com/ Mika -- Thank you for the patches. Do I need to apply the linked patch IN ADDITION TO the four patches I received via email ? Or is the linked patch all I need ? Thanks again ! -- kjh The linked one is all you need. I CC'd you as well with the patch so you should have it in your inbox too. Thanks Mika. Yes, it is attached to an email in my inbox. Building 6.1.52 now with only the linked patch for `diff --git a/drivers/thunderbolt/icm.c b/drivers/thunderbolt/icm.c` Skipping these: diff --git a/drivers/thunderbolt/switch.c b/drivers/thunderbolt/switch.c diff --git a/drivers/thunderbolt/tmu.c b/drivers/thunderbolt/tmu.c diff --git a/drivers/thunderbolt/quirks.c b/drivers/thunderbolt/quirks.c diff --git a/drivers/thunderbolt/xdomain.c b/drivers/thunderbolt/xdomain.c Will build today, install tonite and follow up tomorrow moring. Thanks again ! -- kjh Created attachment 305098 [details]
dmesg for Linux 6.1.52 with Mika's Sep 12 Patch cold boot ; plugged
Mika --
Attached is is the dmesg output for 6.1.52 where applied the single Sep 12 patch I cold-booted with the Thunderbolt Hub turned on and plugged.
Thank you for all your work on the Discrete Thunderbolt Issue !
-- kjh
Thanks a lot for testing! Does it also work if you boot with no device connected? (no need to attach dmesg if it does). Mika -- Yes, linux 6.1.52 boots fine when I cold boot without the Thunderbolt Hub Plugged in. Everything also works fine when I subsequently turn on the Hub after booting unplugged. Linux 6.1.53 was released this morning so I'll be building that later today or tomorrow morning. I'll apply the same Sep 12 patches and let you know what I see with 6.1.53 Thanks again Mika ! -- kjh Created attachment 305108 [details]
Linux 6.1.53 + Mika Sep 12 Patch cold boot plugged
Mika --
Attached is 6.1.53 + your Sep 12 Patch after a cold boot.
The old Linux 6.1.52 shutdown fine without any delay.
Linux 6.1.53 Booted just fine without a noticable delay.
I forgot to grab a dmesg file until quite some time after booting and entering my KDE Desktop.
When I did, I noted the verbose periodic Kernel comms with Thunderbolt toward EoF in the dmesg file.
Thanks Mika !
-- kjh
Thanks for testing! It seems that the TBT wakes up periodically but that probably is related to the same issue the DMAR faults are happening which we cannot fix in the OS side properly without understanding the root cause (and this requires that we would be able to reproduce this locally). I suspect simply firmware update (if it were available) would fix this. Anyways, the workaround seems to work so we good. I will try to get this patch to v6.6-rc and stable trees. There are no obvious effects on the performance of my Slackware Desktop. And it makes sense that the TBT would wake up periodically. I'll watch for Firmware updates but I am not hopeful about something useful trickling down from Clevo to Sager for this particular set of hardware ( Clevo X170M / Sager NP9672m-g1 ) but one never knows ... I'll be watching the kernel logs for your patches and maybe Greg KH will eventually pull the patches back into the 6.1.y ( or the next ) longterm tree ! Thanks again Mika, you're my hero ! -- kjh So, I've tested the patch posted to the ML, and it's performing as intended on my system. With 5 attempts and 2 seconds between attempts, my boot is slowed down by nearly 8s (it's unclear why waiting for thunderbolt to initialize is in the critical boot path), but the hang is significantly reduced compared to no patch, and the thunderbolt controller is functional. I'll probably keep using the reduced timeout patch for faster boots, since I'm not hitting/triggering the power management related problem that Konrad reported. Assuming this is a firmware problem - which firmware is affected? I've got the latest available TB controller firmware update from ASUS applied (NVM 36). As far as I know there's unlikely to be much other firmware in common between my system (an AMD desktop board from ASUS) and Konrad's system (an Intel laptop from Clevo), so it seems odd that a firmware bug would cause us to see the same symptoms, right down to the same number of timing-independent retries needed before the adapter works. For what it's worth, I dual-boot Windows on my system, and the Thunderbolt controller is working there - but when I looked in to it closer, it turns out that Windows "Kernel DMA Protection" feature is off (despite having the IOMMU setting enabled in the BIOS). From what I've read, this indicates that the BIOS is not setting a flag to tell Windows that DMA protection is supported. So the combination of Thunderbolt controller installed and DMA protection enabled on this board is probably untested or unsupported by ASUS :/ I have a second ASUS board (Pro B550M-C/CSM) which is also compatible with this ASUS Thunderbolt card - same AMD chipset generation, but from a business oriented model line (instead of gaming) and with a different type of CPU installed (Ryzen 5 PRO 4650G APU). At some point I'll test that system to see if it also has the same problem. Thanks for testing! Yeah, I had to choose a version that works on both although in your system it slows down the boot (but ends up with functional TBT). The firmware in question is the BIOS. There was a Dell system with similar symptoms and that went away with BIOS upgrade. I contacted Clevo regarding this but they do not support the older models anymore so if I understood correctly it is unlikely (unfortunately) that there is any BIOS upgrade for that system. For ASUS of course it might be different story. Sorry took me some time to look at the final patch. With 2000ms wait and 4-5 retries that's still 8-10 seconds longer boot time which is quite noticeable. I'm currently in home office, but end of next week I can look at the test device again and see if the original less delay patch does cause the oops for me too. If not, maybe the delay time can be quirked? Okay let me know how it goes. Please check connected and not connected cases and over suspend/resume. Sorry took me some time to come back to this device: I could not reproduce the issue Konrad described even when going as low as 50ms for the delay. Looking into the dmesg timestamps: For the TUXEDO device the tunderbolt initialization starts at around 1.7s and other stuff that is seemingly initialized in parralel is finishd at around 2.8-3s. so as long as the 4 known to fail retries take less then 1s there should be no noticeable delay at least for this device. @Konrad Did you try with a timeout of 250ms? A gentle bump to ask if this reduced timeout would also be acceptable? Sure, it is acceptable if it does not break the systems that started working with the initial fix. @wse -- Sorry ... we're getting busy at work and I've been distracted. No I never did try the 250 ms timeout because the small lag has not been bothering me. All the 6.1.y kernels have run good enough since Mika's patches were merged. Is the 250 ms timeout something I can try with a kernel commandline arg or would it require changing the source and recompiling ? Thanks ! -- kjh The value is hardcoded so a patch and recompile is required. I can prep a patch for you later or tomorrow. Created attachment 305567 [details]
reduce timeout to 250ms
Theres a patch based on 6.7-rc4 but it also applies cleanly to v6.1.65 (In reply to Werner Sembach [TUXEDO] from comment #191) > Theres a patch based on 6.7-rc4 but it also applies cleanly to v6.1.65 Werner -- I'll apply your patch to 6.1.65 or later this weekend and boot it. If it runs well, I'll keep it going until something newer comes along :) I'll let you know. Are there any debug configs you would like me to set ? Are there any logs that you would like to see ? Thanks for this ! -- kjh Lets see if it runs without issues Created attachment 305571 [details]
6.1.65 - no patch
Werner --
This was 6.1.65 before I applied your 250 ms patch.
This is for reference ...
It had been running several days so I trimmed out lines after boot up to where the Thunderbolt code threw a kernel trace into the log.
This happens occasionally ever since I started using Mika's patches ( reported here in this thread ).
I also occasionally see the kernel hang for a few seconds at shutdown but it eventually lets go and shuts down normally.
Thanks for looking at this.
The next attachment will be 6.1.66 with your 250 ms patch.
-- kjh
Created attachment 305572 [details]
6.1.66 with Werner's 250 ms patch
Werner --
This is 6.1.66 with your 250 ms patch.
It runs fine.
-- kjh
Thx, on its way to upstream: https://lore.kernel.org/all/20231220150956.230227-1-wse@tuxedocomputers.com/ |