Created attachment 261219 [details] dmesg output Hey, I have a CalDigit TS3 Lite Thunderbolt 3 Dock that I use with a Dell Alienware 13. When plugged in, the dock works fine, except no USB devices are recognized. After some googling I found that I need to authorize it access to the dock via sysfs https://lkml.org/lkml/2017/5/18/848 After I run echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized I can take full use of the dock, however if I unplug it, I face a kernel bug at: drivers/pci/msi.c:352 (dmesg attached). This is the free_msi_irqs function, but I don't understand the code. Any explanation/help suggestions? Cheers, Nick
The igb driver tries to disable MSI(X) after the device has been plugged out: [ 97.290181] pci_disable_msix+0xf6/0x120 [ 97.290186] igb_reset_interrupt_capability+0x5d/0x60 [igb] [ 97.290189] igb_remove+0xb3/0x160 [igb] In general drivers ought to check if their device is still present and if not they should avoid touching the hardware. This is important especially when you have hotpluggable devices like all Thunderbolt devices typically are such. Here, I think the correct fix is to change igb driver to handle hot-unplug properly (avoid touching the hardware if the device is already gone). You may prevent the crash by unloading the igb driver before you unplug the dock: $ sudo rmmod igb Or if you don't use ethernet, you can just blacklist the module for the time being.
Hey Mika, Rmm'ing the igb module indeed fixes the issue. Should I submit a new bug to igb maintainers or CC someone on this? Cheers, Nick
Good to know that it helped. I've contacted Intel igb people and pointed them the issue. Let's see if they'll have a fix for it soon.
Created attachment 273411 [details] Mark stale PCI devices disconnected
Nikolay, Could you try the attached patch and see if the crash still happens?
Hey Mika, Thank you for the patch! I will be away from my dock until next Saturday. I will post an update after that. Cheers, Nick
Hi Mika and Nikolay, I will try this patch soon as possible (this weekend or Monday) with my Caldigit TS2 (and Lenovo t430s) which is affected too. Thanks for the patch! Ferenc
Created attachment 273457 [details] Same error after 0001 patch linux - 4.14.12 distro: archlinux lenovo t430s with thunderbolt port caldigit ts2 (ethernet, usb keyboard and mouse, displayport connected)
Created attachment 273511 [details] Updated patch to mark stale PCI devices disconnected Can you try the updated patch? It should mark the igb device disconnected and prevent the crash.
Unfortunately i got same error. This is the log before the crash occurs when disconnecting the thunderbolt dock: [ 57.473929] igb 0000:11:00.0 enp17s0: PCIe link lost, device now detached [ 57.741216] ata7: failed to stop engine (-5) [ 58.241158] ata8: failed to stop engine (-5) [ 58.241893] xhci_hcd 0000:12:00.0: remove, state 1 [ 58.241909] usb usb6: USB disconnect, device number 1 [ 58.242291] xhci_hcd 0000:12:00.0: USB bus 6 deregistered [ 58.242303] xhci_hcd 0000:12:00.0: xHCI host controller not responding, assume dead [ 58.242321] xhci_hcd 0000:12:00.0: remove, state 1 [ 58.242328] usb usb5: USB disconnect, device number 1 [ 58.242330] usb 5-1: USB disconnect, device number 2 [ 58.378296] usb 5-2: USB disconnect, device number 3 [ 58.548397] usb 5-4: USB disconnect, device number 4 [ 58.549254] xhci_hcd 0000:12:00.0: Host halt failed, -19 [ 58.549258] xhci_hcd 0000:12:00.0: Host not accessible, reset failed. [ 58.549416] xhci_hcd 0000:12:00.0: USB bus 5 deregistered [ 58.549751] igb 0000:11:00.0: removed PHC on enp17s0 [ 58.727965] ------------[ cut here ]------------ [ 58.727970] kernel BUG at drivers/pci/msi.c:352!
OK, thanks for testing. I was hoping that it is enough to mark the device disconnected during hot-unplug but apparently it does not help. We need to fix it in the igb driver then. I'll see if I can make a patch doing so.
Ok, thanks!
What happens with your second patch is that when I try to rmmod igb I get the same kernel bug as when I unplug the dock itself... Weird.
Created attachment 273631 [details] igb: don't call netif_device_detach() when PCIe link is gone Can you try the attached patch in addition to the previous patch and see if the crash is gone now? The previous patch is still needed to make sure PCI config space access is disabled when the device is disconnected.
(In reply to Mika Westerberg from comment #14) > Created attachment 273631 [details] > igb: don't call netif_device_detach() when PCIe link is gone > > Can you try the attached patch in addition to the previous patch and see if > the crash is gone now? The previous patch is still needed to make sure PCI > config space access is disabled when the device is disconnected. Yes, this works! This allows me to plug/unplug the dock without specifically removing the igb module. No more segfaults with rmmod too. Thanks, Nick
Instant freeze for me. :( I will check one more thing and get some log somehow.
This is interesting. Everything works perfectly if i use the ACPI / Hotplug patch fist version (https://bugzilla.kernel.org/attachment.cgi?id=273411&action=diff) with the igb patch. I think main difference between Nick and my config (Arch Linux with 4.14.13 kernel) is Thunderbolt3 vs Thunderbolt2 device. Mika which log, dump be the most usable for you? This is the working (acpi patch v1 and igb patch) dmesg output when disconnecting: > [ 141.118347] ata7: failed to stop engine (-5) > [ 141.617857] ata8: failed to stop engine (-5) > [ 141.618404] xhci_hcd 0000:12:00.0: remove, state 1 > [ 141.618416] usb usb6: USB disconnect, device number 1 > [ 141.618777] xhci_hcd 0000:12:00.0: USB bus 6 deregistered > [ 141.618789] xhci_hcd 0000:12:00.0: xHCI host controller not responding, > assume dead > [ 141.618807] xhci_hcd 0000:12:00.0: remove, state 1 > [ 141.618816] usb usb5: USB disconnect, device number 1 > [ 141.618819] usb 5-1: USB disconnect, device number 2 > [ 141.618821] usb 5-1.1: USB disconnect, device number 4 > [ 141.771220] usb 5-1.2: USB disconnect, device number 6 > [ 141.898338] usb 5-2: USB disconnect, device number 3 > [ 141.898673] usb 5-4: USB disconnect, device number 5 > [ 141.899296] xhci_hcd 0000:12:00.0: Host halt failed, -19 > [ 141.899299] xhci_hcd 0000:12:00.0: Host not accessible, reset failed. > [ 141.899452] xhci_hcd 0000:12:00.0: USB bus 5 deregistered > [ 141.899736] igb 0000:11:00.0: removed PHC on enp17s0 > [ 141.899742] igb 0000:11:00.0 enp17s0: PCIe link lost > [ 142.595875] pci_bus 0000:0e: busn_res: [bus 0e] is released > [ 142.596327] pci_bus 0000:11: busn_res: [bus 11] is released > [ 142.596609] pci_bus 0000:12: busn_res: [bus 12] is released > [ 142.596820] pci_bus 0000:13: busn_res: [bus 13] is released > [ 142.596937] pci_bus 0000:14: busn_res: [bus 14] is released > [ 142.597025] pci_bus 0000:15: busn_res: [bus 15] is released > [ 142.597121] pci_bus 0000:10: busn_res: [bus 10-3c] is released > [ 142.597219] pci_bus 0000:0f: busn_res: [bus 0f-3c] is released > [ 142.597312] pci_bus 0000:3d: busn_res: [bus 3d] is released > [ 142.597401] pci_bus 0000:3e: busn_res: [bus 3e] is released > [ 142.597500] pci_bus 0000:3f: busn_res: [bus 3f] is released > [ 142.597595] pci_bus 0000:0d: busn_res: [bus 0d-3f] is released
Just to let you know I'm also running Arch on 4.14.13... But my device is indeed a TB3 one. Cheers, Nick
I don't think TBT generation matters here as the igb device is the same. Ferenc, could you try with both patches when it freezes and if the system is still somewhat functional take dmesg and attach it to the bug? It is possible that something else crashes in that case.
Yes of course, it will take a little time.
I made a fresh build with latest patches (acpi and igb) with the kernel and now everything went well. Tried many disconnect/reconnect and without any problem. Thanks for the fix.
Cool, thanks for testing. I'll send the patches upstream soon (I have couple of other fixes for acpiphp that I'm currently testing, hoping to include them as well).
Thanks for fixing the driver. :)
Hey Mika, Did you get around to submitting the patches to mainline? Should the bug be marked as resolved? I no longer have the device so I can't test it...
Yes, it is fixed by 17a0b9add6e9 ("igb: Do not call netif_device_detach() when PCIe link goes missing"). I don't have rights to change state of this bug so feel free to mark the bug as resolved :)
Fixed thanks to Mika (17a0b9add6e9)!