Bug 198181 - BUG at drivers/pci/msi.c:352 when unplugging a sysfs authorized thunderbolt 3 dock
Summary: BUG at drivers/pci/msi.c:352 when unplugging a sysfs authorized thunderbolt 3...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-17 15:04 UTC by Nikolay Bogoychev
Modified: 2018-05-23 10:01 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.14.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output (25.58 KB, text/plain)
2017-12-17 15:04 UTC, Nikolay Bogoychev
Details
Mark stale PCI devices disconnected (993 bytes, patch)
2018-01-05 10:53 UTC, Mika Westerberg
Details | Diff
Same error after 0001 patch (4.01 KB, text/plain)
2018-01-08 14:39 UTC, Ferenc Boldog
Details
Updated patch to mark stale PCI devices disconnected (1.09 KB, patch)
2018-01-10 10:12 UTC, Mika Westerberg
Details | Diff
igb: don't call netif_device_detach() when PCIe link is gone (2.49 KB, patch)
2018-01-15 16:55 UTC, Mika Westerberg
Details | Diff

Description Nikolay Bogoychev 2017-12-17 15:04:51 UTC
Created attachment 261219 [details]
dmesg output

Hey,

I have a CalDigit TS3 Lite Thunderbolt 3 Dock that I use with a Dell Alienware 13.

When plugged in, the dock works fine, except no USB devices are recognized. After some googling I found that I need to authorize it access to the dock via sysfs https://lkml.org/lkml/2017/5/18/848

After I run echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized

I can take full use of the dock, however if I unplug it, I face a kernel bug at:

drivers/pci/msi.c:352 (dmesg attached). 

This is the free_msi_irqs function, but I don't understand the code. Any explanation/help suggestions?

Cheers,

Nick
Comment 1 Mika Westerberg 2017-12-18 08:06:22 UTC
The igb driver tries to disable MSI(X) after the device has been plugged out:

[   97.290181]  pci_disable_msix+0xf6/0x120
[   97.290186]  igb_reset_interrupt_capability+0x5d/0x60 [igb]
[   97.290189]  igb_remove+0xb3/0x160 [igb]

In general drivers ought to check if their device is still present and if not they should avoid touching the hardware. This is important especially when you have hotpluggable devices like all Thunderbolt devices typically are such.

Here, I think the correct fix is to change igb driver to handle hot-unplug properly (avoid touching the hardware if the device is already gone).

You may prevent the crash by unloading the igb driver before you unplug the dock:

  $ sudo rmmod igb

Or if you don't use ethernet, you can just blacklist the module for the time being.
Comment 2 Nikolay Bogoychev 2017-12-18 14:53:28 UTC
Hey Mika,

Rmm'ing the igb module indeed fixes the issue.

Should I submit a new bug to igb maintainers or CC someone on this?

Cheers,

Nick
Comment 3 Mika Westerberg 2017-12-18 20:50:49 UTC
Good to know that it helped.

I've contacted Intel igb people and pointed them the issue. Let's see if they'll have a fix for it soon.
Comment 4 Mika Westerberg 2018-01-05 10:53:32 UTC
Created attachment 273411 [details]
Mark stale PCI devices disconnected
Comment 5 Mika Westerberg 2018-01-05 10:54:05 UTC
Nikolay,

Could you try the attached patch and see if the crash still happens?
Comment 6 Nikolay Bogoychev 2018-01-05 15:57:24 UTC
Hey Mika,

Thank you for the patch!

I will be away from my dock until next Saturday. I will post an update after that.

Cheers,

Nick
Comment 7 Ferenc Boldog 2018-01-05 17:56:05 UTC
Hi Mika and Nikolay,

I will try this patch soon as possible (this weekend or Monday) with my Caldigit TS2 (and Lenovo t430s) which is affected too.

Thanks for the patch!

Ferenc
Comment 8 Ferenc Boldog 2018-01-08 14:39:51 UTC
Created attachment 273457 [details]
Same error after 0001 patch

linux - 4.14.12
distro: archlinux
lenovo t430s with thunderbolt port
caldigit ts2 (ethernet, usb keyboard and mouse, displayport connected)
Comment 9 Mika Westerberg 2018-01-10 10:12:08 UTC
Created attachment 273511 [details]
Updated patch to mark stale PCI devices disconnected

Can you try the updated patch? It should mark the igb device disconnected and prevent the crash.
Comment 10 Ferenc Boldog 2018-01-11 09:22:05 UTC
Unfortunately i got same error.

This is the log before the crash occurs when disconnecting the thunderbolt dock:

[   57.473929] igb 0000:11:00.0 enp17s0: PCIe link lost, device now detached
[   57.741216] ata7: failed to stop engine (-5)
[   58.241158] ata8: failed to stop engine (-5)
[   58.241893] xhci_hcd 0000:12:00.0: remove, state 1
[   58.241909] usb usb6: USB disconnect, device number 1
[   58.242291] xhci_hcd 0000:12:00.0: USB bus 6 deregistered
[   58.242303] xhci_hcd 0000:12:00.0: xHCI host controller not responding, assume dead
[   58.242321] xhci_hcd 0000:12:00.0: remove, state 1
[   58.242328] usb usb5: USB disconnect, device number 1
[   58.242330] usb 5-1: USB disconnect, device number 2
[   58.378296] usb 5-2: USB disconnect, device number 3
[   58.548397] usb 5-4: USB disconnect, device number 4
[   58.549254] xhci_hcd 0000:12:00.0: Host halt failed, -19
[   58.549258] xhci_hcd 0000:12:00.0: Host not accessible, reset failed.
[   58.549416] xhci_hcd 0000:12:00.0: USB bus 5 deregistered
[   58.549751] igb 0000:11:00.0: removed PHC on enp17s0
[   58.727965] ------------[ cut here ]------------
[   58.727970] kernel BUG at drivers/pci/msi.c:352!
Comment 11 Mika Westerberg 2018-01-11 09:32:49 UTC
OK, thanks for testing. I was hoping that it is enough to mark the device disconnected during hot-unplug but apparently it does not help. We need to fix it in the igb driver then. I'll see if I can make a patch doing so.
Comment 12 Ferenc Boldog 2018-01-11 09:42:55 UTC
Ok, thanks!
Comment 13 Nikolay Bogoychev 2018-01-14 13:30:32 UTC
What happens with your second patch is that when I try to rmmod igb I get the same kernel bug as when I unplug the dock itself... Weird.
Comment 14 Mika Westerberg 2018-01-15 16:55:10 UTC
Created attachment 273631 [details]
igb: don't call netif_device_detach() when PCIe link is gone

Can you try the attached patch in addition to the previous patch and see if the crash is gone now? The previous patch is still needed to make sure PCI config space access is disabled when the device is disconnected.
Comment 15 Nikolay Bogoychev 2018-01-15 19:26:03 UTC
(In reply to Mika Westerberg from comment #14)
> Created attachment 273631 [details]
> igb: don't call netif_device_detach() when PCIe link is gone
> 
> Can you try the attached patch in addition to the previous patch and see if
> the crash is gone now? The previous patch is still needed to make sure PCI
> config space access is disabled when the device is disconnected.

Yes, this works! This allows me to plug/unplug the dock without specifically removing the igb module. No more segfaults with rmmod too.

Thanks,

Nick
Comment 16 Ferenc Boldog 2018-01-16 08:13:00 UTC
Instant freeze for me. :(
I will check one more thing and get some log somehow.
Comment 17 Ferenc Boldog 2018-01-16 09:42:52 UTC
This is interesting.

Everything works perfectly if i use the ACPI / Hotplug patch fist version (https://bugzilla.kernel.org/attachment.cgi?id=273411&action=diff) with the igb patch.

I think main difference between Nick and my config (Arch Linux with 4.14.13 kernel) is Thunderbolt3 vs Thunderbolt2 device.


Mika which log, dump be the most usable for you?


This is the working (acpi patch v1 and igb patch) dmesg output when disconnecting:

> [  141.118347] ata7: failed to stop engine (-5)
> [  141.617857] ata8: failed to stop engine (-5)
> [  141.618404] xhci_hcd 0000:12:00.0: remove, state 1
> [  141.618416] usb usb6: USB disconnect, device number 1
> [  141.618777] xhci_hcd 0000:12:00.0: USB bus 6 deregistered
> [  141.618789] xhci_hcd 0000:12:00.0: xHCI host controller not responding,
> assume dead
> [  141.618807] xhci_hcd 0000:12:00.0: remove, state 1
> [  141.618816] usb usb5: USB disconnect, device number 1
> [  141.618819] usb 5-1: USB disconnect, device number 2
> [  141.618821] usb 5-1.1: USB disconnect, device number 4
> [  141.771220] usb 5-1.2: USB disconnect, device number 6
> [  141.898338] usb 5-2: USB disconnect, device number 3
> [  141.898673] usb 5-4: USB disconnect, device number 5
> [  141.899296] xhci_hcd 0000:12:00.0: Host halt failed, -19
> [  141.899299] xhci_hcd 0000:12:00.0: Host not accessible, reset failed.
> [  141.899452] xhci_hcd 0000:12:00.0: USB bus 5 deregistered
> [  141.899736] igb 0000:11:00.0: removed PHC on enp17s0
> [  141.899742] igb 0000:11:00.0 enp17s0: PCIe link lost
> [  142.595875] pci_bus 0000:0e: busn_res: [bus 0e] is released
> [  142.596327] pci_bus 0000:11: busn_res: [bus 11] is released
> [  142.596609] pci_bus 0000:12: busn_res: [bus 12] is released
> [  142.596820] pci_bus 0000:13: busn_res: [bus 13] is released
> [  142.596937] pci_bus 0000:14: busn_res: [bus 14] is released
> [  142.597025] pci_bus 0000:15: busn_res: [bus 15] is released
> [  142.597121] pci_bus 0000:10: busn_res: [bus 10-3c] is released
> [  142.597219] pci_bus 0000:0f: busn_res: [bus 0f-3c] is released
> [  142.597312] pci_bus 0000:3d: busn_res: [bus 3d] is released
> [  142.597401] pci_bus 0000:3e: busn_res: [bus 3e] is released
> [  142.597500] pci_bus 0000:3f: busn_res: [bus 3f] is released
> [  142.597595] pci_bus 0000:0d: busn_res: [bus 0d-3f] is released
Comment 18 Nikolay Bogoychev 2018-01-16 10:13:54 UTC
Just to let you know I'm also running Arch on 4.14.13... But my device is indeed a TB3 one.

Cheers,

Nick
Comment 19 Mika Westerberg 2018-01-16 10:20:20 UTC
I don't think TBT generation matters here as the igb device is the same.

Ferenc, could you try with both patches when it freezes and if the system is still somewhat functional take dmesg and attach it to the bug? It is possible that something else crashes in that case.
Comment 20 Ferenc Boldog 2018-01-16 12:36:32 UTC
Yes of course, it will take a little time.
Comment 21 Ferenc Boldog 2018-01-17 18:48:38 UTC
I made a fresh build with latest patches (acpi and igb) with the kernel and now everything went well. 

Tried many disconnect/reconnect and without any problem.

Thanks for the fix.
Comment 22 Mika Westerberg 2018-01-18 10:23:10 UTC
Cool, thanks for testing. I'll send the patches upstream soon (I have couple of other fixes for acpiphp that I'm currently testing, hoping to include them as well).
Comment 23 Ferenc Boldog 2018-01-22 10:16:57 UTC
Thanks for fixing the driver. :)
Comment 24 Nikolay Bogoychev 2018-05-20 07:36:09 UTC
Hey Mika,

Did you get around to submitting the patches to mainline? Should the bug be marked as resolved? I no longer have the device so I can't test it...
Comment 25 Mika Westerberg 2018-05-21 05:33:16 UTC
Yes, it is fixed by 17a0b9add6e9 ("igb: Do not call netif_device_detach() when PCIe link goes missing"). I don't have rights to change state of this bug so feel free to mark the bug as resolved :)
Comment 26 Nikolay Bogoychev 2018-05-23 10:01:39 UTC
Fixed thanks to Mika (17a0b9add6e9)!

Note You need to log in before you can comment on or make changes to this bug.