Bug 199631
Summary: | WARNING: at drivers/thunderbolt/switch.c:587 tb_switch_add+0x63e/0x730 [thunderbolt] | ||
---|---|---|---|
Product: | Drivers | Reporter: | Chris Murphy (bugzilla) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | NEW --- | ||
Severity: | normal | CC: | gicmo, mika.westerberg, superm1, superm1 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.17.0-0.rc3.git4.1.fc29.x86_64 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg
lspci vvnn acpidump -o acpidump.out dmesg 4.16.7 no devices connected dmesg partial force_power with sleep 0.6 udevadminmonitor |
Description
Chris Murphy
2018-05-06 05:20:27 UTC
Created attachment 275785 [details]
dmesg
Created attachment 275787 [details]
lspci vvnn
Few additional questions: - Do you have anything connected to the Thunderbolt/USB-C ports? - Can you revert 13d3047c8150 ("ACPI / hotplug / PCI: Check presence of slot itself in get_slot_status()") and see if the issue goes away? In addition can you attach acpidump output from that system to this bug? Created attachment 275805 [details]
acpidump -o acpidump.out
Nothing is connected other than power (which is USB-C). Attaching acpidump.
Pretty sure this is a downstream bug, this is with 4.16.2. https://bugzilla.redhat.com/show_bug.cgi?id=1568826 Did you try to revert commit 13d3047c8150? Also you say "does not happen with 4.16.6" but in comment #5 you imply that the same issue happens in "4.16.2". Is there some clear version that always works? In that case, you could also try to bisect the commit that caused this (in case it was not 13d3047c8150). (In reply to Mika Westerberg from comment #6) > Did you try to revert commit 13d3047c8150? No it's a production system so I don't immediately have a way to do that. Hopefully in the next week. Also in my case, the WARNING is non-deterministic, it doesn't always happen. (In reply to Mika Westerberg from comment #7) > Also you say "does not happen with 4.16.6" but in comment #5 you imply that > the same issue happens in "4.16.2". Is there some clear version that always > works? In that case, you could also try to bisect the commit that caused > this (in case it was not 13d3047c8150). It does not happen *for me* on 4.16.6, but there is a downstream bug where someone else has it happening on 4.16.2. I have no idea how the hardware involved differ. If it is happens on 4.16.2 then reverting that commit does not help because it is not in that release anyway. Can you check the NVM version of the controller? You can do this: # echo 1 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power This should bring up the controller (you may also see the warning again). Then check what following sysfs attribute contains: # cat /sys/bus/thunderbolt/devices/0-0/nvm_version You can then put the controller back to low power state: # echo 0 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power No warning happens. [root@f28h ~]# cat /sys/bus/thunderbolt/devices/0-0/nvm_version 16.0 I'm also seeing this WARNING and call trace, sometimes shortly after the WARNING in this bug report; sometimes on its own. I also don't see a nearby instigator in the full journal. I have no idea if it's related or not. May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x2 May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: 0: tb_eeprom_read_rom failed May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: stopping RX ring 0 May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 12 (0x0 -> 0x0) May 08 08:47:29 f28h.local kernel: ------------[ cut here ]------------ May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: interrupt for RX ring 0 is already disabled May 08 08:47:29 f28h.local kernel: WARNING: CPU: 3 PID: 1453 at drivers/thunderbolt/nhi.c:106 ring_interrupt_active+0x20c/0x260 [thunderbolt] No bugs for ring_interrupt_active search term in kernel.org bugzilla, but found this one in RHBZ for Fedora. https://bugzilla.redhat.com/show_bug.cgi?id=1571360 I have a theory about this. Fwupd is using intel-wmi-thunderbolt to power on/off the controller during boot when it enumerates devices which support firmware upgrades. I don't know how long it keeps the controller powered so could it be that in this particular Thunderbolt controller the firmware (which is pretty old, v16) takes bit longer to start up and while the driver is in the middle of initializing everything the power is suddenly cut. @Mario, how long it keeps the controller powered on? @Chris, you should be able to reproduce this by doing something like below as well: # echo 1 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power && sleep 1 && echo 0 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power If you then change that sleep to say 60s does it always work or does it still fail? @Mika, Fwupd is configured to set a timeout on the intel-thunderbolt-wmi interface of 20 seconds in fwupd 1.0.6 and later. (https://github.com/hughsie/fwupd/blob/master/plugins/thunderbolt-power/fu-plugin-thunderbolt-power.c#L313) Thanks Mario. @Chris, can you also check what following command returns? $ fwupdmgr --version FYI the value that's selected in fwupd for this tiemeout reset is fairly arbitrary. It was based upon empirical testing with the systems I had on hand when making the fwupd plugin. If it turns out to be the root cause of this issue, I'm OK with us turning it up to something more workable in upstream fwupd. Also in that case we should probably also update kernel documentation to warn this can happen if it's set too low. (In reply to Mika Westerberg from comment #12) > @Chris, you should be able to reproduce this by doing something like below > as well: > > # echo 1 > > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power && > sleep 1 && echo 0 > > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power > > If you then change that sleep to say 60s does it always work or does it > still fail? I can't get the warning to come up with any sleep value. With a sleep value of 1, there is a 22 second delay for powering off. [10622.917707] thunderbolt 0000:03:00.0: NFC Credits: 0x800000 [10644.368918] thunderbolt 0000:03:00.0: stopping RX ring 0 No warning or call trace though. This is with 4.17.0-0.rc4.git1.1.fc29.x86_64 which is git f142f08bf7ec and has a bunch of DEBUG options set in the kernel config. I will have time starting tomorrow to build custom kernels with whatever options and patches; and test with and without connected devices. So far I haven't found a pattern. [root@f28h ~]# fwupdmgr --version client version: 1.0.7 daemon version: 1.0.7 compile-time dependency versions appstream-glib: 0.7.8 gusb: 0.3.0 fwupdate: 10 efivar: 34 [root@f28h ~]# Assuming Mika's theory is correct the pattern will likely align along lack of Thunderbolt devices plugged into the Type-C port. You can readily reproduce it at boot up through right? Would you be able to rebuild fwupd with a local patch? Please change https://github.com/hughsie/fwupd/blob/master/plugins/thunderbolt-power/fu-plugin-thunderbolt-power.c#L42 from '2' to '6'. That should change it up to 60 seconds before it turns off the force power. (In reply to Mario Limonciello from comment #17) > Assuming Mika's theory is correct the pattern will likely align along lack > of Thunderbolt devices plugged into the Type-C port. OK that might be the problem today, I have multiple devices connected. > You can readily reproduce it at boot up through right? Non-deterministic. With somewhat low confidence it appears to not happen when devices are connected, and sometimes happens when no devices are connected. > Would you be able to rebuild fwupd with a local patch? Yes, let me see if I can approximate some kind of fail % without devices connected, and without the patch. Then see about testing with patch. (In reply to Chris Murphy from comment #18) > Yes, let me see if I can approximate some kind of fail % without devices > connected, and without the patch. Then see about testing with patch. You don't need to test with devices connected. Then the situation should not happen because the controller does not get powered down unexpectedly (at least if the current theory is correct). The point is that fwupd may cut the power too early from the controller (when no devices are connected) and we would like to get verification if it is the real root cause. Created attachment 275909 [details]
dmesg 4.16.7 no devices connected
Just as a data point, this is the usual working scenario and the messages I see, including this somewhat recent (since I think 4.16) "scary" message since it's highlighted in red.
[ 62.029130] xhci_hcd 0000:37:00.0: xHCI host controller not responding, assume dead
What about boltd as a possible suspect? Whether warm or cold boot, today for whatever reason I can't reproduce this warning so far, with either the originally reported kernel or rc4. But looking at journal instead of dmesg, I saw this: [ 58.181711] f28h.local kernel: thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x2 [ 58.181726] f28h.local kernel: thunderbolt 0000:03:00.0: 0: tb_eeprom_read_rom failed [ 58.183292] f28h.local kernel: thunderbolt 0000:03:00.0: stopping RX ring 0 [ 58.183325] f28h.local kernel: thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 12 (0xffffffff -> 0xffffefff) [ 58.183482] f28h.local kernel: thunderbolt 0000:03:00.0: stopping TX ring 0 [ 58.183509] f28h.local kernel: thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xfffffffe) [ 58.183671] f28h.local kernel: thunderbolt 0000:03:00.0: control channel stopped [ 58.184685] f28h.local kernel: thunderbolt 0000:03:00.0: freeing RX ring 0 [ 58.184777] f28h.local kernel: thunderbolt 0000:03:00.0: freeing TX ring 0 [ 58.186155] f28h.local kernel: thunderbolt 0000:03:00.0: shutdown [ 58.191564] f28h.local kernel: thunderbolt: probe of 0000:03:00.0 failed with error -110 [ 58.207796] f28h.local kernel: pci_bus 0000:03: busn_res: [bus 03] is released [ 57.853625] f28h.local boltd[1385]: probing: started [1000] [ 58.211049] f28h.local kernel: pci_bus 0000:04: busn_res: [bus 04-36] is released [ 58.216211] f28h.local kernel: pci_bus 0000:37: busn_res: [bus 37] is released [ 58.217261] f28h.local kernel: pci_bus 0000:38: busn_res: [bus 38-6b] is released [ 58.218156] f28h.local kernel: pci_bus 0000:02: busn_res: [bus 02-6b] is released In approximately the same location sequence wise the warning appeared in, is boltd. Unfortunately I had systemd-journald set to volatile storage at the time of the originally reported call trace so I can't be certain of the boltd timing related to warning. I've switched to persistent logging for now. Good finding. Maybe a combination of fwupd behavior + boltd startup could explain it. Boltd starting up and trying to read sysfs files when the power has been cut using thunderbolt force power API. @mika: If that's true, I wonder if what we need is a notifier from intel-wmi-thunderbolt.ko to thunderbolt.ko that force power was set to 0 and that it should rip all sysfs nodes out in advance of the controller disappearing. There's reference counts for that so even if the sysfs node is open we can remove the device underneath. However, if the power is cut in the middle when the driver is initializing it may cause warnings like that to trigger. But I would like to get it confirmed first before thinking how to fix it ;-) The driver actually just fails and bails out which is pretty much what it should do in such situations when hardware disappears. Dumping stack trace may be too much, though so we may "fix" this by just printing warning instead. (In reply to Chris Murphy from comment #20) > [ 62.029130] xhci_hcd 0000:37:00.0: xHCI host controller not responding, > assume dead That is fine. When the Thunderbolt host router is hot-removed the xHCI is removed as well and the xHCI driver notices it printing the above harmless warning. I can trigger this reliably on my Thinkpad Carbon X1 by doing this: # echo 1 > force_power && sleep 0.6 && echo 0 > force_power Created attachment 275927 [details]
dmesg partial force_power with sleep 0.6
4.17.0-0.rc4.git1.1.fc29.x86_64
This is all I get. No warning or trace.
The downstream bug has a Lenovo and Dell make in addition to the two HP makes (now) in this bug. I've asked in that downstream bug for journalctl -o short-monotonic to see if there's any correlation between boltd fwupd or other and the warning.
It depends on which phase of initialization the driver currently is in. Using 0.6 s in your system (with the old v16 firmware) seems to fail earlier: 12677.936235] thunderbolt 0000:03:00.0: failed to send driver ready to ICM [12677.936253] thunderbolt 0000:03:00.0: stopping RX ring 0 [12677.936280] thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 12 (0xffffffff -> 0xffffefff) [12677.936339] thunderbolt 0000:03:00.0: stopping TX ring 0 [12677.936366] thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xfffffffe) [12677.936416] thunderbolt 0000:03:00.0: control channel stopped [12677.936492] thunderbolt 0000:03:00.0: freeing RX ring 0 [12677.936544] thunderbolt 0000:03:00.0: freeing TX ring 0 [12677.936703] thunderbolt 0000:03:00.0: shutdown [12677.938617] thunderbolt: probe of 0000:03:00.0 failed with error -110 Here we timeout even before we have done handshake with the firmware. I would say this is definitely due to fwupd force_power and because some controllers and firmware it sometimes (not always) takes longer time to initialize. The 20s timeout is too short in those cases. Can you try the 60s patch from Mario (or alternatively you can blacklist the force_power module from fwupd) and see if the issue goes away? Regarding boltd interfering, it really cannot happen at this point because we have not yet announced any devices on /sys/bus/thunderbolt/devices so it has no knowledge of any Thunderbolt device. @Mika, At what point do the uvents start to come through? After all that handshaking is done and the sysfs files are created? Or sooner? Something maybe we can do in fwupd is to listen for the relevant uevents and if we catch one then delay turning off force power until we see the device fully initialized. uevents are triggered when a device is added to the thunderbolt bus. If possible, I think it makes sense to wait for the root switch (name X-0 on the bus where X is number of the domain starting from 0) to appear. Initially you get two uevents: one for the domainX and one for the root switch X-0. After that you get one uevent per Thunderbolt device. @Chris
Can you please do the following?
# systemctl stop fwupd.service
In one terminal split screen:
#udevadm monitor
In another terminal
# echo 1 >
> /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
Start a stopwatch.
Can you please tell me if you see the domain0 event come in before 20 seconds and the 0-0 event come after the 20 seconds mark? I think that's what's going to happen.
Based on what Mika has said my thought is that I'll configure thunderbolt-power to continue to default to 20 seconds but also add a listener for the uevents. If a domain uevent comes through, remove the timer and wait until the matching root switch to come through before turning off force power.
On my test system I see all the domain and root switch uvents coming within a second, so I would be able to verify this on my side.
Created attachment 276003 [details]
udevadminmonitor
I get the attached output pretty much right away, within a second. Nothing happens for the following 5 minutes. I see both domain0 and domain0/0-0 happen within a 100th of a second.
Steps
1. Reboot 4.17.0-0.rc4.git4.1.fc29.x86_64 with no devices attached but with power
2. $ sudo systemctl stop fwupd.service
3. $ sudo udevadm monitor
4. $ sudo -i
5. # echo 1 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
I think you would need to repeat that several times. Occasionally the firmware might take a bit longer to get up and running. It does not happen always. I've seen this few times on systems with older firmware. You may also try to power down the system completely (shutdown) instead of rebooting. 3 warm, 3 cold boots, repeat steps in c32 for each. Same results. I also haven't seen the tb_switch_add warning since first reporting it. Is the thunderbolt firmware part of UEFI firmware, or separate? The UEFI firmware is current, BIOS F.40 02/26/2018. It was updated to this version soon after that date, so the firmware hasn't changed during this testing. Neither has fwupd or boltd. Do you currently have fwupd configured to blacklist thunderbolt-power? I think you'll need to otherwise it's going to turn on the TBT controller at bootup and back off, so this cold boot idea for the controller won't really be a cold boot. (In reply to Mario Limonciello from comment #35) > Do you currently have fwupd configured to blacklist thunderbolt-power? ? [chris@f28h ~]$ cat /etc/fwupd/daemon.conf [fwupd] # Allow blacklisting specific devices by their GUID # Uses semicolons as delimiter BlacklistDevices= # Allow blacklisting specific plugins # Uses semicolons as delimiter BlacklistPlugins=test # Maximum archive size that can be loaded in Mb, with 0 for the default ArchiveSizeMax=0 [chris@f28h ~]$ - BlacklistDevices= + BlacklistDevices=thunderbolt-power Then poweroff, boot and retest? Is it still necessary to stop fwupd.service? It needs to be like this: BlacklistPlugins=test;thunderbolt_power And yeah, poweroff, boot and retest. No need to stop fwupd.service, it won't use force power when the plugin is blacklisted. 3 more cold boots with that change, and still get the same results in c32 You mentioned you haven't seen the warning since you reported it so doing 3 + 3 + 3 boots probably does not trigger the issue yet. It would be more like 100+ times. Actually in this case you don't always need to boot the system, it should be fine to toggle force_power on and off and wait some time between (for example with a simple script). That said, I think changing fwupd to wait for the root switch to appear after force power with something like 60s timeout should make the problem go away :) Mario, is there any progress in fwupd side of this? I don't think we can do much in the Thunderbolt driver to besides handling disappearing device gracefully. @Mika, Sorry this fell off my radar. I've produced this PR that I believe should help: https://github.com/hughsie/fwupd/pull/605 It will reset the timeout to 20 seconds every time a new thunderbolt device pops up. So hopefully that should mean the following chain of events: 1) Run WMI force power. 2) Set timeout to 20 seconds <time passes> 3) domain0 shows up 4) Reset timeout to 20 seconds <time passes> 5) 0-0 shows up 6) Reset timeout to 20 seconds 7) Thunderbolt plugin reads NVM <time passes) 8) Timer expires; reset force power to off @Chris, If you can please check if this helps? 1) Checkout that branch 2) Build packages using the appropriate distro script (as described in https://github.com/hughsie/fwupd/tree/master/contrib) 3) Install the packages 4) Remove your blacklist for Thunderbolt Power 5) Modify the fwupd.service systemd unit to add "-v" to ExecStart 6) Power off your machine and power it back on. See if the behavior continues to happen. 7) If it does happen please share the journal output generated by fwupd.service (the verbose in it will help to see if this timing is still off). Thanks Mario! That looks definitely like a good way to handle it. Some recent discussion also brought up the potential that bolt and fwupd are fighting over control of this force power. @Chris, Before you test that branch, please confirm if you also have bolt in the system and what version. It would be good to note if that's playing a factor in this problem. Anyone still encountering this, please update to fwupd 1.1.2 or later and confirm if it still happens. It's expected to be fixed there. Also if you are encountering it, please note the version of bolt you are using. |