Bug 199631 - WARNING: at drivers/thunderbolt/switch.c:587 tb_switch_add+0x63e/0x730 [thunderbolt]
Summary: WARNING: at drivers/thunderbolt/switch.c:587 tb_switch_add+0x63e/0x730 [thun...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-06 05:20 UTC by Chris Murphy
Modified: 2018-09-25 13:19 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.17.0-0.rc3.git4.1.fc29.x86_64
Tree: Mainline
Regression: No


Attachments
dmesg (113.13 KB, text/plain)
2018-05-06 05:21 UTC, Chris Murphy
Details
lspci vvnn (30.10 KB, text/plain)
2018-05-06 05:21 UTC, Chris Murphy
Details
acpidump -o acpidump.out (938.52 KB, text/plain)
2018-05-07 21:47 UTC, Chris Murphy
Details
dmesg 4.16.7 no devices connected (81.86 KB, text/plain)
2018-05-10 18:45 UTC, Chris Murphy
Details
dmesg partial force_power with sleep 0.6 (7.16 KB, text/plain)
2018-05-11 17:14 UTC, Chris Murphy
Details
udevadminmonitor (7.16 KB, text/plain)
2018-05-16 03:52 UTC, Chris Murphy
Details

Description Chris Murphy 2018-05-06 05:20:27 UTC
Soon after boot, I'm getting a warning and a call trace

[   36.051674] thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x36
[   36.051757] ------------[ cut here ]------------
[   36.051769] thunderbolt 0000:03:00.0: 0:1: non switch port without a PHY
[   36.051881] WARNING: CPU: 3 PID: 1511 at drivers/thunderbolt/switch.c:587 tb_switch_add+0x63e/0x730 [thunderbolt]


Does not happen with 4.16.6.
Comment 1 Chris Murphy 2018-05-06 05:21:00 UTC
Created attachment 275785 [details]
dmesg
Comment 2 Chris Murphy 2018-05-06 05:21:19 UTC
Created attachment 275787 [details]
lspci vvnn
Comment 3 Mika Westerberg 2018-05-07 14:21:40 UTC
Few additional questions:

  - Do you have anything connected to the Thunderbolt/USB-C ports?

  - Can you revert 13d3047c8150 ("ACPI / hotplug / PCI: Check presence of slot itself in get_slot_status()") and see if the issue goes away?

In addition can you attach acpidump output from that system to this bug?
Comment 4 Chris Murphy 2018-05-07 21:47:32 UTC
Created attachment 275805 [details]
acpidump -o acpidump.out

Nothing is connected other than power (which is USB-C). Attaching acpidump.
Comment 5 Chris Murphy 2018-05-07 23:59:21 UTC
Pretty sure this is a downstream bug, this is with 4.16.2.

https://bugzilla.redhat.com/show_bug.cgi?id=1568826
Comment 6 Mika Westerberg 2018-05-08 11:02:48 UTC
Did you try to revert commit 13d3047c8150?
Comment 7 Mika Westerberg 2018-05-08 11:37:19 UTC
Also you say "does not happen with 4.16.6" but in comment #5 you imply that the same issue happens in "4.16.2". Is there some clear version that always works? In that case, you could also try to bisect the commit that caused this (in case it was not 13d3047c8150).
Comment 8 Chris Murphy 2018-05-08 19:08:50 UTC
(In reply to Mika Westerberg from comment #6)
> Did you try to revert commit 13d3047c8150?

No it's a production system so I don't immediately have a way to do that. Hopefully in the next week.

Also in my case, the WARNING is non-deterministic, it doesn't always happen.

(In reply to Mika Westerberg from comment #7)
> Also you say "does not happen with 4.16.6" but in comment #5 you imply that
> the same issue happens in "4.16.2". Is there some clear version that always
> works? In that case, you could also try to bisect the commit that caused
> this (in case it was not 13d3047c8150).

It does not happen *for me* on 4.16.6, but there is a downstream bug where someone else has it happening on 4.16.2. I have no idea how the hardware involved differ.
Comment 9 Mika Westerberg 2018-05-09 15:32:39 UTC
If it is happens on 4.16.2 then reverting that commit does not help because it is not in that release anyway.

Can you check the NVM version of the controller? You can do this:

  # echo 1 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power

This should bring up the controller (you may also see the warning again).

Then check what following sysfs attribute contains:

  # cat /sys/bus/thunderbolt/devices/0-0/nvm_version

You can then put the controller back to low power state:

  # echo 0 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
Comment 10 Chris Murphy 2018-05-09 16:12:10 UTC
No warning happens.

[root@f28h ~]# cat /sys/bus/thunderbolt/devices/0-0/nvm_version 
16.0
Comment 11 Chris Murphy 2018-05-09 17:27:54 UTC
I'm also seeing this WARNING and call trace, sometimes shortly after the WARNING in this bug report; sometimes on its own. I also don't see a nearby instigator in the full journal. I have no idea if it's related or not.


May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x2
May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: 0: tb_eeprom_read_rom failed
May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: stopping RX ring 0
May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 12 (0x0 -> 0x0)
May 08 08:47:29 f28h.local kernel: ------------[ cut here ]------------
May 08 08:47:29 f28h.local kernel: thunderbolt 0000:03:00.0: interrupt for RX ring 0 is already disabled
May 08 08:47:29 f28h.local kernel: WARNING: CPU: 3 PID: 1453 at drivers/thunderbolt/nhi.c:106 ring_interrupt_active+0x20c/0x260 [thunderbolt]


No bugs for ring_interrupt_active search term in kernel.org bugzilla, but found this one in RHBZ for Fedora.
https://bugzilla.redhat.com/show_bug.cgi?id=1571360
Comment 12 Mika Westerberg 2018-05-10 08:07:32 UTC
I have a theory about this. Fwupd is using intel-wmi-thunderbolt to power on/off the controller during boot when it enumerates devices which support firmware upgrades. I don't know how long it keeps the controller powered so could it be that in this particular Thunderbolt controller the firmware (which is pretty old, v16) takes bit longer to start up and while the driver is in the middle of initializing everything the power is suddenly cut.

@Mario, how long it keeps the controller powered on?

@Chris, you should be able to reproduce this by doing something like below as well:

  # echo 1 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power && sleep 1 && echo 0 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power

If you then change that sleep to say 60s does it always work or does it still fail?
Comment 13 Mario Limonciello 2018-05-10 13:01:03 UTC
@Mika,

Fwupd is configured to set a timeout on the intel-thunderbolt-wmi interface of 20 seconds in fwupd 1.0.6 and later. (https://github.com/hughsie/fwupd/blob/master/plugins/thunderbolt-power/fu-plugin-thunderbolt-power.c#L313)
Comment 14 Mika Westerberg 2018-05-10 13:59:50 UTC
Thanks Mario.

@Chris, can you also check what following command returns?

  $ fwupdmgr --version
Comment 15 Mario Limonciello 2018-05-10 14:02:38 UTC
FYI the value that's selected in fwupd for this tiemeout reset is fairly arbitrary.  It was based upon empirical testing with the systems I had on hand when making the fwupd plugin.

If it turns out to be the root cause of this issue, I'm OK with us turning it up to something more workable in upstream fwupd.  

Also in that case we should probably also update kernel documentation to warn this can happen if it's set too low.
Comment 16 Chris Murphy 2018-05-10 17:47:37 UTC
(In reply to Mika Westerberg from comment #12)
> @Chris, you should be able to reproduce this by doing something like below
> as well:
> 
>   # echo 1 >
> /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power &&
> sleep 1 && echo 0 >
> /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
> 
> If you then change that sleep to say 60s does it always work or does it
> still fail?

I can't get the warning to come up with any sleep value. With a sleep value of 1, there is a 22 second delay for powering off.

[10622.917707] thunderbolt 0000:03:00.0:   NFC Credits: 0x800000
[10644.368918] thunderbolt 0000:03:00.0: stopping RX ring 0

No warning or call trace though. This is with 4.17.0-0.rc4.git1.1.fc29.x86_64 which is git f142f08bf7ec and has a bunch of DEBUG options set in the kernel config.

I will have time starting tomorrow to build custom kernels with whatever options and patches; and test with and without connected devices. So far I haven't found a pattern.

[root@f28h ~]# fwupdmgr --version
client version:	1.0.7
daemon version:	1.0.7
compile-time dependency versions
	appstream-glib:	0.7.8
	gusb:	0.3.0
	fwupdate:	10
	efivar:	34
[root@f28h ~]#
Comment 17 Mario Limonciello 2018-05-10 17:59:06 UTC
Assuming Mika's theory is correct the pattern will likely align along lack of Thunderbolt devices plugged into the Type-C port.

You can readily reproduce it at boot up through right?

Would you be able to rebuild fwupd with a local patch?

Please change https://github.com/hughsie/fwupd/blob/master/plugins/thunderbolt-power/fu-plugin-thunderbolt-power.c#L42 from '2' to '6'.

That should change it up to 60 seconds before it turns off the force power.
Comment 18 Chris Murphy 2018-05-10 18:04:41 UTC
(In reply to Mario Limonciello from comment #17)
> Assuming Mika's theory is correct the pattern will likely align along lack
> of Thunderbolt devices plugged into the Type-C port.

OK that might be the problem today, I have multiple devices connected.

 > You can readily reproduce it at boot up through right?

Non-deterministic. With somewhat low confidence it appears to not happen when devices are connected, and sometimes happens when no devices are connected.

> Would you be able to rebuild fwupd with a local patch?

Yes, let me see if I can approximate some kind of fail % without devices connected, and without the patch. Then see about testing with patch.
Comment 19 Mika Westerberg 2018-05-10 18:17:03 UTC
(In reply to Chris Murphy from comment #18)
> Yes, let me see if I can approximate some kind of fail % without devices
> connected, and without the patch. Then see about testing with patch.

You don't need to test with devices connected. Then the situation should not happen because the controller does not get powered down unexpectedly (at least if the current theory is correct).

The point is that fwupd may cut the power too early from the controller (when no devices are connected) and we would like to get verification if it is the real root cause.
Comment 20 Chris Murphy 2018-05-10 18:45:15 UTC
Created attachment 275909 [details]
dmesg 4.16.7 no devices connected

Just as a data point, this is the usual working scenario and the messages I see, including this somewhat recent (since I think 4.16) "scary" message since it's highlighted in red.

[   62.029130] xhci_hcd 0000:37:00.0: xHCI host controller not responding, assume dead
Comment 21 Chris Murphy 2018-05-10 19:03:10 UTC
What about boltd as a possible suspect? Whether warm or cold boot, today for whatever reason I can't reproduce this warning so far, with either the originally reported kernel or rc4. But looking at journal instead of dmesg, I saw this:

[   58.181711] f28h.local kernel: thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x2
[   58.181726] f28h.local kernel: thunderbolt 0000:03:00.0: 0: tb_eeprom_read_rom failed
[   58.183292] f28h.local kernel: thunderbolt 0000:03:00.0: stopping RX ring 0
[   58.183325] f28h.local kernel: thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 12 (0xffffffff -> 0xffffefff)
[   58.183482] f28h.local kernel: thunderbolt 0000:03:00.0: stopping TX ring 0
[   58.183509] f28h.local kernel: thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xfffffffe)
[   58.183671] f28h.local kernel: thunderbolt 0000:03:00.0: control channel stopped
[   58.184685] f28h.local kernel: thunderbolt 0000:03:00.0: freeing RX ring 0
[   58.184777] f28h.local kernel: thunderbolt 0000:03:00.0: freeing TX ring 0
[   58.186155] f28h.local kernel: thunderbolt 0000:03:00.0: shutdown
[   58.191564] f28h.local kernel: thunderbolt: probe of 0000:03:00.0 failed with error -110
[   58.207796] f28h.local kernel: pci_bus 0000:03: busn_res: [bus 03] is released
[   57.853625] f28h.local boltd[1385]: probing: started [1000]
[   58.211049] f28h.local kernel: pci_bus 0000:04: busn_res: [bus 04-36] is released
[   58.216211] f28h.local kernel: pci_bus 0000:37: busn_res: [bus 37] is released
[   58.217261] f28h.local kernel: pci_bus 0000:38: busn_res: [bus 38-6b] is released
[   58.218156] f28h.local kernel: pci_bus 0000:02: busn_res: [bus 02-6b] is released


In approximately the same location sequence wise the warning appeared in, is boltd. Unfortunately I had systemd-journald set to volatile storage at the time of the originally reported call trace so I can't be certain of the boltd timing related to warning. I've switched to persistent logging for now.
Comment 22 Mario Limonciello 2018-05-10 19:10:35 UTC
Good finding.
Maybe a combination of fwupd behavior + boltd startup could explain it.  Boltd starting up and trying to read sysfs files when the power has been cut using thunderbolt force power API.

@mika:
If that's true, I wonder if what we need is a notifier from intel-wmi-thunderbolt.ko to thunderbolt.ko that force power was set to 0 and that it should rip all sysfs nodes out in advance of the controller disappearing.
Comment 23 Mika Westerberg 2018-05-10 20:25:11 UTC
There's reference counts for that so even if the sysfs node is open we can remove the device underneath.

However, if the power is cut in the middle when the driver is initializing it may cause warnings like that to trigger. But I would like to get it confirmed first before thinking how to fix it ;-) The driver actually just fails and bails out which is pretty much what it should do in such situations when hardware disappears. Dumping stack trace may be too much, though so we may "fix" this by just printing warning instead.
Comment 24 Mika Westerberg 2018-05-10 20:37:39 UTC
(In reply to Chris Murphy from comment #20)
> [   62.029130] xhci_hcd 0000:37:00.0: xHCI host controller not responding,
> assume dead

That is fine. When the Thunderbolt host router is hot-removed the xHCI is removed as well and the xHCI driver notices it printing the above harmless warning.
Comment 25 Mika Westerberg 2018-05-11 06:53:59 UTC
I can trigger this reliably on my Thinkpad Carbon X1 by doing this:

  # echo 1 > force_power && sleep 0.6 && echo 0 > force_power
Comment 26 Chris Murphy 2018-05-11 17:14:29 UTC
Created attachment 275927 [details]
dmesg partial force_power with sleep 0.6

4.17.0-0.rc4.git1.1.fc29.x86_64

This is all I get. No warning or trace. 

The downstream bug has a Lenovo and Dell make in addition to the two HP makes (now) in this bug. I've asked in that downstream bug for journalctl -o short-monotonic to see if there's any correlation between boltd fwupd or other and the warning.
Comment 27 Mika Westerberg 2018-05-14 05:29:13 UTC
It depends on which phase of initialization the driver currently is in. Using 0.6 s in your system (with the old v16 firmware) seems to fail earlier:

12677.936235] thunderbolt 0000:03:00.0: failed to send driver ready to ICM
[12677.936253] thunderbolt 0000:03:00.0: stopping RX ring 0
[12677.936280] thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 12 (0xffffffff -> 0xffffefff)
[12677.936339] thunderbolt 0000:03:00.0: stopping TX ring 0
[12677.936366] thunderbolt 0000:03:00.0: disabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xfffffffe)
[12677.936416] thunderbolt 0000:03:00.0: control channel stopped
[12677.936492] thunderbolt 0000:03:00.0: freeing RX ring 0
[12677.936544] thunderbolt 0000:03:00.0: freeing TX ring 0
[12677.936703] thunderbolt 0000:03:00.0: shutdown
[12677.938617] thunderbolt: probe of 0000:03:00.0 failed with error -110

Here we timeout even before we have done handshake with the firmware.

I would say this is definitely due to fwupd force_power and because some controllers and firmware it sometimes (not always) takes longer time to initialize. The 20s timeout is too short in those cases.

Can you try the 60s patch from Mario (or alternatively you can blacklist the force_power module from fwupd) and see if the issue goes away?
Comment 28 Mika Westerberg 2018-05-14 10:15:45 UTC
Regarding boltd interfering, it really cannot happen at this point because we have not yet announced any devices on /sys/bus/thunderbolt/devices so it has no knowledge of any Thunderbolt device.
Comment 29 Mario Limonciello 2018-05-14 20:13:03 UTC
@Mika,

At what point do the uvents start to come through?  After all that handshaking is done and the sysfs files are created?  Or sooner?
Something maybe we can do in fwupd is to listen for the relevant uevents and if we catch one then delay turning off force power until we see the device fully initialized.
Comment 30 Mika Westerberg 2018-05-15 05:33:02 UTC
uevents are triggered when a device is added to the thunderbolt bus. If possible, I think it makes sense to wait for the root switch (name X-0 on the bus where X is number of the domain starting from 0) to appear. Initially you get two uevents: one for the domainX and one for the root switch X-0. After that you get one uevent per Thunderbolt device.
Comment 31 Mario Limonciello 2018-05-15 19:40:46 UTC
@Chris

Can you please do the following?

# systemctl stop fwupd.service

In one terminal split screen:
#udevadm monitor

In another terminal

# echo 1 >
> /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power

Start a stopwatch.

Can you please tell me if you see the domain0 event come in before 20 seconds and the 0-0 event come after the 20 seconds mark?  I think that's what's going to happen.

Based on what Mika has said my thought is that I'll configure thunderbolt-power to continue to default to 20 seconds but also add a listener for the uevents.  If a domain uevent comes through, remove the timer and wait until the matching root switch to come through before turning off force power.

On my test system I see all the domain and root switch uvents coming within a second, so I would be able to verify this on my side.
Comment 32 Chris Murphy 2018-05-16 03:52:32 UTC
Created attachment 276003 [details]
udevadminmonitor

I get the attached output pretty much right away, within a second. Nothing happens for the following 5 minutes. I see both domain0 and domain0/0-0 happen within a 100th of a second.

Steps
1. Reboot 4.17.0-0.rc4.git4.1.fc29.x86_64 with no devices attached but with power
2. $ sudo systemctl stop fwupd.service
3. $ sudo udevadm monitor
4. $ sudo -i
5. # echo 1 > /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
Comment 33 Mika Westerberg 2018-05-16 05:18:56 UTC
I think you would need to repeat that several times. Occasionally the firmware might take a bit longer to get up and running. It does not happen always. I've seen this few times on systems with older firmware. You may also try to power down the system completely (shutdown) instead of rebooting.
Comment 34 Chris Murphy 2018-05-16 16:16:43 UTC
3 warm, 3 cold boots, repeat steps in c32 for each. Same results. I also haven't seen the tb_switch_add warning since first reporting it.

Is the thunderbolt firmware part of UEFI firmware, or separate? The UEFI firmware is current, BIOS F.40 02/26/2018. It was updated to this version soon after that date, so the firmware hasn't changed during this testing. Neither has fwupd or boltd.
Comment 35 Mario Limonciello 2018-05-16 16:18:23 UTC
Do you currently have fwupd configured to blacklist thunderbolt-power?  I think you'll need to otherwise it's going to turn on the TBT controller at bootup and back off, so this cold boot idea for the controller won't really be a cold boot.
Comment 36 Chris Murphy 2018-05-16 16:23:13 UTC
(In reply to Mario Limonciello from comment #35)
> Do you currently have fwupd configured to blacklist thunderbolt-power?

?

[chris@f28h ~]$ cat /etc/fwupd/daemon.conf 
[fwupd]

# Allow blacklisting specific devices by their GUID
# Uses semicolons as delimiter
BlacklistDevices=

# Allow blacklisting specific plugins
# Uses semicolons as delimiter
BlacklistPlugins=test

# Maximum archive size that can be loaded in Mb, with 0 for the default
ArchiveSizeMax=0
[chris@f28h ~]$ 


- BlacklistDevices=
+ BlacklistDevices=thunderbolt-power

Then poweroff, boot and retest? Is it still necessary to stop fwupd.service?
Comment 37 Mario Limonciello 2018-05-16 16:26:08 UTC
It needs to be like this:

BlacklistPlugins=test;thunderbolt_power

And yeah, poweroff, boot and retest.  No need to stop fwupd.service, it won't use force power when the plugin is blacklisted.
Comment 38 Chris Murphy 2018-05-16 17:14:39 UTC
3 more cold boots with that change, and still get the same results in c32
Comment 39 Mika Westerberg 2018-05-16 18:35:04 UTC
You mentioned you haven't seen the warning since you reported it so doing 3 + 3 + 3 boots probably does not trigger the issue yet. It would be more like 100+ times. Actually in this case you don't always need to boot the system, it should be fine to toggle force_power on and off and wait some time between (for example with a simple script).

That said, I think changing fwupd to wait for the root switch to appear after force power with something like 60s timeout should make the problem go away :)
Comment 40 Mika Westerberg 2018-07-22 07:37:16 UTC
Mario, is there any progress in fwupd side of this? I don't think we can do much in the Thunderbolt driver to besides handling disappearing device gracefully.
Comment 41 Mario Limonciello 2018-07-23 14:49:28 UTC
@Mika,

Sorry this fell off my radar.  I've produced this PR that I believe should help:

https://github.com/hughsie/fwupd/pull/605

It will reset the timeout to 20 seconds every time a new thunderbolt device pops up.  So hopefully that should mean the following chain of events:

1) Run WMI force power.
2) Set timeout to 20 seconds
<time passes>
3) domain0 shows up
4) Reset timeout to 20 seconds
<time passes>
5) 0-0 shows up
6) Reset timeout to 20 seconds
7) Thunderbolt plugin reads NVM
<time passes)
8) Timer expires; reset force power to off

@Chris,
If you can please check if this helps?
1) Checkout that branch
2) Build packages using the appropriate distro script (as described in https://github.com/hughsie/fwupd/tree/master/contrib)
3) Install the packages
4) Remove your blacklist for Thunderbolt Power
5) Modify the fwupd.service systemd unit to add "-v" to ExecStart
6) Power off your machine and power it back on.  See if the behavior continues to happen.
7) If it does happen please share the journal output generated by fwupd.service (the verbose in it will help to see if this timing is still off).
Comment 42 Mika Westerberg 2018-07-23 15:47:55 UTC
Thanks Mario! That looks definitely like a good way to handle it.
Comment 43 Mario Limonciello 2018-07-26 17:29:31 UTC
Some recent discussion also brought up the potential that bolt and fwupd are fighting over control of this force power.

@Chris,
Before you test that branch, please confirm if you also have bolt in the system and what version.  It would be good to note if that's playing a factor in this problem.
Comment 44 Mario Limonciello 2018-09-25 13:19:29 UTC
Anyone still encountering this, please update to fwupd 1.1.2 or later and confirm if it still happens.  It's expected to be fixed there.

Also if you are encountering it, please note the version of bolt you are using.

Note You need to log in before you can comment on or make changes to this bug.