Bug 201227

Summary: thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x1
Product: Drivers Reporter: Chris Murphy (bugzilla)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn, lukas, mika.westerberg, superm1
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.19.0-0.rc5.git0.1.fc30.x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg only
journal full
dmesg only copy 2
Add debugging to Intel WMI Thunderbolt module
HP spectre acpidump

Description Chris Murphy 2018-09-24 22:03:33 UTC
I'm seeing a regression in rc5 on the very first boot that I haven't seen in a week of 4.19.0-0.rc4.git3.1.fc30.x86_64 (git ae596de1a0c8).

[   51.172922] f29h.local kernel: thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x1
[   51.172925] f29h.local kernel: ------------[ cut here ]------------
[   51.172927] f29h.local kernel: thunderbolt 0000:03:00.0: 0:3: non switch port without a PHY
[   51.172954] f29h.local kernel: WARNING: CPU: 2 PID: 2036 at drivers/thunderbolt/switch.c:594 tb_switch_add+0x69b/0x780 [thunderbolt]
Comment 1 Chris Murphy 2018-09-24 22:05:02 UTC
Created attachment 278745 [details]
dmesg only
Comment 2 Chris Murphy 2018-09-24 22:05:43 UTC
Created attachment 278747 [details]
journal full

Including in case something user space related is triggering this, like fwupd or boltd - but I don't see proximity of any of their messages with this warning.
Comment 3 Bjorn Helgaas 2018-09-24 22:22:02 UTC
Thanks for the report!  I marked this as a regression.  I don't see any PCI- or Thunderbolt-related changes between v4.19-rc4 and v4.19-rc5.  That *is* the interval you've identified, isn't it (-rc4 works and -rc5 fails)?

Can you please post this to linux-kernel and linux-pci (and include the URL of this bugzilla)?  Also cc these folks (from MAINTAINERS for Thunderbolt):

M:      Andreas Noever <andreas.noever@gmail.com>
M:      Michael Jamet <michael.jamet@intel.com>
M:      Mika Westerberg <mika.westerberg@linux.intel.com>
M:      Yehezkel Bernat <YehezkelShB@gmail.com>
Comment 4 Chris Murphy 2018-09-24 22:22:13 UTC
[   52.642045] f29h.local kernel: thunderbolt 0000:03:00.0: timeout writing config space 2 to 0x2c
[   73.122341] f29h.local kernel: thunderbolt 0000:03:00.0: timeout reading config space 1 from 0x2
[   73.122357] f29h.local kernel: thunderbolt 0000:03:00.0: 0: tb_eeprom_read_rom failed


The tb_eeprom_read_rom failed message does sometimes happen with 4.19.0-0.rc4.git3.1.fc30.x86_64, but not every boot. Whereas with rc5 it looks like it's every boot so far. I don't see a pattern.
Comment 5 Chris Murphy 2018-09-24 22:24:42 UTC
Created attachment 278749 [details]
dmesg only copy 2

This is the same configuration, just a reboot with rc5, and the call trace doesn't happen. Therefore it's a transient problem; which means it's plausible it can happen on rc4 and I just haven't hit it yet.
Comment 6 Mika Westerberg 2018-09-25 04:52:15 UTC
I suspect this is the same issue we had previously with fwupd powering down the controller while the driver is in the middle of initialization. Chris, can you check the version of fwupd (fwupdmgr --version)?

I also assume you do not have any TBT devices connected, right?
Comment 7 Mika Westerberg 2018-09-25 05:54:54 UTC
I suspect this is a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=199631.
Comment 8 Mario Limonciello 2018-09-25 13:18:07 UTC
Newer fwupd was configured to keep the lock open for a period of time.  So if this is an newer fwupd then it does sound like it could be bolt doing it.

[   29.673264] f29h.local boltd[2032]: power: force_power support: yes
[   29.673343] f29h.local boltd[2032]: power: setting force_power to ON

You can certainly experiment with blacklisting thunderbolt and thunderbolt_power plugins in fwupd and with disabling bolt.

Please confirm your fwupd version too.
Comment 9 Chris Murphy 2018-09-25 14:58:59 UTC
fwupd-1.1.2-1.fc29.x86_64
bolt-0.4-2.fc29.x86_64

There are two USB-C/Thunderbolt 3 connectors on this laptop, and each has one device connected:
- USB-C to USB-A (USB 3.0) adapter with an old USB 1.1 keyboard attached
- USB-C to HDMI, HDMI to DVI, NEC Multisync PA241W

I'm not sure if the display connection is thunderbolt or displayport as it's passed through the USB-C to HDMI connection.
Comment 10 Chris Murphy 2018-09-25 15:01:49 UTC
Also, this bug report is the first instance of the tb_switch_add warning I've seen since bug 199631 back in May.
Comment 11 Chris Murphy 2018-09-25 15:20:20 UTC
This is the same laptop as in bug 199631. Newer manufacturer firmware has been applied since that bug report: [    0.000000] DMI: HP HP Spectre Notebook/81A0, BIOS F.41 06/15/2018

But still the same thunderbolt controller firmware.
# cat /sys/bus/thunderbolt/devices/0-0/nvm_version
16.0

Re-reading bug 199631, in that case nothing was connected at the time. In this bug 201227 case, I had two devices connected, listed in comment 9. Otherwise they appear to be dups. And while it taints the kernel, it appears to be non-fatal.
Comment 12 Mika Westerberg 2018-09-26 07:35:41 UTC
Created attachment 278769 [details]
Add debugging to Intel WMI Thunderbolt module
Comment 13 Mika Westerberg 2018-09-26 07:39:25 UTC
Hi Chris,

The Thunderbolt firmware is typically not part of the system BIOS.

Yes, the warning you see is not fatal but it merely means that the controller got powered off during the driver initialization and the driver does what it can to bail out.

However, it would be good to understand why you are seeing this issue. I wonder if you could apply the patch I attached and when the situation happens attach full dmesg here. That should show when the controller is force powered.

In addition can you attach acpidump here as well? I would like to check how the ACPI force power method is implemented.
Comment 14 Chris Murphy 2018-09-26 15:22:24 UTC
Created attachment 278777 [details]
HP spectre acpidump
Comment 15 Chris Murphy 2018-09-26 15:26:38 UTC
(In reply to Mika Westerberg from comment #13)
> I
> wonder if you could apply the patch I attached and when the situation
> happens attach full dmesg here. That should show when the controller is
> force powered.

It seems to be really rare, since it last happened in May. I can run it with the patch for maybe a week, but generally I'm running Fedora kernels not building my own.
Comment 16 Mika Westerberg 2018-09-27 08:59:46 UTC
OK, I checked the acpidump and the force power method looks pretty standard.

It could be just a bug in the TBT firmware because yours is quite old (v16) or another explanation could be that there is something (fwupd, bolt) that powers the controller down early. It would be good if you could run with the patch some time (week is fine) and if you see the warning again attach full dmesg. Maybe we can spot what happens.