Bug 210761 - Input/output error on Thunderbolt devices
Summary: Input/output error on Thunderbolt devices
Status: RESOLVED WILL_NOT_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-12-17 22:03 UTC by Joe Borg
Modified: 2021-01-11 19:30 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.8.0-33-generic
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg with tb debug enabled (13.41 KB, text/plain)
2020-12-18 16:03 UTC, Joe Borg
Details
lspci -vv out (81.35 KB, text/plain)
2020-12-18 16:04 UTC, Joe Borg
Details
dmesg with tb debug enabled no tbt devices plugged in at boot (5.28 KB, text/plain)
2020-12-22 17:18 UTC, Joe Borg
Details
Dmesg with dock plugged in from boot (112.60 KB, text/plain)
2021-01-07 15:17 UTC, Joe Borg
Details
Dmesg with dock plugged in after boot (181.41 KB, text/plain)
2021-01-07 15:17 UTC, Joe Borg
Details
device fw versions (7.32 KB, application/octet-stream)
2021-01-07 15:36 UTC, Joe Borg
Details
Dmesg with dock plugged in from boot and BIOS (107.37 KB, text/plain)
2021-01-08 14:19 UTC, Joe Borg
Details
Second attempt at BIOS defaults (251.56 KB, text/plain)
2021-01-08 15:31 UTC, Joe Borg
Details
Dmesg after modprobe (202.51 KB, text/plain)
2021-01-08 18:06 UTC, Joe Borg
Details
Dmesg during sucessful fwupd update (133.28 KB, text/plain)
2021-01-11 15:38 UTC, Joe Borg
Details

Description Joe Borg 2020-12-17 22:03:12 UTC
This surfaced when trying to fwupd update the firmware on my Dell XPS 13 9370 (https://github.com/fwupd/fwupd/issues/2689).

When trying to access nvmem on any of the TB devices, I get IO timeouts:

$ sudo cat /sys/bus/thunderbolt/devices/domain0/0-0/device_name 
XPS 13 9370

$ sudo cat /sys/bus/thunderbolt/devices/domain0/0-0/nvm_active0/nvmem 
cat: /sys/bus/thunderbolt/devices/domain0/0-0/nvm_active0/nvmem: Input/output error

$ sudo cat /sys/bus/thunderbolt/devices/domain0/0-0/0-3/device_name 
WD19TB Thunderbolt Dock

$ sudo cat /sys/bus/thunderbolt/devices/domain0/0-0/0-3/nvm_active1/nvmem 
cat: /sys/bus/thunderbolt/devices/domain0/0-0/0-3/nvm_active1/nvmem: Input/output error
Comment 1 Mika Westerberg 2020-12-18 07:45:24 UTC
Can you add "thunderbolt.dyndbg" to the kernel command line and then attach full dmesg here? Also please attach output of 'sudo lspci -vv'.
Comment 2 Joe Borg 2020-12-18 16:03:49 UTC
Created attachment 294211 [details]
dmesg with tb debug enabled
Comment 3 Joe Borg 2020-12-18 16:04:57 UTC
Created attachment 294213 [details]
lspci -vv out
Comment 4 Joe Borg 2020-12-18 16:32:52 UTC
Looking at boltctl, it does seem that the dock boots as "connected and authoirized" and then drops to "disconnected".  I can't tell whether or not this ties in with anything in dmesg.  I have tried with TLP disabled, but it's the same story.
Comment 5 Joe Borg 2020-12-18 17:03:48 UTC
(In reply to Joe Borg from comment #4)
> Looking at boltctl, it does seem that the dock boots as "connected and
> authoirized" and then drops to "disconnected".  I can't tell whether or not
> this ties in with anything in dmesg.  I have tried with TLP disabled, but
> it's the same story.

After a few reboots and trying to cat nvmem, the fall to "disconnected" doesn't always happen but I can never read nvmem.
Comment 6 Mika Westerberg 2020-12-21 08:07:23 UTC
What you mean by "TLP disabled"?

Can you completely power off the machine and boot it up without device connected. Then check if you can read the host controller NVM (and also check if nvm_version is there and is readable).

What does /sys/bus/thunderbolt/devices/domain0/security contain?
Comment 7 Mika Westerberg 2020-12-21 08:38:04 UTC
I realized that XPS 9370 is probably using ACPI based PCI enumeration so you don't see the TBT controller once you boot up the system with nothing connected. It might be there for a while as bolt powers it on at boot. This makes me wonder why the driver is runtime suspending the controller as in your log? That should not be happening.

@Mario, can you confirm that 9370 is not RTD3 system?
Comment 8 Mario Limonciello 2020-12-21 16:04:20 UTC
I recall that the circuitry for supporting RTD3 was introduced for the Whiskey Lake generation of laptops - that is the 9380.  

So your hypothesis is driver bug where it is advertising it can support runtime power management on this machine but shouldn't and thus something (like TLP) is turning the knob and suspending it but it can't wake up.

IIRC 9370 should also be controllable via force power pin, which is what bolt may be doing at bootup using the WMI driver. As a workaround can you try toggling as advertised in https://www.kernel.org/doc/html/v5.10/admin-guide/thunderbolt.html?highlight=intel-wmi-thunderbolt in a situation that NVM can't normally be read?  See if that kicks it awake.
Comment 9 Mika Westerberg 2020-12-22 10:29:26 UTC
Right, the driver should only enable runtime PM if the TBT firmware says it supports RTD3 but it may be that something is wrong somewhere ;-)

I can try to reproduce this with 9370 after the holidays.

@Joe, can you check the BIOS version too? And also did you change any TBT related BIOS settings?
Comment 10 Joe Borg 2020-12-22 14:58:19 UTC
I mean to say I've edited /etc/tlp.conf, set TLP_ENABLE=0, saved and rebooted.

I've checked the settings in BIOS and they are open security wise, everything else looks a sane.  I'll now reboot to get the BIOS version.
Comment 11 Joe Borg 2020-12-22 15:06:49 UTC
BIOS version is 1.13.1, which seems to be the latest according to the Dell downloads page.

If I reboot the system with no TBT devices plugged in, I get an empty /sys/bus/thunderbolt/devices.

/sys/bus/thunderbolt/devices/domain0/security is "none" when a TBT device is plugged in from boot.
Comment 12 Mika Westerberg 2020-12-22 15:56:25 UTC
Can you turn on the TBT security from the BIOS? It typically under "Thunderbolt" menu where you can choose security level. Select "User Approval" or similar.

Once you have done that, boot the system up, no device connected. Then plug in the device (you may need to authorize it from the UI or settings panel depending on which distro you are running). Then check again if you can read the NVM contents of the device or the host and attach full dmesg here. Please also check the nvm_version files under those devices.
Comment 13 Joe Borg 2020-12-22 16:12:45 UTC
Tried that and I get the same results as “open” permissions.  Interestingly, when I do this, I get the message “Thunderbolt security level could not be determined”.
Comment 14 Mika Westerberg 2020-12-22 16:31:56 UTC
Can you attach full dmesg of that run? Preferably with "thunderbolt.dyndbg" in the kernel command line so we hopefully can see bit more what it is doing.
Comment 15 Joe Borg 2020-12-22 17:16:49 UTC
Attaching dmsg with
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash thunderbolt.dyndbg"
```
to clarify, this is with no TBT devices plugged in at boot and then plugged in after login.
Comment 16 Joe Borg 2020-12-22 17:18:09 UTC
Created attachment 294299 [details]
dmesg with tb debug enabled no tbt devices plugged in at boot
Comment 17 Mika Westerberg 2020-12-23 11:17:43 UTC
Unfortunately it seems to be missing a lots of information :( If you are building your own kernel, please increase CONFIG_LOG_BUF_SHIFT so that it has room for all the messages. Then do these steps (keep the thunderbolt.dyndbg in the command line too):

1. Boot up the system, nothing connected
2. Once the system is booted up, plug in the WD19TB dock
3. Run 'dmesg' and attach the full output here.

You can remove/scramble lines that include MAC addresses and similar more "sensitive" information if you like. What we are interested here is to see the initial PCI configuration, the ACPI _OSC call and the hotplug so that it might help to debug this issue further.

I'm sure this has been working for XPS 9370 (as I tested this myself some time ago) so there must be some regression in the TBT driver side.
Comment 18 Joe Borg 2021-01-04 14:51:14 UTC
My kernel is from Ubuntu, it's not built by me so I'm not sure I can change that.

I'm not removing anything from these logs, they're the direct result of 

```
sudo dmesg -l debug > dmesg.log

```

I've checked again, this time comparing before plugging in vs after and I can't see any new lines added after plugging the device in after boot.  Also, if I check `boltctl`, the device is marked as disconnected.  So I don't even think it's getting initialised.  When I plug it in before boot, it shows as connected.
Comment 19 Mika Westerberg 2021-01-04 15:27:53 UTC
Can you please run dmesg without -l option and attach the full here? Like this:

  $ sudo dmesg > dmesg.log

But also do the steps in my previous comment (boot up with nothing connected etc) before that so we can hopefully see something useful in the full dmesg.
Comment 20 Mika Westerberg 2021-01-05 16:58:15 UTC
I now tried with XPS 9370, BIOS version 1.13.1, TBT NVM 40 and Dell WD19TB dock with NVM 40 and followed my steps above but I don't see any issues. I also tried so that I boot up with dock connected but still both NVMs are accessible and the device works fine. I do have newer kernel though (mainline v5.12-rc2 + few patches but those should not matter).

I also can answer to my own question, this system is using ACPI enumeration and I checked that the _OSC call is fine. 

I will try to install Ubuntu on that system and see if that matters, now I'm using my buildroot "distro".

BTW, which port you connect the dock? I'm using the left side ports marked with lightning symbol.
Comment 21 Mario Limonciello 2021-01-05 17:11:38 UTC
As a non-obvious potential thought, Joe do you have a smaller power power supply connected to the dock than is suggested?  The firmware does emit a message to let users know whether too small of a power supply is connected but nothing in Linux picks it up at the moment.  In Windows you'll see a pop-up along those lines from Dell Power Manager.
Comment 22 Mika Westerberg 2021-01-07 09:11:07 UTC
Just for the record. I now booted Ubuntu 20.10 and tried the same steps but I cannot reproduce the issue. Both NVMs are accessible and also fwupd is able to read them. The Ubuntu kernel version is 5.8.0-25.
Comment 23 Joe Borg 2021-01-07 15:16:13 UTC
Sorry for the tardy reply, it's been a busy week.

I am using the power supply that Dell provided with the dock, so I hope it's correct but maybe it's not or failing.

I have attached both dmesg.log, which is boot unplugged and plug in, and dmesg-dockboot.log, which is from boot.

In the plug in after boot, i tried in all 3 ports.  I usually run it on the left side, which have the TB logos on (the right has the DP logo).

Thanks for your continuing help!
Comment 24 Joe Borg 2021-01-07 15:17:07 UTC
Created attachment 294547 [details]
Dmesg with dock plugged in from boot
Comment 25 Joe Borg 2021-01-07 15:17:22 UTC
Created attachment 294549 [details]
Dmesg with dock plugged in after boot
Comment 26 Mario Limonciello 2021-01-07 15:32:29 UTC
(In that case I have no worry about the power supply)

Can you please confirm your firmware versions match what Mika has been trying with success on Ubuntu 20.10?
Comment 27 Joe Borg 2021-01-07 15:36:32 UTC
Yes, my BIOS is 1.13.1

I'll attach a log of all the versions for TB device fw versions.
Comment 28 Joe Borg 2021-01-07 15:36:49 UTC
Created attachment 294551 [details]
device fw versions
Comment 29 Mika Westerberg 2021-01-08 07:29:52 UTC
Thanks for the logs. What I can see is that the TBT controller is there also when you boot without device connected and that is not expected. It should be completely powered down at that point. The other log where you boot with device connected gets bit farther but it tries to runtime suspend the controller and that's also something that should not happen with this system. My guess is the host TBT firmware is somehow wrong. It looks like it is the RTD3 firmware instead of the non-RTD3 one. It is possible that some BIOS settings affect this too. Did you change any BIOS settings from the defaults?
Comment 30 Joe Borg 2021-01-08 14:19:18 UTC
Created attachment 294565 [details]
Dmesg with dock plugged in from boot and BIOS

I've set all BIOS settings back to factory.  It now seems that device doesn't get detected at anymore but I guess it's falling back on USB because my devices work.
Comment 31 Mika Westerberg 2021-01-08 14:49:45 UTC
There is still some issue as the driver fails to talk to the firmware:

[   68.584139] thunderbolt 0000:05:00.0: failed to send driver ready to ICM

and then the xHCI dies too:

[   92.432655] xhci_hcd 0000:39:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[   92.432716] xhci_hcd 0000:39:00.0: can't change power state from D3hot to D0 (config space inaccessible)
[   92.432760] xhci_hcd 0000:39:00.0: Controller not ready at resume -19
[   92.432763] xhci_hcd 0000:39:00.0: PCI post-resume error -19!
[   92.432767] xhci_hcd 0000:39:00.0: HC died; cleaning up

I think the factory defaults for Dell systems are that the TBT is in "user" security mode and that is not the case here. Do you have:

  System Configuration -> Dell Type-C Dock Configuration

checked? My system has it checked and also the following:

  System Configuration -> Thunderbolt Adapter Configuration
     Enable Thunderbolt Technology Support
     Security Level - User Authorization
Comment 32 Joe Borg 2021-01-08 14:54:02 UTC
Dell Type-C Dock Configuration is enabled
Security Level is set to None on mine though (by default).
Comment 33 Mario Limonciello 2021-01-08 15:02:16 UTC
Dell BIOS has a variety of different "defaults" profiles.  They're explained here: https://github.com/torvalds/linux/blob/master/Documentation/ABI/testing/sysfs-class-firmware-attributes#L245

I'm a bit surprised if factory had security set to "none" by default, it should have been user.

Please make sure that you have the same ones as Mika checked.

>It looks like it is the RTD3 firmware instead of the non-RTD3 one

I'm starting to suspect a corrupted TBT firmware.  One more thought though to try to force everything awake.  Can you please:

1) Make sure boltd service is disabled (so it doesn't mess with the force power pin)
2) Unload thunderbolt.ko
3) Use the force power driver to force thunderbolt force power awake
3) Try to reload thunderbolt.ko

Share the messages that came up then.  If we're still not seeing things work properly I think we're at a corrupted TBT SPI.
Comment 34 Joe Borg 2021-01-08 15:31:46 UTC
Created attachment 294569 [details]
Second attempt at BIOS defaults

I tried again, this time resetting BIOS defaults and factory defaults (both in BIOS oddly), that seems to align now (i.e. user auth).  However, I still see exactly the same problems.

Could you explain the process of 2, 3 and 4 please?  I'm not sure how to do those steps.
Comment 35 Mario Limonciello 2021-01-08 15:42:46 UTC
Here you go:

# sudo rmmod thunderbolt
# sudo modprobe intel-wmi-thunderbolt
# echo "1" | sudo tee /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
sudo modprobe thunderbolt
Comment 36 Joe Borg 2021-01-08 16:40:24 UTC
➜ sudo systemctl stop bolt 

~ 
➜ sudo rmmod thunderbolt  

~ 
➜ sudo modprobe intel-wmi-thunderbolt

~ 
➜ echo "1" | sudo tee /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power
1

~ 
➜ sudo modprobe thunderbolt

Nothing errored
Comment 37 Mario Limonciello 2021-01-08 16:43:52 UTC
Can you check dmesg for any new messages?  Did the ICM come up this time?
Comment 38 Joe Borg 2021-01-08 18:06:32 UTC
Created attachment 294571 [details]
Dmesg after modprobe

This is with the rmprobe and modprobe.
Comment 39 Mario Limonciello 2021-01-08 21:18:52 UTC
OK - my thought at this point is unfortunately corrupted firmware or failing thunderbolt controller.  I would suggest working on getting your MB replaced if you're under warranty or a service contract.

Mika you agree?
Comment 40 Mika Westerberg 2021-01-11 09:29:30 UTC
Yes, I agree.
Comment 41 Joe Borg 2021-01-11 15:37:13 UTC
Oddly, whilst (I assume) the dock is in USB rather than TB mode, I can update it with fwupd:

Upgrade available for Package level of Dell dock from 01.00.14.00 to 01.00.14.01
Downloading 01.00.14.01 for Package level of Dell dock...
Decompressing…           [***************************************]
Authenticating…          [***************************************]
Updating Package level of Dell dock…                             ]
Restarting device…       [***************************************]
Successfully installed firmware
Upload message: 90fc13732f343a41ea50b5282142865613197665 replaces old report
Successfully uploaded 1 report
• RTS5413 in Dell dock has the latest available firmware version
• RTS5487 in Dell dock has the latest available firmware version
• VMM5331 in Dell dock has the latest available firmware version
• WD19TB has the latest available firmware version
• KXG50ZNV1T02 NVMe TOSHIBA 1024GB has no available firmware updates
• System Firmware has the latest available firmware version
• TPM 2.0 has no available firmware updates
• Touchpad has no available firmware updates


However, even after a reboot and powercycle of the dock, it still doesn't show under boltctl as a device.


Jan 11 10:27:09 mia boltd[6594]: dbus: exported domain at /org/freedesktop/bolt/domains/e0010000_0070_6508_a321_b602e2d28a1e
Jan 11 10:27:09 mia systemd[1]: Started Thunderbolt system service.
Jan 11 10:27:09 mia boltd[6594]: power: state changed: supported/on
Jan 11 10:27:09 mia boltd[6594]: power: guard '2' for 'fwupd' active
Jan 11 10:27:28 mia boltd[6594]: power: got event for guard '2' (10)
Jan 11 10:27:28 mia boltd[6594]: power: guard '2' for 'fwupd' deactivated
Jan 11 10:27:28 mia boltd[6594]: power: shutdown scheduled (T-20.00s)
Jan 11 10:27:28 mia boltd[6594]: power: state changed: supported/wait
Jan 11 10:28:10 mia boltd[6594]: power: setting force_power to OFF
Jan 11 10:28:10 mia boltd[6594]: power: state changed: supported/off

I'll attach the dmesg just in case but, if not, thanks for all the help, it's really appreciated.  Looks like I'll need a new laptop :(
Comment 42 Joe Borg 2021-01-11 15:38:00 UTC
Created attachment 294601 [details]
Dmesg during sucessful fwupd update
Comment 43 Mario Limonciello 2021-01-11 19:29:15 UTC
I mean if you're super adventurous and it's out of warranty depending upon the socket type you can certainly experiment with a flash programmer clip on the TBT SPI chip.

And yes - the dock has the ability to update all components over USB mode if the host doesn't have a working thunderbolt interface.  Nothing new pops up from your dmesg, same situation that the ICM (which runs on the host TBT chip) isn't starting up.
Comment 44 Joe Borg 2021-01-11 19:30:50 UTC
Okay, thanks Mario!

Note You need to log in before you can comment on or make changes to this bug.