Bug 206493

Summary: Hotplug Thunderbolt daisy chain (TBT host -> TBT monitor -> TBT storage), TBT storage may not appear.
Product: Drivers Reporter: Kai-Heng Feng (kai.heng.feng)
Component: OtherAssignee: drivers_other
Severity: normal CC: mika.westerberg
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: mainline Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg
lspci -vvv
Deactivate EEPROM access
Retry DROM read
dmesg with "Retry DROM read" patch
Modified patch which can solve the issue.
Add some logging around DROM parsing
Retry DROM read once if the parsing fails (new version)

Description Kai-Heng Feng 2020-02-11 08:25:46 UTC
The storage is connected to the monitor. When hotplug the monitor to the host, sometimes the storage behind monitor can't be detected. This error can be spotted in dmesg:
[ 1464.264628] thunderbolt 0000:04:00.0: 3: drom size mismatch, aborting
[ 1464.264629] thunderbolt 0-3: reading DROM failed
[ 1464.264637] thunderbolt 0000:04:00.0: failed to find parent switch for 303

The monitor itself works fine when this issue happens.
Comment 1 Kai-Heng Feng 2020-02-11 08:26:14 UTC
Created attachment 287297 [details]
Comment 2 Kai-Heng Feng 2020-02-11 08:26:35 UTC
Created attachment 287299 [details]
lspci -vvv
Comment 3 Kai-Heng Feng 2020-02-11 08:27:27 UTC
The TBT NVM version is 55.
Comment 4 Mika Westerberg 2020-02-11 10:09:13 UTC
Hi, what monitor this is? I remember seeing something like this with LG 5k monitor.
Comment 5 Kai-Heng Feng 2020-02-11 10:33:13 UTC
Comment 6 Kai-Heng Feng 2020-02-14 06:22:23 UTC
So what's next step should I do?
Comment 7 Mika Westerberg 2020-02-14 12:22:50 UTC
What is the NVM version of the monitor and can you check if there is an update available?
Comment 8 Kai-Heng Feng 2020-02-14 12:40:43 UTC
Indeed there's an update firmware on Asus' website. Now I know monitors also need firmware update :)

I'll report back once I've done the test.
Comment 9 Kai-Heng Feng 2020-02-17 06:11:57 UTC
I can still reproduce the same issue after upgrading the firmware on the Asus monitor.
Comment 10 Mika Westerberg 2020-02-18 07:46:06 UTC
Can you add "thunderbolt.dyndbg" to the kernel command line, reproduce the issue and attach dmesg here? What is the NVM version (nvm_version) of the monitor after the upgrade?
Comment 11 Mika Westerberg 2020-02-21 09:21:58 UTC
Created attachment 287531 [details]
Deactivate EEPROM access

Can you also try this patch (on top of v5.6-rc2) and see if changes anything?
Comment 12 Kai-Heng Feng 2020-02-26 13:15:35 UTC
TBT display:
  DeviceId:             d958fa2449a859e6afe1937e6316b30f760a5b8b
  Guid:                 67d422af-4240-56b1-9c9d-aad8013ba010
  Plugin:               thunderbolt
  Flags:                updatable|registered
  Vendor:               ASUS-Display
  VendorId:             TBT:0x0031
  Version:              33.00
  Icon:                 audio-card
  Created:              2020-02-21

  DeviceId:             4ee2fd1be53f8393edbe2d8411074f8726743e29
  Guid:                 ecde9711-de91-56a2-bcfb-345c46f2cb8f
  Plugin:               thunderbolt
  Flags:                updatable|registered
  Vendor:               HP
  VendorId:             TBT:0x01EC
  Version:              29.00
  Icon:                 audio-card
  Created:              2020-02-21
Comment 13 Kai-Heng Feng 2020-02-26 13:19:08 UTC
I haven't tried your patch, but I found that adding a delay can workaround the issue:

diff --git a/drivers/thunderbolt/eeprom.c b/drivers/thunderbolt/eeprom.c
index 921d164b3f35..0122799a3215 100644
--- a/drivers/thunderbolt/eeprom.c
+++ b/drivers/thunderbolt/eeprom.c
@@ -9,6 +9,7 @@
 #include <linux/crc32.h>
 #include <linux/property.h>
 #include <linux/slab.h>
+#include <linux/delay.h>
 #include "tb.h"
@@ -576,6 +577,8 @@ int tb_drom_read(struct tb_switch *sw)
        sw->drom = kzalloc(size, GFP_KERNEL);
        if (!sw->drom)
                return -ENOMEM;
+       msleep(100);
        res = tb_drom_read_n(sw, 0, sw->drom, size);
        if (res)
                goto err;
Comment 14 Mika Westerberg 2020-02-27 14:55:12 UTC
Thanks for the information :)

Does it work every time or just from time to time? Maybe it makes sense to retry the DROM read once?
Comment 15 Kai-Heng Feng 2020-02-27 15:18:34 UTC
Smaller numbers like 50ms may still fail, but 100ms seems to be pretty reliable.

I'll see if retry the DROM works or not.
Comment 16 Kai-Heng Feng 2020-03-02 08:17:50 UTC
Retry reading DROM doesn't help. And the issue is still reproducible with patch in #11:
[   33.687862] thunderbolt 0000:04:00.0: failed to find parent switch for 301

100ms delay seems to be really reliable, haven't seen this issue after several dozen times hotplugging.
Comment 17 Mika Westerberg 2020-03-02 10:42:57 UTC
Hmm, I wonder if we could do so that try the read first and if it fails then wait for the 100ms and retry once, does it work? Then we would not need to do that for every device.
Comment 18 Kai-Heng Feng 2020-03-03 11:57:27 UTC
The read-delay-retry loop will make subsequent hotplug stops working.

Does Windows driver have the delay?
Comment 19 Mika Westerberg 2020-03-03 12:10:05 UTC
AFAIK Windows driver does not read DROM at all. Linux reads it because it contains information about the ports etc. Let me try to come up another patch to test.
Comment 20 Mika Westerberg 2020-03-03 12:21:54 UTC
Created attachment 287777 [details]
Retry DROM read

Can you try this one?
Comment 21 Mika Westerberg 2020-03-04 10:01:22 UTC
Also please add "thunderbolt.dyndbg" to the kernel command line when you try the patch and attach full dmesg.
Comment 22 Kai-Heng Feng 2020-03-05 11:10:56 UTC
Created attachment 287805 [details]
dmesg with "Retry DROM read" patch
Comment 23 Kai-Heng Feng 2020-03-05 11:11:30 UTC
It doesn't work all the time.

Normal cases:
[  133.478769] thunderbolt 0-303: new device found, vendor=0x1ec device=0x4257
[  133.478770] thunderbolt 0-303: HP P800
[  133.637115] pcieport 0000:00:1d.0: PME: Spurious native interrupt!
[  133.637162] pcieport 0000:03:04.0: pciehp: Slot(4): Card present
[  133.637166] pcieport 0000:03:04.0: pciehp: Slot(4): Link Up

No "Card present":
[  179.723429] thunderbolt 0-303: new device found, vendor=0x1ec device=0x4257
[  179.723429] thunderbolt 0-303: HP P800
Comment 24 Kai-Heng Feng 2020-05-23 12:14:38 UTC
Created attachment 289243 [details]
Modified patch which can solve the issue.

Based on your patch, a slightly modified one that can solve the issue.
Comment 25 Mika Westerberg 2020-05-25 09:24:20 UTC
Can you check the error the first tb_eeprom_read_n() returns when it fails?
Comment 26 Kai-Heng Feng 2020-05-30 05:55:57 UTC
Now tb_eeprom_read_n() always return 0. The error always happen at crc32 mismatch:
[  157.748881] thunderbolt 0000:04:00.0: 1: reading drom (length: 0x6c)
[  158.165393] thunderbolt 0000:04:00.0: 1: drom data crc32 mismatch (expected: 0x600a2071, got: 0x24b59097), continuing
[  158.165773] thunderbolt 0000:04:00.0: 1: drom buffer overrun, aborting
[  158.165775] thunderbolt 0-1: reading DROM failed
Comment 27 Kai-Heng Feng 2020-05-30 05:57:55 UTC
Created attachment 289431 [details]
Comment 28 Kai-Heng Feng 2020-07-02 12:06:04 UTC
Comment 29 Mika Westerberg 2020-07-02 12:34:44 UTC
Thanks for the reminder, forgot this one. My vacation starts tomorrow so I don't think I can do anything before that but once I'm back I'll take a look again. I think the DROM read bit-banging somehow goes wrong with this particular device.
Comment 30 Kai-Heng Feng 2020-07-02 12:36:55 UTC
Alright, have a nice vacation!
Comment 31 Mika Westerberg 2020-08-19 13:19:56 UTC
Created attachment 292027 [details]
Add some logging around DROM parsing

Can you try the attached patch and then attached full dmesg so we can maybe see what it looks like when it is corrupted?
Comment 32 Kai-Heng Feng 2020-08-25 07:52:27 UTC
Created attachment 292165 [details]

[  398.255734] thunderbolt 0000:04:00.0: 3: drom data crc32 mismatch (expected: 0x600a2071, got: 0xf1f0368a), continuing
[  398.255858] thunderbolt 0000:04:00.0: 3: drom buffer overrun at 0x1e, aborting
[  398.255860] thunderbolt 0-3: reading DROM failed
[  398.609700] thunderbolt 0000:04:00.0: failed to find parent switch for 303
Comment 33 Mika Westerberg 2020-08-25 11:26:12 UTC
Can you add "thunderbolt.dyndbg" (note it is thunderbolt.dyndbg not thunderbolt.dyndbg=1) and try again so we get all the logs?
Comment 34 Mika Westerberg 2020-08-26 10:22:03 UTC
There also seems to be a new firmware update for this monitor:


I wonder if you could try it as well?
Comment 35 Kai-Heng Feng 2020-08-27 03:59:30 UTC
Created attachment 292185 [details]

The issue persists after upgrading the TBT monitor's firmware.

Here's the correct dmesg.
Comment 36 Mika Westerberg 2020-08-28 09:52:24 UTC
Created attachment 292201 [details]
Retry DROM read once if the parsing fails (new version)

Can you try the attached patch?
Comment 37 Kai-Heng Feng 2020-09-04 03:53:11 UTC
The patch solves the issue

Please collect my Tested-by tag:
Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Comment 38 Mika Westerberg 2020-09-08 09:58:23 UTC
Posted slightly modified patch upstream here:

Comment 39 Kai-Heng Feng 2020-12-08 16:41:52 UTC
Fixed by commit f022ff7bf377ca94367be05de61277934d42ea74 ("thunderbolt: Retry DROM read once if parsing fails")