The storage is connected to the monitor. When hotplug the monitor to the host, sometimes the storage behind monitor can't be detected. This error can be spotted in dmesg:
[ 1464.264628] thunderbolt 0000:04:00.0: 3: drom size mismatch, aborting
[ 1464.264629] thunderbolt 0-3: reading DROM failed
[ 1464.264637] thunderbolt 0000:04:00.0: failed to find parent switch for 303
The monitor itself works fine when this issue happens.
Created attachment 287297 [details]
Created attachment 287299 [details]
The TBT NVM version is 55.
Hi, what monitor this is? I remember seeing something like this with LG 5k monitor.
So what's next step should I do?
What is the NVM version of the monitor and can you check if there is an update available?
Indeed there's an update firmware on Asus' website. Now I know monitors also need firmware update :)
I'll report back once I've done the test.
I can still reproduce the same issue after upgrading the firmware on the Asus monitor.
Can you add "thunderbolt.dyndbg" to the kernel command line, reproduce the issue and attach dmesg here? What is the NVM version (nvm_version) of the monitor after the upgrade?
Created attachment 287531 [details]
Deactivate EEPROM access
Can you also try this patch (on top of v5.6-rc2) and see if changes anything?
I haven't tried your patch, but I found that adding a delay can workaround the issue:
diff --git a/drivers/thunderbolt/eeprom.c b/drivers/thunderbolt/eeprom.c
index 921d164b3f35..0122799a3215 100644
@@ -9,6 +9,7 @@
@@ -576,6 +577,8 @@ int tb_drom_read(struct tb_switch *sw)
sw->drom = kzalloc(size, GFP_KERNEL);
res = tb_drom_read_n(sw, 0, sw->drom, size);
Thanks for the information :)
Does it work every time or just from time to time? Maybe it makes sense to retry the DROM read once?
Smaller numbers like 50ms may still fail, but 100ms seems to be pretty reliable.
I'll see if retry the DROM works or not.
Retry reading DROM doesn't help. And the issue is still reproducible with patch in #11:
[ 33.687862] thunderbolt 0000:04:00.0: failed to find parent switch for 301
100ms delay seems to be really reliable, haven't seen this issue after several dozen times hotplugging.
Hmm, I wonder if we could do so that try the read first and if it fails then wait for the 100ms and retry once, does it work? Then we would not need to do that for every device.
The read-delay-retry loop will make subsequent hotplug stops working.
Does Windows driver have the delay?
AFAIK Windows driver does not read DROM at all. Linux reads it because it contains information about the ports etc. Let me try to come up another patch to test.
Created attachment 287777 [details]
Retry DROM read
Can you try this one?
Also please add "thunderbolt.dyndbg" to the kernel command line when you try the patch and attach full dmesg.
Created attachment 287805 [details]
dmesg with "Retry DROM read" patch
It doesn't work all the time.
[ 133.478769] thunderbolt 0-303: new device found, vendor=0x1ec device=0x4257
[ 133.478770] thunderbolt 0-303: HP P800
[ 133.637115] pcieport 0000:00:1d.0: PME: Spurious native interrupt!
[ 133.637162] pcieport 0000:03:04.0: pciehp: Slot(4): Card present
[ 133.637166] pcieport 0000:03:04.0: pciehp: Slot(4): Link Up
No "Card present":
[ 179.723429] thunderbolt 0-303: new device found, vendor=0x1ec device=0x4257
[ 179.723429] thunderbolt 0-303: HP P800
Created attachment 289243 [details]
Modified patch which can solve the issue.
Based on your patch, a slightly modified one that can solve the issue.
Can you check the error the first tb_eeprom_read_n() returns when it fails?
Now tb_eeprom_read_n() always return 0. The error always happen at crc32 mismatch:
[ 157.748881] thunderbolt 0000:04:00.0: 1: reading drom (length: 0x6c)
[ 158.165393] thunderbolt 0000:04:00.0: 1: drom data crc32 mismatch (expected: 0x600a2071, got: 0x24b59097), continuing
[ 158.165773] thunderbolt 0000:04:00.0: 1: drom buffer overrun, aborting
[ 158.165775] thunderbolt 0-1: reading DROM failed
Created attachment 289431 [details]
Thanks for the reminder, forgot this one. My vacation starts tomorrow so I don't think I can do anything before that but once I'm back I'll take a look again. I think the DROM read bit-banging somehow goes wrong with this particular device.
Alright, have a nice vacation!
Created attachment 292027 [details]
Add some logging around DROM parsing
Can you try the attached patch and then attached full dmesg so we can maybe see what it looks like when it is corrupted?
Created attachment 292165 [details]
[ 398.255734] thunderbolt 0000:04:00.0: 3: drom data crc32 mismatch (expected: 0x600a2071, got: 0xf1f0368a), continuing
[ 398.255858] thunderbolt 0000:04:00.0: 3: drom buffer overrun at 0x1e, aborting
[ 398.255860] thunderbolt 0-3: reading DROM failed
[ 398.609700] thunderbolt 0000:04:00.0: failed to find parent switch for 303
Can you add "thunderbolt.dyndbg" (note it is thunderbolt.dyndbg not thunderbolt.dyndbg=1) and try again so we get all the logs?
There also seems to be a new firmware update for this monitor:
I wonder if you could try it as well?
Created attachment 292185 [details]
The issue persists after upgrading the TBT monitor's firmware.
Here's the correct dmesg.
Created attachment 292201 [details]
Retry DROM read once if the parsing fails (new version)
Can you try the attached patch?
The patch solves the issue
Please collect my Tested-by tag:
Tested-by: Kai-Heng Feng <email@example.com>
Posted slightly modified patch upstream here:
Fixed by commit f022ff7bf377ca94367be05de61277934d42ea74 ("thunderbolt: Retry DROM read once if parsing fails")