Created attachment 285807 [details] kernel log just before/during/after update Filing this bug on request from https://github.com/dell/thunderbolt-nvm-linux/issues/23 (which may have some additional context) Note: I wasn't sure if drivers would be a better component or not here. Overview: When using fwupd to update the Thunderbolt firmware of my laptop, the thunderbolt interface disappears from the system, but does not re-appear after the update has completed. The display-port screen that is connected to a thunderbolt dock (which is itself connected to the machine) does turn back on, but no other device comes back. Steps taken: - Thunderbolt dock is connected - fwupdmgr -v install file-here.cab - <update happens> Actual Results: Thunderbolt controller/interface is missing until complete computer restart. fwupdmgr get-devices no longer shows the thunderbolt controller device (until restart) Expected Results: Have the thunderbolt interface up and functional after an update Build Date & Hardware: Kernel: 5.3.8 Machine: DELL XPS 13 9370 (latest EFI 0.1.11.1) TB Dock: DELL TB16 (latest firmware from August 3.12.002) fwupd: 1.3.3 Tested on 2019-11-06 Thunderbolt controller information: ├─Thunderbolt Controller: │ Device ID: c600676d09338e9fd2094171b628603b85841d99 │ Summary: Unmatched performance for high-speed I/O │ Current version: 40.00 │ Vendor: Dell (TBT:0x00D4) │ GUIDs: 4eeb9d07-a96c-56d6-92d3-4a23ee7a6e4a ← TBT-00d407e6 │ 31dedd69-1429-5224-87d4-a8fb8072a50e ← TBT-00d407e6-0000:05:00.0 │ Device Flags: • Internal device │ • Updatable │ • Requires AC power │ • Device stages updates See also attached kernel logs (just before/during/after the update occurred)
Can you try the upgrade so that you disable fwupd (not sure how it is done, Mario can help us here) and then follow instructions here: https://www.kernel.org/doc/html/latest/admin-guide/thunderbolt.html#upgrading-nvm-on-thunderbolt-device-or-host If it still fails, please attach full dmesg and output of 'sudo lspci -vv'.
Also please disable bolt as well so it does not do any force power things.
I did some experiments on my end since I have both 9370 and TB16 dock. So I booted to shell (minimal userspace, no bolt or fwupd running) with the dock connected and did several times NVM upgrade from v40 -> v40 and each time the controller comes back after a while (takes ~15-20s for it to come back the second time). BIOS version in this system is the latest (1.11.1). So one guess is that the force power thing is somehow affecting this.
> Can you try the upgrade so that you disable fwupd (not sure how it is done, > Mario can help us here) and then follow instructions here: Basically would be these steps: # gcab --extract firmware.cab # dd if=firmware.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate > Also please disable bolt as well so it does not do any force power things. CC gicmo to confirm, but I expect that bolt shouldn't be leaving anything force powered, especially with a TBT dock hooked up. > So one guess is that the force power thing is somehow affecting this. OP: did you have the dock plugged in from the moment you booted up all the way to when you ran the FW upgrade that caused the failure? Or did you hotplug it some time after booting up? If it's some time after booting up nothing should have force powered. Otherwise we'll need to check fwupd and bolt caching of force power stuff to see if they could have re-applied it. Could you check your system journal? I think bolt will be messaging something to the journal at this time if fwupd or bolt used force power.
(In reply to Mario Limonciello from comment #4) > > Can you try the upgrade so that you disable fwupd (not sure how it is done, > > Mario can help us here) and then follow instructions here: > > Basically would be these steps: > > # gcab --extract firmware.cab > # dd if=firmware.bin > of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem > # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate > > > Also please disable bolt as well so it does not do any force power things. > > CC gicmo to confirm, but I expect that bolt shouldn't be leaving anything > force powered, especially with a TBT dock hooked up. boltd itself only force-powers directly when boltd is started and it cannot find a controller (e.g. nothing is plugged in). Afterwards only on request (D-Bus API). > If it's some time after booting up nothing should have force powered. > Otherwise we'll need to check fwupd and bolt caching of force power stuff to > see if they could have re-applied it. Could you check your system journal? > I think bolt will be messaging something to the journal at this time if > fwupd or bolt used force power. It indeed should. journalctl -u bolt -b (for the current boot). Also "boltctl power -q" shows the current force power status.
Created attachment 285823 [details] bolt logs (nov 6)
(In reply to Mario Limonciello from comment #4) > OP: did you have the dock plugged in from the moment you booted up all the > way to when you ran the FW upgrade that caused the failure? Or did you > hotplug it some time after booting up? Hotplug some time after booting up > > If it's some time after booting up nothing should have force powered. > Otherwise we'll need to check fwupd and bolt caching of force power stuff to > see if they could have re-applied it. Could you check your system journal? > I think bolt will be messaging something to the journal at this time if > fwupd or bolt used force power. These are the only logs i found for boltd in the same time window: ``` Nov 06 11:13:52 xps13 kernel: thunderbolt: probe of 0000:05:00.0 failed with error -108 Nov 06 11:13:52 xps13 boltd[5033]: domain: unregistered domain removed at /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0/0000:04:00.0/0> Nov 06 11:13:52 xps13 kernel: thunderbolt 0000:05:00.0: 0: power cycling the switch now Nov 06 11:13:52 xps13 boltd[5033]: probing: adding /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 to roots Nov 06 11:13:52 xps13 boltd[5033]: probing: started [1000] Nov 06 11:13:52 xps13 kernel: pci 0000:03:00.0: bridge window [mem 0x60000000-0xa9ffffff 64bit pref] Nov 06 11:13:52 xps13 kernel: pci 0000:03:00.0: bridge window [mem 0xac000000-0xda0fffff] ``` These are the "full" bolt log for context for the same time window: https://bugzilla.kernel.org/attachment.cgi?id=285823 This is the current output of `boltctl power -q` (i.e. Nov 7 / not during update or anything like that) ``` supported: yes power state: unset 0 active power guards ``` I did see the request to retry update without bolt/fwupd running, I'll see if i can do that later today (this is my work machine right now :)
I don't see any mention in that log of bolt touching force power (Should see this message): https://github.com/gicmo/bolt/blob/7259ecdc4b93b27a0cb319b3c8bdbd7615b44616/boltd/bolt-power.c#L1183 fwupd uses bolt to control force power, and only does it if the sysfs files are missing (https://github.com/fwupd/fwupd/blob/master/plugins/thunderbolt-power/fu-plugin-thunderbolt-power.c#L377) So it's a pretty good assumption that force power wasn't involved with this particular problem. I wonder if we should try reproducing this by specifically the jump from NVM 28 to 40? I'd have to dig around and see if I can find NVM 28 for that test.
Here's NVM 28: https://fwupd.org/downloads/2ab30ba0769a84029af4e055270444f9391c71c8-3H3DP_NVM28.00.cab
Using these steps: # systemctl stop fwupd # systemctl stop bolt ( Confirmed they're not running ) # gcab --extract firmware.cab # dd if=firmware.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate (Replaced firmware.cab/firmware.bin with actual names) With NVM 28 cab (failure 40=>28): - same result, had to reboot - Noted that after "echo 1 > ... nvm_authenticate" it said something like "write error" (in CLI) - when rebooting the NVM version still said 40 on the Thunderbolt controller With NVM 40 cab (success 40=>40): - no problem, TB came back after update, no write error, no reboot was necessary - I noticed that boltd auto-started (i guess some systemd magic there) Here's the log in reverse order: Nov 07 11:13:52 xps13 kernel: thunderbolt 0-303: Dell Dell Thunderbolt Dock Nov 07 11:13:52 xps13 kernel: thunderbolt 0-303: new device found, vendor=0xd4 device=0xb054 Nov 07 11:13:51 xps13 kernel: thunderbolt 0-3: Dell Dell Thunderbolt Cable Nov 07 11:13:51 xps13 kernel: thunderbolt 0-3: new device found, vendor=0xd4 device=0xb051 Nov 07 11:13:51 xps13 kernel: thunderbolt 0000:05:00.0: 3: drom data crc32 mismatch (expected: 0xaf438340, got: 0xaf4383c0), continuing Nov 07 11:13:51 xps13 kernel: pcieport 0000:04:01.0: BAR 13: failed to assign [io size 0x1000] Nov 07 11:13:51 xps13 kernel: pcieport 0000:04:01.0: BAR 13: no space for [io size 0x1000] [...] Nov 07 11:13:49 xps13 kernel: pci 0000:04:01.0: [8086:15d3] type 01 class 0x060400 Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: PME# supported from D0 D1 D2 D3hot D3cold Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: supports D1 D2 Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: enabling Extended Tags Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: [8086:15d3] type 01 class 0x060400 Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: PME# supported from D0 D1 D2 D3hot D3cold Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: supports D1 D2 Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: enabling Extended Tags Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: [8086:15d3] type 01 class 0x060400 Nov 07 11:13:41 xps13 boltd[2737]: probing: timeout, done: [2917493] (2000000) Nov 07 11:13:38 xps13 boltd[2737]: probing: removing /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 from roots Nov 07 11:13:38 xps13 kernel: pci_bus 0000:04: busn_res: [bus 04-6d] is released Nov 07 11:13:38 xps13 kernel: pci_bus 0000:3a: busn_res: [bus 3a-6d] is released Nov 07 11:13:38 xps13 kernel: pci_bus 0000:39: busn_res: [bus 39] is released Nov 07 11:13:38 xps13 kernel: pci_bus 0000:06: busn_res: [bus 06-38] is released Nov 07 11:13:38 xps13 kernel: pci_bus 0000:05: busn_res: [bus 05] is released Nov 07 11:13:38 xps13 kernel: pcieport 0000:04:02.0: Refused to change power state, currently in D3 Nov 07 11:13:38 xps13 kernel: thunderbolt: probe of 0000:05:00.0 failed with error -108 Nov 07 11:13:38 xps13 boltd[2737]: domain: unregistered domain removed at /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0/0000:04:00.0/0000:05:00.0/domain0 Nov 07 11:13:38 xps13 kernel: thunderbolt 0000:05:00.0: 0: power cycling the switch now Nov 07 11:13:38 xps13 boltd[2737]: probing: adding /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 to roots Nov 07 11:13:38 xps13 boltd[2737]: probing: started [1000] Nov 07 11:13:38 xps13 boltd[2737]: probing: adding /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 to roots Nov 07 11:13:38 xps13 kernel: pci 0000:03:00.0: bridge window [mem 0x60000000-0xa9ffffff 64bit pref]
out of curiosity i re-ran NVM 40=>40 update with fwupd and boltd running, and got a kernel panic (but i dont think it was related to that, i think it was related to flashing twice without reboot maybe?) System went +- dead after that, thunderbolt controller did not come back, services wouldnt stop properly, the "echo 1 > ... nvm_authenticate" never returned - had to force reboot Logs: Nov 07 11:24:32 xps13 kernel: pcieport 0000:04:02.0: Refused to change power state, currently in D3 Nov 07 11:24:32 xps13 kernel: pcieport 0000:3b:01.0: Refused to change power state, currently in D3 Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: USB bus 3 deregistered Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: Host not accessible, reset failed. Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: Host halt failed, -19 [..] Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: xHCI host controller not responding, assume dead Nov 07 11:24:32 xps13 kernel: usb 4-1.2: USB disconnect, device number 3 Nov 07 11:24:32 xps13 kernel: usb 4-1: USB disconnect, device number 2 Nov 07 11:24:32 xps13 kernel: usb usb4: USB disconnect, device number 1 Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: remove, state 1 Nov 07 11:24:32 xps13 avahi-daemon[952]: Leaving mDNS multicast group on interface enp63s0u1u2.IPv6 with address fe80::db8:da03:a82c:88c5. Nov 07 11:24:32 xps13 avahi-daemon[952]: Interface enp63s0u1u2.IPv6 no longer relevant for mDNS. Nov 07 11:24:32 xps13 kernel: CR2: 0000305b71c95000 CR3: 000000042f922005 CR4: 00000000003606e0 Nov 07 11:24:32 xps13 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 07 11:24:32 xps13 kernel: FS: 0000000000000000(0000) GS:ffffa2ec5e480000(0000) knlGS:0000000000000000 Nov 07 11:24:32 xps13 kernel: R13: ffffa2ec5e4af700 R14: 0000000000000000 R15: 0ffffa2ec5e4af70 Nov 07 11:24:32 xps13 kernel: R10: ffffa2ec58a09a2c R11: 0000000000000018 R12: ffffa2ec5e4aa240 Nov 07 11:24:32 xps13 kernel: RBP: ffffa2ec4403e5b0 R08: 000073746e657665 R09: 8080808080808080 Nov 07 11:24:32 xps13 kernel: RDX: dead000000000100 RSI: 0000000000000080 RDI: ffffa2ec4403e5b0 Nov 07 11:24:32 xps13 kernel: RAX: dead000000000122 RBX: ffffa2ec439b56c0 RCX: 0000000000000000 Nov 07 11:24:32 xps13 kernel: RSP: 0018:ffffa5d34043be58 EFLAGS: 00010246 Nov 07 11:24:32 xps13 kernel: Code: 40 00 0f 1f 44 00 00 55 53 48 8b 47 08 48 89 fb 48 8d 68 70 48 89 ef e8 56 49 01 e4 48 8b 83 b0 00 00 00 48 8b 93 a8 00 00 00 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 a8 00 Nov 07 11:24:32 xps13 kernel: RIP: 0010:tb_cfg_request_dequeue+0x28/0x90 [thunderbolt] Nov 07 11:24:32 xps13 kernel: ---[ end trace 03e49a1cbc4072d6 ]--- Nov 07 11:24:32 xps13 kernel: crc32_pclmul ghash_clmulni_intel rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 aesni_intel aes_x86_64 crypto_simd xhci_pci cryptd glue_helper xhci_hcd rtsx_pci i8042 serio ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 Nov 07 11:24:32 xps13 kernel: snd_timer intel_uncore cfg80211 snd drm intel_rapl_perf intel_gtt rfkill input_leds psmouse pcspkr soundcore i2c_i801 agpgart libarc4 processor_thermal_device mei_me syscopyarea idma64 rtsx_pci_ms sysfillrect intel_rapl_common intel_lpss_pci intel_xhci_usb_role_switch sysimgblt memstick ucsi_acpi mei fb_sys_fops roles intel_soc_dts_iosf intel_lp> Nov 07 11:24:32 xps13 kernel: Modules linked in: hid_plantronics snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device usbhid cdc_ether usbnet r8152 mii thunderbolt overlay uvcvideo videobuf2_vmalloc videobuf2_memops snd_soc_skl videobuf2_v4l2 snd_soc_hdac_hda videobuf2_common snd_hda_ext_core videodev snd_soc_skl_ipc btusb btrtl snd_hda_codec_hdmi btbcm btintel snd_hda_co> Nov 07 11:24:32 xps13 kernel: ret_from_fork+0x35/0x40 Nov 07 11:24:32 xps13 kernel: ? kthread_park+0x80/0x80 Nov 07 11:24:32 xps13 kernel: ? process_one_work+0x3a0/0x3a0 Nov 07 11:24:32 xps13 kernel: kthread+0xfb/0x130 Nov 07 11:24:32 xps13 kernel: worker_thread+0x1fe/0x3d0 Nov 07 11:24:32 xps13 kernel: process_one_work+0x1d1/0x3a0 Nov 07 11:24:32 xps13 kernel: tb_cfg_request_work+0x2a/0x30 [thunderbolt] Nov 07 11:24:32 xps13 kernel: Call Trace: Nov 07 11:24:32 xps13 kernel: CR2: 0000305b71c95000 CR3: 000000042f922005 CR4: 00000000003606e0 Nov 07 11:24:32 xps13 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 07 11:24:32 xps13 kernel: FS: 0000000000000000(0000) GS:ffffa2ec5e480000(0000) knlGS:0000000000000000 Nov 07 11:24:32 xps13 kernel: R13: ffffa2ec5e4af700 R14: 0000000000000000 R15: 0ffffa2ec5e4af70 Nov 07 11:24:32 xps13 kernel: R10: ffffa2ec58a09a2c R11: 0000000000000018 R12: ffffa2ec5e4aa240 Nov 07 11:24:32 xps13 kernel: RBP: ffffa2ec4403e5b0 R08: 000073746e657665 R09: 8080808080808080 Nov 07 11:24:32 xps13 kernel: RDX: dead000000000100 RSI: 0000000000000080 RDI: ffffa2ec4403e5b0 Nov 07 11:24:32 xps13 kernel: RAX: dead000000000122 RBX: ffffa2ec439b56c0 RCX: 0000000000000000 Nov 07 11:24:32 xps13 kernel: RSP: 0018:ffffa5d34043be58 EFLAGS: 00010246 Nov 07 11:24:32 xps13 kernel: Code: 40 00 0f 1f 44 00 00 55 53 48 8b 47 08 48 89 fb 48 8d 68 70 48 89 ef e8 56 49 01 e4 48 8b 83 b0 00 00 00 48 8b 93 a8 00 00 00 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 a8 00 Nov 07 11:24:32 xps13 kernel: RIP: 0010:tb_cfg_request_dequeue+0x28/0x90 [thunderbolt] Nov 07 11:24:32 xps13 kernel: Workqueue: events tb_cfg_request_work [thunderbolt] Nov 07 11:24:32 xps13 kernel: Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.11.1 07/11/2019 Nov 07 11:24:32 xps13 kernel: CPU: 2 PID: 209 Comm: kworker/2:1 Tainted: G U OE 5.3.8-arch1-1 #1 Nov 07 11:24:32 xps13 kernel: general protection fault: 0000 [#1] PREEMPT SMP PTI
>With NVM 28 cab (failure 40=>28): >- same result, had to reboot >- Noted that after "echo 1 > ... nvm_authenticate" it said something like >"write error" (in CLI) The 40->28 downgrade might have been blocked by the firmware itself (some firmware versions will prevent downgrades), which is why authentication failed. > but i dont think it was related to that, i think it was related to flashing > twice without reboot maybe? That should be a use case that works. I would say a kernel bug was encountered.
Indeed the downgrade is prevented from v40 to v28. I think that the driver misses power cycling in case we get an error immediately back from the update auth operation.
Created attachment 285831 [details] Power cycle Thunderbolt host router if authentication fails Can you try the attached patch and do both v40 -> v40, and v40 -> v28 NVM upgrades. The controller should always come back.
(In reply to Mika Westerberg from comment #14) > Created attachment 285831 [details] > Power cycle Thunderbolt host router if authentication fails > > Can you try the attached patch and do both v40 -> v40, and v40 -> v28 NVM > upgrades. The controller should always come back. Forgot to say the patch is based on my thunderbolt.git/next branch [1]. [1] https://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt.git/log/?h=next
> Power cycle Thunderbolt host router if authentication fails Yes, I experienced the need for this when working on TBT NVM update over I2C as well, I think this patch makes sense.
Just ran kernel 5.4.0-rc6 from your branch with the batch from comment 4 I flashed firmware 40.0 twice in a row with no ill effect, everything came back. Boltd and fwupd were running at the same time. Also tested 40.0=>28.0 no write error this time (not sure why) but update did fail (its still at 40.0) and everything came back. Also no kernel panic Thanks :)
OK thanks for checking. I'll submit a proper patch soon.
Submitted the patch now: https://lore.kernel.org/lkml/20191112092452.70789-1-mika.westerberg@linux.intel.com/