Bug 205457 - Thunderbolt bus disappears after fw update (until reboot)
Summary: Thunderbolt bus disappears after fw update (until reboot)
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-06 23:22 UTC by zang
Modified: 2019-11-12 09:26 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.3.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel log just before/during/after update (10.72 KB, text/plain)
2019-11-06 23:22 UTC, zang
Details
bolt logs (nov 6) (3.23 KB, text/plain)
2019-11-07 17:24 UTC, zang
Details
Power cycle Thunderbolt host router if authentication fails (647 bytes, patch)
2019-11-08 12:12 UTC, Mika Westerberg
Details | Diff

Description zang 2019-11-06 23:22:39 UTC
Created attachment 285807 [details]
kernel log just before/during/after update

Filing this bug on request from https://github.com/dell/thunderbolt-nvm-linux/issues/23 (which may have some additional context)
Note: I wasn't sure if drivers would be a better component or not here.


Overview:

When using fwupd to update the Thunderbolt firmware of my laptop, the thunderbolt interface disappears from the system, but does not re-appear after the update has completed. The display-port screen that is connected to a thunderbolt dock (which is itself connected to the machine) does turn back on, but no other device comes back. 

Steps taken:
- Thunderbolt dock is connected
- fwupdmgr -v install file-here.cab
- <update happens>

Actual Results:
Thunderbolt controller/interface is missing until complete computer restart.
fwupdmgr get-devices no longer shows the thunderbolt controller device (until restart)

Expected Results:
Have the thunderbolt interface up and functional after an update

Build Date & Hardware:
Kernel: 5.3.8
Machine: DELL XPS 13 9370 (latest EFI 0.1.11.1)
TB Dock: DELL TB16 (latest firmware from August 3.12.002)
fwupd: 1.3.3
Tested on 2019-11-06

Thunderbolt controller information:
├─Thunderbolt Controller:
│     Device ID:           c600676d09338e9fd2094171b628603b85841d99
│     Summary:             Unmatched performance for high-speed I/O
│     Current version:     40.00
│     Vendor:              Dell (TBT:0x00D4)
│     GUIDs:               4eeb9d07-a96c-56d6-92d3-4a23ee7a6e4a ← TBT-00d407e6
│                          31dedd69-1429-5224-87d4-a8fb8072a50e ← TBT-00d407e6-0000:05:00.0
│     Device Flags:        • Internal device
│                          • Updatable
│                          • Requires AC power
│                          • Device stages updates

See also attached kernel logs (just before/during/after the update occurred)
Comment 1 Mika Westerberg 2019-11-07 08:28:31 UTC
Can you try the upgrade so that you disable fwupd (not sure how it is done, Mario can help us here) and then follow instructions here:

  https://www.kernel.org/doc/html/latest/admin-guide/thunderbolt.html#upgrading-nvm-on-thunderbolt-device-or-host

If it still fails, please attach full dmesg and output of 'sudo lspci -vv'.
Comment 2 Mika Westerberg 2019-11-07 08:39:38 UTC
Also please disable bolt as well so it does not do any force power things.
Comment 3 Mika Westerberg 2019-11-07 11:12:48 UTC
I did some experiments on my end since I have both 9370 and TB16 dock. So I booted to shell (minimal userspace, no bolt or fwupd running) with the dock connected and did several times NVM upgrade from v40 -> v40 and each time the controller comes back after a while (takes ~15-20s for it to come back the second time).

BIOS version in this system is the latest (1.11.1).

So one guess is that the force power thing is somehow affecting this.
Comment 4 Mario Limonciello 2019-11-07 13:08:50 UTC
> Can you try the upgrade so that you disable fwupd (not sure how it is done,
> Mario can help us here) and then follow instructions here:

Basically would be these steps:

# gcab --extract firmware.cab
# dd if=firmware.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem
# echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate

> Also please disable bolt as well so it does not do any force power things.

CC gicmo to confirm, but I expect that bolt shouldn't be leaving anything force powered, especially with a TBT dock hooked up.

> So one guess is that the force power thing is somehow affecting this.

OP: did you have the dock plugged in from the moment you booted up all the way to when you ran the FW upgrade that caused the failure?  Or did you hotplug it some time after booting up?

If it's some time after booting up nothing should have force powered.  Otherwise we'll need to check fwupd and bolt caching of force power stuff to see if they could have re-applied it.  Could you check your system journal?  I think bolt will be messaging something to the journal at this time if fwupd or bolt used force power.
Comment 5 Christian Kellner 2019-11-07 13:21:23 UTC
(In reply to Mario Limonciello from comment #4)
> > Can you try the upgrade so that you disable fwupd (not sure how it is done,
> > Mario can help us here) and then follow instructions here:
> 
> Basically would be these steps:
> 
> # gcab --extract firmware.cab
> # dd if=firmware.bin
> of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem
> # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate
> 
> > Also please disable bolt as well so it does not do any force power things.
> 
> CC gicmo to confirm, but I expect that bolt shouldn't be leaving anything
> force powered, especially with a TBT dock hooked up.
boltd itself only force-powers directly when boltd is started and it cannot find a controller (e.g. nothing is plugged in). Afterwards only on request (D-Bus API).


> If it's some time after booting up nothing should have force powered. 
> Otherwise we'll need to check fwupd and bolt caching of force power stuff to
> see if they could have re-applied it.  Could you check your system journal? 
> I think bolt will be messaging something to the journal at this time if
> fwupd or bolt used force power.

It indeed should. journalctl -u bolt -b (for the current boot). Also "boltctl power -q" shows the current force power status.
Comment 6 zang 2019-11-07 17:24:45 UTC
Created attachment 285823 [details]
bolt logs (nov 6)
Comment 7 zang 2019-11-07 17:25:40 UTC
(In reply to Mario Limonciello from comment #4)

> OP: did you have the dock plugged in from the moment you booted up all the
> way to when you ran the FW upgrade that caused the failure?  Or did you
> hotplug it some time after booting up?

Hotplug some time after booting up

> 
> If it's some time after booting up nothing should have force powered. 
> Otherwise we'll need to check fwupd and bolt caching of force power stuff to
> see if they could have re-applied it.  Could you check your system journal? 
> I think bolt will be messaging something to the journal at this time if
> fwupd or bolt used force power.

These are the only logs i found for boltd in the same time window:


```
Nov 06 11:13:52 xps13 kernel: thunderbolt: probe of 0000:05:00.0 failed with error -108
Nov 06 11:13:52 xps13 boltd[5033]: domain: unregistered domain removed at /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0/0000:04:00.0/0>
Nov 06 11:13:52 xps13 kernel: thunderbolt 0000:05:00.0: 0: power cycling the switch now
Nov 06 11:13:52 xps13 boltd[5033]: probing: adding /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 to roots
Nov 06 11:13:52 xps13 boltd[5033]: probing: started [1000]
Nov 06 11:13:52 xps13 kernel: pci 0000:03:00.0:   bridge window [mem 0x60000000-0xa9ffffff 64bit pref]
Nov 06 11:13:52 xps13 kernel: pci 0000:03:00.0:   bridge window [mem 0xac000000-0xda0fffff]
```

These are the "full" bolt log for context for the same time window: https://bugzilla.kernel.org/attachment.cgi?id=285823


This is the current output of `boltctl power -q` (i.e. Nov 7 / not during update or anything like that)


```
supported: yes
power state: unset
0 active power guards
```

I did see the request to retry update without bolt/fwupd running, I'll see if i can do that later today (this is my work machine right now :)
Comment 8 Mario Limonciello 2019-11-07 17:40:34 UTC
I don't see any mention in that log of bolt touching force power (Should see this message):
https://github.com/gicmo/bolt/blob/7259ecdc4b93b27a0cb319b3c8bdbd7615b44616/boltd/bolt-power.c#L1183

fwupd uses bolt to control force power, and only does it if the sysfs files are missing (https://github.com/fwupd/fwupd/blob/master/plugins/thunderbolt-power/fu-plugin-thunderbolt-power.c#L377)
So it's a pretty good assumption that force power wasn't involved with this particular problem.

I wonder if we should try reproducing this by specifically the jump from NVM 28 to 40?  I'd have to dig around and see if I can find NVM 28 for that test.
Comment 10 zang 2019-11-07 19:22:42 UTC
Using these steps:
# systemctl stop fwupd
# systemctl stop bolt

( Confirmed they're not running )

# gcab --extract firmware.cab
# dd if=firmware.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem
# echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate

(Replaced firmware.cab/firmware.bin with actual names)

With NVM 28 cab (failure 40=>28):
- same result, had to reboot
- Noted that after "echo 1 > ... nvm_authenticate" it said something like "write error" (in CLI)
- when rebooting the NVM version still said 40 on the Thunderbolt controller

With NVM 40 cab (success 40=>40):
- no problem, TB came back after update, no write error, no reboot was necessary
- I noticed that boltd auto-started (i guess some systemd magic there)

Here's the log in reverse order:

Nov 07 11:13:52 xps13 kernel: thunderbolt 0-303: Dell Dell Thunderbolt Dock
Nov 07 11:13:52 xps13 kernel: thunderbolt 0-303: new device found, vendor=0xd4 device=0xb054
Nov 07 11:13:51 xps13 kernel: thunderbolt 0-3: Dell Dell Thunderbolt Cable
Nov 07 11:13:51 xps13 kernel: thunderbolt 0-3: new device found, vendor=0xd4 device=0xb051
Nov 07 11:13:51 xps13 kernel: thunderbolt 0000:05:00.0: 3: drom data crc32 mismatch (expected: 0xaf438340, got: 0xaf4383c0), continuing
Nov 07 11:13:51 xps13 kernel: pcieport 0000:04:01.0: BAR 13: failed to assign [io  size 0x1000]
Nov 07 11:13:51 xps13 kernel: pcieport 0000:04:01.0: BAR 13: no space for [io  size 0x1000]

[...]

Nov 07 11:13:49 xps13 kernel: pci 0000:04:01.0: [8086:15d3] type 01 class 0x060400
Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: PME# supported from D0 D1 D2 D3hot D3cold
Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: supports D1 D2
Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: enabling Extended Tags
Nov 07 11:13:49 xps13 kernel: pci 0000:04:00.0: [8086:15d3] type 01 class 0x060400
Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: PME# supported from D0 D1 D2 D3hot D3cold
Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: supports D1 D2
Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: enabling Extended Tags
Nov 07 11:13:49 xps13 kernel: pci 0000:03:00.0: [8086:15d3] type 01 class 0x060400
Nov 07 11:13:41 xps13 boltd[2737]: probing: timeout, done: [2917493] (2000000)
Nov 07 11:13:38 xps13 boltd[2737]: probing: removing /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 from roots
Nov 07 11:13:38 xps13 kernel: pci_bus 0000:04: busn_res: [bus 04-6d] is released
Nov 07 11:13:38 xps13 kernel: pci_bus 0000:3a: busn_res: [bus 3a-6d] is released
Nov 07 11:13:38 xps13 kernel: pci_bus 0000:39: busn_res: [bus 39] is released
Nov 07 11:13:38 xps13 kernel: pci_bus 0000:06: busn_res: [bus 06-38] is released
Nov 07 11:13:38 xps13 kernel: pci_bus 0000:05: busn_res: [bus 05] is released
Nov 07 11:13:38 xps13 kernel: pcieport 0000:04:02.0: Refused to change power state, currently in D3
Nov 07 11:13:38 xps13 kernel: thunderbolt: probe of 0000:05:00.0 failed with error -108
Nov 07 11:13:38 xps13 boltd[2737]: domain: unregistered domain removed at /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0/0000:04:00.0/0000:05:00.0/domain0
Nov 07 11:13:38 xps13 kernel: thunderbolt 0000:05:00.0: 0: power cycling the switch now
Nov 07 11:13:38 xps13 boltd[2737]: probing: adding /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 to roots
Nov 07 11:13:38 xps13 boltd[2737]: probing: started [1000]
Nov 07 11:13:38 xps13 boltd[2737]: probing: adding /sys/devices/pci0000:00/0000:00:1c.4/0000:03:00.0 to roots
Nov 07 11:13:38 xps13 kernel: pci 0000:03:00.0:   bridge window [mem 0x60000000-0xa9ffffff 64bit pref]
Comment 11 zang 2019-11-07 19:33:30 UTC
out of curiosity i re-ran NVM 40=>40 update with fwupd and boltd running, and got a kernel panic (but i dont think it was related to that, i think it was related to flashing twice without reboot maybe?)
System went +- dead after that, thunderbolt controller did not come back, services wouldnt stop properly, the "echo 1 > ... nvm_authenticate" never returned - had to force reboot

Logs:

Nov 07 11:24:32 xps13 kernel: pcieport 0000:04:02.0: Refused to change power state, currently in D3
Nov 07 11:24:32 xps13 kernel: pcieport 0000:3b:01.0: Refused to change power state, currently in D3
Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: USB bus 3 deregistered
Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: Host not accessible, reset failed.
Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: Host halt failed, -19
[..]
Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: xHCI host controller not responding, assume dead
Nov 07 11:24:32 xps13 kernel: usb 4-1.2: USB disconnect, device number 3
Nov 07 11:24:32 xps13 kernel: usb 4-1: USB disconnect, device number 2
Nov 07 11:24:32 xps13 kernel: usb usb4: USB disconnect, device number 1
Nov 07 11:24:32 xps13 kernel: xhci_hcd 0000:3f:00.0: remove, state 1
Nov 07 11:24:32 xps13 avahi-daemon[952]: Leaving mDNS multicast group on interface enp63s0u1u2.IPv6 with address fe80::db8:da03:a82c:88c5.
Nov 07 11:24:32 xps13 avahi-daemon[952]: Interface enp63s0u1u2.IPv6 no longer relevant for mDNS.
Nov 07 11:24:32 xps13 kernel: CR2: 0000305b71c95000 CR3: 000000042f922005 CR4: 00000000003606e0
Nov 07 11:24:32 xps13 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 07 11:24:32 xps13 kernel: FS:  0000000000000000(0000) GS:ffffa2ec5e480000(0000) knlGS:0000000000000000
Nov 07 11:24:32 xps13 kernel: R13: ffffa2ec5e4af700 R14: 0000000000000000 R15: 0ffffa2ec5e4af70
Nov 07 11:24:32 xps13 kernel: R10: ffffa2ec58a09a2c R11: 0000000000000018 R12: ffffa2ec5e4aa240
Nov 07 11:24:32 xps13 kernel: RBP: ffffa2ec4403e5b0 R08: 000073746e657665 R09: 8080808080808080
Nov 07 11:24:32 xps13 kernel: RDX: dead000000000100 RSI: 0000000000000080 RDI: ffffa2ec4403e5b0
Nov 07 11:24:32 xps13 kernel: RAX: dead000000000122 RBX: ffffa2ec439b56c0 RCX: 0000000000000000
Nov 07 11:24:32 xps13 kernel: RSP: 0018:ffffa5d34043be58 EFLAGS: 00010246
Nov 07 11:24:32 xps13 kernel: Code: 40 00 0f 1f 44 00 00 55 53 48 8b 47 08 48 89 fb 48 8d 68 70 48 89 ef e8 56 49 01 e4 48 8b 83 b0 00 00 00 48 8b 93 a8 00 00 00 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 a8 00
Nov 07 11:24:32 xps13 kernel: RIP: 0010:tb_cfg_request_dequeue+0x28/0x90 [thunderbolt]
Nov 07 11:24:32 xps13 kernel: ---[ end trace 03e49a1cbc4072d6 ]---
Nov 07 11:24:32 xps13 kernel:  crc32_pclmul ghash_clmulni_intel rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 aesni_intel aes_x86_64 crypto_simd xhci_pci cryptd glue_helper xhci_hcd rtsx_pci i8042 serio ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2
Nov 07 11:24:32 xps13 kernel:  snd_timer intel_uncore cfg80211 snd drm intel_rapl_perf intel_gtt rfkill input_leds psmouse pcspkr soundcore i2c_i801 agpgart libarc4 processor_thermal_device mei_me syscopyarea idma64 rtsx_pci_ms sysfillrect intel_rapl_common intel_lpss_pci intel_xhci_usb_role_switch sysimgblt memstick ucsi_acpi mei fb_sys_fops roles intel_soc_dts_iosf intel_lp>
Nov 07 11:24:32 xps13 kernel: Modules linked in: hid_plantronics snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device usbhid cdc_ether usbnet r8152 mii thunderbolt overlay uvcvideo videobuf2_vmalloc videobuf2_memops snd_soc_skl videobuf2_v4l2 snd_soc_hdac_hda videobuf2_common snd_hda_ext_core videodev snd_soc_skl_ipc btusb btrtl snd_hda_codec_hdmi btbcm btintel snd_hda_co>
Nov 07 11:24:32 xps13 kernel:  ret_from_fork+0x35/0x40
Nov 07 11:24:32 xps13 kernel:  ? kthread_park+0x80/0x80
Nov 07 11:24:32 xps13 kernel:  ? process_one_work+0x3a0/0x3a0
Nov 07 11:24:32 xps13 kernel:  kthread+0xfb/0x130
Nov 07 11:24:32 xps13 kernel:  worker_thread+0x1fe/0x3d0
Nov 07 11:24:32 xps13 kernel:  process_one_work+0x1d1/0x3a0
Nov 07 11:24:32 xps13 kernel:  tb_cfg_request_work+0x2a/0x30 [thunderbolt]
Nov 07 11:24:32 xps13 kernel: Call Trace:
Nov 07 11:24:32 xps13 kernel: CR2: 0000305b71c95000 CR3: 000000042f922005 CR4: 00000000003606e0
Nov 07 11:24:32 xps13 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 07 11:24:32 xps13 kernel: FS:  0000000000000000(0000) GS:ffffa2ec5e480000(0000) knlGS:0000000000000000
Nov 07 11:24:32 xps13 kernel: R13: ffffa2ec5e4af700 R14: 0000000000000000 R15: 0ffffa2ec5e4af70
Nov 07 11:24:32 xps13 kernel: R10: ffffa2ec58a09a2c R11: 0000000000000018 R12: ffffa2ec5e4aa240
Nov 07 11:24:32 xps13 kernel: RBP: ffffa2ec4403e5b0 R08: 000073746e657665 R09: 8080808080808080
Nov 07 11:24:32 xps13 kernel: RDX: dead000000000100 RSI: 0000000000000080 RDI: ffffa2ec4403e5b0
Nov 07 11:24:32 xps13 kernel: RAX: dead000000000122 RBX: ffffa2ec439b56c0 RCX: 0000000000000000
Nov 07 11:24:32 xps13 kernel: RSP: 0018:ffffa5d34043be58 EFLAGS: 00010246
Nov 07 11:24:32 xps13 kernel: Code: 40 00 0f 1f 44 00 00 55 53 48 8b 47 08 48 89 fb 48 8d 68 70 48 89 ef e8 56 49 01 e4 48 8b 83 b0 00 00 00 48 8b 93 a8 00 00 00 <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 48 89 83 a8 00
Nov 07 11:24:32 xps13 kernel: RIP: 0010:tb_cfg_request_dequeue+0x28/0x90 [thunderbolt]
Nov 07 11:24:32 xps13 kernel: Workqueue: events tb_cfg_request_work [thunderbolt]
Nov 07 11:24:32 xps13 kernel: Hardware name: Dell Inc. XPS 13 9370/0F6P3V, BIOS 1.11.1 07/11/2019
Nov 07 11:24:32 xps13 kernel: CPU: 2 PID: 209 Comm: kworker/2:1 Tainted: G     U     OE     5.3.8-arch1-1 #1
Nov 07 11:24:32 xps13 kernel: general protection fault: 0000 [#1] PREEMPT SMP PTI
Comment 12 Mario Limonciello 2019-11-07 21:47:23 UTC
>With NVM 28 cab (failure 40=>28):
>- same result, had to reboot
>- Noted that after "echo 1 > ... nvm_authenticate" it said something like
>"write error" (in CLI)

The 40->28 downgrade might have been blocked by the firmware itself (some firmware versions will prevent downgrades), which is why authentication failed.

> but i dont think it was related to that, i think it was related to flashing
> twice without reboot maybe?

That should be a use case that works.  I would say a kernel bug was encountered.
Comment 13 Mika Westerberg 2019-11-08 12:09:48 UTC
Indeed the downgrade is prevented from v40 to v28. I think that the driver misses power cycling in case we get an error immediately back from the update auth operation.
Comment 14 Mika Westerberg 2019-11-08 12:12:22 UTC
Created attachment 285831 [details]
Power cycle Thunderbolt host router if authentication fails

Can you try the attached patch and do both v40 -> v40, and v40 -> v28 NVM upgrades. The controller should always come back.
Comment 15 Mika Westerberg 2019-11-08 12:13:07 UTC
(In reply to Mika Westerberg from comment #14)
> Created attachment 285831 [details]
> Power cycle Thunderbolt host router if authentication fails
> 
> Can you try the attached patch and do both v40 -> v40, and v40 -> v28 NVM
> upgrades. The controller should always come back.

Forgot to say the patch is based on my thunderbolt.git/next branch [1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt.git/log/?h=next
Comment 16 Mario Limonciello 2019-11-08 15:34:28 UTC
> Power cycle Thunderbolt host router if authentication fails

Yes, I experienced the need for this when working on TBT NVM update over I2C as well, I think this patch makes sense.
Comment 17 zang 2019-11-08 19:37:03 UTC
Just ran kernel 5.4.0-rc6 from your branch with the batch from comment 4
I flashed firmware 40.0 twice in a row with no ill effect, everything came back. Boltd and fwupd were running at the same time.

Also tested 40.0=>28.0 no write error this time (not sure why) but update did fail (its still at 40.0) and everything came back.

Also no kernel panic

Thanks :)
Comment 18 Mika Westerberg 2019-11-11 11:08:15 UTC
OK thanks for checking. I'll submit a proper patch soon.
Comment 19 Mika Westerberg 2019-11-12 09:26:22 UTC
Submitted the patch now:

  https://lore.kernel.org/lkml/20191112092452.70789-1-mika.westerberg@linux.intel.com/

Note You need to log in before you can comment on or make changes to this bug.