Hi. I'm running a Debian Bookworm host with Xanmod 6.9.1 kernel This motherboard: https://www.supermicro.com/en/products/motherboard/a2sdi-16c-tp8f With this USB controller in the 4x PCIe slot. https://www.startech.com/en-gb/cards-adapters/pexusb3s2ei The USB card is based on the Renesas uPD720201 USB 3.0 Host Controller and reports the latest firmware. I have a Debian Bookworm VM running on this host, which I intend to pass the entire PCIe card through to (Gnome+Plexamp->USB-SPDIF). If I configure the VM to do this the VM fails to start and I get the following errors from the kernel. The card then becomes seemingly unrecoverable without a warm reboot at least. I have tried many kernel and BIOS options regarding PCIe but nothing has helped so far. I'll attach a boot log. This is the error when I start the VM: May 19 09:24:46 kryten kernel: VFIO - User Level meta-driver version: 0.3 May 19 09:24:46 kryten kernel: xhci_hcd 0000:02:00.0: remove, state 1 May 19 09:24:46 kryten kernel: usb usb4: USB disconnect, device number 1 May 19 09:24:46 kryten kernel: xhci_hcd 0000:02:00.0: USB bus 4 deregistered May 19 09:24:46 kryten kernel: xhci_hcd 0000:02:00.0: remove, state 1 May 19 09:24:46 kryten kernel: usb usb3: USB disconnect, device number 1 May 19 09:24:46 kryten kernel: usb 3-4: USB disconnect, device number 2 May 19 09:24:46 kryten kernel: xhci_hcd 0000:02:00.0: USB bus 3 deregistered May 19 09:24:47 kryten kernel: usb 1-1.2: USB disconnect, device number 4 May 19 09:24:53 kryten kernel: pcieport 0000:00:09.0: broken device, retraining non-functional downstream link at 2.5GT/s May 19 09:24:54 kryten kernel: pcieport 0000:00:09.0: retraining failed May 19 09:24:55 kryten kernel: pcieport 0000:00:09.0: broken device, retraining non-functional downstream link at 2.5GT/s May 19 09:24:56 kryten kernel: pcieport 0000:00:09.0: retraining failed May 19 09:24:56 kryten kernel: vfio-pci 0000:02:00.0: not ready 1023ms after bus reset; waiting May 19 09:24:57 kryten kernel: vfio-pci 0000:02:00.0: not ready 2047ms after bus reset; waiting May 19 09:24:59 kryten kernel: vfio-pci 0000:02:00.0: not ready 4095ms after bus reset; waiting May 19 09:25:04 kryten kernel: vfio-pci 0000:02:00.0: not ready 8191ms after bus reset; waiting May 19 09:25:12 kryten kernel: vfio-pci 0000:02:00.0: not ready 16383ms after bus reset; waiting May 19 09:25:29 kryten kernel: vfio-pci 0000:02:00.0: not ready 32767ms after bus reset; waiting May 19 09:26:05 kryten kernel: pcieport 0000:00:09.0: broken device, retraining non-functional downstream link at 2.5GT/s May 19 09:26:06 kryten kernel: pcieport 0000:00:09.0: retraining failed May 19 09:26:08 kryten kernel: pcieport 0000:00:09.0: broken device, retraining non-functional downstream link at 2.5GT/s May 19 09:26:09 kryten kernel: pcieport 0000:00:09.0: retraining failed May 19 09:26:09 kryten kernel: vfio-pci 0000:02:00.0: not ready 1023ms after bus reset; waiting May 19 09:26:10 kryten kernel: vfio-pci 0000:02:00.0: not ready 2047ms after bus reset; waiting May 19 09:26:12 kryten kernel: vfio-pci 0000:02:00.0: not ready 4095ms after bus reset; waiting May 19 09:26:16 kryten kernel: vfio-pci 0000:02:00.0: not ready 8191ms after bus reset; waiting May 19 09:26:25 kryten kernel: vfio-pci 0000:02:00.0: not ready 16383ms after bus reset; waiting May 19 09:26:43 kryten kernel: vfio-pci 0000:02:00.0: not ready 32767ms after bus reset; waiting May 19 09:27:18 kryten kernel: vfio-pci 0000:02:00.0: Unable to change power state from D0 to D3hot, device inaccessible May 19 09:27:19 kryten kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible May 19 09:27:19 kryten kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible May 19 09:27:19 kryten kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible May 19 09:27:19 kryten kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible May 19 09:27:19 kryten kernel: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: Invalid ROM.. May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: xHCI Host Controller May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: new USB bus registered, assigned bus number 3 May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: Host halt failed, -19 May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: can't setup: -19 May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: USB bus 3 deregistered May 19 09:27:19 kryten kernel: xhci_hcd 0000:02:00.0: init 0000:02:00.0 fail, -19 Thanks for your time and help.
Created attachment 306324 [details] Boot log and what happens when I try to start vm with VFIO PCIe passthrough
This could be a power management issue: vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible Could you add the outputs of these commands? # detailed PCI device info sudo lspci -vvnnk # power reset methods of PCI devices find /sys/devices/pci0000\:00/ -name reset_method | while read -r f; do echo -e "$f = $(cat $f)"; done
Created attachment 306326 [details] sudo lspci -vvnnk sudo lspci -vvnnk
Created attachment 306327 [details] power reset methods of PCI devices power reset methods of PCI devices find /sys/devices/pci0000\:00/ -name reset_method | while read -r f; do echo -e "$f = $(cat $f)"; done
In case it's useful, I have tried with pcie_aspm=off but it didn't seem to fix it, same behaviour.
Thanks for the logs. An overview of how devices are connected. The PCIE Root port at address 00:09.0 (Bus:Device:Function) is the 'parent' of the USB host controller on Bus 02:00.0. The issue here appears to be that when the USB host controller is removed it may actually go into D3Cold state. This actually removes power and, currently, Linux kernel has no mechanism to control power on PCI bus [0]. There are three possible work-arounds I can think of worth testing: 1. remove 00:09.0 and then rescan its parent root complex since that *may* trigger power to be restored to the port (use the script I shared with you on IRC via termbin) 2. unbind [1] the xhci_hcd driver from the device *before* trying to start the VM or loading vfio-pci (this could be scripted) so the device remains powered: # echo 0000:02:0.0 > /sys/bus/pci/drivers/xhci_hcd/unbind 3. avoid this altogether if the USB host controller is only ever wanted for use in the guest virtual machine by reserving the device so the host's XHCI controller driver never claims it via kernel command-line; something like: vfio-pci.ids=1912:0014 - but this would require ensuring that vfio-pci was loaded *very* early in the initrd processing to take control of the USB host controller before xhci_hcd gets to it! I don't think there is an easy way to ensure ordering the module load order for that aside from a custom udev rule that loads vfio-pci for this device ID. [0] https://www.kernel.org/doc/html/latest/power/pci.html#native-pci-power-management [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-bus-pci
I've done some basic analysis and testing here to develop a udev rule. This looks like it ought to do the job. # this is /etc/udev/rules.d/00-vfiio-pci.rules SUBSYSTEM=="pci", ATTR{endor}=="1912", ATTR{device}=="0014", RUN+="modprobe vfio-pci ids=1912:0014" It needs testing so what I'd suggest is: 1. use the unbind method in (2) in my previous comment to detach the xhci_hcd driver and check there is no "Kernel driver in use" with: $ lspci -nnk -d 1912:0014" 2. Tell the kernel to replay events to test if the rule reacts as expected: # udevadm trigger --type=subsystems --subsystem-match=pci 3. Check if VFIO bound to the device ("Kernel driver in use") with: $ lspci -nnk -d 1912:0014" If that works then the module and rule need adding to the initrd.img with: # echo "vfio-pci" >> /etc/initramfs-tools/modules # update-initramfs -u and then do a reboot test when convenient.
Argh! noticed typos in the rule name and the rule! # this is /etc/udev/rules.d/00-vfio-pci.rules SUBSYSTEM=="pci", ATTR{vendor}=="1912", ATTR{device}=="0014", RUN+="modprobe vfio-pci ids=1912:0014"
(In reply to TJ from comment #2) > This could be a power management issue: > > vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, > device inaccessible I wouldn't rule out a power management issue, but I think the device has already disappeared by the time we're seeing these error logs. We're hitting the timeouts waiting for the device after bus reset and we trigger quirks that are trying to retrain the link at a reduced rate. Unfortunately this device only supports bus reset. Is it possible to test assigning another single function device installed in this slot? We'd want to make sure that the reset_method is still "bus", which can be selected via the same sysfs file if the other device supports more reset mechanism. If it is a power management issue, we can also restrict vfio-pci use of power management by passing the disable_idle_d3=1 module option when loading the vfio-pci driver. If that works, we might want to quirk the ID for this old NEC controller to avoid D3 states.
I tried with adding the module options but I still get the same behaviour. cat /etc/modprobe.d/vfio.conf options vfio-pci disable_idle_d3=1 (reboot) cat /proc/modules | cut -f 1 -d " " | while read module; do echo "Module: $module"; if [ -d "/sys/module/$module/parameters" ]; then ls /sys/module/$module/parameters/ | while read parameter; do echo -n "Parameter: $parameter --> "; cat /sys/module/$module/parameters/$parameter; done; fi; echo; done | grep -A 5 vfio [snip] Parameter: disable_idle_d3 --> Y [snip] Try launching the VM with the PCI device passed through and I get the same retrain failure. I have this card to try next. https://www.amazon.co.uk/dp/B087G7T234 Sorry for being a little slow - I have disabilities that can put a stop to my activities regardless of what my brain thinks about it. Thanks for all the help.
Just tried with the new card: 02:00.0 USB controller: ASMedia Technology Inc. ASM2142/ASM3142 USB 3.1 Host Controller I added it to the VM in virt-manager and I get the same error when I launch it. [snip] kernel: pcieport 0000:00:09.0: broken device, retraining non-functional downstream link at 2.5GT/s [snip] vfio-pci 0000:02:00.0: not ready 4095ms after bus reset; waiting Thanks.