Bug 202055
Summary: | Failed to PCI passthrough SSD with SMI SM2262 controller. | ||
---|---|---|---|
Product: | Virtualization | Reporter: | Alex (coffmaker) |
Component: | kvm | Assignee: | virtualization_kvm |
Status: | NEW --- | ||
Severity: | normal | CC: | alex.williamson, coffmaker, dongli.zhang, Felix.leclair123, maximlevitsky, nicholas.pomee, plantroon, tomm |
Priority: | P1 | ||
Hardware: | Other | ||
OS: | Linux | ||
Kernel Version: | 4.19.12-arch1-1-ARCH | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Prefer secondary bus reset over FLR
Prefer secondary bus reset over FLR trace Test patch, NVMe shutdown + delay to avoid ACS violation linux config NVMe subsystem reset with ACS masking Debug patch trace |
Description
Alex
2018-12-24 19:34:46 UTC
There's been another report[1] that this devices reports an invalid MSI-X capability where the vector table and PBA do overlap. The user there reports: Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Where each vector table entry is 16 bytes therefore a 22 entry vector table based at 0x2000 would extend to at least 0x2160 but the PBA is claimed to start at 0x2100. We have different results here: Capabilities: [b0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 It appears that this capability is sane and should pass the QEMU sanity test, but clearly it did not, so did this capability report the same values when vfio read it? Note that in the first case MSI-X is not enabled while in the latter case it is enabled and we can see the device is bound to the nvme driver. Perhaps this suggests there are states where this device reports a valid MSI-X capability and states where it does not. I would suggest: a) Unbind the device from the nvme driver, bind it to vfio-pci, look at lspci in the host and see if the Count value in the MSI-X capability has changed. b) If the device still reports Count=16 after the steps in a), continue from that point by resetting the device via pci-sysfs (ex. echo 1 > /sys/bus/pci/devices/0000:06:00.0/reset). Look again at lspci in the host to see if the Count value has changed. Thanks [1]https://patchwork.kernel.org/patch/10707761/ A wild guess based - device reports the same number of MSI vectors as the number of IO queues configured (using 'Number of queues' feature) So NVME driver enables the device, sends in the number of queues set feature command, and msi-x starts 'working' Do you happen to have 16 logical CPUs? s/A wild guess based /A wild guess based on the suspicions number of the MSI IRQs in both cases. I rebound device and there is a lspci output. It reports Count=16. ``` root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/vfio-pci/new_id 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b] Physical Slot: 2-1 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 42 NUMA node: 0 Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [158 v1] Secondary PCI Express <?> Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [180 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=10us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Kernel driver in use: vfio-pci ``` Resetted with vfio-pci in use. This time it reports Count=22. ``` root@localhost /home/alex # echo 1 | tee /sys/bus/pci/devices/0000:06:00.0/reset 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b] Physical Slot: 2-1 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 42 NUMA node: 0 Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/16 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] Express (v2) Endpoint, MSI 01 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [158 v1] Secondary PCI Express <?> Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [180 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=10us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Kernel driver in use: vfio-pci ``` I am owner of 12 thread cpu. ``` alex@localhost ~ % lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Stepping: 2 CPU MHz: 1499.984 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 8003.18 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm arat pln pts flush_l1d ``` (In reply to Alex from comment #4) > I rebound device and there is a lspci output. It reports Count=16. > > ``` > root@localhost /home/alex # echo 0000:06:00.0 | tee > /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind > root@localhost /home/alex # echo 8086 f1a6 | tee > /sys/bus/pci/drivers/vfio-pci/new_id > 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro > 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) > Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- >Stepping- SERR+ FastB2B- DisINTx- > Capabilities: [b0] MSI-X: Enable- Count=16 Masked- > Vector table: BAR=0 offset=00002000 > PBA: BAR=0 offset=00002100 > ``` > > Resetted with vfio-pci in use. This time it reports Count=22. > ``` > root@localhost /home/alex # echo 1 | tee > /sys/bus/pci/devices/0000:06:00.0/reset > 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro > 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) > Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- >Stepping- SERR+ FastB2B- DisINTx- > Capabilities: [b0] MSI-X: Enable- Count=22 Masked- > Vector table: BAR=0 offset=00002000 > PBA: BAR=0 offset=00002100 Ok, interesting. So likely when QEMU is analyzing the device it's seeing this 22 value which is why it throws an error at the sanity test. With the nvme driver bound, we seem to get a sane number of MSI-X entries, though it still confuses me how the reporter in [1] claimed their system was making use of 17 vectors, which would mean that Count=16 is still bogus. In any case, let's see if we can figure out what we can poke on the device to make these fields within the register change. Start with the device in the state you have it above where it reports Count=22. First let's test if the vector table size is really read-only: # setpci -s 06:00.0 CAP_MSIX+2.w This should return 0016 as 0x16 is 22. Try to write it: # setpci -s 06:00.0 CAP_MSIX+2.w=10:7ff And read it back again: # setpci -s 06:00.0 CAP_MSIX+2.w Did the value change? Next, we already have memory enabled on the device, but the nvme driver also enables bus master before enabling interrupts, so let's check if setting bus master triggers a change in the MSI-X capability: # setpci -s 06:00.0 COMMAND This should report 0102 based on the lspci output, to enable bus master: # setpci -s 06:00.0 COMMAND=4:4 Does the Count value in the MSI-X capability change? (To return it back to the previous state: setpci -s 06:00.0 COMMAND=0:4) I'd really hope that one of the above helps to indicate a next step, but we could also try enabling MSI-X (with it masked), so we could try: # setpci -s 06:00.0 CAP_MSIX+2.w=c000:c000 This should cause lspci to report Enable+ and Masked+, but does the Count value change? (To return it back to the previous state: setpci -s 06:00.0 CAP_MSIX+2.w=0:c000) root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w 0015 root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=10:7ff root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w 0015 root@localhost /home/alex # setpci -s 06:00.0 COMMAND 0400 root@localhost /home/alex # setpci -s 06:00.0 COMMAND=4:4 After this lspci reports "MSI-X: Enable- Count=22 Masked-" root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=c000:c000 lspci reports "MSI-X: Enable+ Count=22 Masked+" Did previous step today again. Got a little different results. root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w 0015 root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=10:7ff root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w 0015 root@localhost /home/alex # setpci -s 06:00.0 COMMAND 0102 root@localhost /home/alex # setpci -s 06:00.0 COMMAND=4:4 root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=c000:c000 Tt every step a got Count=22 from lspci. If you have the device in a state where it reports Count=22 and bind it back to the nvme driver, is Count restored to 16 or does it require a host reset to restore the device to its default state? I've been assuming there's a path back from Count=22, but perhaps there's not without resetting the host. It stays at Count=22 as I rebind to nvme. ``` 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) ... Capabilities: [b0] MSI-X: Enable+ Count=22 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 ... Kernel driver in use: nvme ``` It continues to stay at Count=22 after resetting echo 1 | tee /sys/bus/pci/devices/0000:06:00.0/reset ``` 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) ... Capabilities: [b0] MSI-X: Enable+ Count=22 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 ... Kernel driver in use: nvme ``` After reboot it comes back to Count=16 ``` 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) ... Capabilities: [b0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 ... Kernel driver in use: nvme ``` Ok, how about we try a secondary bus reset then. For testing purposes we're going to trigger a secondary bus reset outside of the control of the kernel, so the device state will not be restored after this. We can look at the PCI config space, but don't expect the device to work until the system is rebooted. To start we need to identify the upstream port for the device. My system will be different from yours, so extrapolate as needed: # lspci -tv | grep -i nvme +-1c.4-[04]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 This shows my Samsung NVMe drive at 4:00.0 is attached to the root port at 00:1c.4, which is the bridge we'll be using to generate the reset. Replace with the device above your NVMe controller at 6:00.0. We can then read the bridge control register using: # setpci -s 00:1c.4 BRIDGE_CONTROL 0000 The bus reset procedure is to set the bus reset bit briefly, clear it, then wait for the bus to recover, therefore: # setpci -s 00:1c.4 BRIDGE_CONTROL=40:40; sleep 0.1; setpci -s 00:1c.4 BRIDGE_CONTROL=0:40; sleep 1 (don't forget to replace each occurrence of 00:1c.4 with the port the NVMe drive is attached in your system) From here check the MSI-X Count of the NVMe device. It would be interesting to test starting with Count=16, binding to vfio-pci, if you replace the 'echo 1 > reset' with the above, what does Count report. And also, after resetting the system, put the device back into a state where it reports Count=22, then try the secondary bus reset above to see if it returns the device to Count=16. If this is a better reset method for this device we can implement a device specific reset in the kernel that does this rather than an FLR. Rebind to vfio-pci and reset device. root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind 0000:06:00.0 root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/vfio-pci/new_id 8086 f1a6 root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind 0000:06:00.0 root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/nvme/new_id 8086 f1a6 root@localhost /home/alex # echo 1 | teee /sys/bus/pci/devices/0000:06:00.0/reset 1 At this point got Count=22 as expected with nvme in use. 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) ... Capabilities: [b0] MSI-X: Enable+ Count=22 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 ... Kernel driver in use: nvme root@localhost /home/alex # lspci -tvv | grep SSD +-01.1-[06]----00.0 Intel Corporation SSD Pro 7600p/760p/E 6100p Series root@localhost /home/alex # setpci -s 00:01.1 BRIDGE_CONTROL 0010 root@localhost /home/alex # setpci -s 00:01.1 BRIDGE_CONTROL=40:40 && sleep 0.1 && setpci -s 00:01.1 BRIDGE_CONTROL=0:40 && sleep 1 This time lspci reports Count=16. Capabilities: [b0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Rebooted. Rebound to vfio-pci. root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind 0000:06:00.0 root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/vfio-pci/new_id 8086 f1a6 At this point before bus reset lspci reports Count=16 as expected root@localhost /home/alex # setpci -s 00:01.1 BRIDGE_CONTROL=40:40 && sleep 0.1 && setpci -s 00:01.1 BRIDGE_CONTROL=0:40 && sleep 1 Now lspci still reports Count=16 with vfio-pci in use. 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) ... Capabilities: [b0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 ... Kernel driver in use: vfio-pci Created attachment 280179 [details]
Prefer secondary bus reset over FLR
Please test the attached patch against a recent Linux kernel tree on the host. This will cause the reset function interface to prefer a secondary bus reset over FLR for this device, which should resolve both the 'echo 1 > reset' failure and the assignment to QEMU using vfio-pci. If you prefer a different reporting/testing attribute in the patch, please let me know and I'll correct it before posting upstream, assuming this works.
The following errors are hit with above patch: ./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0 WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'raw' format explicitly to remove the restrictions. qemu-system-x86_64: vfio_err_notifier_handler(0000:01:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest # dmesg [ 124.940551] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 124.940557] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 124.940561] pcieport 0000:00:1b.0: device [8086:a2e7] error status/mask=00200000/00010000 [ 124.940563] pcieport 0000:00:1b.0: [21] ACSViol (First) [ 125.920253] pcieport 0000:00:1b.0: AER: Device recovery successful [ 125.920261] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 125.920277] pcieport 0000:00:1b.0: can't find device of ID00d8 [ 125.920386] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x158 [ 125.920394] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x1e@0x180 [ 126.010862] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 126.010877] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 126.010914] pcieport 0000:00:1b.0: device [8086:a2e7] error status/mask=00200000/00010000 [ 126.010923] pcieport 0000:00:1b.0: [21] ACSViol (First) [ 127.008662] pcieport 0000:00:1b.0: AER: Device recovery successful [ 127.008671] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 127.008682] pcieport 0000:00:1b.0: can't find device of ID00d8 [ 150.603263] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 150.603270] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 150.603274] pcieport 0000:00:1b.0: device [8086:a2e7] error status/mask=00200000/00010000 [ 150.603277] pcieport 0000:00:1b.0: [21] ACSViol (First) [ 151.598132] pcieport 0000:00:1b.0: AER: Device recovery successful [ 151.598139] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 151.598146] pcieport 0000:00:1b.0: can't find device of ID00d8 Although above errors are encountered, the msix count is 16. # lspci -s 01:00.0 -vv Capabilities: [b0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Dongli Zhang Hi Dongli, you're getting an ACS violation, I wonder if it's related to an issue your colleague resolved recently: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa667c6408d20a84c7637420bc3b7aa0abab59a2 Is there an IDT switch in your topology or is the NVMe drive connected directly to the Intel root port? If the former, perhaps James' patch doesn't account for the invalid source ID propagating upstream. If directly connected to the Intel root port, perhaps IDT isn't the only downstream port with the issue. You could try disabling Source Validation on the root port via setpci to see if we're dealing with a similar issue: # setpci -s 1b.0 ECAP_ACS+6.w=0:1 However, you're using an Intel system with a non-standard (aka broken) ACS capability, therefore the ACS capability and control registers are actually dwords, so I think the correct command would be: # setpci -s 1b.0 ECAP_ACS+8.l=0:1 Also you won't be able to trust lspci for decoding of the ACS capability. Patch from above works just fine for me. I was able to passthrough device to linux and windows guests. There is a lspci from host. 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b] Physical Slot: 2-1 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 42 NUMA node: 0 Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Capabilities: [100 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [158 v1] Secondary PCI Express <?> Capabilities: [178 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [180 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=10us PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1- T_CommonMode=0us LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us Kernel driver in use: vfio-pci And from a guest. 00:07.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express]) Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b] Physical Slot: 7 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 11 NUMA node: 0 Region 0: Memory at fc074000 (64-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+ Address: 0000000000000000 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Kernel driver in use: nvme Device still does not bind to vfio-pci on boot though and needs to be rebinded manually. With following .conf file options vfio-pci ids=8086:f1a6 softdep nvme pre: vfio vfio-pci Created attachment 280237 [details]
Prefer secondary bus reset over FLR
Include the native Silicon Motion PCI ID as used on the ADATA XPG SX8200 and hopefully others.
Created attachment 280239 [details]
trace
Went back into logs and found following trace.
Was not able to reproduce once more.
Besides this nvme reports following every time I start guest.
nvme nvme0: failed to set APST feature (-19)
(In reply to Alex Williamson from comment #14) > Hi Dongli, you're getting an ACS violation, I wonder if it's related to an > issue your colleague resolved recently: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > ?id=aa667c6408d20a84c7637420bc3b7aa0abab59a2 > > Is there an IDT switch in your topology or is the NVMe drive connected > directly to the Intel root port? If the former, perhaps James' patch > doesn't account for the invalid source ID propagating upstream. If directly > connected to the Intel root port, perhaps IDT isn't the only downstream port > with the issue. > > You could try disabling Source Validation on the root port via setpci to see > if we're dealing with a similar issue: > > # setpci -s 1b.0 ECAP_ACS+6.w=0:1 > > However, you're using an Intel system with a non-standard (aka broken) ACS > capability, therefore the ACS capability and control registers are actually > dwords, so I think the correct command would be: > > # setpci -s 1b.0 ECAP_ACS+8.l=0:1 > > Also you won't be able to trust lspci for decoding of the ACS capability. Hi Alex, The kernel I use is the most recent upstream version including commit aa667c6408d20a84c7637420bc3b7aa0abab59a2. Is there a way to know if IDT switch is in the topology? The env is an dell desktop I use at home to debug program myself. # lspci 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05) 00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04) 00:14.0 USB controller: Intel Corporation Device a2af 00:14.2 Signal processing controller: Intel Corporation Device a2b1 00:16.0 Communication controller: Intel Corporation Device a2ba 00:17.0 SATA controller: Intel Corporation Device a282 00:1b.0 PCI bridge: Intel Corporation Device a2e7 (rev f0) 00:1d.0 PCI bridge: Intel Corporation Device a298 (rev f0) 00:1f.0 ISA bridge: Intel Corporation Device a2c6 00:1f.2 Memory controller: Intel Corporation Device a2a1 00:1f.3 Audio device: Intel Corporation Device a2f0 00:1f.4 SMBus: Intel Corporation Device a2a3 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (5) I219-V 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev 03) 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) Dongli Zhang Hi Dongli, (In reply to Dongli Zhang from comment #19) > > The kernel I use is the most recent upstream version including commit > aa667c6408d20a84c7637420bc3b7aa0abab59a2. > > Is there a way to know if IDT switch is in the topology? No IDT switch in this system, so you shouldn't have that issue. > The env is an dell desktop I use at home to debug program myself. > > # lspci > 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05) > 00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04) > 00:14.0 USB controller: Intel Corporation Device a2af > 00:14.2 Signal processing controller: Intel Corporation Device a2b1 > 00:16.0 Communication controller: Intel Corporation Device a2ba > 00:17.0 SATA controller: Intel Corporation Device a282 > 00:1b.0 PCI bridge: Intel Corporation Device a2e7 (rev f0) > 00:1d.0 PCI bridge: Intel Corporation Device a298 (rev f0) > 00:1f.0 ISA bridge: Intel Corporation Device a2c6 > 00:1f.2 Memory controller: Intel Corporation Device a2a1 > 00:1f.3 Audio device: Intel Corporation Device a2f0 > 00:1f.4 SMBus: Intel Corporation Device a2a3 > 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (5) I219-V > 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev > 03) > 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network > Connection (rev 01) > 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network > Connection (rev 01) I bought an ADATA XPG SX8200 drive to debug further, in some systems it works fine with the attached patch, but in another I think I'm getting something similar to what you see. My system has Downstream Port Containment (DPC) support, so I think that catches the error before AER, but if I disable ACS Source Validation on the root port it avoids any errors, so I think we're still dealing with the ACS violation that you see. A clue though is that triggering the bus reset via setpci as in comment 10 does not trigger the fault. I then stumbled on adding a delay in the kernel code path prior to the bus reset to avoid the issue. Long story short, could you try adding a delay to the previous patch, for example make the new function in drivers/pci/quirks.c look like this: static int prefer_bus_reset(struct pci_dev *dev, int probe) { msleep(100); return pci_parent_bus_reset(dev, probe); } I look forward to seeing if this works around the AER fault in your system as well. Actually, msleep(100) may be a few orders of magnitude longer than we need, I continue to see errors with udelay(10), but it seems to work perfectly with udelay(100). Dongli, please test the above using udelay(100) rather than msleep(100). Thanks The delay in comment 20 allows the device to reset when it's already quiesced, but after the VM makes use of the device I'm finding that it will still trigger the fault. I've got another version that follows the path of the Samsung nvme quirk to test and disable the nvme controller before performing a reset. Coupling with the delay, this seems to address both the previously active and previously idle reset cases. I'll attach a new patch implementing this for testing. (In reply to Alex Williamson from comment #21) > Actually, msleep(100) may be a few orders of magnitude longer than we need, > I continue to see errors with udelay(10), but it seems to work perfectly > with udelay(100). Dongli, please test the above using udelay(100) rather > than msleep(100). Thanks Hi Alex, While waiting for the patch mentioned by Comment 22, I have tested the below by adding udelay(100): 3828 static int prefer_bus_reset(struct pci_dev *dev, int probe) 3829 { 3830 udelay(100); 3831 return pci_parent_bus_reset(dev, probe); 3832 } I got the below error again: QEMU: # ./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0 WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'raw' format explicitly to remove the restrictions. qemu-system-x86_64: vfio_err_notifier_handler(0000:01:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest KERNEL: [ 69.715224] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 69.715230] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 69.715234] pcieport 0000:00:1b.0: device [8086:a2e7] error status/mask=00200000/00010000 [ 69.715236] pcieport 0000:00:1b.0: [21] ACSViol (First) [ 70.742423] pcieport 0000:00:1b.0: AER: Device recovery successful [ 70.742430] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 70.742442] pcieport 0000:00:1b.0: can't find device of ID00d8 [ 70.742554] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x158 [ 70.742562] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x1e@0x180 [ 70.834427] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 70.834440] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID) [ 70.834448] pcieport 0000:00:1b.0: device [8086:a2e7] error status/mask=00200000/00010000 [ 70.834453] pcieport 0000:00:1b.0: [21] ACSViol (First) [ 71.822627] pcieport 0000:00:1b.0: AER: Device recovery successful [ 71.822634] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0 [ 71.822645] pcieport 0000:00:1b.0: can't find device of ID00d8 Dongli Zhang Created attachment 280535 [details]
Test patch, NVMe shutdown + delay to avoid ACS violation
Here's the patch for testing, this avoids all the ACS violation faults on my system with ADATA XPG SX8200. Please test.
Created attachment 280555 [details]
linux config
Hi Alex, The patch does not work for me :( Here is how I reproduce the issue. The attached file is my kernel config. qemu commit: 6f2f34177a25bffd6fd92a05e6e66c8d22d97094 linux commit: 1c7fc5cbc33980acd13d668f1c8f0313d6ae9fd8 To build qemu: # ./configure --target-list=x86_64-softmmu # make -j8 > /dev/null To build linux: use the attached config # make -j8 > /dev/null To reproduce, boot into the linux kernel. I always use qemu at where it is built. I do not run "make install" for qemu. # modprobe vfio # modprobe vfio-pci # echo 0000:01:00.0 > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind # echo "8086 f1a6" > /sys/bus/pci/drivers/vfio-pci/new_id # ./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0 # ./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0 WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'raw' format explicitly to remove the restrictions. qemu-system-x86_64: vfio_err_notifier_handler(0000:01:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest Dongli Zhang I don't know what more I can do here, I've since tested the ADATA XPG SX8200 in an Intel laptop with 200-series chipset and it behaves just fine with the latest patch. It's possible the additional issues are unique to the Intel 760p implementation of the SM2262 or only exposed in configurations similar to yours. I'm out of options to investigate further. You could potentially boot with pci=noaer to disable Advanced Error Reporting in your configuration, but that's never a good long term solution. (In reply to Alex Williamson from comment #27) > I don't know what more I can do here, I've since tested the ADATA XPG SX8200 > in an Intel laptop with 200-series chipset and it behaves just fine with the > latest patch. It's possible the additional issues are unique to the Intel > 760p implementation of the SM2262 or only exposed in configurations similar > to yours. I'm out of options to investigate further. You could potentially > boot with pci=noaer to disable Advanced Error Reporting in your > configuration, but that's never a good long term solution. Hi Alex, Thank you very much for the help. Perhaps it is only specific to this hardware or my machine. Perhaps I should upgrade the firmware. I would try to debug it a little bit in my spare time. So far to disabled aer in grub would boot guest VM successfully. With the patch, the entires of msix is not 22 any more. Dongli Zhang Hi Alex, The "Prefer secondary bus reset over FLR" patch works for devices you added in pci_dev_reset_methods[]. Will this patch work correctly for a SM2263 controller as well? One such device (Crucial P1 CT500P1SSD8) has PCI ID [c0a9:2263], just a matter of adding this ID? Also, should we be using the "Test patch, NVMe shutdown + delay to avoid ACS violation" patch instead? thanks, Tom (In reply to LimeTech from comment #29) > Hi Alex, > > The "Prefer secondary bus reset over FLR" patch works for devices you added > in pci_dev_reset_methods[]. Will this patch work correctly for a SM2263 > controller as well? One such device (Crucial P1 CT500P1SSD8) has PCI ID > [c0a9:2263], just a matter of adding this ID? > > > Also, should we be using the "Test patch, NVMe shutdown + delay to avoid ACS > violation" patch instead? Hi Tom, The second patch is intended to be a replacement of the original, it at least enables the ADATA drive on a server where the first patch did not, even if that turned out to be not exactly the same issue as Dongli experiences. To add the SM2263 just add a new ID, ex: { 0xc0a9, 0x2263, sm2262_reset }, Add it to the code where the last chunk of the patch includes the known SM2262 variants, in the pci_dev_reset_methods array. Please report back the results. It's really unfortunate that there's such a fundamental bug in a whole family of controllers that's getting rebranded with different PCI IDs by so many vendors. Thanks, Alex (In reply to Alex Williamson from comment #30) > (In reply to LimeTech from comment #29) > > Hi Alex, > > > > The "Prefer secondary bus reset over FLR" patch works for devices you added > > in pci_dev_reset_methods[]. Will this patch work correctly for a SM2263 > > controller as well? One such device (Crucial P1 CT500P1SSD8) has PCI ID > > [c0a9:2263], just a matter of adding this ID? > > > > > > Also, should we be using the "Test patch, NVMe shutdown + delay to avoid > ACS > > violation" patch instead? > > Hi Tom, > > The second patch is intended to be a replacement of the original, it at > least enables the ADATA drive on a server where the first patch did not, > even if that turned out to be not exactly the same issue as Dongli > experiences. To add the SM2263 just add a new ID, ex: > > { 0xc0a9, 0x2263, sm2262_reset }, > > Add it to the code where the last chunk of the patch includes the known > SM2262 variants, in the pci_dev_reset_methods array. Please report back the > results. > It's really unfortunate that there's such a fundamental bug in a whole > family of controllers that's getting rebranded with different PCI IDs by so > many vendors. Thanks, > > Alex Thank you, applied patch, will report back. (In reply to LimeTech from comment #31) > (In reply to Alex Williamson from comment #30) > > (In reply to LimeTech from comment #29) > > > Hi Alex, > > > > > > The "Prefer secondary bus reset over FLR" patch works for devices you > added > > > in pci_dev_reset_methods[]. Will this patch work correctly for a SM2263 > > > controller as well? One such device (Crucial P1 CT500P1SSD8) has PCI ID > > > [c0a9:2263], just a matter of adding this ID? > > > > > > > > > Also, should we be using the "Test patch, NVMe shutdown + delay to avoid > > ACS > > > violation" patch instead? > > > > Hi Tom, > > > > The second patch is intended to be a replacement of the original, it at > > least enables the ADATA drive on a server where the first patch did not, > > even if that turned out to be not exactly the same issue as Dongli > > experiences. To add the SM2263 just add a new ID, ex: > > > > { 0xc0a9, 0x2263, sm2262_reset }, > > > > Add it to the code where the last chunk of the patch includes the known > > SM2262 variants, in the pci_dev_reset_methods array. Please report back > the > > results. > > It's really unfortunate that there's such a fundamental bug in a whole > > family of controllers that's getting rebranded with different PCI IDs by so > > many vendors. Thanks, > > > > Alex > > Thank you, applied patch, will report back. The report is that the patch solved the problem with the Crucial P1 using the SM2263 controller, and also passthrough works perfectly now. thanks Tom Created attachment 280913 [details]
NVMe subsystem reset with ACS masking
Dongli, I'd appreciate testing of this patch series. The differences from the previous version are:
1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies some of the hotplug slot code from the previous version
2) Mask ACS Source Validation around reset, this eliminates some of the magic voodoo that avoided the fault on my system, but not yours
This exploded into a several patch series to simplify the ACS masking, but it should still apply easily. Testing by others obviously welcome as well. Thanks
(In reply to Alex Williamson from comment #33) > Created attachment 280913 [details] > NVMe subsystem reset with ACS masking > > Dongli, I'd appreciate testing of this patch series. The differences from > the previous version are: > > 1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies > some of the hotplug slot code from the previous version > 2) Mask ACS Source Validation around reset, this eliminates some of the > magic voodoo that avoided the fault on my system, but not yours > > This exploded into a several patch series to simplify the ACS masking, but > it should still apply easily. Testing by others obviously welcome as well. > Thanks A user is reporting a flood of syslog messages as a result of running fstrim on one of these devices: 02:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device [126f:2262] (rev 03) Subsystem: Silicon Motion, Inc. Device [126f:2262] Kernel driver in use: nvme Kernel modules: nvme Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr ef321000 [fault reason 06] PTE Read access is not set Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr f0a19000 [fault reason 06] PTE Read access is not set Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr efe93000 [fault reason 06] PTE Read access is not set Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 Jan 27 07:00:17 unRAID kernel: dmar_fault: 77 callbacks suppressed Do you think your latest patch might fix this? (In reply to LimeTech from comment #34) > > A user is reporting a flood of syslog messages as a result of running fstrim > on one of these devices: > > 02:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device > [126f:2262] (rev 03) > Subsystem: Silicon Motion, Inc. Device [126f:2262] > Kernel driver in use: nvme > Kernel modules: nvme > > Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 > Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] > fault addr ef321000 [fault reason 06] PTE Read access is not set > Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 > Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] > fault addr f0a19000 [fault reason 06] PTE Read access is not set > Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 > Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] > fault addr efe93000 [fault reason 06] PTE Read access is not set > Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3 > Jan 27 07:00:17 unRAID kernel: dmar_fault: 77 callbacks suppressed > > Do you think your latest patch might fix this? Not likely. Gosh, how many ways can these devices be broken? This was while the device was in use by the host or within a guest? Those faults indicate the device is trying to do a DMA read from an IOVA it doesn't have mapped through the IOMMU. Based on the addresses, I'd guess this is not a VM use case. Either way, it's not the issue this bug is tracking. (In reply to Alex Williamson from comment #33) > Created attachment 280913 [details] > NVMe subsystem reset with ACS masking > > Dongli, I'd appreciate testing of this patch series. The differences from > the previous version are: > > 1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies > some of the hotplug slot code from the previous version > 2) Mask ACS Source Validation around reset, this eliminates some of the > magic voodoo that avoided the fault on my system, but not yours > > This exploded into a several patch series to simplify the ACS masking, but > it should still apply easily. Testing by others obviously welcome as well. > Thanks Hi Alex, I am on vacation and could not access the test machine with the nvme (with issue). I will test it next week. Thank you very much for creating the patch. Dongli Zhang Added another PCI ID to quirks.c (2019-01-16): + { 0x126f, 0x2263, sm2262_reset }, Also your latest patch (2019-02-01) will not apply against 4.19 kernel. (The 2019-01-16 patch doesn't either but that's easy to fix). What kernel should this be applied to? -Tom (In reply to LimeTech from comment #37) > Added another PCI ID to quirks.c (2019-01-16): > > + { 0x126f, 0x2263, sm2262_reset }, > > Also your latest patch (2019-02-01) will not apply against 4.19 kernel. > (The 2019-01-16 patch doesn't either but that's easy to fix). What kernel > should this be applied to? Added. It's against v4.20. Thanks. (In reply to Alex Williamson from comment #33) > Created attachment 280913 [details] > NVMe subsystem reset with ACS masking > > Dongli, I'd appreciate testing of this patch series. The differences from > the previous version are: > > 1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies > some of the hotplug slot code from the previous version > 2) Mask ACS Source Validation around reset, this eliminates some of the > magic voodoo that avoided the fault on my system, but not yours > > This exploded into a several patch series to simplify the ACS masking, but > it should still apply easily. Testing by others obviously welcome as well. > Thanks Hi Alex, I have tested the 5-patch 280913 (as below). Unfortunately, I encountered the initial problem again, that is, the msix count changed from 16 to 22 again. There is no AER message this time. https://bugzilla.kernel.org/attachment.cgi?id=280913 ./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0 WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'raw' format explicitly to remove the restrictions. qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0: vfio error: 0000:01:00.0: failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they don't fit in BARs, or don't align The msix count changed from 16 to 22 again. 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev 03) (prog-if 02 [NVM Express]) Subsystem: Intel Corporation Device 390b Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 16 ... ... Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002100 Dongli Zhang Created attachment 281113 [details]
Debug patch
Dongli, if we're unable to perform the NVMe subsystem reset, we fall back to other resets, including the known bad FLR reset, which seems like what might be happening here. Could you please apply this patch on top of the previous to add some debugging to show where the detection is failing. I can only guess this might mean your device does not support an NVMe subsystem reset, but I can't imagine why the Intel variant would remove this while the ADATA version has it. Ugh. Thanks
Created attachment 281173 [details]
trace
NVMe subsystem reset pathc does not quite work for me eather.
MSI-X stays at 16 but device appear on guest side as
04:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b]
Physical Slot: 0-2
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin ? routed to IRQ 20
NUMA node: 0
Region 0: Memory at 98000000 (64-bit, non-prefetchable) [size=16K]
It seems there's a partial workaround available since QEMU v2.12 hiding under our noses. That version adds support for relocating the MSI-X vector table on vfio-pci devices, which recreates the MSI-X MMIO space elsewhere on the device. A side-effect of this is that the vector table and PBA are properly sized so as not to collide. The size of the tables remains wrong, but this only becomes a problem if the nvme code attempts to allocate >16 vectors, which requires >15 vCPU (or host CPUs if the device is returned to host drivers after being assigned)(nvme appears to allocate 1 admin queue, plus a queue per CPU, each making use of an IRQ vector). The QEMU vfio-pci device option is x-msix-relocation= which allows specifying the bar to use for the MSI-X tables, ex. bar0...bar5. Since this device uses a 64bit bar0, we can either extend that BAR or choose another, excluding bar1, which is consumed by the upper half of bar0. For instance, I tested with: <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> ... <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <alias name='ua-sm2262'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </hostdev> ... <qemu:commandline> <qemu:arg value='-set'/> <qemu:arg value='device.ua-sm2262.x-msix-relocation=bar2'/> </qemu:commandline> (NB: "ua-" is a required prefix when specifying an alias) A new virtual BAR appears in the guest hosting the MSI-X table and QEMU starts normally so long as the guest doesn't exceed 15 vCPUs. The vCPU/pCPU count limitations are obviously not ideal, but hopefully this provides some degree of workaround for typical configurations. (In reply to Alex Williamson from comment #42) > It seems there's a partial workaround available since QEMU v2.12 hiding > under our noses. That version adds support for relocating the MSI-X vector > table on vfio-pci devices, which recreates the MSI-X MMIO space elsewhere on > the device. A side-effect of this is that the vector table and PBA are > properly sized so as not to collide. The size of the tables remains wrong, > but this only becomes a problem if the nvme code attempts to allocate >16 > vectors, which requires >15 vCPU (or host CPUs if the device is returned to > host drivers after being assigned)(nvme appears to allocate 1 admin queue, > plus a queue per CPU, each making use of an IRQ vector). The QEMU vfio-pci > device option is x-msix-relocation= which allows specifying the bar to use > for the MSI-X tables, ex. bar0...bar5. Since this device uses a 64bit bar0, > we can either extend that BAR or choose another, excluding bar1, which is > consumed by the upper half of bar0. For instance, I tested with: > > <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> > ... > <hostdev mode='subsystem' type='pci' managed='yes'> > <source> > <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> > </source> > <alias name='ua-sm2262'/> > <address type='pci' domain='0x0000' bus='0x02' slot='0x00' > function='0x0'/> > </hostdev> > ... > <qemu:commandline> > <qemu:arg value='-set'/> > <qemu:arg value='device.ua-sm2262.x-msix-relocation=bar2'/> > </qemu:commandline> > > > (NB: "ua-" is a required prefix when specifying an alias) > > A new virtual BAR appears in the guest hosting the MSI-X table and QEMU > starts normally so long as the guest doesn't exceed 15 vCPUs. > > The vCPU/pCPU count limitations are obviously not ideal, but hopefully this > provides some degree of workaround for typical configurations. Hi Alex, are you saying the above coding in the VM xml is all that is necessary (with noted vCPU/pCPU count limitations) to successfully pass-through sm2262-based controllers, without above kernel patch? Or is a kernel patch also necessary (if so, which one)? thx, -tom (In reply to LimeTech from comment #43) > > Hi Alex, are you saying the above coding in the VM xml is all that is > necessary (with noted vCPU/pCPU count limitations) to successfully > pass-through sm2262-based controllers, without above kernel patch? Or is a > kernel patch also necessary (if so, which one)? Yes, without kernel patch. Sorry if this is a dumb question but I think I am experiencing exactly this issue. But I cannot work out how to apply the patch. Can anyone explain it or point me in the direction of how to apply the patch Hey gents! Trying to apply this to a proxmox VM, any advice on how to do so? I'm running on top of the 5.4.x linux kernel. I'm experiencing the same bug as in topic name, using the SMI 2263 controller. (Adata as above) Anything I can provide in terms of debugging data? Otherwise, any advice on how to use the above non kernel patch? also comfortable doing a kernel recompile if thats the better solution. Will be going to a FreeBSD (FreeNAS 11.3) VM with 4 threads and 72GB of ram alongside an LSI SAS controller. Platform is a Dual Xeon 5670 on a HP ml350 G6 mother board. Also attached but on a different VM is a Radeon 5600XT, an intel USB controller and an intel Nic to a windows VM (In reply to FCL from comment #46) > Hey gents! > > Trying to apply this to a proxmox VM, any advice on how to do so? I'm > running on top of the 5.4.x linux kernel. I'm experiencing the same bug as > in topic name, using the SMI 2263 controller. (Adata as above) > > Anything I can provide in terms of debugging data? Otherwise, any advice on > how to use the above non kernel patch? also comfortable doing a kernel > recompile if thats the better solution. Will be going to a FreeBSD (FreeNAS > 11.3) VM with 4 threads and 72GB of ram alongside an LSI SAS controller. > > Platform is a Dual Xeon 5670 on a HP ml350 G6 mother board. > > Also attached but on a different VM is a Radeon 5600XT, an intel USB > controller and an intel Nic to a windows VM Found the solution without the need for a kernel recompile. in my Qemu conf file added: args: -set device.hostpci1.x-msix-relocation=bar2 Where hostpci1 is hostpci1: 08:00.0 For Intel SSD 660p series, the latest firmware is required for pass-through to work without issues. The latest one should be from mid 2020 at the time of writing this. Not even any workarounds or special settings for qemu/libvirt are needed, just a simple pass-through setup as with any other PCIe device. |