Bug 202055

Summary: Failed to PCI passthrough SSD with SMI SM2262 controller.
Product: Virtualization Reporter: Alex (coffmaker)
Component: kvmAssignee: virtualization_kvm
Status: NEW ---    
Severity: normal CC: alex.williamson, coffmaker, dongli.zhang, Felix.leclair123, maximlevitsky, nicholas.pomee, plantroon, tomm
Priority: P1    
Hardware: Other   
OS: Linux   
Kernel Version: 4.19.12-arch1-1-ARCH Subsystem:
Regression: No Bisected commit-id:
Attachments: Prefer secondary bus reset over FLR
Prefer secondary bus reset over FLR
trace
Test patch, NVMe shutdown + delay to avoid ACS violation
linux config
NVMe subsystem reset with ACS masking
Debug patch
trace

Description Alex 2018-12-24 19:34:46 UTC
Trying to pci passthrough Intel SSD 760p 256G which is build with SMI SM2262 controller fails with following error:
> qemu-system-x86_64: -device vfio-pci,host=06:00.0: vfio 0000:06:00.0: failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they don't fit in BARs, or don't align.

According to [this](https://forums.unraid.net/topic/72036-nvme-m2-passthrough/) thread it happens to every SSD with SM2262 controller.

It happens regardless of nvme or vfio-pci is in use.

Besides this I had problems using vfio-pci with device.

There is my vfio.conf:
options vfio-pci ids=8086:f1a6

But according to lspci it still used nvme driver

I tried to add "softdep nvme pre: vfio vfio-pci" to vfio.conf with same result. Nvme was used.

So I tried to rebind device and it worked but qemu still crashed.
> echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
> echo 8086 f1a6 | tee /sys/bus/pci/drivers/vfio-pci/new_id


uname -a:
> Linux localhost 4.19.12-arch1-1-ARCH #1 SMP PREEMPT Fri Dec 21 13:56:54 UTC 2018 x86_64 GNU/Linux

root=/dev/nvme1n1p2 rw transparent_hugepage=never intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1

QEMU 3.1.0

/etc/modprobe.d/kvm.conf:
> options kvm_intel nested=1
> options kvm allow_unsafe_assigned_interrupts=1
> options kvm ignore_msrs=1

/etc/modprobe.d/vfio.conf:
> options vfio-pci ids=8086:f1a6
> softdep nvme pre: vfio vfio-pci

```
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b]
	Physical Slot: 2-1
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 42
	NUMA node: 0
	Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [158 v1] Secondary PCI Express <?>
	Capabilities: [178 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [180 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: nvme
```
Comment 1 Alex Williamson 2018-12-26 03:08:03 UTC
There's been another report[1] that this devices reports an invalid MSI-X capability where the vector table and PBA do overlap.  The user there reports:

	Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100

Where each vector table entry is 16 bytes therefore a 22 entry vector table based at 0x2000 would extend to at least 0x2160 but the PBA is claimed to start at 0x2100.  We have different results here:

	Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100

It appears that this capability is sane and should pass the QEMU sanity test, but clearly it did not, so did this capability report the same values when vfio read it?  Note that in the first case MSI-X is not enabled while in the latter case it is enabled and we can see the device is bound to the nvme driver.  Perhaps this suggests there are states where this device reports a valid MSI-X capability and states where it does not.

I would suggest:

a) Unbind the device from the nvme driver, bind it to vfio-pci, look at lspci in the host and see if the Count value in the MSI-X capability has changed.

b) If the device still reports Count=16 after the steps in a), continue from that point by resetting the device via pci-sysfs (ex. echo 1 > /sys/bus/pci/devices/0000:06:00.0/reset).  Look again at lspci in the host to see if the Count value has changed.

Thanks


[1]https://patchwork.kernel.org/patch/10707761/
Comment 2 Maxim Levitsky 2018-12-26 13:15:33 UTC
A wild guess based - device reports the same number of MSI vectors as the number of IO queues configured (using 'Number of queues' feature)

So NVME driver enables the device, sends in the number of queues set feature command, and msi-x starts 'working'

Do you happen to have 16 logical CPUs?
Comment 3 Maxim Levitsky 2018-12-26 13:17:22 UTC
s/A wild guess based /A wild guess based on the suspicions number of the MSI IRQs in both cases.
Comment 4 Alex 2018-12-26 21:08:23 UTC
I rebound device and there is a lspci output. It reports Count=16.

```
root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/vfio-pci/new_id
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b]
	Physical Slot: 2-1
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 42
	NUMA node: 0
	Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D3 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [158 v1] Secondary PCI Express <?>
	Capabilities: [178 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [180 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: vfio-pci
```

Resetted with vfio-pci in use. This time it reports Count=22.
```
root@localhost /home/alex # echo 1 | tee /sys/bus/pci/devices/0000:06:00.0/reset
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b]
	Physical Slot: 2-1
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 42
	NUMA node: 0
	Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/16 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 01
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [158 v1] Secondary PCI Express <?>
	Capabilities: [178 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [180 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: vfio-pci
```

I am owner of 12 thread cpu.
```
alex@localhost ~ % lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
Stepping:            2
CPU MHz:             1499.984
CPU max MHz:         3200.0000
CPU min MHz:         1200.0000
BogoMIPS:            8003.18
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            15360K
NUMA node0 CPU(s):   0-11
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm arat pln pts flush_l1d
```
Comment 5 Alex Williamson 2018-12-26 22:43:32 UTC
(In reply to Alex from comment #4)
> I rebound device and there is a lspci output. It reports Count=16.
> 
> ```
> root@localhost /home/alex # echo 0000:06:00.0 | tee
> /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
> root@localhost /home/alex # echo 8086 f1a6 | tee
> /sys/bus/pci/drivers/vfio-pci/new_id
> 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro
> 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
>       Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
>Stepping- SERR+ FastB2B- DisINTx-
>       Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
>               Vector table: BAR=0 offset=00002000
>               PBA: BAR=0 offset=00002100
> ```
> 
> Resetted with vfio-pci in use. This time it reports Count=22.
> ```
> root@localhost /home/alex # echo 1 | tee
> /sys/bus/pci/devices/0000:06:00.0/reset
> 06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro
> 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
>       Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
>Stepping- SERR+ FastB2B- DisINTx-
>       Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
>               Vector table: BAR=0 offset=00002000
>               PBA: BAR=0 offset=00002100


Ok, interesting.  So likely when QEMU is analyzing the device it's seeing this 22 value which is why it throws an error at the sanity test.  With the nvme driver bound, we seem to get a sane number of MSI-X entries, though it still confuses me how the reporter in [1] claimed their system was making use of 17 vectors, which would mean that Count=16 is still bogus.  In any case, let's see if we can figure out what we can poke on the device to make these fields within the register change.

Start with the device in the state you have it above where it reports Count=22.

First let's test if the vector table size is really read-only:

# setpci -s 06:00.0 CAP_MSIX+2.w

This should return 0016 as 0x16 is 22.  Try to write it:

# setpci -s 06:00.0 CAP_MSIX+2.w=10:7ff

And read it back again:

# setpci -s 06:00.0 CAP_MSIX+2.w

Did the value change?

Next, we already have memory enabled on the device, but the nvme driver also enables bus master before enabling interrupts, so let's check if setting bus master triggers a change in the MSI-X capability:

# setpci -s 06:00.0 COMMAND

This should report 0102 based on the lspci output, to enable bus master:

# setpci -s 06:00.0 COMMAND=4:4

Does the Count value in the MSI-X capability change?

(To return it back to the previous state: setpci -s 06:00.0 COMMAND=0:4)

I'd really hope that one of the above helps to indicate a next step, but we could also try enabling MSI-X (with it masked), so we could try:

# setpci -s 06:00.0 CAP_MSIX+2.w=c000:c000

This should cause lspci to report Enable+ and Masked+, but does the Count value change?

(To return it back to the previous state: setpci -s 06:00.0 CAP_MSIX+2.w=0:c000)
Comment 6 Alex 2018-12-26 23:45:38 UTC
root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w
0015

root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=10:7ff
root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w
0015

root@localhost /home/alex # setpci -s 06:00.0 COMMAND
0400

root@localhost /home/alex # setpci -s 06:00.0 COMMAND=4:4
After this lspci reports "MSI-X: Enable- Count=22 Masked-"

root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=c000:c000
lspci reports "MSI-X: Enable+ Count=22 Masked+"
Comment 7 Alex 2018-12-27 12:16:44 UTC
Did previous step today again. Got a little different results.

root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w
0015
root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=10:7ff
root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w
0015
root@localhost /home/alex # setpci -s 06:00.0 COMMAND
0102
root@localhost /home/alex # setpci -s 06:00.0 COMMAND=4:4
root@localhost /home/alex # setpci -s 06:00.0 CAP_MSIX+2.w=c000:c000

Tt every step a got Count=22 from lspci.
Comment 8 Alex Williamson 2018-12-27 14:07:46 UTC
If you have the device in a state where it reports Count=22 and bind it back to the nvme driver, is Count restored to 16 or does it require a host reset to restore the device to its default state?  I've been assuming there's a path back from Count=22, but perhaps there's not without resetting the host.
Comment 9 Alex 2018-12-27 14:58:08 UTC
It stays at Count=22 as I rebind to nvme.

```
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
...
	Capabilities: [b0] MSI-X: Enable+ Count=22 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
...
	Kernel driver in use: nvme
```

It continues to stay at Count=22 after resetting
echo 1 | tee /sys/bus/pci/devices/0000:06:00.0/reset
```
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
...
	Capabilities: [b0] MSI-X: Enable+ Count=22 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
...
	Kernel driver in use: nvme
```

After reboot it comes back to Count=16
```
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
...
	Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
...
	Kernel driver in use: nvme
```
Comment 10 Alex Williamson 2018-12-28 16:47:58 UTC
Ok, how about we try a secondary bus reset then.  For testing purposes we're going to trigger a secondary bus reset outside of the control of the kernel, so the device state will not be restored after this.  We can look at the PCI config space, but don't expect the device to work until the system is rebooted.  To start we need to identify the upstream port for the device.  My system will be different from yours, so extrapolate as needed:

# lspci -tv | grep -i nvme
           +-1c.4-[04]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

This shows my Samsung NVMe drive at 4:00.0 is attached to the root port at 00:1c.4, which is the bridge we'll be using to generate the reset.  Replace with the device above your NVMe controller at 6:00.0.

We can then read the bridge control register using:

# setpci -s 00:1c.4 BRIDGE_CONTROL
0000

The bus reset procedure is to set the bus reset bit briefly, clear it, then wait for the bus to recover, therefore:

# setpci -s 00:1c.4 BRIDGE_CONTROL=40:40; sleep 0.1; setpci -s 00:1c.4 BRIDGE_CONTROL=0:40; sleep 1

(don't forget to replace each occurrence of 00:1c.4 with the port the NVMe drive is attached in your system)

From here check the MSI-X Count of the NVMe device.  It would be interesting to test starting with Count=16, binding to vfio-pci, if you replace the 'echo 1 > reset' with the above, what does Count report.  And also, after resetting the system, put the device back into a state where it reports Count=22, then try the secondary bus reset above to see if it returns the device to Count=16.

If this is a better reset method for this device we can implement a device specific reset in the kernel that does this rather than an FLR.
Comment 11 Alex 2018-12-28 18:18:25 UTC
Rebind to vfio-pci and reset device.
root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
0000:06:00.0
root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/vfio-pci/new_id
8086 f1a6
root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
0000:06:00.0
root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/nvme/new_id
8086 f1a6
root@localhost /home/alex # echo 1 | teee /sys/bus/pci/devices/0000:06:00.0/reset
1

At this point got Count=22 as expected with nvme in use.
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
...
	Capabilities: [b0] MSI-X: Enable+ Count=22 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
...
	Kernel driver in use: nvme

root@localhost /home/alex # lspci -tvv | grep SSD  
             +-01.1-[06]----00.0  Intel Corporation SSD Pro 7600p/760p/E 6100p Series

root@localhost /home/alex # setpci -s 00:01.1 BRIDGE_CONTROL 
0010

root@localhost /home/alex # setpci -s 00:01.1 BRIDGE_CONTROL=40:40 && sleep 0.1 && setpci -s 00:01.1 BRIDGE_CONTROL=0:40 && sleep 1

This time lspci reports Count=16.
	Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100


Rebooted. Rebound to vfio-pci.
root@localhost /home/alex # echo 0000:06:00.0 | tee /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
0000:06:00.0
root@localhost /home/alex # echo 8086 f1a6 | tee /sys/bus/pci/drivers/vfio-pci/new_id
8086 f1a6

At this point before bus reset lspci reports Count=16 as expected

root@localhost /home/alex # setpci -s 00:01.1 BRIDGE_CONTROL=40:40 && sleep 0.1 && setpci -s 00:01.1 BRIDGE_CONTROL=0:40 && sleep 1

Now lspci still reports Count=16 with vfio-pci in use.
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
...
	Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
...
	Kernel driver in use: vfio-pci
Comment 12 Alex Williamson 2018-12-28 21:27:49 UTC
Created attachment 280179 [details]
Prefer secondary bus reset over FLR

Please test the attached patch against a recent Linux kernel tree on the host.  This will cause the reset function interface to prefer a secondary bus reset over FLR for this device, which should resolve both the 'echo 1 > reset' failure and the assignment to QEMU using vfio-pci.  If you prefer a different reporting/testing attribute in the patch, please let me know and I'll correct it before posting upstream, assuming this works.
Comment 13 Dongli Zhang 2018-12-29 11:40:16 UTC
The following errors are hit with above patch:

./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0
WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
qemu-system-x86_64: vfio_err_notifier_handler(0000:01:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest


# dmesg
[  124.940551] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[  124.940557] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[  124.940561] pcieport 0000:00:1b.0:   device [8086:a2e7] error status/mask=00200000/00010000
[  124.940563] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[  125.920253] pcieport 0000:00:1b.0: AER: Device recovery successful
[  125.920261] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[  125.920277] pcieport 0000:00:1b.0: can't find device of ID00d8
[  125.920386] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x158
[  125.920394] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x1e@0x180
[  126.010862] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[  126.010877] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[  126.010914] pcieport 0000:00:1b.0:   device [8086:a2e7] error status/mask=00200000/00010000
[  126.010923] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[  127.008662] pcieport 0000:00:1b.0: AER: Device recovery successful
[  127.008671] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[  127.008682] pcieport 0000:00:1b.0: can't find device of ID00d8
[  150.603263] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[  150.603270] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[  150.603274] pcieport 0000:00:1b.0:   device [8086:a2e7] error status/mask=00200000/00010000
[  150.603277] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[  151.598132] pcieport 0000:00:1b.0: AER: Device recovery successful
[  151.598139] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[  151.598146] pcieport 0000:00:1b.0: can't find device of ID00d8


Although above errors are encountered, the msix count is 16.

# lspci -s 01:00.0 -vv
	Capabilities: [b0] MSI-X: Enable- Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100

Dongli Zhang
Comment 14 Alex Williamson 2018-12-29 17:35:13 UTC
Hi Dongli, you're getting an ACS violation, I wonder if it's related to an issue your colleague resolved recently:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aa667c6408d20a84c7637420bc3b7aa0abab59a2

Is there an IDT switch in your topology or is the NVMe drive connected directly to the Intel root port?  If the former, perhaps James' patch doesn't account for the invalid source ID propagating upstream.  If directly connected to the Intel root port, perhaps IDT isn't the only downstream port with the issue.

You could try disabling Source Validation on the root port via setpci to see if we're dealing with a similar issue:

# setpci -s 1b.0 ECAP_ACS+6.w=0:1

However, you're using an Intel system with a non-standard (aka broken) ACS capability, therefore the ACS capability and control registers are actually dwords, so I think the correct command would be:

# setpci -s 1b.0 ECAP_ACS+8.l=0:1

Also you won't be able to trust lspci for decoding of the ACS capability.
Comment 15 Alex 2018-12-30 15:14:02 UTC
Patch from above works just fine for me.

I was able to passthrough device to linux and windows guests.

There is a lspci from host.
06:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b]
	Physical Slot: 2-1
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 42
	NUMA node: 0
	Region 0: Memory at fba00000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [158 v1] Secondary PCI Express <?>
	Capabilities: [178 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [180 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Kernel driver in use: vfio-pci

And from a guest.
00:07.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b]
	Physical Slot: 7
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 11
	NUMA node: 0
	Region 0: Memory at fc074000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable- Count=1/8 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100
	Kernel driver in use: nvme
Comment 16 Alex 2018-12-30 15:39:18 UTC
Device still does not bind to vfio-pci on boot though and needs to be rebinded manually.

With following .conf file
options vfio-pci ids=8086:f1a6
softdep nvme pre: vfio vfio-pci
Comment 17 Alex Williamson 2019-01-02 02:52:28 UTC
Created attachment 280237 [details]
Prefer secondary bus reset over FLR

Include the native Silicon Motion PCI ID as used on the ADATA XPG SX8200 and hopefully others.
Comment 18 Alex 2019-01-02 14:10:07 UTC
Created attachment 280239 [details]
trace

Went back into logs and found following trace.
Was not able to reproduce once more.

Besides this nvme reports following every time I start guest.
nvme nvme0: failed to set APST feature (-19)
Comment 19 Dongli Zhang 2019-01-02 15:21:05 UTC
(In reply to Alex Williamson from comment #14)
> Hi Dongli, you're getting an ACS violation, I wonder if it's related to an
> issue your colleague resolved recently:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=aa667c6408d20a84c7637420bc3b7aa0abab59a2
> 
> Is there an IDT switch in your topology or is the NVMe drive connected
> directly to the Intel root port?  If the former, perhaps James' patch
> doesn't account for the invalid source ID propagating upstream.  If directly
> connected to the Intel root port, perhaps IDT isn't the only downstream port
> with the issue.
> 
> You could try disabling Source Validation on the root port via setpci to see
> if we're dealing with a similar issue:
> 
> # setpci -s 1b.0 ECAP_ACS+6.w=0:1
> 
> However, you're using an Intel system with a non-standard (aka broken) ACS
> capability, therefore the ACS capability and control registers are actually
> dwords, so I think the correct command would be:
> 
> # setpci -s 1b.0 ECAP_ACS+8.l=0:1
> 
> Also you won't be able to trust lspci for decoding of the ACS capability.

Hi Alex,

The kernel I use is the most recent upstream version including commit aa667c6408d20a84c7637420bc3b7aa0abab59a2.

Is there a way to know if IDT switch is in the topology?

The env is an dell desktop I use at home to debug program myself.

# lspci
00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04)
00:14.0 USB controller: Intel Corporation Device a2af
00:14.2 Signal processing controller: Intel Corporation Device a2b1
00:16.0 Communication controller: Intel Corporation Device a2ba
00:17.0 SATA controller: Intel Corporation Device a282
00:1b.0 PCI bridge: Intel Corporation Device a2e7 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Device a298 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Device a2c6
00:1f.2 Memory controller: Intel Corporation Device a2a1
00:1f.3 Audio device: Intel Corporation Device a2f0
00:1f.4 SMBus: Intel Corporation Device a2a3
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (5) I219-V
01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev 03)
02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)

Dongli Zhang
Comment 20 Alex Williamson 2019-01-15 20:54:48 UTC
Hi Dongli,

(In reply to Dongli Zhang from comment #19)
> 
> The kernel I use is the most recent upstream version including commit
> aa667c6408d20a84c7637420bc3b7aa0abab59a2.
> 
> Is there a way to know if IDT switch is in the topology?

No IDT switch in this system, so you shouldn't have that issue.

> The env is an dell desktop I use at home to debug program myself.
> 
> # lspci
> 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
> 00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04)
> 00:14.0 USB controller: Intel Corporation Device a2af
> 00:14.2 Signal processing controller: Intel Corporation Device a2b1
> 00:16.0 Communication controller: Intel Corporation Device a2ba
> 00:17.0 SATA controller: Intel Corporation Device a282
> 00:1b.0 PCI bridge: Intel Corporation Device a2e7 (rev f0)
> 00:1d.0 PCI bridge: Intel Corporation Device a298 (rev f0)
> 00:1f.0 ISA bridge: Intel Corporation Device a2c6
> 00:1f.2 Memory controller: Intel Corporation Device a2a1
> 00:1f.3 Audio device: Intel Corporation Device a2f0
> 00:1f.4 SMBus: Intel Corporation Device a2a3
> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (5) I219-V
> 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev
> 03)
> 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
> Connection (rev 01)
> 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
> Connection (rev 01)

I bought an ADATA XPG SX8200 drive to debug further, in some systems it works fine with the attached patch, but in another I think I'm getting something similar to what you see.  My system has Downstream Port Containment (DPC) support, so I think that catches the error before AER, but if I disable ACS Source Validation on the root port it avoids any errors, so I think we're still dealing with the ACS violation that you see.

A clue though is that triggering the bus reset via setpci as in comment 10 does not trigger the fault.  I then stumbled on adding a delay in the kernel code path prior to the bus reset to avoid the issue.  Long story short, could you try adding a delay to the previous patch, for example make the new function in drivers/pci/quirks.c look like this:

static int prefer_bus_reset(struct pci_dev *dev, int probe)
{
       msleep(100);
       return pci_parent_bus_reset(dev, probe);
}

I look forward to seeing if this works around the AER fault in your system as well.
Comment 21 Alex Williamson 2019-01-15 22:30:20 UTC
Actually, msleep(100) may be a few orders of magnitude longer than we need, I continue to see errors with udelay(10), but it seems to work perfectly with udelay(100).  Dongli, please test the above using udelay(100) rather than msleep(100).  Thanks
Comment 22 Alex Williamson 2019-01-16 06:48:58 UTC
The delay in comment 20 allows the device to reset when it's already quiesced, but after the VM makes use of the device I'm finding that it will still trigger the fault.  I've got another version that follows the path of the Samsung nvme quirk to test and disable the nvme controller before performing a reset.  Coupling with the delay, this seems to address both the previously active and previously idle reset cases.  I'll attach a new patch implementing this for testing.
Comment 23 Dongli Zhang 2019-01-16 14:31:36 UTC
(In reply to Alex Williamson from comment #21)
> Actually, msleep(100) may be a few orders of magnitude longer than we need,
> I continue to see errors with udelay(10), but it seems to work perfectly
> with udelay(100).  Dongli, please test the above using udelay(100) rather
> than msleep(100).  Thanks

Hi Alex,

While waiting for the patch mentioned by Comment 22, I have tested the below by adding udelay(100):

3828 static int prefer_bus_reset(struct pci_dev *dev, int probe)
3829 {
3830         udelay(100);
3831         return pci_parent_bus_reset(dev, probe);
3832 }

I got the below error again:

QEMU:

# ./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0
WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
qemu-system-x86_64: vfio_err_notifier_handler(0000:01:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest



KERNEL:

[   69.715224] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[   69.715230] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[   69.715234] pcieport 0000:00:1b.0:   device [8086:a2e7] error status/mask=00200000/00010000
[   69.715236] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[   70.742423] pcieport 0000:00:1b.0: AER: Device recovery successful
[   70.742430] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[   70.742442] pcieport 0000:00:1b.0: can't find device of ID00d8
[   70.742554] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x158
[   70.742562] vfio_ecap_init: 0000:01:00.0 hiding ecap 0x1e@0x180
[   70.834427] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[   70.834440] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[   70.834448] pcieport 0000:00:1b.0:   device [8086:a2e7] error status/mask=00200000/00010000
[   70.834453] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
[   71.822627] pcieport 0000:00:1b.0: AER: Device recovery successful
[   71.822634] pcieport 0000:00:1b.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1b.0
[   71.822645] pcieport 0000:00:1b.0: can't find device of ID00d8


Dongli Zhang
Comment 24 Alex Williamson 2019-01-16 15:57:42 UTC
Created attachment 280535 [details]
Test patch, NVMe shutdown + delay to avoid ACS violation

Here's the patch for testing, this avoids all the ACS violation faults on my system with ADATA XPG SX8200.  Please test.
Comment 25 Dongli Zhang 2019-01-17 13:01:55 UTC
Created attachment 280555 [details]
linux config
Comment 26 Dongli Zhang 2019-01-17 13:02:15 UTC
Hi Alex,

The patch does not work for me :(

Here is how I reproduce the issue. The attached file is my kernel config.

qemu commit: 6f2f34177a25bffd6fd92a05e6e66c8d22d97094

linux commit: 1c7fc5cbc33980acd13d668f1c8f0313d6ae9fd8

To build qemu:

# ./configure --target-list=x86_64-softmmu
# make -j8 > /dev/null

To build linux:

use the attached config

# make -j8 > /dev/null


To reproduce, boot into the linux kernel. I always use qemu at where it is built. I do not run "make install" for qemu.

# modprobe vfio
# modprobe vfio-pci

# echo 0000:01:00.0 > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind
# echo "8086 f1a6" > /sys/bus/pci/drivers/vfio-pci/new_id

# ./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0


# ./x86_64-softmmu/qemu-system-x86_64  -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0
WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
qemu-system-x86_64: vfio_err_notifier_handler(0000:01:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

Dongli Zhang
Comment 27 Alex Williamson 2019-01-19 17:01:10 UTC
I don't know what more I can do here, I've since tested the ADATA XPG SX8200 in an Intel laptop with 200-series chipset and it behaves just fine with the latest patch.  It's possible the additional issues are unique to the Intel 760p implementation of the SM2262 or only exposed in configurations similar to yours.  I'm out of options to investigate further.  You could potentially boot with pci=noaer to disable Advanced Error Reporting in your configuration, but that's never a good long term solution.
Comment 28 Dongli Zhang 2019-01-20 11:06:07 UTC
(In reply to Alex Williamson from comment #27)
> I don't know what more I can do here, I've since tested the ADATA XPG SX8200
> in an Intel laptop with 200-series chipset and it behaves just fine with the
> latest patch.  It's possible the additional issues are unique to the Intel
> 760p implementation of the SM2262 or only exposed in configurations similar
> to yours.  I'm out of options to investigate further.  You could potentially
> boot with pci=noaer to disable Advanced Error Reporting in your
> configuration, but that's never a good long term solution.

Hi Alex,

Thank you very much for the help.

Perhaps it is only specific to this hardware or my machine. Perhaps I should upgrade the firmware.

I would try to debug it a little bit in my spare time.

So far to disabled aer in grub would boot guest VM successfully.

With the patch, the entires of msix is not 22 any more.

Dongli Zhang
Comment 29 LimeTech 2019-01-24 18:02:14 UTC
Hi Alex,

The "Prefer secondary bus reset over FLR" patch works for devices you added in pci_dev_reset_methods[].  Will this patch work correctly for a SM2263 controller as well?  One such device (Crucial P1 CT500P1SSD8) has PCI ID [c0a9:2263], just a matter of adding this ID?


Also, should we be using the "Test patch, NVMe shutdown + delay to avoid ACS violation" patch instead?

thanks,
Tom
Comment 30 Alex Williamson 2019-01-24 19:16:43 UTC
(In reply to LimeTech from comment #29)
> Hi Alex,
> 
> The "Prefer secondary bus reset over FLR" patch works for devices you added
> in pci_dev_reset_methods[].  Will this patch work correctly for a SM2263
> controller as well?  One such device (Crucial P1 CT500P1SSD8) has PCI ID
> [c0a9:2263], just a matter of adding this ID?
> 
> 
> Also, should we be using the "Test patch, NVMe shutdown + delay to avoid ACS
> violation" patch instead?

Hi Tom,

The second patch is intended to be a replacement of the original, it at least enables the ADATA drive on a server where the first patch did not, even if that turned out to be not exactly the same issue as Dongli experiences.  To add the SM2263 just add a new ID, ex:

 { 0xc0a9, 0x2263, sm2262_reset },

Add it to the code where the last chunk of the patch includes the known SM2262 variants, in the pci_dev_reset_methods array.  Please report back the results. 
It's really unfortunate that there's such a fundamental bug in a whole family of controllers that's getting rebranded with different PCI IDs by so many vendors.  Thanks,

Alex
Comment 31 LimeTech 2019-01-24 21:10:17 UTC
(In reply to Alex Williamson from comment #30)
> (In reply to LimeTech from comment #29)
> > Hi Alex,
> > 
> > The "Prefer secondary bus reset over FLR" patch works for devices you added
> > in pci_dev_reset_methods[].  Will this patch work correctly for a SM2263
> > controller as well?  One such device (Crucial P1 CT500P1SSD8) has PCI ID
> > [c0a9:2263], just a matter of adding this ID?
> > 
> > 
> > Also, should we be using the "Test patch, NVMe shutdown + delay to avoid
> ACS
> > violation" patch instead?
> 
> Hi Tom,
> 
> The second patch is intended to be a replacement of the original, it at
> least enables the ADATA drive on a server where the first patch did not,
> even if that turned out to be not exactly the same issue as Dongli
> experiences.  To add the SM2263 just add a new ID, ex:
> 
>  { 0xc0a9, 0x2263, sm2262_reset },
> 
> Add it to the code where the last chunk of the patch includes the known
> SM2262 variants, in the pci_dev_reset_methods array.  Please report back the
> results. 
> It's really unfortunate that there's such a fundamental bug in a whole
> family of controllers that's getting rebranded with different PCI IDs by so
> many vendors.  Thanks,
> 
> Alex

Thank you, applied patch, will report back.
Comment 32 LimeTech 2019-01-25 18:16:02 UTC
(In reply to LimeTech from comment #31)
> (In reply to Alex Williamson from comment #30)
> > (In reply to LimeTech from comment #29)
> > > Hi Alex,
> > > 
> > > The "Prefer secondary bus reset over FLR" patch works for devices you
> added
> > > in pci_dev_reset_methods[].  Will this patch work correctly for a SM2263
> > > controller as well?  One such device (Crucial P1 CT500P1SSD8) has PCI ID
> > > [c0a9:2263], just a matter of adding this ID?
> > > 
> > > 
> > > Also, should we be using the "Test patch, NVMe shutdown + delay to avoid
> > ACS
> > > violation" patch instead?
> > 
> > Hi Tom,
> > 
> > The second patch is intended to be a replacement of the original, it at
> > least enables the ADATA drive on a server where the first patch did not,
> > even if that turned out to be not exactly the same issue as Dongli
> > experiences.  To add the SM2263 just add a new ID, ex:
> > 
> >  { 0xc0a9, 0x2263, sm2262_reset },
> > 
> > Add it to the code where the last chunk of the patch includes the known
> > SM2262 variants, in the pci_dev_reset_methods array.  Please report back
> the
> > results. 
> > It's really unfortunate that there's such a fundamental bug in a whole
> > family of controllers that's getting rebranded with different PCI IDs by so
> > many vendors.  Thanks,
> > 
> > Alex
> 
> Thank you, applied patch, will report back.

The report is that the patch solved the problem with the Crucial P1 using the SM2263 controller, and also passthrough works perfectly now.

thanks
Tom
Comment 33 Alex Williamson 2019-02-01 18:23:09 UTC
Created attachment 280913 [details]
NVMe subsystem reset with ACS masking

Dongli, I'd appreciate testing of this patch series.  The differences from the previous version are:

1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies some of the hotplug slot code from the previous version
2) Mask ACS Source Validation around reset, this eliminates some of the magic voodoo that avoided the fault on my system, but not yours

This exploded into a several patch series to simplify the ACS masking, but it should still apply easily.  Testing by others obviously welcome as well.  Thanks
Comment 34 LimeTech 2019-02-01 20:02:11 UTC
(In reply to Alex Williamson from comment #33)
> Created attachment 280913 [details]
> NVMe subsystem reset with ACS masking
> 
> Dongli, I'd appreciate testing of this patch series.  The differences from
> the previous version are:
> 
> 1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies
> some of the hotplug slot code from the previous version
> 2) Mask ACS Source Validation around reset, this eliminates some of the
> magic voodoo that avoided the fault on my system, but not yours
> 
> This exploded into a several patch series to simplify the ACS masking, but
> it should still apply easily.  Testing by others obviously welcome as well. 
> Thanks

A user is reporting a flood of syslog messages as a result of running fstrim on one of these devices:

02:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device [126f:2262] (rev 03)
	Subsystem: Silicon Motion, Inc. Device [126f:2262]
	Kernel driver in use: nvme
	Kernel modules: nvme

Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr ef321000 [fault reason 06] PTE Read access is not set
Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr f0a19000 [fault reason 06] PTE Read access is not set
Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0] fault addr efe93000 [fault reason 06] PTE Read access is not set
Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
Jan 27 07:00:17 unRAID kernel: dmar_fault: 77 callbacks suppressed

Do you think your latest patch might fix this?
Comment 35 Alex Williamson 2019-02-01 20:18:44 UTC
(In reply to LimeTech from comment #34)
> 
> A user is reporting a flood of syslog messages as a result of running fstrim
> on one of these devices:
> 
> 02:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device
> [126f:2262] (rev 03)
>       Subsystem: Silicon Motion, Inc. Device [126f:2262]
>       Kernel driver in use: nvme
>       Kernel modules: nvme
> 
> Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
> Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0]
> fault addr ef321000 [fault reason 06] PTE Read access is not set
> Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
> Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0]
> fault addr f0a19000 [fault reason 06] PTE Read access is not set
> Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
> Jan 27 07:00:11 unRAID kernel: DMAR: [DMA Read] Request device [02:00.0]
> fault addr efe93000 [fault reason 06] PTE Read access is not set
> Jan 27 07:00:11 unRAID kernel: DMAR: DRHD: handling fault status reg 3
> Jan 27 07:00:17 unRAID kernel: dmar_fault: 77 callbacks suppressed
> 
> Do you think your latest patch might fix this?

Not likely.  Gosh, how many ways can these devices be broken?  This was while the device was in use by the host or within a guest?  Those faults indicate the device is trying to do a DMA read from an IOVA it doesn't have mapped through the IOMMU.  Based on the addresses, I'd guess this is not a VM use case.  Either way, it's not the issue this bug is tracking.
Comment 36 Dongli Zhang 2019-02-05 09:40:10 UTC
(In reply to Alex Williamson from comment #33)
> Created attachment 280913 [details]
> NVMe subsystem reset with ACS masking
> 
> Dongli, I'd appreciate testing of this patch series.  The differences from
> the previous version are:
> 
> 1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies
> some of the hotplug slot code from the previous version
> 2) Mask ACS Source Validation around reset, this eliminates some of the
> magic voodoo that avoided the fault on my system, but not yours
> 
> This exploded into a several patch series to simplify the ACS masking, but
> it should still apply easily.  Testing by others obviously welcome as well. 
> Thanks

Hi Alex,

I am on vacation and could not access the test machine with the nvme (with issue).

I will test it next week. Thank you very much for creating the patch.

Dongli Zhang
Comment 37 LimeTech 2019-02-09 20:32:41 UTC
Added another PCI ID to quirks.c (2019-01-16):

+       { 0x126f, 0x2263, sm2262_reset },

Also your latest patch (2019-02-01) will not apply against 4.19 kernel.  (The 2019-01-16 patch doesn't either but that's easy to fix).  What kernel should this be applied to?

-Tom
Comment 38 Alex Williamson 2019-02-09 20:49:27 UTC
(In reply to LimeTech from comment #37)
> Added another PCI ID to quirks.c (2019-01-16):
> 
> +       { 0x126f, 0x2263, sm2262_reset },
> 
> Also your latest patch (2019-02-01) will not apply against 4.19 kernel. 
> (The 2019-01-16 patch doesn't either but that's easy to fix).  What kernel
> should this be applied to?

Added.  It's against v4.20.  Thanks.
Comment 39 Dongli Zhang 2019-02-12 12:36:01 UTC
(In reply to Alex Williamson from comment #33)
> Created attachment 280913 [details]
> NVMe subsystem reset with ACS masking
> 
> Dongli, I'd appreciate testing of this patch series.  The differences from
> the previous version are:
> 
> 1) Use NVMe subsystem reset rather than secondary bus reset, this simplifies
> some of the hotplug slot code from the previous version
> 2) Mask ACS Source Validation around reset, this eliminates some of the
> magic voodoo that avoided the fault on my system, but not yours
> 
> This exploded into a several patch series to simplify the ACS masking, but
> it should still apply easily.  Testing by others obviously welcome as well. 
> Thanks

Hi Alex,

I have tested the 5-patch 280913 (as below). Unfortunately, I encountered the initial problem again, that is, the msix count changed from 16 to 22 again. There is no AER message this time.

https://bugzilla.kernel.org/attachment.cgi?id=280913

./x86_64-softmmu/qemu-system-x86_64 -hda /home/zhang/img/ubuntu/disk.img -smp 2 -m 2000M -enable-kvm -vnc :0 -device vfio-pci,host=0000:01:00.0
WARNING: Image format was not specified for '/home/zhang/img/ubuntu/disk.img' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
qemu-system-x86_64: -device vfio-pci,host=0000:01:00.0: vfio error: 0000:01:00.0: failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they don't fit in BARs, or don't align


The msix count changed from 16 to 22 again.

01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 390b
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 16
         ... ...
	Capabilities: [b0] MSI-X: Enable- Count=22 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00002100



Dongli Zhang
Comment 40 Alex Williamson 2019-02-12 17:02:09 UTC
Created attachment 281113 [details]
Debug patch

Dongli, if we're unable to perform the NVMe subsystem reset, we fall back to other resets, including the known bad FLR reset, which seems like what might be happening here.  Could you please apply this patch on top of the previous to add some debugging to show where the detection is failing.  I can only guess this might mean your device does not support an NVMe subsystem reset, but I can't imagine why the Intel variant would remove this while the ADATA version has it.  Ugh.  Thanks
Comment 41 Alex 2019-02-16 22:19:24 UTC
Created attachment 281173 [details]
trace

NVMe subsystem reset pathc does not quite work for me eather.

MSI-X stays at 16 but device appear on guest side as
 
04:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:390b]
	Physical Slot: 0-2
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin ? routed to IRQ 20
	NUMA node: 0
	Region 0: Memory at 98000000 (64-bit, non-prefetchable) [size=16K]
Comment 42 Alex Williamson 2019-04-08 04:16:10 UTC
It seems there's a partial workaround available since QEMU v2.12 hiding under our noses.  That version adds support for relocating the MSI-X vector table on vfio-pci devices, which recreates the MSI-X MMIO space elsewhere on the device.  A side-effect of this is that the vector table and PBA are properly sized so as not to collide.  The size of the tables remains wrong, but this only becomes a problem if the nvme code attempts to allocate >16 vectors, which requires >15 vCPU (or host CPUs if the device is returned to host drivers after being assigned)(nvme appears to allocate 1 admin queue, plus a queue per CPU, each making use of an IRQ vector).  The QEMU vfio-pci device option is x-msix-relocation= which allows specifying the bar to use for the MSI-X tables, ex. bar0...bar5.  Since this device uses a 64bit bar0, we can either extend that BAR or choose another, excluding bar1, which is consumed by the upper half of bar0.  For instance, I tested with:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
...
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <alias name='ua-sm2262'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </hostdev>
...
  <qemu:commandline>
    <qemu:arg value='-set'/>
    <qemu:arg value='device.ua-sm2262.x-msix-relocation=bar2'/>
  </qemu:commandline>


(NB: "ua-" is a required prefix when specifying an alias)

A new virtual BAR appears in the guest hosting the MSI-X table and QEMU starts normally so long as the guest doesn't exceed 15 vCPUs.

The vCPU/pCPU count limitations are obviously not ideal, but hopefully this provides some degree of workaround for typical configurations.
Comment 43 LimeTech 2019-04-09 14:43:02 UTC
(In reply to Alex Williamson from comment #42)
> It seems there's a partial workaround available since QEMU v2.12 hiding
> under our noses.  That version adds support for relocating the MSI-X vector
> table on vfio-pci devices, which recreates the MSI-X MMIO space elsewhere on
> the device.  A side-effect of this is that the vector table and PBA are
> properly sized so as not to collide.  The size of the tables remains wrong,
> but this only becomes a problem if the nvme code attempts to allocate >16
> vectors, which requires >15 vCPU (or host CPUs if the device is returned to
> host drivers after being assigned)(nvme appears to allocate 1 admin queue,
> plus a queue per CPU, each making use of an IRQ vector).  The QEMU vfio-pci
> device option is x-msix-relocation= which allows specifying the bar to use
> for the MSI-X tables, ex. bar0...bar5.  Since this device uses a 64bit bar0,
> we can either extend that BAR or choose another, excluding bar1, which is
> consumed by the upper half of bar0.  For instance, I tested with:
> 
> <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
> ...
>     <hostdev mode='subsystem' type='pci' managed='yes'>
>       <source>
>         <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
>       </source>
>       <alias name='ua-sm2262'/>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x00'
> function='0x0'/>
>     </hostdev>
> ...
>   <qemu:commandline>
>     <qemu:arg value='-set'/>
>     <qemu:arg value='device.ua-sm2262.x-msix-relocation=bar2'/>
>   </qemu:commandline>
> 
> 
> (NB: "ua-" is a required prefix when specifying an alias)
> 
> A new virtual BAR appears in the guest hosting the MSI-X table and QEMU
> starts normally so long as the guest doesn't exceed 15 vCPUs.
> 
> The vCPU/pCPU count limitations are obviously not ideal, but hopefully this
> provides some degree of workaround for typical configurations.

Hi Alex, are you saying the above coding in the VM xml is all that is necessary (with noted vCPU/pCPU count limitations) to successfully pass-through sm2262-based controllers, without above kernel patch?  Or is a kernel patch also necessary (if so, which one)?

thx,
-tom
Comment 44 Alex Williamson 2019-04-09 14:57:46 UTC
(In reply to LimeTech from comment #43)
> 
> Hi Alex, are you saying the above coding in the VM xml is all that is
> necessary (with noted vCPU/pCPU count limitations) to successfully
> pass-through sm2262-based controllers, without above kernel patch?  Or is a
> kernel patch also necessary (if so, which one)?

Yes, without kernel patch.
Comment 45 Nick P 2019-08-03 09:35:07 UTC
Sorry if this is a dumb question but I think I am experiencing exactly this issue. But I cannot work out how to apply the patch. Can anyone explain it or point me in the direction of how to apply the patch
Comment 46 FCL 2020-07-28 15:41:10 UTC
Hey gents! 

Trying to apply this to a proxmox VM, any advice on how to do so? I'm running on top of the 5.4.x linux kernel. I'm experiencing the same bug as in topic name, using the SMI 2263 controller. (Adata as above)

Anything I can provide in terms of debugging data? Otherwise, any advice on how to use the above non kernel patch? also comfortable doing a kernel recompile if thats the better solution. Will be going to a FreeBSD (FreeNAS 11.3) VM with 4 threads and 72GB of ram alongside an LSI SAS controller. 

Platform is a Dual Xeon 5670 on a HP ml350 G6 mother board. 

Also attached but on a different VM is a Radeon 5600XT, an intel USB controller and an intel Nic to a windows VM
Comment 47 FCL 2020-07-29 14:34:36 UTC
(In reply to FCL from comment #46)
> Hey gents! 
> 
> Trying to apply this to a proxmox VM, any advice on how to do so? I'm
> running on top of the 5.4.x linux kernel. I'm experiencing the same bug as
> in topic name, using the SMI 2263 controller. (Adata as above)
> 
> Anything I can provide in terms of debugging data? Otherwise, any advice on
> how to use the above non kernel patch? also comfortable doing a kernel
> recompile if thats the better solution. Will be going to a FreeBSD (FreeNAS
> 11.3) VM with 4 threads and 72GB of ram alongside an LSI SAS controller. 
> 
> Platform is a Dual Xeon 5670 on a HP ml350 G6 mother board. 
> 
> Also attached but on a different VM is a Radeon 5600XT, an intel USB
> controller and an intel Nic to a windows VM

Found the solution without the need for a kernel recompile. in my Qemu conf file added:
args: -set device.hostpci1.x-msix-relocation=bar2

Where hostpci1 is 
hostpci1: 08:00.0
Comment 48 plantroon 2020-12-29 01:00:42 UTC
For Intel SSD 660p series, the latest firmware is required for pass-through to work without issues. The latest one should be from mid 2020 at the time of writing this.

Not even any workarounds or special settings for qemu/libvirt are needed, just a simple pass-through setup as with any other PCIe device.