Bug 118941

Summary: PCI/ASPM: PCI/E endpoint got randomly reset due to improper ASPM L0s setting.
Product: Drivers Reporter: Ocean He (hehy1)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: CLOSED INVALID    
Severity: normal CC: bjorn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.6 Subsystem:
Regression: No Bisected commit-id:

Description Ocean He 2016-05-25 12:35:06 UTC
In the test machine IBM System x3250 M5, Server RAID M5110(20:00.0) is connected to PCI bridge(00:01.0).

#lspci -t -v
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller
           +-01.0-[20]----00.0  LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt]
           +-01.1-[10]--
           +-14.0  Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI
           +-1a.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2
		   .............................................

Following pcie_aspm_check_latency() and pcie_config_aspm_link(), kernel ASPM driver set PCI bridge L0s enabled while M5110 L0s disabled.
DevCap Acceptable L0s Exit Latency:  64ns(PCI bridge)        64ns(M5110)
LnkCap L0s Exit Latecny:             256ns(PCI bridge)       64ns(M5110)

This cause Server RAID M5110 randomly reset and dmesg show:
megaraid_sas 0000:20:00.0: 2614 (517500295s/0x0020/CRIT) - Controller encountered a fatal error and was reset.

If all the PCI bridge L0s and M5110 L0s are disabled, then no issue happens.

#lspci -s 20:00.0 -vv
20:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
	Subsystem: IBM ServeRAID M5110 SAS/SATA Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: I/O ports at 3000 [size=256]
	Region 1: Memory at 82b40000 (64-bit, non-prefetchable) [size=16K]
	Region 3: Memory at 82b00000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at 80100000 [disabled] [size=128K]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L0s, Latency L0 <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	....................................
#lspci -s 00:01.0 -vv
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Bus: primary=00, secondary=20, subordinate=20, sec-latency=0
	I/O behind bridge: 00003000-00003fff
	Memory behind bridge: 82b00000-82bfffff
	Prefetchable memory behind bridge: 0000000080100000-00000000801fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [88] Subsystem: Intel Corporation Device 1999
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00418  Data: 0000
	Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #2, Speed 8GT/s, Width x8, ASPM L0s L1, Latency L0 <256ns, L1 <8us
			ClockPM- Surprise- LLActRep- BwNot+
		LnkCtl:	ASPM L0s Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt+
	........................................................

Ocean.
Comment 1 Ocean He 2016-07-05 13:24:20 UTC
It approves a firmware bug. So close it.

Ocean.
Comment 2 Bjorn Helgaas 2016-07-15 17:16:37 UTC
I assume you mean the Server RAID M5110 random reset problem was caused by a firmware bug.  Can you add details about what firmware version contains the fix?  If anybody else trips over the problem, that will help debug it.