Bug 118941 - PCI/ASPM: PCI/E endpoint got randomly reset due to improper ASPM L0s setting.
Summary: PCI/ASPM: PCI/E endpoint got randomly reset due to improper ASPM L0s setting.
Status: CLOSED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-25 12:35 UTC by Ocean He
Modified: 2016-07-15 17:16 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Ocean He 2016-05-25 12:35:06 UTC
In the test machine IBM System x3250 M5, Server RAID M5110(20:00.0) is connected to PCI bridge(00:01.0).

#lspci -t -v
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller
           +-01.0-[20]----00.0  LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt]
           +-01.1-[10]--
           +-14.0  Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI
           +-1a.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2
		   .............................................

Following pcie_aspm_check_latency() and pcie_config_aspm_link(), kernel ASPM driver set PCI bridge L0s enabled while M5110 L0s disabled.
DevCap Acceptable L0s Exit Latency:  64ns(PCI bridge)        64ns(M5110)
LnkCap L0s Exit Latecny:             256ns(PCI bridge)       64ns(M5110)

This cause Server RAID M5110 randomly reset and dmesg show:
megaraid_sas 0000:20:00.0: 2614 (517500295s/0x0020/CRIT) - Controller encountered a fatal error and was reset.

If all the PCI bridge L0s and M5110 L0s are disabled, then no issue happens.

#lspci -s 20:00.0 -vv
20:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)
	Subsystem: IBM ServeRAID M5110 SAS/SATA Controller
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: I/O ports at 3000 [size=256]
	Region 1: Memory at 82b40000 (64-bit, non-prefetchable) [size=16K]
	Region 3: Memory at 82b00000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at 80100000 [disabled] [size=128K]
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L0s, Latency L0 <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch+ ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	....................................
#lspci -s 00:01.0 -vv
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Bus: primary=00, secondary=20, subordinate=20, sec-latency=0
	I/O behind bridge: 00003000-00003fff
	Memory behind bridge: 82b00000-82bfffff
	Prefetchable memory behind bridge: 0000000080100000-00000000801fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [88] Subsystem: Intel Corporation Device 1999
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00418  Data: 0000
	Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 256 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #2, Speed 8GT/s, Width x8, ASPM L0s L1, Latency L0 <256ns, L1 <8us
			ClockPM- Surprise- LLActRep- BwNot+
		LnkCtl:	ASPM L0s Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt+
	........................................................

Ocean.
Comment 1 Ocean He 2016-07-05 13:24:20 UTC
It approves a firmware bug. So close it.

Ocean.
Comment 2 Bjorn Helgaas 2016-07-15 17:16:37 UTC
I assume you mean the Server RAID M5110 random reset problem was caused by a firmware bug.  Can you add details about what firmware version contains the fix?  If anybody else trips over the problem, that will help debug it.

Note You need to log in before you can comment on or make changes to this bug.