Bug 211187

Summary: GL9750 SD Host Controller slows down suspend and resume
Product: Drivers Reporter: Victor Ding (victording)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: victording
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: PCIe analyzer trace showing GL9750 inactive for ~4.2us.

Description Victor Ding 2021-01-14 08:55:52 UTC

    
Comment 1 Victor Ding 2021-01-14 08:56:07 UTC
GL9750 has a 3100us PortTPowerOnTime (Port T_POWER_ON); however, it enters L1.2 after only ~4us inactivity per PCIe trace. During a suspend/resume process, PCI access operations are frequently longer than 4us apart.

Therefore, the device frequently enters and leaves L1.2 during this process, causing longer than desirable suspend/resume time. The total time cost due to  could add up to ~200ms.

In theory other PCIe devices could encounter this issue as well. It more-likely occurs on those enters L1 aggressively but has high port power-on time. GL9750 is the only one I have access to that has such characteristics.

One way to fix the issue is to disable ASPM during suspension, or at least part of the suspend/resume process*, mainly for two reasons:
  1. Power saving is expected to be little. During suspend/resume, we frequently access PCI registers, making it unlikely to stay in low power states;
  2. It could cause performance degradation. Unfortunate timing could put the device into low power states and wake it up very soon after; resulting noticeable delays.
 * By "part if the suspend process", I refer [suspend/resume]_noirq, where there are frequent PCI register accesses and suffers most from this issue.

Ideally, the entry time could be tune so that it is long enough that the device would not go into low power states during suspend; however, it may not be feasible mainly because:
  1. Hardware limitations;
  2. The best timing for suspend/resume may not be the best timing for other tasks. Considering suspend/resume is a rare task, we probably do not want to sacrifice other tasks;
  3. If the goal is to avoid entering low power states during suspend, it might be better just to disable it.


In case if needed, `lspci -vvv` showed:
```
02:00.0 SD Host controller: Genesys Logic, Inc Device 9750 (rev 01) (prog-if 01)
	Subsystem: Genesys Logic, Inc Device 9750
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 38
	Region 0: Memory at e0700000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [80] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #80, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Via message/WAKE#
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [e0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee08004  Data: 4025
	Capabilities: [f8] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Vendor Specific Information: ID=17a0 Rev=1 Len=008 <?>
	Capabilities: [108 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [110 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=3100us
		L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
			   T_CommonMode=0us LTR1.2_Threshold=163840ns
		L1SubCtl2: T_PwrOn=3100us
	Capabilities: [200 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 04000001 0000000f 02070000 e99cbdfc
	Kernel driver in use: sdhci-pci
```
Comment 2 Victor Ding 2022-02-09 07:49:32 UTC
Created attachment 300420 [details]
PCIe analyzer trace showing GL9750 inactive for ~4.2us.
Comment 3 Victor Ding 2022-02-09 08:03:49 UTC
As Bjorn Helgaas pointed out in the LKML, 4us is more than sufficient for a lot of code execution, especially these PCI config space accesses for the same device appear to be fairly close to each other. However, as device resumes occur in parallel and a global mutex is required for each access, these PCI config space accesses from multiple devices are serialized arbitrarily causing accesses from the same device far apart from each other.

The hypothesis came from my observation that PCI config space accesses from different devices in the slow runs were always interleaved, while they were mostly grouped together in the fast runs. To further prove the hypothesis, I added a global lock when save/restore PCI state, forcing the related PCI config space accesses from the same device grouped together, I always got fast runs.