Bug 211187
Summary: | GL9750 SD Host Controller slows down suspend and resume | ||
---|---|---|---|
Product: | Drivers | Reporter: | Victor Ding (victording) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | NEW --- | ||
Severity: | normal | CC: | victording |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | PCIe analyzer trace showing GL9750 inactive for ~4.2us. |
Description
Victor Ding
2021-01-14 08:55:52 UTC
GL9750 has a 3100us PortTPowerOnTime (Port T_POWER_ON); however, it enters L1.2 after only ~4us inactivity per PCIe trace. During a suspend/resume process, PCI access operations are frequently longer than 4us apart. Therefore, the device frequently enters and leaves L1.2 during this process, causing longer than desirable suspend/resume time. The total time cost due to could add up to ~200ms. In theory other PCIe devices could encounter this issue as well. It more-likely occurs on those enters L1 aggressively but has high port power-on time. GL9750 is the only one I have access to that has such characteristics. One way to fix the issue is to disable ASPM during suspension, or at least part of the suspend/resume process*, mainly for two reasons: 1. Power saving is expected to be little. During suspend/resume, we frequently access PCI registers, making it unlikely to stay in low power states; 2. It could cause performance degradation. Unfortunate timing could put the device into low power states and wake it up very soon after; resulting noticeable delays. * By "part if the suspend process", I refer [suspend/resume]_noirq, where there are frequent PCI register accesses and suffers most from this issue. Ideally, the entry time could be tune so that it is long enough that the device would not go into low power states during suspend; however, it may not be feasible mainly because: 1. Hardware limitations; 2. The best timing for suspend/resume may not be the best timing for other tasks. Considering suspend/resume is a rare task, we probably do not want to sacrifice other tasks; 3. If the goal is to avoid entering low power states during suspend, it might be better just to disable it. In case if needed, `lspci -vvv` showed: ``` 02:00.0 SD Host controller: Genesys Logic, Inc Device 9750 (rev 01) (prog-if 01) Subsystem: Genesys Logic, Inc Device 9750 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 38 Region 0: Memory at e0700000 (32-bit, non-prefetchable) [size=4K] Capabilities: [80] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #80, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Via message/WAKE# AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [e0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee08004 Data: 4025 Capabilities: [f8] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=17a0 Rev=1 Len=008 <?> Capabilities: [108 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [110 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=255us PortTPowerOnTime=3100us L1SubCtl1: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ T_CommonMode=0us LTR1.2_Threshold=163840ns L1SubCtl2: T_PwrOn=3100us Capabilities: [200 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 04000001 0000000f 02070000 e99cbdfc Kernel driver in use: sdhci-pci ``` Created attachment 300420 [details]
PCIe analyzer trace showing GL9750 inactive for ~4.2us.
As Bjorn Helgaas pointed out in the LKML, 4us is more than sufficient for a lot of code execution, especially these PCI config space accesses for the same device appear to be fairly close to each other. However, as device resumes occur in parallel and a global mutex is required for each access, these PCI config space accesses from multiple devices are serialized arbitrarily causing accesses from the same device far apart from each other. The hypothesis came from my observation that PCI config space accesses from different devices in the slow runs were always interleaved, while they were mostly grouped together in the fast runs. To further prove the hypothesis, I added a global lock when save/restore PCI state, forcing the related PCI config space accesses from the same device grouped together, I always got fast runs. |