Created attachment 277581 [details] lspci -vv (partial) How to reproduce bug: In bad kernel versions, do: $ modprobe iwlmvm; sleep .1; modprobe -r iwlmvm; sleep .1 # line 1 $ echo auto > /sys/devices/pci0000\:00/0000\:00\:1c.5/0000\:02\:00.0/power/control # line 2 $ echo "mem" > /sys/power/state # line 3 $ modprobe iwlmvm # line 4 Then the system freezes. 00:1c.5 is a PCI Express Root Port, and Wifi controler is pci 02:00.0 (details will follow) The lines 1 and 2 cannot be swapped; if the order is line 2, 1, 3, 4 it does'n freeze. (However 1 3 2 4 also causes a freeze.) It seems short sleeps (or else) are necessary to cause a hang. The bad commit is: ------------------------------------------------------------------------ commit 877b3729ca03b00800b99ac0c076e9456ef3ae6b Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Date: Wed Jan 3 01:38:27 2018 +0100 PCI / PM: Use SMART_SUSPEND and LEAVE_SUSPENDED flags for PCIe ports Make the PCIe port driver set DPM_FLAG_SMART_SUSPEND and DPM_FLAG_LEAVE_SUSPENDED for the devices handled by it to benefit from the opportunistic optimizations in the PCI layer enabled by these flags. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> :040000 040000 01e99438c1643bbe861fbd29b00750827f349de6 dd82628e48b1351a6ac6123b900e4164599aeb5d M drivers ------------------------------------------------------------------------ Best regards.
Created attachment 277583 [details] .config for the mentioned commit
Is this issue present in 4.18-rc6?
Yep. Thanks kernel developers.
Ah, sorry for the ugly pathname with backslashes. It's /sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/power/control
Created attachment 277599 [details] PCI / PM: Only set driver flags for PCIe ports if enabling runtime PM by default It looks like the optimization made by the problematic commit is overly aggressive. Please test the attached patch and let me know if it makes a difference.
Unfortunately, it doesn't work. Thanks anyway.
So does it work if you revert the problematic commit completely from 4.18-rc6 (or -rc7)?
Yes. (Tried 4.18-rc6)
What's there initially in /sys/devices/pci0000\:00/0000\:00\:1c.5/0000\:02\:00.0/power/control ?
"on" on boot. FYI: Insertion and deletion of the moudule iwlmvm do not reset it.
Well, if this is "on" on boot, then there should be no difference between reverting the problematic commit and applying the patch from comment #5.
Created attachment 277645 [details] PCI / PM: Only set driver flags for PCIe ports if enabling runtime PM by default Please boot with this patch applied and provide the output of $ dmesg | grep pcie_portdrv_probe on boot (unless it is empty, in which case please let me know).
Created attachment 277647 [details] entire "lspci -vv" Dmesg says: pcieport 0000:00:1c.0: pcie_portdrv_probe: Setting PM flags pcieport 0000:00:1c.5: pcie_portdrv_probe: Setting PM flags "lspci -vv" for 0000:00:1c.0 is: ------------------------------------------------------------------------ 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 I/O behind bridge: 00003000-00003fff Memory behind bridge: b1200000-b12fffff Prefetchable memory behind bridge: 00000000b1000000-00000000b10fffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #5, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <16us ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+ LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise- Slot #4, PowerLimit 10.000W; Interlock- NoCompl+ SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg- Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock- SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock- Changed: MRL- PresDet- LinkState+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit- Address: 00000000 Data: 0000 Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise Point-LP PCI Express Root Port Capabilities: [a0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [140 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Capabilities: [220 v1] #19 Kernel driver in use: pcieport ------------------------------------------------------------------------
In case this helps: Other initial values: (1) /sys/devices/pci0000:00/0000:00:1c.0/power/control : auto (2) /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/power/control : on (3) /sys/devices/pci0000:00/0000:00:1c.5/power/control : auto (4) /sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/power/control : on (As already reported) It seems that the above device (4) is the wifi controller. The value of /sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/device is 0x3165, and hwinfo says, ------------------------------------------------------------------------ 26: PCI 200.0: 0280 Network controller ... SysFS ID: /devices/pci0000:00/0000:00:1c.5/0000:02:00.0 ... Model: "Intel WLAN controller" ... Device: pci 0x3165 ------------------------------------------------------------------------ Similarly, the device (2) must be an ethernet controller. No other device seems to be there under these two PCIe controllers. Your patch clarified that the flag of the pcieports is set. ("modprobe iwlmvm" does not emit any counterpart message.)
I've done a similar experiment with the other PCIe port & the ethernet driver, under a "bad" kernel, and it turns out that it doesn't cause a hang! ------------------------------------------------------------------------ $ modprobe r8169; sleep .1; modprobe -r r8169 ; sleep .1 # It's the ethernet driver. $ echo "auto" > /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/power/control $ echo "mem" > /sys/power/state $ modprobe r8169 ; sleep .1 ------------------------------------------------------------------------
It looks like we need to debug this a bit. I'll prepare a couple of debug patches to see what's up, stay tuned!
Created attachment 277859 [details] PCI / ACPI / PM: Resume all bridges on system suspend to S3 Sorry for the delay. Please let me know if the issue is reproducible with this patch applied.
It works! To be sure: 1. I used the last patch alone. Not with the obsolete patch on Jul 30. 2. I tried linux-4.17.y and 4.18.0-rc6. In fact, Rafael, you saved my previous buggy PC years ago, (Sorry I can't find its log.) so this is the second time. I deeply thank your years of work in the linux kernel. > stay tuned! Radios are wireless, after all. (Forgive me for a joke for elderly folks. ;-) Kind regards.
(In reply to teika kazura from comment #18) > It works! OK, thanks! > To be sure: > 1. I used the last patch alone. Not with the obsolete patch on Jul 30. > 2. I tried linux-4.17.y and 4.18.0-rc6. OK > In fact, Rafael, you saved my previous buggy PC years ago, (Sorry I can't > find its log.) so this is the second time. I deeply thank your years of work > in the linux kernel. Appreciated, thank you! Let me submit the patch with a changelog to the mailing lists.
Patch submitted: https://patchwork.kernel.org/patch/10567315/
Pulled into linux-next[1] and linux-4.18.5[2]: [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/pci/pci-acpi.c?h=next-20180823&id=9d64b539b738fc181442caab95f1f76d9bd58539 [2] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/pci/pci-acpi.c?h=v4.18.5&id=4d4306a283a1be22e4342c5a04d99f65f256d157 Thanks all.
Patch merged as commit 9d64b539b738 PCI / ACPI / PM: Resume all bridges on suspend-to-RAM Closing.