Bug 200675 - Bisected: "modprobe iwlmvm" causes a hang under some condition.
Summary: Bisected: "modprobe iwlmvm" causes a hang under some condition.
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Power Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-28 12:31 UTC by teika kazura
Modified: 2018-09-13 20:54 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.15.0-rc3+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
lspci -vv (partial) (6.46 KB, text/plain)
2018-07-28 12:31 UTC, teika kazura
Details
.config for the mentioned commit (88.45 KB, text/plain)
2018-07-28 12:41 UTC, teika kazura
Details
PCI / PM: Only set driver flags for PCIe ports if enabling runtime PM by default (894 bytes, patch)
2018-07-30 09:07 UTC, Rafael J. Wysocki
Details | Diff
PCI / PM: Only set driver flags for PCIe ports if enabling runtime PM by default (959 bytes, patch)
2018-08-01 09:53 UTC, Rafael J. Wysocki
Details | Diff
entire "lspci -vv" (22.35 KB, text/plain)
2018-08-01 10:59 UTC, teika kazura
Details
PCI / ACPI / PM: Resume all bridges on system suspend to S3 (1.06 KB, patch)
2018-08-14 10:49 UTC, Rafael J. Wysocki
Details | Diff

Description teika kazura 2018-07-28 12:31:56 UTC
Created attachment 277581 [details]
lspci -vv (partial)

How to reproduce bug: In bad kernel versions, do:
$ modprobe iwlmvm; sleep .1; modprobe -r iwlmvm; sleep .1   # line 1
$ echo auto > /sys/devices/pci0000\:00/0000\:00\:1c.5/0000\:02\:00.0/power/control # line 2
$ echo "mem" > /sys/power/state # line 3
$ modprobe iwlmvm # line 4

Then the system freezes.

00:1c.5 is a PCI Express Root Port, and Wifi controler is pci 02:00.0 (details will follow)

The lines 1 and 2 cannot be swapped; if the order is line 2, 1, 3, 4 it does'n freeze. (However 1 3 2 4 also causes a freeze.) It seems short sleeps (or else) are necessary to cause a hang.

The bad commit is:
------------------------------------------------------------------------
commit 877b3729ca03b00800b99ac0c076e9456ef3ae6b
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Wed Jan 3 01:38:27 2018 +0100

    PCI / PM: Use SMART_SUSPEND and LEAVE_SUSPENDED flags for PCIe ports

    Make the PCIe port driver set DPM_FLAG_SMART_SUSPEND and
    DPM_FLAG_LEAVE_SUSPENDED for the devices handled by it to benefit
    from the opportunistic optimizations in the PCI layer enabled by
    these flags.

    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Acked-by: Bjorn Helgaas <bhelgaas@google.com>

:040000 040000 01e99438c1643bbe861fbd29b00750827f349de6 dd82628e48b1351a6ac6123b900e4164599aeb5d M	drivers
------------------------------------------------------------------------

Best regards.
Comment 1 teika kazura 2018-07-28 12:41:16 UTC
Created attachment 277583 [details]
.config for the mentioned commit
Comment 2 Rafael J. Wysocki 2018-07-29 11:10:01 UTC
Is this issue present in 4.18-rc6?
Comment 3 teika kazura 2018-07-29 12:46:21 UTC
Yep. 

Thanks kernel developers.
Comment 4 teika kazura 2018-07-29 12:48:56 UTC
Ah, sorry for the ugly pathname with backslashes. It's 
/sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/power/control
Comment 5 Rafael J. Wysocki 2018-07-30 09:07:40 UTC
Created attachment 277599 [details]
PCI / PM: Only set driver flags for PCIe ports if enabling runtime PM by default

It looks like the optimization made by the problematic commit is overly aggressive.

Please test the attached patch and let me know if it makes a difference.
Comment 6 teika kazura 2018-07-30 10:10:34 UTC
Unfortunately, it doesn't work. Thanks anyway.
Comment 7 Rafael J. Wysocki 2018-07-31 09:22:29 UTC
So does it work if you revert the problematic commit completely from 4.18-rc6 (or -rc7)?
Comment 8 teika kazura 2018-07-31 12:48:14 UTC
Yes. (Tried 4.18-rc6)
Comment 9 Rafael J. Wysocki 2018-07-31 16:53:36 UTC
What's there initially in /sys/devices/pci0000\:00/0000\:00\:1c.5/0000\:02\:00.0/power/control ?
Comment 10 teika kazura 2018-07-31 23:22:38 UTC
"on" on boot.  
FYI: Insertion and deletion of the moudule iwlmvm do not reset it.
Comment 11 Rafael J. Wysocki 2018-08-01 09:49:00 UTC
Well, if this is "on" on boot, then there should be no difference between reverting the problematic commit and applying the patch from comment #5.
Comment 12 Rafael J. Wysocki 2018-08-01 09:53:52 UTC
Created attachment 277645 [details]
PCI / PM: Only set driver flags for PCIe ports if enabling runtime PM by default

Please boot with this patch applied and provide the output of

$ dmesg | grep pcie_portdrv_probe

on boot (unless it is empty, in which case please let me know).
Comment 13 teika kazura 2018-08-01 10:59:32 UTC
Created attachment 277647 [details]
entire "lspci -vv"

Dmesg says:
pcieport 0000:00:1c.0: pcie_portdrv_probe: Setting PM flags
pcieport 0000:00:1c.5: pcie_portdrv_probe: Setting PM flags

"lspci -vv" for 0000:00:1c.0 is:
------------------------------------------------------------------------
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 00003000-00003fff
	Memory behind bridge: b1200000-b12fffff
	Prefetchable memory behind bridge: 00000000b1000000-00000000b10fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #5, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <16us
			ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM L0s L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #4, PowerLimit 10.000W; Interlock- NoCompl+
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
			Changed: MRL- PresDet- LinkState+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR+, OBFF Not Supported ARIFwd+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [80] MSI: Enable- Count=1/1 Maskable- 64bit-
		Address: 00000000  Data: 0000
	Capabilities: [90] Subsystem: Hewlett-Packard Company Sunrise Point-LP PCI Express Root Port
	Capabilities: [a0] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [140 v1] Access Control Services
		ACSCap:	SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [220 v1] #19
	Kernel driver in use: pcieport
------------------------------------------------------------------------
Comment 14 teika kazura 2018-08-03 02:27:42 UTC
In case this helps:

Other initial values:
(1) /sys/devices/pci0000:00/0000:00:1c.0/power/control : auto
(2) /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/power/control : on
(3) /sys/devices/pci0000:00/0000:00:1c.5/power/control : auto
(4) /sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/power/control : on (As already reported)

It seems that the above device (4) is the wifi controller. The value of
/sys/devices/pci0000:00/0000:00:1c.5/0000:02:00.0/device
is 0x3165, and hwinfo says,
------------------------------------------------------------------------
26: PCI 200.0: 0280 Network controller
...
SysFS ID: /devices/pci0000:00/0000:00:1c.5/0000:02:00.0
...
Model: "Intel WLAN controller"
...
Device: pci 0x3165
------------------------------------------------------------------------
Similarly, the device (2) must be an ethernet controller. No other device seems to be there under these two PCIe controllers.

Your patch clarified that the flag of the pcieports is set. ("modprobe iwlmvm" does not emit any counterpart message.)
Comment 15 teika kazura 2018-08-05 22:18:56 UTC
I've done a similar experiment with the other PCIe port & the ethernet driver, under a "bad" kernel, and it turns out that it doesn't cause a hang!
------------------------------------------------------------------------
$ modprobe r8169; sleep .1; modprobe -r r8169 ; sleep .1   # It's the ethernet driver.
$ echo "auto" > /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/power/control
$ echo "mem" > /sys/power/state
$ modprobe r8169 ; sleep .1
------------------------------------------------------------------------
Comment 16 Rafael J. Wysocki 2018-08-06 08:42:05 UTC
It looks like we need to debug this a bit.  I'll prepare a couple of debug patches to see what's up, stay tuned!
Comment 17 Rafael J. Wysocki 2018-08-14 10:49:57 UTC
Created attachment 277859 [details]
PCI / ACPI / PM: Resume all bridges on system suspend to S3

Sorry for the delay.

Please let me know if the issue is reproducible with this patch applied.
Comment 18 teika kazura 2018-08-14 13:00:27 UTC
It works!

To be sure:
1. I used the last patch alone. Not with the obsolete patch on Jul 30.
2. I tried linux-4.17.y and 4.18.0-rc6.

In fact, Rafael, you saved my previous buggy PC years ago, (Sorry I can't find its log.) so this is the second time. I deeply thank your years of work in the linux kernel.

> stay tuned!
Radios are wireless, after all. (Forgive me for a joke for elderly folks. ;-)

Kind regards.
Comment 19 Rafael J. Wysocki 2018-08-16 07:12:29 UTC
(In reply to teika kazura from comment #18)
> It works!

OK, thanks!

> To be sure:
> 1. I used the last patch alone. Not with the obsolete patch on Jul 30.
> 2. I tried linux-4.17.y and 4.18.0-rc6.

OK

> In fact, Rafael, you saved my previous buggy PC years ago, (Sorry I can't
> find its log.) so this is the second time. I deeply thank your years of work
> in the linux kernel.

Appreciated, thank you!

Let me submit the patch with a changelog to the mailing lists.
Comment 20 Rafael J. Wysocki 2018-08-16 21:30:37 UTC
Patch submitted: https://patchwork.kernel.org/patch/10567315/
Comment 22 Rafael J. Wysocki 2018-09-13 20:54:09 UTC
Patch merged as commit

9d64b539b738 PCI / ACPI / PM: Resume all bridges on suspend-to-RAM

Closing.

Note You need to log in before you can comment on or make changes to this bug.