Bug 99751 - Enormous resume delay, unless pcie_ports=compat - Dell XPS11 - 9P33
Summary: Enormous resume delay, unless pcie_ports=compat - Dell XPS11 - 9P33
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks: 178231
  Show dependency tree
 
Reported: 2015-06-10 16:25 UTC by Len Brown
Modified: 2019-11-01 03:29 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.15 through 4.1rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vvv output (9.50 KB, text/plain)
2015-06-10 16:25 UTC, Len Brown
Details
3.14.44 success case -- analyze_suspend.py output (60.55 KB, text/html)
2015-06-10 17:13 UTC, Len Brown
Details
3.15.0 failure case - analyze_suspend.py output (71.40 KB, text/html)
2015-06-10 17:17 UTC, Len Brown
Details
3.15.0 failure case - dmesg (46.82 KB, text/plain)
2015-06-10 17:20 UTC, Len Brown
Details
4.1.0-rc5+ failure case - analyze_suspend.py (206.06 KB, text/html)
2015-06-10 17:33 UTC, Len Brown
Details
4.1.0-rc5+ failure case - dmesg (47.43 KB, text/plain)
2015-06-10 17:35 UTC, Len Brown
Details
4.1.0-rc5+ failure case - ftrace (407.43 KB, text/plain)
2015-06-10 17:42 UTC, Len Brown
Details
4.1.0-rc5+ dmesg from boot (149.62 KB, application/octet-stream)
2015-06-10 17:47 UTC, Len Brown
Details
4.1.0-rc5+ lspci -vvv as root (24.06 KB, text/plain)
2015-06-10 17:47 UTC, Len Brown
Details
4.1.0-rc5+ dmesg from boot, including suspend/resume (198.32 KB, text/plain)
2015-06-10 17:52 UTC, Len Brown
Details
debug patch (1.82 KB, patch)
2015-06-10 22:46 UTC, Bjorn Helgaas
Details | Diff
4.1.0-rc5+ dmesg from boot, including suspend/resume with patch from comment #11 applied (199.88 KB, text/plain)
2015-06-11 03:43 UTC, Len Brown
Details
debug patch (1.18 KB, patch)
2015-06-17 16:12 UTC, Bjorn Helgaas
Details | Diff
debug patch (1.18 KB, patch)
2015-06-18 03:37 UTC, Bjorn Helgaas
Details | Diff
analyze suspend output (314.21 KB, text/html)
2016-10-13 21:23 UTC, Len Brown
Details
testing patch (1.66 KB, patch)
2016-10-14 21:49 UTC, Bjorn Helgaas
Details | Diff
dmesg with patch (72.55 KB, text/plain)
2016-10-21 19:28 UTC, Len Brown
Details
Still a 3,500 ms delay in Linux-4.14.2 (632.78 KB, text/html)
2017-11-26 03:30 UTC, Len Brown
Details

Description Len Brown 2015-06-10 16:25:34 UTC
Created attachment 179451 [details]
lspci -vvv output

On a Dell XPS11 laptop, pcieport 0000:00:1c.0 takes
several seconds to resume from suspend.

Booting with "pcie_ports=compat", or
building with  CONFIG_HOTPLUG_PCI_PCIE=n
makes this delay go away.

I've tested Linux-3.14.44 and do not see this delay.
I've tested Linux-3.15.0 and others up through 4.1-rc5
and I do see this delay.  So I'll mark this as a regression.

while I use analyze_suspend.py to trace resume,
initcall_debug is sufficient to show this delay in dmesg,
and I notice some additional PCI warnings there:

[   55.897986] pciehp 0000:00:1c.0:pcie04: Link Training Error occurs 
[   55.897988] pciehp 0000:00:1c.0:pcie04: Failed to check link status
[   55.897992] call 0000:00:1c.0+ returned 0 after 2197660 usecs

(yes, that is 2.19 SECONDS -- where anything> 50ms for resume is considered slow)

Unclear (to me) what pcie04 is...

$ lspci
00:00.0 Host bridge: Intel Corporation Haswell-ULT DRAM Controller (rev 0b)
00:02.0 VGA compatible controller: Intel Corporation Device 0a1e (rev 0b)
00:03.0 Audio device: Intel Corporation Haswell-ULT HD Audio Controller (rev 0b)
00:04.0 Signal processing controller: Intel Corporation Device 0a03 (rev 0b)
00:14.0 USB controller: Intel Corporation 8 Series USB xHCI HC (rev 04)
00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04)
00:1b.0 Audio device: Intel Corporation 8 Series HD Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 8 Series PCI Express Root Port 1 (rev e4)
00:1c.1 PCI bridge: Intel Corporation 8 Series PCI Express Root Port 2 (rev e4)
00:1c.3 PCI bridge: Intel Corporation 8 Series PCI Express Root Port 4 (rev e4)
00:1d.0 USB controller: Intel Corporation 8 Series USB EHCI #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation 8 Series LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 8 Series SATA Controller 1 [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 8 Series SMBus Controller (rev 04)
00:1f.6 Signal processing controller: Intel Corporation 8 Series Thermal (rev 04)
05:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5249 PCI Express Card Reader (rev 01)
07:00.0 Network controller: Intel Corporation Wireless 7260 (rev 6b)

This 04 item appears under 1c.0 when I ask lspci about topology:
$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Haswell-ULT DRAM Controller
           +-02.0  Intel Corporation Device 0a1e
           +-03.0  Intel Corporation Haswell-ULT HD Audio Controller
           +-04.0  Intel Corporation Device 0a03
           +-14.0  Intel Corporation 8 Series USB xHCI HC
           +-16.0  Intel Corporation 8 Series HECI #0
           +-1b.0  Intel Corporation 8 Series HD Audio Controller
           +-1c.0-[04]--
           +-1c.1-[05]----00.0  Realtek Semiconductor Co., Ltd. RTS5249 PCI Express Card Reader
           +-1c.3-[07]----00.0  Intel Corporation Wireless 7260
           +-1d.0  Intel Corporation 8 Series USB EHCI #1
           +-1f.0  Intel Corporation 8 Series LPC Controller
           +-1f.2  Intel Corporation 8 Series SATA Controller 1 [AHCI mode]
           +-1f.3  Intel Corporation 8 Series SMBus Controller
           \-1f.6  Intel Corporation 8 Series Thermal
Comment 1 Len Brown 2015-06-10 17:13:05 UTC
Created attachment 179461 [details]
3.14.44 success case -- analyze_suspend.py output

0000:00:1c.0 is not seen on the resume side of the graph
Comment 2 Len Brown 2015-06-10 17:17:58 UTC
Created attachment 179471 [details]
3.15.0 failure case - analyze_suspend.py output

notice the 2250ms 0000:00:1c.0 in the yellow resume section
click on "device detail" and notice it has no children
Comment 3 Len Brown 2015-06-10 17:20:35 UTC
Created attachment 179481 [details]
3.15.0 failure case - dmesg

in this dmesg:

$ grep 0000:00:1c.0 *dmesg*
[   35.291619] calling  0000:00:1c.0+ @ 106, parent: pci0000:00
[   35.291624] call 0000:00:1c.0+ returned 0 after 2 usecs
[   35.529334] calling  0000:00:1c.0+ @ 6, parent: pci0000:00
[   35.529577] call 0000:00:1c.0+ returned 0 after 235 usecs
[   35.619421] calling  0000:00:1c.0+ @ 6, parent: pci0000:00
[   35.619561] call 0000:00:1c.0+ returned 0 after 134 usecs
[   35.635103] calling  0000:00:1c.0+ @ 1730, parent: pci0000:00
[   37.885840] pciehp 0000:00:1c.0:pcie04: Link Training Error occurs
[   37.885842] pciehp 0000:00:1c.0:pcie04: Failed to check link status
[   37.885846] call 0000:00:1c.0+ returned 0 after 2197684 usecs
[   37.957392] pcieport 0000:00:1c.0: no hotplug settings from platform
[   37.958538] pcieport 0000:00:1c.0: no hotplug settings from platform
Comment 4 Len Brown 2015-06-10 17:33:51 UTC
Created attachment 179491 [details]
4.1.0-rc5+ failure case - analyze_suspend.py

see in the yellow resume section, 0000:00:1c.0 {pcieport},
which takes 2249ms -- the slowest device in the system, by far.
Comment 5 Len Brown 2015-06-10 17:35:57 UTC
Created attachment 179501 [details]
4.1.0-rc5+ failure case - dmesg

$ grep 0000:00:1c.0 *dmesg*
[   34.624011] calling  0000:00:1c.0+ @ 107, parent: pci0000:00
[   34.624032] call 0000:00:1c.0+ returned 0 after 18 usecs
[   34.755665] calling  0000:00:1c.0+ @ 96, parent: pci0000:00
[   34.755735] call 0000:00:1c.0+ returned 0 after 67 usecs
[   34.795443] calling  0000:00:1c.0+ @ 97, parent: pci0000:00
[   34.795675] call 0000:00:1c.0+ returned 0 after 222 usecs
[   34.816632] calling  0000:00:1c.0+ @ 1668, parent: pci0000:00
[   37.065898] pciehp 0000:00:1c.0:pcie04: Link Training Error occurs
[   37.065899] pciehp 0000:00:1c.0:pcie04: Failed to check link status
[   37.065907] call 0000:00:1c.0+ returned 0 after 2198217 usecs
Comment 6 Bjorn Helgaas 2015-06-10 17:39:36 UTC
Hi Len, would you please:

  - attach a complete dmesg log from a recent kernel, including the initial boot, and
  - attach "lspci -vvv" output collected as root?

"pcie04" is the hotplug service (PCIE_PORT_SERVICE_HP).  The "service" idea is a way to have separate drivers bind to different services (AER, PME, hotplug, etc.) associated with the single 00:1c.0 Root Port.  The "04" in the "lspci -tv" output is the secondary bus number of the 1c.0 Root Port and it's only coincidence that it resembles pcie04.

From other XPS 11 lspci output on the web, I think 1c.0 does not have a slot, so I think pciehp should not touch it (but I haven't yet convinced myself that the pciehp probe path does this correctly).
Comment 7 Len Brown 2015-06-10 17:42:54 UTC
Created attachment 179511 [details]
4.1.0-rc5+ failure case - ftrace

analyze_suspend.py also grabs some ftrace points in the kernel.
Curiously, this ftrace output shows:

 pci_bus 0000:04, parent: 0000:00:1c.0, [suspend]

but lspci does not show device 0000:04
Comment 8 Len Brown 2015-06-10 17:47:18 UTC
Created attachment 179521 [details]
4.1.0-rc5+ dmesg from boot
Comment 9 Len Brown 2015-06-10 17:47:56 UTC
Created attachment 179531 [details]
4.1.0-rc5+ lspci -vvv as root
Comment 10 Len Brown 2015-06-10 17:52:29 UTC
Created attachment 179541 [details]
4.1.0-rc5+ dmesg from boot, including suspend/resume

oops, previous dmesg was collected before the suspend,
this one is a super-set, including boot and suspend/resume
Comment 11 Bjorn Helgaas 2015-06-10 22:46:54 UTC
Created attachment 179551 [details]
debug patch

I'm stumped.  Re: comment #7, 0000:04 refers to a PCI bus, not to a device.  This one is the secondary bus of the 00:1c.0 root port, and there is no device on that bus.

On 4.1.0-rc5, I think you should see a delay of 1.1sec or so.  Here are the capabilities of this slot:

  pciehp 0000:00:1c.0:pcie04: Slot #0 AttnBtn- AttnInd- PwrInd- PwrCtrl- MRL- Interlock- NoCompl+ LLActRep+

This is the path we should be in:

  pciehp_resume
    pcie_enable_notification
      pcie_write_cmd
        pcie_wait_cmd
          if (NO_CMD_CMPL)
            return                                  # A
    pciehp_get_adapter_status(..., &status)
    if (status)                                     # B
      pciehp_enable_slot
        pciehp_get_adapter_status
        board_added
          pciehp_check_link_status
            if (ctrl->link_active_reporting)        # C
              pcie_wait_link_active                 # D 1 sec delay
            msleep(100)                             # E 0.1 sec delay
            "Link Training Error occurs"
          "Failed to check link status"

At (A), we don't wait at all because this Root Port has the No Command Completed Support bit set ("NoCmpl+" above and in lspci), so we can write commands with no delay.  This path has changed quite a bit since v3.14, so I wouldn't be surprised if we did wait for 1 sec here on some kernels between v3.15 and v4.0.

At (B), we apparently read a Slot Status with the Presence Detect State bit set.  This is a puzzle because your lspci output shows this bit cleared, and I assume you didn't add a device while we were suspended.

At (C), this Root Port has the Data Link Layer Link Active Reporting Capable bit set ("LLActRep+" above and in lspci), so we can poll for Link Active instead of always waiting 1 sec.

At (D), we poll for up to 1 sec, waiting for Data Link Layer Link Active to be set in the Link Status register.  But it never becomes set, probably because there's no device on the other end, so we timeout after 1 sec.

At (E), we always wait 100ms, presumably because of some PCIe spec requirement.
Comment 12 Bjorn Helgaas 2015-06-10 22:49:07 UTC
Oops, I meant to ask if you could test with the debug patch in comment #11.  Maybe there's some slot capability that changed over suspend/resume, and we aren't refreshing it when we should be.
Comment 13 Len Brown 2015-06-11 03:43:30 UTC
Created attachment 179611 [details]
4.1.0-rc5+ dmesg from boot, including suspend/resume with patch from comment #11 applied

4.1.0-rc5+ dmesg from boot, including suspend/resume, with patch from comment #11 applied
Comment 14 Bjorn Helgaas 2015-06-11 16:45:53 UTC
[    0.431618] pcieport 0000:00:1c.0: in pciehp_probe:
[    0.431621] pcieport 0000:00:1c.0: PCI_EXP_FLAGS  0x0142
[    0.431637] pcieport 0000:00:1c.0: PCI_EXP_SLTSTA 0x0000

[   34.361722] pcieport 0000:00:1c.0: in pciehp_resume:
[   34.361727] pcieport 0000:00:1c.0: PCI_EXP_FLAGS  0x0042
[   34.361765] pcieport 0000:00:1c.0: PCI_EXP_SLTSTA 0x0040

#define PCI_EXP_FLAGS_SLOT      0x0100
#define  PCI_EXP_SLTSTA_PDS     0x0040

At boot, 00:1c.0 claims to have a slot (PCI_EXP_FLAGS_SLOT is set) but there's no device in the slot (PCI_EXP_SLTSTA_PDS is clear).

At resume, 00:1c.0 now claims there's no slot (PCI_EXP_FLAGS_SLOT is clear) but there *is* a device (PCI_EXP_SLTSTA_PDS is set).

pciehp doesn't re-read PCI_EXP_FLAGS at resume-time, so it still thinks there's a slot.  It does read Slot Status, which now claims there's a device, so it tries to bring up the link.  But there isn't really a device, so the link never comes up and we time out.

PCI_EXP_FLAGS_SLOT is HwInit, so it's read-only from our point of view.  I'm suspicious of the fact that it changed between boot and resume.  Is this a hardware defect, a BIOS problem, some kind of Linux re-initialization problem?

I don't know how we should handle this or why it seemed to work in v3.14.  There were several pciehp changes between v3.14 and v3.15, but nothing obvious jumps out to me.
Comment 15 Len Brown 2015-06-12 05:27:56 UTC
This may have always been broken.

> changes between v3.14 and v3.15

intending to bisect, I discovered that 3.14.0 fails too.
so I re-built and re-tested 3.14.44 and it still fails.
Although I do have analyze_suspend results for the 3.14.44 "success" case,
it is possible that I mixed that result up with a "pcie_ports=compat" test.
Unfortunately, analyze_suspend does not save the whole dmesg
and so the cmdline for that test wasn't saved.

I know when I first got this machine, that pcieport was not the slowest
resume device (i915 was), so I do think this is a regression.  However,
it could be that what I used to run a year ago had this driver disabled?

That said, what is the benefit of the pcieport driver on
a laptop with no exposed PCIE ports?
Comment 16 Bjorn Helgaas 2015-06-12 16:31:08 UTC
> That said, what is the benefit of the pcieport driver on
> a laptop with no exposed PCIE ports?

The question is how pciehp could tell whether it is needed.  The code that does this is a bit of a mess, but I think pciehp claims a port if:

  - _OSC has granted native PCIe hotplug control to the OS,
  - PCIe Capabilities Register has "Slot Implemented" set, and
  - PCIe Slot Capabilities has "Hot-Plug Capable" set.

Based on your dmesg and lspci output, these are true for 00:1c.0 and 00:1c.1, but not for 00:1c.3 (where Hot-Plug Capable is not set).

Is there some other indication pciehp should look at to determine whether a port is exposed?  I don't see anything else in PCIe, but maybe there's something in ACPI?
Comment 17 Len Brown 2015-06-12 17:27:33 UTC
FWIW, I updated the BIOS to A02 from A00 today, and it made no difference.

I've been unable to discover what scenario caused this issue to not appear
in an earlier experiment -- other than booting with the pcieport driver disabled;
so I'm clearing the "regression" bit on this report -- it seems Linux
has had this trouble with the XPS11 forever.
Comment 18 Rajat Jain 2015-06-12 18:52:19 UTC
Ummm... Len mentions that he has seen the same kernel (3.14.44) work in the past, and has this issue now. Since the pciehp purely relies on slot status & capabilities, and we see that those bits have been fiddling in the logs here (after a resume, the controller says it doesn't have a slot, but now has a device), do you think this could be an intermittent issue where because of some HW issue (or voltage / noise / spikes) some of the bits intermittently or transiently get flipped? Is there another XPS11 that can be tried?
Comment 19 Yinghai Lu 2015-06-12 22:13:14 UTC
(In reply to Len Brown from comment #17)
> FWIW, I updated the BIOS to A02 from A00 today, and it made no difference.
> 
> I've been unable to discover what scenario caused this issue to not appear
> in an earlier experiment -- other than booting with the pcieport driver
> disabled;
> so I'm clearing the "regression" bit on this report -- it seems Linux
> has had this trouble with the XPS11 forever.

I think that is silicon bug. The 00:1c.0 is from ICH, and that slot is
a dummy hotplug ...

We checked with Intel before, and they agreed that is silicon bug.
Comment 20 Bjorn Helgaas 2015-06-12 22:25:51 UTC
> I think that is silicon bug. The 00:1c.0 is from ICH, and that slot is
> a dummy hotplug ...
> 
> We checked with Intel before, and they agreed that is silicon bug.

Can you give us any more hints?  A URL to an erratum?  If we knew, for example, that function 0 of certain Root Ports can never have a slot, but it sometimes reports one, maybe we could work around that in pciehp.
Comment 21 Yinghai Lu 2015-06-12 23:15:07 UTC
(In reply to Bjorn Helgaas from comment #20)
> 
> Can you give us any more hints?  A URL to an erratum?  If we knew, for
> example, that function 0 of certain Root Ports can never have a slot, but it
> sometimes reports one, maybe we could work around that in pciehp.

Maybe Len could check with that internally.
Comment 22 Len Brown 2015-06-17 03:00:53 UTC
Intel's silicon validation team responded:

"At least the Sunrise Point Spec clearly says that if XCAP.SI is 0 the PDS bit is set to “1”. The Linux driver should re-read SI bit to make a decision and then look at the PDS"

Presence Detect State (PDS): If XCAP.SI is set (indicating that this root port
spawns a slot), then this bit indicates whether a device is connected ('1')
or empty ('0').  If XCAP.SI is cleared, this bit is a '1'.

Slot Implemented (SI): Indicates whether the root port is connected to a slot.
Slot support is platform specific.  BIOS programs this field,
and it is maintained until a platform reset.
Comment 23 Rajat Jain 2015-06-17 04:52:04 UTC
> 
> Slot Implemented (SI): Indicates whether the root port is connected to a
> slot.
> Slot support is platform specific.  BIOS programs this field,
> and it is maintained until a platform reset.

But from comment #14 above, the SI bit is getting cleared between the suspend and resume? 

[    0.431618] pcieport 0000:00:1c.0: in pciehp_probe:
[    0.431621] pcieport 0000:00:1c.0: PCI_EXP_FLAGS  0x0142
[    0.431637] pcieport 0000:00:1c.0: PCI_EXP_SLTSTA 0x0000

[   34.361722] pcieport 0000:00:1c.0: in pciehp_resume:
[   34.361727] pcieport 0000:00:1c.0: PCI_EXP_FLAGS  0x0042
[   34.361765] pcieport 0000:00:1c.0: PCI_EXP_SLTSTA 0x0040

The SI bit value is not getting maintained apparently..
Comment 24 Bjorn Helgaas 2015-06-17 16:12:56 UTC
Created attachment 180191 [details]
debug patch

(In reply to comment #22):

1) I assume XCAP.SI is another name for the "Slot Implemented" bit in the PCI Express Capabilities Register (PCIe spec r3.0, sec 7.8.2).

2) I agree that if SI is 0, PDS in the Slot Status register must be hardwired to 1 (PCIe spec, sec 7.8).  The real question is why SI is changing.

3) The PCIe SI bit is HwInit, and I assume on this hardware that translates to "write-once by firmware, then read-only and can only be reset by Fundamental Reset" (PCIe spec sec 7.4).

4) I suppose the implication is that suspend/resume involves a PCIe Fundamental Reset, and Linux should re-read SI when it resumes.  If that's true, there may be other places in PCI where we need to re-read things on resume.

5) No hardware changed during suspend/resume, so the fact that SI isn't set again during resume sounds like a BIOS bug.

I don't know much about suspend/resume ... does it involve a Fundamental Reset?
Comment 25 Bjorn Helgaas 2015-06-17 16:17:08 UTC
Dangit, I wasn't finished editing that.  Bugzilla is over-eager to submit when adding comments to an attachment.

Anyway, if there is a Fundamental Reset, maybe we could do something like the patch in comment #24.  I think there would be more to do, too.  For example, the PCIe Capability Register is cached in pcie_flags_reg and is now stale.
Comment 26 Len Brown 2015-06-18 02:38:25 UTC
Bjorn, the patch in comment #24 doesn't build for me --
does it apply to a particular tree?
Comment 27 Bjorn Helgaas 2015-06-18 03:37:11 UTC
Created attachment 180261 [details]
debug patch

Sorry, I botched the patch.  This applies on v4.1-rc2 (but should apply on any later kernel), and this time I actually did build it.
Comment 28 Len Brown 2015-06-19 23:41:17 UTC
The patch in comment #27 eliminates the big pcieport delay
from resume.  dmesg after the resume shows:

pcieport 0000:00:1c.0: slot disappeared!

However, the patch causes subsequent suspend/resumes now fail --
not making it back far enough to resume the screen.

Tested on top of 4.1-rc8, which can suspend/resume
repeatedly, without failure.
Comment 29 Len Brown 2016-10-13 21:23:31 UTC
Created attachment 241671 [details]
analyze suspend output

still broken in Linux 4.8.0, as shipped by Ubuntu 16.10

this analyze_suspend.py was run used the new -dev option,
and so it shows many msleep(10)'s an msleep(100),
and then many msleep(20)'s invoked by pciehp_check_link_status()
Comment 30 Bjorn Helgaas 2016-10-14 21:49:01 UTC
Created attachment 241731 [details]
testing patch

Here's my theory about the hang in comment #28: the first resume succeeds and since 00:1c.0 no longer advertises a slot, we call pciehp_remove(), which deregisters the slot and frees some data structures.  When we attempt a second suspend/resume, pciehp is still bound to 00:1c.0, so we call pciehp_resume() again, and we again think the slot disappeared and we again call pciehp_remove().  This time it fails because we try to use some of the data structures we freed during the first resume.

This patch clears out the service_data pointer the first time, so subsequent resumes should basically be no-ops.
Comment 31 Len Brown 2016-10-21 02:59:32 UTC
Confirmed on Linux-4.9-rc1+, that the patch in comment #30 removes the pcieport delay from resume, and allows multiple resumes to succeed w/o the hang mentioned in comment #28.
Comment 32 Len Brown 2016-10-21 19:28:22 UTC
Created attachment 242151 [details]
dmesg with patch

Also verified the patch works as applied to vanilla Linux-4.8.0
Attached is the dmesg resulting from boot with 2 suspend/resume cycles.

Note that when the patch is present, there are couple of additional lines:

[   55.927208] pcieport 0000:00:1c.0: PCIe Capabilities was 0x0142, now 0x0042
[   55.986801] pcieport 0000:00:1c.0: No longer supports hotplug
...
[   97.024614] pcieport 0000:00:1c.0: No longer supports hotplug
Comment 33 Len Brown 2016-11-21 23:49:38 UTC
Bjorn,
I don't see this patch in linux-next.
Do you have plans to upstream it?
Is there any more information I can provide?
Comment 34 Bjorn Helgaas 2016-11-22 00:38:44 UTC
If you have the machine handy, can you try this after a resume (without the patch above)?

  setpci -s 00:1c.0 42.w=142
  setpci -s 00:1c.0 42.w

The spec says the PCI_EXP_FLAGS_SLOT bit (0x100) is R/WO, and suggests that it becomes read-only after the first write.  I think this problem happens because BIOS didn't write the bit after resume.  So maybe we can do the write in Linux.
Comment 35 Len Brown 2017-11-26 03:27:00 UTC
root@xps11-9p33:/# setpci -s 00:1c.0 42.w=142
root@xps11-9p33:/# setpci -s 00:1c.0 42.w
0142
Comment 36 Len Brown 2017-11-26 03:30:29 UTC
Created attachment 260867 [details]
Still a 3,500 ms delay in Linux-4.14.2
Comment 37 Len Brown 2019-11-01 03:29:47 UTC
Upgraded the BIOS on this machine to A04 (1/2/2019).
Installed Ubuntu 19.10 (Linux 5.3)
This issue is no longer reproducible.

Note You need to log in before you can comment on or make changes to this bug.