Bug 86421 - BISECTED - Machine crashes right *after* ~successful resume - Intel(R) Core(TM) i7-3770K
Summary: BISECTED - Machine crashes right *after* ~successful resume - Intel(R) Core(T...
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Yinghai Lu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-16 20:55 UTC by Wilmer van der Gaast
Modified: 2015-03-10 00:56 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.12+ (from git rev 928bea964827d7824b548c1f8e06eccbbc4d0d7d)
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
enhance pci_pm_reenable_device (859 bytes, patch)
2014-11-03 22:19 UTC, Yinghai Lu
Details | Diff
enable pci ite bridge (1.70 KB, patch)
2015-01-28 04:40 UTC, Yinghai Lu
Details | Diff

Description Wilmer van der Gaast 2014-10-16 20:55:27 UTC
ref: https://lkml.org/lkml/2014/10/7/790 https://lkml.org/lkml/2014/10/16/184

I've filed this as a PM bug since the observed problem is a machine not resuming from suspend properly, but the change that caused it is actually PCI-related: 928bea964827d7824b548c1f8e06eccbbc4d0d7d

To summarise the thread: Since that change, my machine can successfully resume from memory-suspend only twice. The third time, it resumes and claims success at waking up all processes, but then the machine crashes. For a few seconds during this process, the networking stack is already up and the machine responds to ICMP echo requests. Also, for a very brief period, bits of userland are alive, for example the ping running on this machine (with a 0.01s interval) manages to send out a few packets (though at a much slower interval than 0.01s).

928bea96... could not be reverted cleanly on 3.17 (or even the 3.12 release tarball already), but Yinghai produced a modified rollback change that did work. First attempt at working around the issue was not successful.

Some logs are on http://gaast.net/~wilmer/.lkml/
Comment 1 Len Brown 2014-10-28 04:52:22 UTC
Marking as a regression on Intel HW.
Comment 2 Len Brown 2014-11-03 21:28:52 UTC
928bea964827d7824b548c1f8e06eccbbc4d0d7d
Author: Yinghai Lu <yinghai@kernel.org>  2013-07-22 17:37:17
Committer: Bjorn Helgaas <bhelgaas@google.com>  2013-07-25 14:35:03
Follows: v3.11-rc2
Precedes: v3.12-rc1

    PCI: Delay enabling bridges until they're needed
    
    We currently enable PCI bridges after scanning a bus and assigning
    resources.  This is often done in arch code.
    
    This patch changes this so we don't enable a bridge until necessary, i.e.,
    until we enable a PCI device behind the bridge.  We do this in the generic
    pci_enable_device() path, so this also removes the arch-specific code to
    enable bridges.
    
    [bhelgaas: changelog]
    Signed-off-by: Yinghai Lu <yinghai@kernel.org>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Comment 3 Yinghai Lu 2014-11-03 22:19:11 UTC
Created attachment 156431 [details]
enhance pci_pm_reenable_device

pci_pm_reenable_device does not call pci_reenable_device()

as enable_cnt and is_busmaster are not set when we delay

bridges enabling and no driver for those bridges.

even before suspend BIOS already enable bridge and busmaster.
Comment 4 Zhang Rui 2014-11-10 07:08:11 UTC
wilmer, can you please apply the patch and check if it works for you?
Comment 5 Wilmer van der Gaast 2014-11-12 21:37:12 UTC
Sorry, I didn't realise there was another patch on this bug. :-/

I compiled a kernel with it applied and rebooted, then https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1289977 rendered my machine unbootable. I don't have a USB drive handy and no time to get the thing back into a bootable state right now, will get back to this bug later..

TYVM, Grub... :<
Comment 6 Wilmer van der Gaast 2014-11-16 22:11:34 UTC
That machine finally boots again, and the patch worked! Machine finished six suspend+resume cycles.
Comment 7 Zhang Rui 2014-11-17 03:28:46 UTC
Bug resolved.

Yinghai, what is the status of the patch?
Comment 8 Wilmer van der Gaast 2015-01-14 23:17:33 UTC
Looks like at least the most recent patch is not merged yet? Is any other fix merged? I'd like to stop running 3.10 kernels on this machine.. :-(
Comment 9 Bjorn Helgaas 2015-01-14 23:49:14 UTC
Yinghai sent at least two patches for testing, but they didn't have changelogs, signed-off-by, etc., and weren't posted to linux-pci.  That means they don't officially exist and didn't make it onto my reviewing queue.  I'm reopening the bug.
Comment 10 Yinghai Lu 2015-01-28 04:40:05 UTC
Created attachment 164981 [details]
enable pci ite bridge

Just enable the bridge ...

Wilmer,

Can you please try this patch instead?
Comment 11 Wilmer van der Gaast 2015-02-09 00:37:59 UTC
My apologies for the late repsonse! :-(

I haven't even tried the patch yet, because annoyingly and very very confusingly, the problem is no longer showing. :-(

I've tried 3.19-rc7, 3.17.8 and the most recent 3.16 Debian package, and reliable resumes almost constantly. I only saw it hang once, when I tried really hard - the only difference on my system between now and then is that I got rid of the Microsoft wireless kb/mouse receiver. When I plugged it back in, two resumes later the machine did hang.

However another reboot later, and I now could do six suspend-resume cycles with that stick plugged in.

I should look in the logs to see if indeed the PCI device info doesn't get corrupted anymore or if somehow magically the corruption is no longer causing crashes (does seem unlikely I assume), will try to do that tomorrow, it's getting too late here now.

I have no idea how this could change - the crash always happened super-reliably on a whole bunch of kernels. I haven't made any other changes to my system (last time I opened the case was back in Oct/Nov while troubleshooting this very issue).
Comment 12 Rafael J. Wysocki 2015-03-10 00:56:40 UTC
OK, closing for now.  Please reopen if you find more information.

Note You need to log in before you can comment on or make changes to this bug.