Bug 86241

Summary: Regression: Suspend to RAM randomly fails on Dell XPS 13
Product: Drivers Reporter: Vincent Petry (PVince81)
Component: OtherAssignee: Tomas Winkler (tomas.winkler)
Status: RESOLVED CODE_FIX    
Severity: high CC: alan, conflatulence, gabriele.mzt, jiri.tyr, lenb, mamadontgodaddycomehome, mattia.b89, r087r70, rui.zhang, tomas.winkler, tomasw, vbourachot
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 3.16.4-1.g7a8842b-desktop Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg from simulated suspend
All loaded modules
screen hang at shutdown
mei dmesg debug message from real s2ram
mei dmesg debug message from successful s2ram
Disable mei on XPS13 9333
suspend issue possible fix

Description Vincent Petry 2014-10-14 12:35:13 UTC
I have a Dell XPS 13 9333 (Intel i7-4650U), 8GB ram

This worked properly before using openSUSE 13.1 x86_64 with kernel 3.11.10-21-desktop and KDE 4.11.5 

I've installed openSUSE Factory from scratch (13.2 beta, then switched to factory repos) with kernel 3.16.4-1.g7a8842b-desktop and KDE 4.14.1.

Since that last one suspend to RAM randomly hangs.

There's another guy, Roberto, who owns exactly the same laptop and had similar issues, original report is here: https://bugs.kde.org/show_bug.cgi?id=339391

We both confirmed that the problem also happens when running "sudo pm-suspend" from the command line, so it is unlikely to be a KDE bug.

My root filesystem is BTRFS on SSD, swap is enabled.

The problem cannot be reproduced consistently but according to Robert we need to wait several hours before attempting a suspend to ram to increase the chances of it happening. I tried doing it several times in a row after a cold boot and it did not happen.


So far I tried setting "pm_trace" to 1, but after it hung the dmesg logs didn't contain any new messages.

Some logs can be found in the linked report here: https://bugs.kde.org/show_bug.cgi?id=339391



Please let me know what kind of information is needed to debug this.
I'm trying to find out what components I could disable to try and isolate the problems.
Comment 1 Vincent Petry 2014-10-14 12:43:12 UTC
Copying Roberto's environment info here:
System: dell-xps13-sputnik-2014 (Intel i7-4650U), 8GB ram, NO SWAP SPACE (maybe relavant?) OS: Kubuntu 14.04 x86_64 with kernel 3.16.0-17

It is likely that the bug is related to kernel 3.16 as we both have this major version.
Comment 2 Roberto 2014-10-14 12:48:41 UTC
I'm also affected by this bug (same laptop and OS as Vincent).

In my case I have also problems in shutting down the system: it freezes with the screen saying something like: "The system will halt NOW..."
Comment 3 Vincent Petry 2014-10-15 07:42:37 UTC
For the suspend to RAM issue, whenever we know it's likely to happen soon (after a day of laptop use), here are a few things we could try:
- Log out from KDE and shut down the X server, then run pm-suspend
- Unload vboxdrv and other modules like bluetooth

Based on these results we can at least find out whether it's anything in KDE that it causing the issue and give us directions about what to try next.

Roberto, do you also have Virtualbox installed ?

And obviously something else to try would be to upgrade the kernel to 3.17.
Comment 4 Roberto 2014-10-15 07:49:49 UTC
I do have virtualbox installed, even though I never use it.

And I'm with mainline kernel 3.17.
Comment 5 Vincent Petry 2014-10-15 12:13:59 UTC
Just had the crash again: unloading all virtualbox modules didn't help.
Comment 6 Roberto 2014-10-15 13:57:38 UTC
Tried with X-server shutted down, doesn't help.
Comment 7 Vincent Petry 2014-10-16 08:35:14 UTC
Yesterday I shut down KDE but forgot to shut down Xorg, and also got the crash.

So now I guess the next step is trying to stop services, then also manually unload kernel modules.

Another idea would be to downgrade the kernel until it works again to find the version in which the regression was introduced (might be tedious).
Comment 8 Vincent Petry 2014-10-16 08:41:27 UTC
I just found this: https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

Scroll down to "Testing suspend to RAM".

It seems it's possible to test different phases of suspend to identify what's going wrong. I'll try this in a few hours when the likeliness of a crash is higher.
Comment 9 Vincent Petry 2014-10-16 11:34:02 UTC
Created attachment 153951 [details]
dmesg from simulated suspend

I tried the following:

# echo devices > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
# s2ram

and later

# echo core > /sys/power/pm_test
# s2ram

all worked. I've attached the result dmesg from the "core" test in case it can contain clues.

However after that I set it back to "none" for the real suspend to ram, and then the system hung again.

There is a suspicious stack trace in that log, but not sure if related to the issue itself.
Comment 10 Vincent Petry 2014-10-18 17:20:27 UTC
Today when closing the lid it hung but this time it didn't shut down the screen.
When I reopened I could still see the desktop but everything was frozen.

Just mentioning this in case it's another subtle clue.



I think next up to test would be unloading kernel modules like wifi, usb and others, before running suspend to ram.
Comment 11 Vincent Petry 2014-10-22 07:55:49 UTC
I had two more unsuccessful tries:
1) rmmod of iwlwifi,mac80211,iwlmvm
2) rmmod of xhci_hcd, sdhci, sdhci_acpi, mmc_core

Maybe next time I'll try and remove even more modules.

Note. openSUSE Factory upgraded me to kernel 3.16.4-1.g7a8842b-desktop but the problem persists here.
Comment 12 Vincent Petry 2014-10-22 12:32:31 UTC
Another failed test run where I removed snd, iwlwifi, bluetooth, *hci.
Here are the remaining modules:
Module                  Size  Used by
ctr                    13049  0 
ccm                    17773  0 
af_packet              39991  0 
fuse                  100461  0 
6lowpan_iphc           18702  0 
binfmt_misc            17468  1 
hid_logitech_dj        18469  0 
uvcvideo               89131  0 
videobuf2_vmalloc      13216  1 uvcvideo
videobuf2_memops       13362  1 videobuf2_vmalloc
videobuf2_core         63200  1 uvcvideo
v4l2_common            15265  1 videobuf2_core
videodev              157329  3 uvcvideo,v4l2_common,videobuf2_core
hid_multitouch         17419  0 
dell_wmi               12681  0 
sparse_keymap          13948  1 dell_wmi
iTCO_wdt               13480  0 
iTCO_vendor_support    13718  1 iTCO_wdt
hid_rmi                17528  0 
x86_pkg_temp_thermal    14205  0 
intel_powerclamp       18823  0 
coretemp               13441  0 
dcdbas                 14978  0 
crct10dif_pclmul       14268  0 
crc32_pclmul           13133  0 
ghash_clmulni_intel    13230  0 
aesni_intel           152552  0 
aes_x86_64             17131  1 aesni_intel
lrw                    13286  1 aesni_intel
gf128mul               14951  1 lrw
glue_helper            13990  1 aesni_intel
ablk_helper            13597  1 aesni_intel
cryptd                 16263  3 ghash_clmulni_intel,aesni_intel,ablk_helper
arc4                   12608  0 
serio_raw              13434  0 
lpc_ich                21093  0 
mfd_core               13435  1 lpc_ich
i2c_i801               22454  0 
shpchp                 32951  0 
wmi                    19193  1 dell_wmi
thermal                22971  0 
battery                23237  0 
i2c_hid                18726  0 
mei_me                 23664  0 
ac                     13335  0 
i2c_designware_platform    12979  0 
soundcore              15047  0 
8250_dw                13551  0 
i2c_designware_core    14768  1 i2c_designware_platform
mei                    96067  1 mei_me
processor              40484  0 
dm_mod                111114  0 
btrfs                1006855  1 
xor                    21411  1 btrfs
raid6_pq              106004  1 btrfs
crc32c_intel           22094  1 
i915                  983484  1 
i2c_algo_bit           13413  1 i915
video                  24419  1 i915
drm_kms_helper         65670  1 i915
drm                   335594  3 i915,drm_kms_helper
button                 13971  1 i915
sg                     40630  0
Comment 13 Vincent Petry 2014-10-22 12:33:33 UTC
Created attachment 154551 [details]
All loaded modules

For reference I've attached the list of ALL loaded modules on the Dell XPS 13 when on the desktop (KDE).
Comment 14 Roberto 2014-10-22 15:33:56 UTC
Yesterday I've tested the standby/shutdown issue using a usb-key with a live linux distribution (tried both kali-live and kubuntu-live).
They worked flawlessly.

But again, it may depend on the time the system is active, so I'll test again this thoroughly in the next days...
Comment 15 Alan 2014-10-23 14:08:54 UTC
Does it happen if you fill up ram with loads of stuff then suspend/resume even early on ?
Comment 16 Roberto 2014-10-23 20:26:28 UTC
With 8GB ram I have to think how to fill it up. I'll try.
Comment 17 Vincent Petry 2014-10-23 21:37:17 UTC
You could mount ramdisks (tmpfs) and fill them with files.

Another thing to try: write down the memory usage after login and before suspend.
That could at least confirm whether it's related to the ram usage somehow.

Like I said previously, suspend to ram used to work flawlessly in openSUSE 13.1 with kernel 3.11.10-21 so it's not like it never worked before.

Roberto, what kernel version did you have on that USB stick ?
Comment 18 Alan 2014-10-23 21:47:45 UTC
find /usr -exec grep -H Bananas {} \;

or similar with a very large file system of big files will fill memory up with tons and tons of cached file data, that may not be same as tons and tons of dirty app data but its a first test.
Comment 19 Roberto 2014-10-25 09:21:32 UTC
after boot, pm-suspend worked fine after filling up /dev/shm with some large files

cached memory should not be the problem here.
Comment 20 Vincent Petry 2014-10-27 13:38:44 UTC
This is free -h before the suspend crash after a few hours use:

             total       used       free     shared    buffers     cached
Mem:          7.3G       7.1G       254M       340M         8K       3.4G
-/+ buffers/cache:       3.7G       3.7G
Swap:          15G       240K        15G


After the crash and directly after the reboot I created a tmpfs virtual drive:
sudo mount -t tmpfs -o size=6024M tmpfs temp3

Then I coped a VM inside until it said "no more free space".

The used memory was also around 7G (forgot to save it) and the suspend did not crash.

I'll try Alan's approach later.
Comment 21 Len Brown 2014-10-28 04:44:40 UTC
Marked as a regression on Intel HW.

Can you reproduce using the upstream Linux kernel?
Can you identify the latest kernel that worked
(and the first that failed), say, by using git-bisect?
Comment 22 Roberto 2014-10-28 10:05:55 UTC
Hello,

as stated in comment #4 I've tried with 3.17 mainline kernel.

In my case, everything earlier than the factory-installed ubuntu-12.04 produces the error.
Comment 23 Vincent Petry 2014-10-29 14:27:07 UTC
git bisect is an idea but will take quite some time as we still don't know how to trigger the issue realiably, apart from using the laptop for a few hours. So would do that only as a last resort, or as soon as we know how to produce the issue quickly.

In my case I tried running "free -h" before every suspend, and this is the result so far:

Failed suspends:
             total       used       free     shared    buffers     cached
Mem:          7.3G       4.1G       3.2G       2.2M       452K       4.0G
-/+ buffers/cache:       135M       7.2G
Swap:          15G         0B        15G
             total       used       free     shared    buffers     cached
Mem:          7.3G       6.5G       910M       329M       880K       4.1G
-/+ buffers/cache:       2.4G       5.0G
Swap:          15G         0B        15G
             total       used       free     shared    buffers     cached
Mem:          7.3G       6.1G       1.3G       260M       880K       4.0G
-/+ buffers/cache:       2.1G       5.3G
Swap:          15G         0B        15G


Succeeded suspends:
             total       used       free     shared    buffers     cached
Mem:          7.3G       3.7G       3.7G       268M       880K       2.1G
-/+ buffers/cache:       1.6G       5.8G
Swap:          15G         0B        15G
             total       used       free     shared    buffers     cached
Mem:          7.3G       4.5G       2.9G       261M       880K       2.0G
-/+ buffers/cache:       2.5G       4.9G
Swap:          15G         0B        15G
             total       used       free     shared    buffers     cached
Mem:          7.3G       5.0G       2.3G       284M       880K       2.3G
-/+ buffers/cache:       2.8G       4.6G
Swap:          15G         0B        15G
             total       used       free     shared    buffers     cached
Mem:          7.3G       5.4G       2.0G       273M       880K       2.4G
-/+ buffers/cache:       2.9G       4.4G
Swap:          15G         0B        15G
             total       used       free     shared    buffers     cached
Mem:          7.3G       5.4G       2.0G       266M       880K       2.5G
-/+ buffers/cache:       2.9G       4.4G
Swap:          15G         0B        15G


I wonder if there's a limit (MAX_INT?) after which the cached memory will cause suspend to fail. (if it's really memory-related)
Comment 24 Vincent Petry 2014-10-30 08:37:47 UTC
I tried Alan's technique with:
find /usr -exec grep -H Bananas {} \;

Firstly, I didn't find any Bananas ;-)

Secondly, the cached memory did fill up as expected, but running suspend to RAM after that still worked properly...

             total       used       free     shared    buffers     cached
Mem:          7.3G       7.2G       131M       299M         8K       5.2G
-/+ buffers/cache:       2.0G       5.3G
Swap:          15G       2.2M        15G

Other failures I had were with cached memory > 4 GB.
So it looks like the issue is not memory related.

Usually it happens after 3 hours waking time.

What else could build up over time ?
Comment 25 Vincent Petry 2014-10-30 08:41:17 UTC
Additional note: yesterday I had another hanging case where the screen stayed on (this is quite rare), I saw KDE's lock screen but everything was frozen.

Which means that the part that causes the system to freeze might sometimes happen even before the screen is suspended.

Also I still don't understand why running the suspend tests all work fine but only the actual suspend will freeze. Is there something that is not covered by the tests from here https://bugzilla.kernel.org/show_bug.cgi?id=86241#c9 ?
Comment 26 Vincent Petry 2014-10-30 12:58:06 UTC
Tried to drop the ram caches with 
echo 3 > /proc/sys/vm/drop_caches
before suspending. Still hung.
It confirms that it's probably not memory related.

Tried booting 3.18.0-rc2-g7a8842b but had trouble booting.
Will try with older versions and see if I can identify which version introduced the issue. Might take several days, stay tuned ;-)
Comment 27 Vincent Petry 2014-10-31 18:36:09 UTC
I compiled and installed 3.15.0-1.g7a8842b-desktop

Today I didn't experience any issues with suspend to RAM.

So the regression is between 3.15.0 and 3.16.4.

Will try and find a narrower interval.
Comment 28 Roberto 2014-10-31 19:45:34 UTC
Vincent: please test the 3.15 thoroughly, we need to be totally sure about it functioning!

thank you
Comment 29 Vincent Petry 2014-11-03 13:58:59 UTC
Kernel 3.16.0 failed.

Next up: 3.15.10

Roberto: don't worry, once I find the threshold I'll double check.
Comment 30 Vincent Petry 2014-11-04 17:59:45 UTC
Laptop survived one work day with 2 suspends with 3.15.10.

The breaking change is likely to be between 3.15.10 and 3.16.0
Comment 31 Vincent Petry 2014-11-05 18:36:31 UTC
Got a crash today with 3.16-rc2.
I couldn't boot 3.16-rc1.

So now I'd say the breaking change is between 3.15.10 and 3.16-rc2.
I see many changes related to ACPICA have been introduced.

I'll try one more day with 3.15.10 to confirm that it's stable.

Not sure how to proceed from there, I could try and bisect individual commits ? But that will likely take a month (one day per step). Also compiling the kernel seems to take 2 hours even on my modern machine.
Comment 32 Vincent Petry 2014-11-07 13:18:26 UTC
I started bisecting between 3.15.10 and 3.16-rc2
There are about 13 steps.

Here's the result so far:

git bisect start
# good: [f35b5e46feabab668a44df5b33f3558629f94dfc] Linux 3.15.10
git bisect good f35b5e46feabab668a44df5b33f3558629f94dfc
# bad: [a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee] Linux 3.16-rc2
git bisect bad a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee
# good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d
# bad: [d09cc3659db494aca4b3bb2393c533fb4946b794] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect bad d09cc3659db494aca4b3bb2393c533fb4946b794
Comment 33 Zhang Rui 2014-11-10 07:06:42 UTC
any chance to get the first bad commit?
Comment 34 Vincent Petry 2014-11-10 16:57:57 UTC
Zhang Rui, I'm still bisecting... As state above, the laptop needs to be used actively at least 3 hours before the bug can be reproduced. Which means I'm roughly doing one step a day.

Here is the current updated log:
git bisect start
# good: [f35b5e46feabab668a44df5b33f3558629f94dfc] Linux 3.15.10
git bisect good f35b5e46feabab668a44df5b33f3558629f94dfc
# bad: [a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee] Linux 3.16-rc2
git bisect bad a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee
# good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d
# bad: [d09cc3659db494aca4b3bb2393c533fb4946b794] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect bad d09cc3659db494aca4b3bb2393c533fb4946b794
# bad: [5142c33ed86acbcef5c63a63d2b7384b9210d39f] Merge tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging into next
git bisect bad 5142c33ed86acbcef5c63a63d2b7384b9210d39f
# bad: [4046136afbd1038d776bad9c59e1e4cca78186fb] Merge tag 'char-misc-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc into next
git bisect bad 4046136afbd1038d776bad9c59e1e4cca78186fb
# good: [825f4e0271b0de3f7f31d963dcdaa0056fe9b73a] Merge tag 'soc-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into next
git bisect good 825f4e0271b0de3f7f31d963dcdaa0056fe9b73a

9 steps to go.
Comment 35 Vincent Petry 2014-11-10 17:02:17 UTC
To be clearer, here is how I test:
1. compile the kernel (takes about 1-2 hours)
2. boot the kernel
3. use the computer from 10 am until around 1 pm
4. suspend to ram
5. after lunch, continue using the computer until 6-7 pm
6. suspend to ram

For bad commits, the computer will hang aither at step 4. or 6. Suspending the computer earlier than 2 hours usually works correctly.

For good commits, it will always suspend properly.

But to be sure I prefer to let it run the whole day to avoid false positives, as it is still not clear what exactly is causing the freezing to happen.
Comment 36 Roberto 2014-11-10 19:51:49 UTC
Created attachment 157201 [details]
screen hang at shutdown
Comment 37 Roberto 2014-11-10 19:53:05 UTC
I've attached a screenshot of the hanged system after shutdown.
Comment 38 Vincent Petry 2014-11-11 08:40:40 UTC
Roberto, I'd suggest raising the shutdown issue as a separate ticket, as I believe it is a completely different issue.

I never had such shutdown issues with this laptop.
Comment 39 Roberto 2014-11-11 09:46:08 UTC
Strangely, shutdown freeze occurs only after several hours of work, just like suspend-to-ram.
This makes me think that they might be related. However, we will see when the s2ram problem will be solved...
Comment 40 Vincent Petry 2014-11-11 21:47:32 UTC
Next step done:

# good: [70bc6bb3f254c1cf605a30a2d5bb18eff90a9584] Merge tag 'zynq-dt-for-3.16' of git://git.xilinx.com/linux-xlnx into next/dt
git bisect good 70bc6bb3f254c1cf605a30a2d5bb18eff90a9584
Comment 41 Vincent Petry 2014-11-12 19:06:18 UTC
I see some new people have joined us here.

Welcome to the daily bisect ! Please have a seat.

Here is today's update:

# good: [755a9ba7bf24a45b6dbf8bb15a5a56c8ed12461a] Merge tag 'dt-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into next
git bisect good 755a9ba7bf24a45b6dbf8bb15a5a56c8ed12461a

there are about 7 steps to go.

Stay tuned :-)
Comment 42 Vincent Petry 2014-11-14 17:54:57 UTC
# good: [182434f748885e169554fba410aebfef6bdf21ed] Merge tag 'exynos-cpuidle' of http://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung into next/drivers
git bisect good 182434f748885e169554fba410aebfef6bdf21ed


Bisecting: 36 revisions left to test after this (roughly 6 steps)
Comment 43 Vincent Petry 2014-11-17 18:01:35 UTC
# bad: [6a57bad6e78ba0355f0f6df8cca1f7df42b58bfd] Merge tag 'extcon-next-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-next
git bisect bad 6a57bad6e78ba0355f0f6df8cca1f7df42b58bfd
# good: [0604002cde72cd60a11013daf2d9f456d4895ce8] extcon: max14577: Use devm_extcon_dev_allocate for extcon_dev
git bisect good 0604002cde72cd60a11013daf2d9f456d4895ce8

Bisecting: 14 revisions left to test after this (roughly 4 steps)
Comment 44 Vincent Petry 2014-11-17 21:54:58 UTC
Got "lucky" once more this evening with a bad one:

# bad: [86113500c060bccb0f08bdcadcecc0bd267fd25a] mei: make return values consistent across the driver
git bisect bad 86113500c060bccb0f08bdcadcecc0bd267fd25a

3 steps to go, yay!
Comment 45 Vincent Petry 2014-11-18 17:55:14 UTC
Bisecting: 3 revisions left to test after this (roughly 2 steps)

# good: [a532bbedc85ff3b834ba81e49163a3f543be1775] mei: add function to check write queues
git bisect good a532bbedc85ff3b834ba81e49163a3f543be1775
Comment 46 Vincent Petry 2014-11-21 12:39:34 UTC
# good: [e13fa90ce42d8e7ee501426ea414c8ae4a5366ef] mei: me: use runtime PG pm domain for non wakeable devices
git bisect good e13fa90ce42d8e7ee501426ea414c8ae4a5366ef
# bad: [61a1aea7c7cb40de071e202cfaa31fa2c1fca8ba] mei: me: bump hbm version to 1.1 to support power gating
git bisect bad 61a1aea7c7cb40de071e202cfaa31fa2c1fca8ba


Bisecting: 0 revisions left to test after this (roughly 0 steps)
Comment 47 Vincent Petry 2014-11-21 18:39:43 UTC
# bad: [d2d56faebaed1dd9bc011fcceed7df6b1bea8fac] mei: txe: use runtime PG pm domain for non wakeable devices
git bisect bad d2d56faebaed1dd9bc011fcceed7df6b1bea8fac
# first bad commit: [d2d56faebaed1dd9bc011fcceed7df6b1bea8fac] mei: txe: use runtime PG pm domain for non wakeable devices

And, after one month of bisecting, ladies and gentleman, here is the breaking commit:

commit d2d56faebaed1dd9bc011fcceed7df6b1bea8fac
Author: Alexander Usyskin <alexander.usyskin@intel.com>
Date:   Tue Mar 18 22:52:06 2014 +0200

    mei: txe: use runtime PG pm domain for non wakeable devices
    
    For non wakeable devices we can't use pci runtime framework
    as we are not able to wakeup from D3 states.
    Instead we create new pg runtime domain that only drives TXE power
    gating protocol to reduce the power consumption.
    
    Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
    Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

:040000 040000 3a6e2f73692e0747af2148a103588367e9aca3e2 c9c5bfdce6535d7d714caf93c480e19e1709cadc M      drivers


I will do another set of tests, one with that commit and one with the one before, to make sure that there was no false positive.
Comment 48 Vincent Petry 2014-11-25 07:54:39 UTC
d2d56faebaed1dd9bc011fcceed7df6b1bea8fac confirmed broken. Note that the two times I got it hanging the screen was still on and showed the KDE lock screen (which isn't the case in all fail). I hope it's not yet another similar issue.

Next up:
1) Test the commit before that: e13fa90ce42d8e7ee501426ea414c8ae4a5366ef and make sure s2ram doesn't hang over a span of several days (in progress)
2) Test with 3.17.4 to see whether the bug was fixed
3) Test with 3.17.4 + reverted d2d56faebaed1dd9bc011fcceed7df6b1bea8fac
4) Test with 3.18-rc6
5) Test with 3.18-rc6 + reverted d2d56faebaed1dd9bc011fcceed7df6b1bea8fac
6) Bisect the broken commit itself, removing lines one by one...

In the meantime I hope a kernel dev could chime in with some ideas/theories as to why this would hang s2ram after 3 hours of laptop use (but not before), and not fail the blank tests as shown here https://bugzilla.kernel.org/show_bug.cgi?id=86241#c9.

If needed I can also test patches.
Comment 49 Vincent Petry 2014-11-27 13:21:29 UTC
Some update:

1) The commit before the breaking one, which is e13fa90ce42d8e7ee501426ea414c8ae4a5366ef worked fine for me over several days: did not hang. This confirms that d2d56faebaed1dd9bc011fcceed7df6b1bea8fac is the breaking commit.

2) Test with 3.17.4 (in progress)

4) Test with 3.18-rc5: s2ram still hangs, so the problem was not fixed there
Comment 50 Vincent Petry 2014-11-28 07:44:54 UTC
3.17.4 broken too.

Next up: will to revert the commit.
Comment 51 Vincent Petry 2014-11-28 13:36:43 UTC
Bad news: 3.17.4 + reverted d2d56faebaed1dd9bc011fcceed7df6b1bea8fac is STILL broken.

This means that d2d56faebaed1dd9bc011fcceed7df6b1bea8fac probably isn't the only commit that breaks suspend. It is likely that there is a bunch of commits after that that will trigger the issue.

I have no idea how I can efficiently find out which it is. I could try redoing the full bisect but by reverting d2d56faebaed1dd9bc011fcceed7df6b1bea8fac for every step... but this is going a bit too far.

Any ideas what I could do or how I could debug this ? Can devs provide a patch that logs more information ?

The hardest part is that logs do not contain any useful/extra information, especially that the laptop crashes directly and log entries, if any, are probably not flushed to disk. There is no way I could hook the laptop to a serial console as suggested in some websites.

Any suggestions ?
Comment 52 Vincent Petry 2014-11-28 18:09:00 UTC
Okay, so that one commit at the beginning of the breaking change is related to Intel's MEI.

And I noticed that "lsmod | grep mei" shows this:
mei_me                 19527  0 
mei                    88055  1 mei_me

Next time I could try unloading that driver before sleep (not sure if possible or recommended though).
Comment 53 Vincent Petry 2014-12-01 22:12:15 UTC
I have unloaded both "mei_me" and "mei" (which seem to be related to the AMT management service from Intel which I don't use and doesn't seem to work anywwa).

After unloading these modules, I've been able to suspend properly since at least one day.

I'll continue for one more day to make sure it's not a false positive.

Roberto, could you try the same on your laptop ? (blacklist "mei" and "mei_me")
Comment 54 Alan 2014-12-02 13:06:54 UTC
+Tomas Winkler for mei/amt
Comment 55 Tomas Winkler 2014-12-02 13:38:15 UTC
Can you please also append run on  lspci -vnx 00:16.0

Thanks
Comment 56 Vincent Petry 2014-12-02 18:56:03 UTC
The command didn't work with the 00:16.0 argument so I copied the matching part from the output:

00:16.0 0780: 8086:9c3a (rev 04)
        Subsystem: 1028:060a
        Flags: fast devsel, IRQ 16
        Memory at f0519000 (64-bit, non-prefetchable) [size=32]
        Capabilities: [50] Power Management version 3
        Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+
        Kernel modules: mei_me
00: 86 80 3a 9c 02 00 18 00 04 00 80 07 00 00 80 00
10: 04 90 51 f0 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 0a 06
30: 00 00 00 00 50 00 00 00 00 00 00 00 0b 01 00 00


And this is when not verbose:
00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04)


Note: I haven't found anywhere in the BIOS how to enable AMT and didn't find any docs about it for the Dell XPS 13. When I bought the laptop it came with Ubuntu so it is likely that AMT was already permanently disabled by default.
I read somewhere online that disabling AMT was a permanent operation.
Not sure if this could affect the module somehow.
Comment 57 Roberto 2014-12-02 22:22:48 UTC
Holy shit, I confirm It was that fucking mei module!

By blacklisting mei+mei_ime I can finally suspend AND shutdown properly!

One year of trouble on a 1300+ euros laptop for a useless module, WTF intel!!
Comment 58 Roberto 2014-12-02 22:24:07 UTC
BTW, a huge thank to Vincent: excellent work! thank you!!
Comment 59 Zhang Rui 2014-12-03 01:35:32 UTC
re-assign to mei experts.
Comment 60 Tomas Winkler 2014-12-03 08:14:22 UTC
(In reply to Vincent Petry from comment #56)
> The command didn't work with the 00:16.0 argument so I copied the matching
> part from the output:
> 
> 00:16.0 0780: 8086:9c3a (rev 04)
>         Subsystem: 1028:060a
>         Flags: fast devsel, IRQ 16
>         Memory at f0519000 (64-bit, non-prefetchable) [size=32]
>         Capabilities: [50] Power Management version 3
>         Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+
>         Kernel modules: mei_me
> 00: 86 80 3a 9c 02 00 18 00 04 00 80 07 00 00 80 00
> 10: 04 90 51 f0 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 0a 06
> 30: 00 00 00 00 50 00 00 00 00 00 00 00 0b 01 00 00
> 
> 
> And this is when not verbose:
> 00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04)

Thanks for the output and for hard work on bisecting. I will try to find similar product to reproduce, though I'll be gland if you can help me to nail this issue as I do not have immediate  access to all possible Intel based laptops.
There might be also some BIOS incompatibility. BIOS is usually not finalized by Intel and their might be inconsistencies.

To focus on the issue and your bisection finding can you please check if the runtime pm is enabled for this devices.

cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_status
cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_enabled

To enable verbose debug we can enable dynamic debug logs. 
cat /etc/modprobe.d/mei.conf                                                                                                                                                                                
options mei dyndbg="+pltf"
options mei_me dyndbg="+pltf"

Maybe this will give as some more hints since I do not something suspicious in the log youv'e attached


> 
> Note: I haven't found anywhere in the BIOS how to enable AMT and didn't find
> any docs about it for the Dell XPS 13. When I bought the laptop it came with
> Ubuntu so it is likely that AMT was already permanently disabled by default.
> I read somewhere online that disabling AMT was a permanent operation.
> Not sure if this could affect the module somehow.

AMT is only available on vPro machines and MEI driver has also uses beyond AMT so this is not the issue here.

For your personal use you can also blacklist the module though I'll be glad if you can help me to validate a possible fix.
Comment 61 Vincent Petry 2014-12-03 17:10:50 UTC
Created attachment 159581 [details]
mei dmesg debug message from real s2ram

I've attached mei-debug.txt which contains the dmesg journal from before the suspend and a few lines after it (which unfortunately contain nothing about mei).

And here the result from sys before the suspend:

  % cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_status
suspended

  % cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_enabled
enabled

After booting the computer the state are the same.

So still no clue why this only hangs after ~3 hours of laptop usage.

I hope you'll find something :-)
Comment 62 Vincent Petry 2014-12-03 17:13:57 UTC
Created attachment 159591 [details]
mei dmesg debug message from successful s2ram

And here for reference the dmesg of a SUCCESSFUL suspend (from when laptop usage time is just a few minutes). So here you can see what is supposed to happen but doesn't happen in the other case.
Comment 63 Vincent Petry 2014-12-03 17:21:26 UTC
Just to avoid confusion:

- the fail log is here: https://bugzilla.kernel.org/attachment.cgi?id=159581

There is actually a successful suspend in the middle of that log, it was from a few hours before when the threshold time wasn't reached yet. The actual failing suspend is obviously from the bottom of that log when it hung.

- the success log is here (suspending a few minutes after boot): https://bugzilla.kernel.org/attachment.cgi?id=159591

Hope it helps.
Comment 64 Gabriele Mazzotta 2015-04-21 21:01:09 UTC
Created attachment 174711 [details]
Disable mei on XPS13 9333

Hi,

would be preventing the module from loading automatically on the XPS13 9333 an acceptable (and temporary) solution?

I confirm that either mei or mei_me is causing issues (I don't know exactly which one, I haven't done any test). Ever since I blacklisted them, I didn't have a single failed suspension.

Since few months have passed since this bug was reported, wouldn't be wise to simply disable it?

Something like the patch here attached should work.

Gabriele
Comment 65 Tomas Winkler 2015-04-21 21:21:19 UTC
sorry this totally fail of my radar, I'm looking into the logs.
Tomas
Comment 66 Tomas Winkler 2015-04-28 08:36:10 UTC
Looks like we are hitting an unexpected interrupt during the spend flow which we interpret as request for reset due to register setting. 
Need to find out why we are hitting interrupt here. 

Dec 03 14:30:02 vvortex.ttv kernel: [6603] mei_stop:289: mei_me 0000:00:16.0: stopping the device.
Dec 03 14:30:02 vvortex.ttv kernel: [6603] mei_reset:119: mei_me 0000:00:16.0: remove iamthif and wd from the file list.
Dec 03 14:30:02 vvortex.ttv kernel: [6603] mei_reset:136: mei_me 0000:00:16.0: powering down: end of reset
Dec 03 14:30:02 vvortex.ttv kernel: [476] mei_me_irq_thread_handler:643: mei_me 0000:00:16.0: function called after ISR to handle the interrupt processing.
Dec 03 14:30:02 vvortex.ttv kernel: mei_me 0000:00:16.0: FW not ready: resetting.
Dec 03 14:30:02 vvortex.ttv kernel: [476] mei_me_irq_thread_handler:708: mei_me 0000:00:16.0: interrupt thread end ret = 0
Dec 03 14:30:02 vvortex.ttv kernel: [175] mei_reset:119: mei_me 0000:00:16.0: remove iamthif and wd from the file list.
Dec 03 14:30:02 vvortex.ttv kernel: PM: suspend of devices complete after 594.944 msecs
Dec 03 14:30:02 vvortex.ttv kernel: PM: late suspend of devices complete after 18.910 msecs
Dec 03 14:30:02 vvortex.ttv kernel: ehci-pci 0000:00:1d.0: System wakeup enabled by ACPI
Comment 67 Tomas Winkler 2015-05-04 12:00:57 UTC
Currently I suspect this is a FW/BIOS issue and I'm checking the bug database if this was perhaps fixed.  I believe that the FW version is pretty old on that machine.
Comment 68 Zachary Warren 2015-05-04 13:44:23 UTC
Thinkpad yoga has a similar problem:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770397
Comment 69 Tomas Winkler 2015-06-09 14:59:09 UTC
Created attachment 179251 [details]
suspend issue possible fix
Comment 70 Tomas Winkler 2015-06-09 15:00:02 UTC
I'll be glad if someone can check if the attached patch fixes the issue.
Thanks
Comment 71 Gabriele Mazzotta 2015-06-10 22:08:43 UTC
I've been able to successfully suspend my laptop with no issues twice already.
The first time after it has been up for 10 hours, the second time for 4 hours (no cold boot in between). So it seems to work, but it's too soon to be sure.

I'll report any issue that may arise in the next days.
Comment 72 Tomas Winkler 2015-06-11 21:24:19 UTC
May I add you to the Tested-by: ?
Comment 73 Gabriele Mazzotta 2015-06-12 18:24:39 UTC
Sure.

Just for record, everything is still working fine.
Comment 74 Hongjoo Lee 2015-06-20 09:28:00 UTC
Not experiencing the problem on 3.16.0-41 since I backlisted "mei" families. Thanks to whom have spent a lot of time on testing especially to Vincent Petry.
Comment 75 Tomas Winkler 2015-07-05 11:29:23 UTC
commit 3dc196eae1db548f05e53e5875ff87b8ff79f249
Author: Alexander Usyskin <alexander.usyskin@intel.com>
Date:   Sat Jun 13 08:51:17 2015 +0300

    mei: me: wait for power gating exit confirmation

    Fix the hbm power gating state machine so it will wait till it receives
    confirmation interrupt for the PG_ISOLATION_EXIT message.

    In process of the suspend flow the devices first have to exit from the
    power gating state (runtime pm resume).
    If we do not handle the confirmation interrupt after sending
    PG_ISOLATION_EXIT message, we may receive it already after the suspend
    flow has changed the device state and interrupt will be interpreted as a
    spurious event, consequently link reset will be invoked which will
    prevent the device from completing the suspend flow

    kernel: [6603] mei_reset:136: mei_me 0000:00:16.0: powering down: end of reset
    kernel: [476] mei_me_irq_thread_handler:643: mei_me 0000:00:16.0: function called after ISR to handle the interrupt processing.
    kernel: mei_me 0000:00:16.0: FW not ready: resetting

    Cc: <stable@vger.kernel.org> #3.18+
    Cc: Gabriele Mazzotta <gabriele.mzt@gmail.com>
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=86241
    Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770397
    Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
    Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
    Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Comment 76 Tomas Winkler 2015-07-05 11:40:19 UTC
Can we close this issue? I don't have permission to change the status of the bug
Comment 77 Vincent Petry 2016-07-19 18:47:01 UTC
I have been getting similar symptoms again with Kernel 4.6.3-1-default.

Will need more time to gather information.
Comment 78 Vincent Petry 2016-07-19 18:48:06 UTC
My previous kernel before the update that made this happen again was 4.6.0-1, all on openSUSE Tumbleweed. Would be good to know if anyone else with a Dell XPS 13 is seeing similar symptoms with 4.6.3.
Comment 79 Roberto 2016-07-20 07:28:59 UTC
Dear All,

   the solution has been posted many times in many places. You have to blacklist the modules mei and mei_me, simple like

sudo cat << EOF > /etc/modprobe.d/blacklist.suspend-bug.conf

blacklist mei
blacklist mei_me
EOF
Comment 80 Gabriele Mazzotta 2016-07-21 17:52:04 UTC
Vincent, I haven't been using 4.6.3 that much, but I've tried to suspend my laptop running 4.6.3 after more than 5 hours of uptime and it didn't hung.
Comment 81 mattia.b89 2019-02-24 14:20:25 UTC
Here (XPS 9343 i5-5200U w/ kernel 4.20.12) and sometimes (one each ten suspension), it still happens:
I suspend the laptop, then I try to resume it but:
- black screen
- keyboard unresponsive
- Ethernet LED light up

Now, I have blacklisted the two `mei` modules in order to see if they are really the problem!
Comment 82 mattia.b89 2019-03-02 20:54:52 UTC
UPDATE: after one week, where I used daily the laptop, that is one suspend and resume per day, the issue presents again, EVEN IF I blacklisted `mei` and `mei_me` modules.

Please reopen the bug.