Bug 102091 - Asus UX32LN ultrabook freeze on suspend when mei_me runtime PM turned on
Summary: Asus UX32LN ultrabook freeze on suspend when mei_me runtime PM turned on
Status: ASSIGNED
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Tomas Winkler
URL:
Keywords:
: 105301 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-07-28 14:20 UTC by Petter Krossbakken
Modified: 2018-11-22 20:07 UTC (History)
16 users (show)

See Also:
Kernel Version: >3.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output when pci device 00:00.0. disappeared and i915 errors occured. (at 8547) (14.64 KB, text/plain)
2016-01-12 17:31 UTC, Manuel D.
Details
get MEI fw version (1.20 KB, application/octet-stream)
2016-02-29 07:48 UTC, Tomas Winkler
Details
mei_me debug from syslog (36.33 KB, text/plain)
2016-03-13 19:56 UTC, Manuel D.
Details
inxi device log (2.88 KB, text/plain)
2017-09-20 22:23 UTC, Alexey Kharlamov
Details
ooups picture (179.69 KB, image/jpeg)
2017-09-20 22:23 UTC, Alexey Kharlamov
Details
dmidecode output (4.95 KB, text/plain)
2017-10-03 11:31 UTC, Alexey Kharlamov
Details
dmesg after first suspend (56.51 KB, text/plain)
2017-10-03 11:34 UTC, Alexey Kharlamov
Details

Description Petter Krossbakken 2015-07-28 14:20:13 UTC
Running Arch Linux x64 on a Asus ultrabook with kernels later than the current LTS kernel (3.14) causes laptop to randomly freeze when suspending. This happens about 60-70% of the times the laptop is put to sleep, and the screen turns black, fans going at 100% and a hard shutdown is required.

This also happened from time to time on my older Asus UX51Vz, both with latest bios revisions from Asus.
Comment 1 Petter Krossbakken 2015-07-28 14:23:17 UTC
HW specs: Intel Core i7-4510U, 8GB RAM, Intel HD graphics with NVIDIA Optimus(840M), Intel 7260 AC Wifi/Bluetooth.
Comment 2 Aaron Lu 2015-08-05 07:00:11 UTC
Please follow this document to do some debugging:
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

Also, this looks like a regression. It's not clear from your description so I have to ask: is it that the problem starts to occur from v3.15? What about the latest upstream kernel, v4.1?
Comment 3 Petter Krossbakken 2015-08-08 11:42:33 UTC
This seems to be about wether or not I am closing the lid fast or slow.

On kernels later than 3.14, closing the lid "slow", the computer freezes on suspend (one can tell because the power-LED is not blinking, computer screen goes black and fans racing).

Closing the lid slow on kernel =3.14, no problem occurs and the computer suspends.

This is happening on kernels >3.14 up to =4.0.9.

I have not tried kernels later than 4.0.9.

I am aware that this is not really an easy problem to investigate into for you guys, however I thought it's better to submit a bug on it in case others experience similar problems.
Comment 4 Aaron Lu 2015-08-21 07:25:05 UTC
What if you issue the following command to enter suspend?
# echo mem > /sys/power/state
Comment 5 Aaron Lu 2015-09-01 02:54:55 UTC
ping
Comment 6 Petter Krossbakken 2015-09-01 07:47:50 UTC
This works 10/10 times with the following kernel:

echo mem > /sys/power/state

uname -r

4.1.6.1-ck
Comment 7 Aaron Lu 2015-09-01 08:30:27 UTC
I don't have any idea why the "slow" move of lid would cause this problem, perhaps you can do a git bisect between the good(v3.14) and bad(v3.15) kernel to find out the offending commit?
Comment 8 Petter Krossbakken 2015-09-01 10:24:10 UTC
I will look into it. I have also read others having the same issues on this particular laptop aswell. UX32LN and UX303LN is 99% the same.

https://bbs.archlinux.org/viewtopic.php?id=194413
Comment 9 Maxime Martineau 2015-09-17 08:21:47 UTC
To be more specific, this bug occurs either after a long duration use or after a big number of suspend.

Generally speaking, the probability for this bug to happen increases along time from the moment you boot the computer.

Also - on a UX303LN - I was able to prevent that bug to occur by disabling a PM option relative to PCI (Runtime Power Management for PCI)

If you use TLP, the corresponding option is RUNTIME_PM_ON_BAT.
Comment 10 Aaron Lu 2015-09-17 09:11:52 UTC
Hi Maxime,

Just to make sure, do you also have the suspend problem caused by slow move of the LID?
Comment 11 Maxime Martineau 2015-09-17 09:16:30 UTC
Hi Aaron,

I'll check this week end.
Also, I am on 4.1.4-1-ARCH but I'm not sure I can reproduce the bug on this kernel.
Comment 12 Maxime Martineau 2015-09-21 23:14:07 UTC
Hi Aaron,

Still the same problem on 4.1.4 when RUNTIME_PM_ON_BAT=auto.

But I couldn't make it bug when closing slowly the lid.
Comment 13 Aaron Lu 2015-09-28 02:59:00 UTC
Hi Maxime,

Then you have a different problem, please file a new bug about yours.
Comment 14 J. Alexander 2015-11-05 13:42:06 UTC
Just want to chime in here. I have an Asus UX303LN with the same specs as Petter K. and I *had* problems suspending, it was however not related to the speed with which I closed the lid. I had this problem on kernels 3.13, 3.16 and 4.1.0 and there would be not pattern as to when it would occur. 

What solved it for me, was to uninstall tlp. After removing tlp I have dozens of successful suspend/resume cycles and I can again achieve weeks of uptime on my laptop without problems. In order to confirm that it is some setting designed to save power that is the cause of the problem, I used powertop to enable all power saving features and boom! suspend failed again. Checking each power saving feature individually will hopefully identify the culprit but I don't have time for that kind of testing right now, but will try and get around to it later.
Comment 15 Maxime Martineau 2015-11-11 19:29:33 UTC
Hi J. Alexander,

Here are a few tickets about my case and probably yours :

https://bugzilla.kernel.org/show_bug.cgi?id=105301

https://github.com/linrunner/TLP/issues/162
Comment 16 Aaron Lu 2015-11-12 04:49:39 UTC
Thanks for the info Maxime.

Meanwhile, it doesn't seem we can do anything about the slow lid close problem, unless someone is willing to do the git bisect(according to comment #3, this only occurs on kernels > v3.14), so I'll close it for now. If anyone has some new findings, please re-open it.
Comment 17 Petter Krossbakken 2015-11-12 09:11:46 UTC
Hi Aaron.

It seems sometimes it doesnt matter wether the lid is closed slow or fast anymore. Running latest kernel for Arch.

Still freeze and fans rev up.

This happens regardless of AC/BAT and not only when closing the lid slow.
Sorry for the mistake.
Comment 18 Aaron Lu 2015-11-13 01:39:15 UTC
(In reply to Petter K. from comment #17)
> Hi Aaron.
> 
> It seems sometimes it doesnt matter wether the lid is closed slow or fast
> anymore. Running latest kernel for Arch.
> 
> Still freeze and fans rev up.

Does the suspend always fail or sometimes?

> 
> This happens regardless of AC/BAT and not only when closing the lid slow.
> Sorry for the mistake.

Do you have TLP enabled? If yes, please disable or remove it. This bug here doesn't deal with TLP related issues.
Comment 19 Petter Krossbakken 2015-11-13 12:25:51 UTC
I have TLP enabled, but TLP is disabled when put in AC-mode, atleast the buggy problems with PCI power management and runtime_pm, and the bug still occurs SOMETIMES, but I can try to uninstall the whole TLP stuff, but this also happened before I switched to TLP from laptop-mode-tools.
Comment 20 Aaron Lu 2015-11-16 02:14:54 UTC
If you do not use TLP and laptop-mode-tools, does the problem still occur?
Comment 21 Petter Krossbakken 2015-11-16 12:05:11 UTC
(In reply to Aaron Lu from comment #20)
> If you do not use TLP and laptop-mode-tools, does the problem still occur?

Sometimes, yes. TLP is disabled in AC-mode (atleast most of the options.)

I can try to disable it completely.
Comment 22 Andrzej 2015-11-22 07:31:27 UTC
Just wanted to say that I have exactly the same problem on UX301LA. Freezes happen mostly after a period of heavy load, usually, everything works great for a while after restart. When the problem occurs, the power led will start blinking as in case of working suspend, but the fan will still be running. Then pressing a key will make the power led stop blinking, but the laptop will not wake up.
Comment 23 Manuel D. 2015-12-11 21:20:26 UTC
Hello,

I'm also having the described problem; the description of Andrzej couldnt be better. 
The system freezes after a longer usage time and some standby/resume cycles. Sometimes even after the first try. 

Running Ubuntu 15.10:

Linux zen 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

and latest BIOS (211) on UX301LA.
Comment 24 Andrzej 2015-12-12 00:16:11 UTC
I experimented with disabling Runtime Power Management and it seems to make a huge difference. Previously I used Laptop Mode Tools or TLP and this problem occurred. Since I disabled runtime PM for PCI devices (in TLP: RUNTIME_PM_ON_BAT=on), I didn't have a single freeze. It remains to be seen which device caused the problem.
Comment 25 Manuel D. 2015-12-13 23:07:36 UTC
Thanks for that. I also didn't experience a freeze since disabling PM for PCI devices. 

Wasn't using TLP or laptop-mode-tools but mis-tuned these settings manually via powertop...
Comment 26 Aaron Lu 2015-12-16 02:45:58 UTC
Please find out which PCI device's runtime PM caused this issue, thanks.
Comment 27 Andrzej 2015-12-20 16:41:07 UTC
So far I know that after enabling PM for these three devices, the problem returned:
00:1b.0 Audio device: Intel Corporation 8 Series HD Audio Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04)
02:00.0 Network controller: Intel Corporation Wireless 7260 (rev 6b)

Will investigate further.
Comment 28 BorisH 2015-12-25 21:56:42 UTC
I only have PM for 

02:00.0 Network controller: Intel Corporation Wireless 7260 (rev 6b)

disabled and the problem did not occur on my UX301LA since a few days.
Comment 29 Andrzej 2015-12-26 17:39:54 UTC
I did exactly the same and can also confirm no freezes over the last couple of days.
Comment 30 Aaron Lu 2015-12-28 01:53:22 UTC
Then I'll move the bug to Drivers/Wireless.
Comment 31 Aaron Lu 2015-12-28 01:54:17 UTC
BTW, which driver module is it for that network wireless controller?
lspci -v should tell you that:
Kernel driver in use: XXX
Comment 32 BorisH 2015-12-28 03:06:17 UTC
it is iwlwifi.
Comment 33 Aaron Lu 2015-12-28 05:18:46 UTC
Add Intel Linux Wireless <linuxwifi@intel.com> (supporter:INTEL WIRELESS WIFI LINK (iwlwifi))
Comment 34 Emmanuel Grumbach 2015-12-28 06:42:31 UTC
Hi :)

I maintain iwlwifi.
I am not exactly an expert in Runtime PM, but iwlwifi doesn't support it (yet):

static SIMPLE_DEV_PM_OPS(iwl_dev_pm_ops, iwl_pci_suspend, iwl_pci_resume);

#define IWL_PM_OPS      (&iwl_dev_pm_ops)

#else

#define IWL_PM_OPS      NULL

#endif

static struct pci_driver iwl_pci_driver = {
        .name = DRV_NAME,
        .id_table = iwl_hw_card_ids,
        .probe = iwl_pci_probe,
        .remove = iwl_pci_remove,
        .driver.pm = IWL_PM_OPS,
};

So I wonder how iwlwifi could be the culprit?
Has someone tried to get logs through netconsole or maybe that can be reproduced in a VM?
Comment 35 Aaron Lu 2015-12-28 07:03:18 UTC
It sounds like a runtime PM and system PM inter operate problem. The system suspend of iwlwifi is OK if the runtime PM is not turned on for the network controller, but something went wrong otherwise.
Comment 36 Emmanuel Grumbach 2015-12-28 07:10:18 UTC
I see. We are now working on runtime PM enablement for iwlwifi and things seem to work.
I am not sure how I can help here unfortunately...
Comment 37 Aaron Lu 2015-12-28 07:33:50 UTC
If it is a driver related issue, read the below document:
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt
And then do the devices and platform levels debug should trigger some error:
1 boot system normally, do not use TLP, but enable runtime PM for the wireless network controller(I assume this setup should cause system suspend freeze?);
2 do the test like this as root:
# cd /sys/power
# echo devices > pm_test
# echo mem > state
Wait till the system is back by itself, check dmesg to see if there is anything wrong, like XXX device fail to suspend, etc.
You may need to do this multiple times to trigger some error.
If still no problem, continue to the next level "platform":
# echo platform > pm_test
# echo mem > state
and redo the check.
Comment 38 Manuel D. 2015-12-28 21:00:24 UTC
Hello,

I experienced two further freezes after enabling PM for all PCI devices except the wifi card:

 echo 'auto' > '/sys/bus/pci/devices/0000:00:1f.2/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:02.0/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:04.0/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:14.0/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:16.0/power/control'; 
# echo 'auto' > '/sys/bus/pci/devices/0000:02:00.0/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:1f.3/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:1f.0/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:1c.3/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:1c.0/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:03.0/power/control'; 
 echo 'auto' > '/sys/bus/pci/devices/0000:00:1b.0/power/control'; 

It sometime takes up to five days for the bug to occur. I had the experience that it disappeared after changing any of the PM settings, but after some days of uptime freezed again spontanously. :/

Will be hard to figure out which one / or comibination is causing this. Still trying. Definitely I had best results till now without PM at all.
Comment 39 Emmanuel Grumbach 2015-12-30 06:51:11 UTC
So I understand that this is not related to WiFi, right?
Aaron, can you please take the bug back to the original component?
Comment 40 BorisH 2015-12-30 14:09:57 UTC
Sorry, i noticed that i have pm disabled for the wifi adapter and for 

00:00.0 Host bridge: Intel Corporation Haswell-ULT DRAM Controller (rev 09)
Kernel driver in use: hsw_uncore

I will enable PM for the wifi device and tell you in a few days if the crashes come back.
Comment 41 Manuel D. 2015-12-31 11:13:42 UTC
By following Aarons PM debugging description I could not find any strange messages in dmesg or even provoke a freeze. But as I already said, I think we're all testing much too less. 
Sometimes the freeze happens after the 50th suspend cycle or so. Randomly...

But.. after the "platform" tests, the whole keyboard backlight keeps blinking in the sync with the power-led, which of course should have stopped blinking after wakeup, but instead it carries off the whole keyboard to blink with it ;)

Had PM enabled for all devices. 

That stops after pressing any key. Somewhat strange.

The wifi card also seems to have issues after suspend (sometimes), as you have to restart the network-manager to properly reconnect. But I'm not sure if this isn't a Ubuntu/network-manager specific bug.

Here some logfiles from this morning, when I got waked up, because "the Internet isn't working...".

Dec 31 07:14:09 zen NetworkManager[1266]: <info>  wake requested (sleeping: yes  enabled: yes)
Dec 31 07:14:09 zen systemd[1]: Started Run anacron jobs at resume.
Dec 31 07:14:09 zen NetworkManager[1266]: <info>  waking up...
Dec 31 07:14:09 zen NetworkManager[1266]: <info>  (wlan0): device state change: unmanaged -> unavailable
 (reason 'managed') [10 20 2]

Dec 31 07:14:10 zen NetworkManager[1266]: <info>  NetworkManager state is now DISCONNECTED
Dec 31 07:14:10 zen wpa_supplicant[1406]: dbus: wpa_dbus_get_object_properties: failed to get object properties: (none) none
Dec 31 07:14:10 zen wpa_supplicant[1406]: dbus: Failed to construct signal
Dec 31 07:14:10 zen wpa_supplicant[1406]: Could not read interface p2p-dev-wlan0 flags: No such device
Dec 31 07:14:10 zen NetworkManager[1266]: <info>  (wlan0): supplicant interface state: starting -> ready
Dec 31 07:14:10 zen NetworkManager[1266]: <info>  (wlan0): device state change: unavailable -> disconnected (reason 'supplicant-available') [20 30 42]
Dec 31 07:14:10 zen NetworkManager[1266]: <info>  Device 'wlan0' has no connection; scheduling activate_check in 0 seconds.

Restart of network-manager helps most times, a second suspend won't. Sometimes only reboot.

Dec 31 10:01:50 zen systemd[1]: Stopping Network Manager...
Dec 31 10:01:50 zen NetworkManager[1266]: <info>  caught SIGTERM, shutting down normally.
Dec 31 10:01:50 zen NetworkManager[1266]: <info>  (wlan0): device state change: disconnected -> unmanaged (reason 'unmanaged') [30 10 3]
Dec 31 10:01:50 zen NetworkManager[1266]: <info>  exiting (success)
Dec 31 10:01:50 zen wpa_supplicant[1406]: nl80211: deinit ifname=p2p-dev-wlan0 disabled_11b_rates=0
Dec 31 10:01:50 zen systemd[1]: Stopped Network Manager.
Dec 31 10:01:50 zen systemd[1]: Starting Network Manager...
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  NetworkManager (version 1.0.4) is starting...
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  Read config: /etc/NetworkManager/NetworkManager.conf
Dec 31 10:01:50 zen systemd[1]: Started Network Manager.
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  VPN: loaded org.freedesktop.NetworkManager.pptp
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  DNS: loaded plugin dnsmasq
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  init!
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  update_system_hostname
Dec 31 10:01:50 zen NetworkManager[5509]: <info>        interface-parser: parsing file /etc/network/interfaces
Dec 31 10:01:50 zen NetworkManager[5509]: <info>        interface-parser: finished parsing file /etc/network/interfaces
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  management mode: unmanaged
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  devices added (path: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/net/wlan0, iface: wlan0)
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  device added (path: /sys/devices/pci0000:00/0000:00:1c.3/0000:02:00.0/net/wlan0, iface: wlan0): no ifupdown configuration found.
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  devices added (path: /sys/devices/virtual/net/lo, iface: lo)
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  device added (path: /sys/devices/virtual/net/lo, iface: lo): no ifupdown configuration found.
Dec 31 10:01:50 zen NetworkManager[5509]: <info>  end _init.
Comment 42 Manuel D. 2015-12-31 11:25:13 UTC
Before the problem with the wifi device/network-manager occurs, I could find following:

Dec 31 01:42:25 zen systemd-udevd[4642]: Process '/sbin/crda' failed with exit code 249.

That happened during suspend.
Comment 43 Manuel D. 2016-01-02 15:28:09 UTC
Just tried to enable netconsole to see what happens when it occurs. Unfortunately  I can't get configured:

[  143.101399] netpoll: netconsole: wlan0 doesn't support polling, aborting

The USB-LAN adapter I have seems to have the same problem:

[  595.435335] netpoll: netconsole: enx0023574c2122 doesn't support polling, aborting

So no luck on that. 

"tail"-ling logfiles via ssh doesn't work either, as the messages of the suspend process get sent after wake-up, which probably won't work in case of freeze, but I'll still try, maybe the wifi connection still gets up.

To my experience with pm debugging: It was stable last two 2 days I left "platform" in "pm_test". After changing back to "none", it just freezed after the first cycle. Again, had PM for all PCI devices enabled.
Comment 44 Aaron Lu 2016-01-04 01:56:07 UTC
If platform test mode can not trigger problem, you may try further: processors and core
Comment 45 Petter Krossbakken 2016-01-04 12:53:14 UTC
I can verify that disabling runtime_pm through TLP on a UX32LN with exact same specs as UX303LN has solved the issue with the computer being frozen at suspend, so something has to do with it (TLP).

This is with iwlwifi too, but disabled RPM on all devices.
Comment 46 Manuel D. 2016-01-09 12:57:03 UTC
BorisH mentioned that device 

00:00.0 Host bridge: Intel Corporation Haswell-ULT DRAM Controller (rev 09)

When I read his post, I also checked PM settings for it but there was no such device found on my UX301LA. Today its here. 

So it seems that device sometimes disappears (at least from lspci listing).

I just had a crash, after rebooting to see if the device reappears after reboot (answer: yes). Before suspending last time, the device still existed. 

Before that reboot, the device was missing and I had been testing around with PM enabled on all remaining PCI devices, unable to produce a freeze for 2 days. 

Seems difficult to get in a state where it is missing (without freeze). It takes many (good) suspend cycles. 
Hopefully I'll be there again soon to grab some logs or so.
Comment 47 BorisH 2016-01-09 14:39:12 UTC
Its the same for me too. lspci does not always show the device, that's why i didn't mentioned it in my first post.
Comment 48 Manuel D. 2016-01-12 17:29:50 UTC
Okay. I wrote a little script to notify me when that device goes away.. And just had some success. But still no idea what happens to PCI device 00:00.0 or is causing the freezes.

That device got away after UX301LA was suspended for about 9h. Immediately after waking it up.

I found some messages in dmesg concerning the graphics card. I've attached the output including two suspend cycles. Strange things happen from there on: 

[ 8547.160634] i915 0000:00:02.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment

Anyway, it survived the wakeup, even if PM was enabled for all devices again.
I have no clue if this error has anything to do with the freezes.

Also, I just noticed the following message appearing sometimes during wakeup:

[10142.635421] xhci_hcd 0000:00:14.0: port 6 resume PLC timeout

There are no connected USB devices.
Comment 49 Manuel D. 2016-01-12 17:31:17 UTC
Created attachment 199451 [details]
dmesg output when pci device 00:00.0. disappeared and i915 errors occured. (at 8547)
Comment 50 BorisH 2016-02-22 21:25:24 UTC
Could someone test if the crashes stop when adding the device:

00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04)

At least with these devices on the list the crashes did still occur:

00:00.0 00:1f.3 00:1f.2 00:1f.0 00:1c.3 00:1c.0 00:1b.0

After adding the Communication controller to this list i hadn't had a crash since a few days.
Comment 51 BorisH 2016-02-22 21:27:55 UTC
(In reply to BorisH from comment #50)
> Could someone test if the crashes stop when adding the device:
> 
> 00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04)
> 
> At least with these devices on the list the crashes did still occur:
> 
> 00:00.0 00:1f.3 00:1f.2 00:1f.0 00:1c.3 00:1c.0 00:1b.0
> 
> After adding the Communication controller to this list i hadn't had a crash
> since a few days.

By adding i meant putting it on the exclusion list of tlp.
Comment 52 Petter Krossbakken 2016-02-23 11:11:12 UTC
I have had no crashes after blacklisting the "mei_me"-device in the exclusion list in /etc/default/tlp on my UX32LN so far.

Normally my PC would have had atleast a couple of crashes because of frequent suspends.
Comment 53 Aaron Lu 2016-02-24 02:50:29 UTC
Adding maintainer of this driver: Tomas Winkler

Hi Tomas,

People are suffering problems when enabling runtime PM for PCI device mei_me.
Comment 54 Aaron Lu 2016-02-24 02:57:29 UTC
*** Bug 105301 has been marked as a duplicate of this bug. ***
Comment 55 Tomas Winkler 2016-02-24 07:33:24 UTC
I'm looking into this, so far I see this is only relevant to the haswell generation on kernels 3.14-4.2.0. 
Has anybody tried that on the latest kernel.
Comment 56 Petter Krossbakken 2016-02-24 08:10:14 UTC
I am running 4.3.6 and have blacklisted the runtime_pm for the device because of freezes.
Comment 57 BorisH 2016-02-24 10:08:05 UTC
Without blacklisting the freezes also occur on kernel 4.4.1-2.
Comment 58 Manuel D. 2016-02-26 10:16:25 UTC
I also did not have single freeze on my UX301LA since I re-enabled PM for all PCI devices except the "mei_me" one:
00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04)
Comment 59 Tomas Winkler 2016-02-29 07:48:22 UTC
Created attachment 206401 [details]
get MEI fw version
Comment 60 Tomas Winkler 2016-02-29 07:50:35 UTC
I couldn't reproduce the issue on my systems so I would like to get the MEI firmware version of this particular platform. The  version is either available from the BIOS menu or one can use a simple script getver-sa.py
Comment 61 Petter Krossbakken 2016-02-29 09:24:53 UTC
ME Code Firmware Version: 9.5.20.1742
ME NFTP Firmware Version: 9.5.20.1742
ME FITC Firmware Version: 9.5.15.1730

on a Asus UX32LN (identical to the UX303LN but with lower screen resolution).
Comment 62 Manuel D. 2016-02-29 13:08:08 UTC
From my UX301LA:

ME Code Firmware Version: 9.5.30.1808
ME NFTP Firmware Version: 9.5.30.1808
ME FITC Firmware Version: 9.5.15.1730
Comment 63 Andrzej 2016-03-11 20:05:52 UTC
From my UX301L:

ME Code Firmware Version: 9.5.3.1520
ME NFTP Firmware Version: 9.5.3.1520
ME FITC Firmware Version: 9.5.3.1520
Comment 64 Maxime Martineau 2016-03-13 10:10:54 UTC
From my UX303LN :

ME Code Firmware Version: 9.5.20.1742
ME NFTP Firmware Version: 9.5.20.1742
ME FITC Firmware Version: 9.5.15.1730
Comment 65 Tomas Winkler 2016-03-13 11:21:25 UTC
Thanks for posting, we've cannot reproduce the issue on our setup with those particular fw versions, also we do not have report from other vendors. so I believe the issue is somehow connected to specifically to the vendor, It will take time till I will have access to such latptop, so I would be glad if someone can supply some more debug info.
More verbose logging can be enabled by

echo -n 'module mei +lfp' > /sys/kernel/debug/dynamic_debug/control
echo -n 'module mei_me +lfp' > /sys/kernel/debug/dynamic_debug/control

Register tracing can be enabled by
echo 1 > /sys/kernel/debug/tracing/events/mei/enable
cat /sys/kernel/debug/tracing/trace

I hope that something relevant ca be cached during suspend resume cycles
Comment 66 Manuel D. 2016-03-13 19:56:27 UTC
Created attachment 208911 [details]
mei_me debug from syslog
Comment 67 Zhang Rui 2016-03-15 06:14:48 UTC
reassign to mei_me expert.
Comment 68 Tomas Winkler 2016-03-15 07:38:29 UTC
(In reply to Manuel D. from comment #66)
> Created attachment 208911 [details]
> mei_me debug from syslog

Did you hit the freeze here, as in quick glance the trace in the syslog looks like a good run.?
Comment 69 Manuel D. 2016-03-15 08:14:59 UTC
Yes, this was a good run. 

I don't think I'll be cable to catch the output for a bad run, as the laptop will freeze completely. 
I tried netconsole with the wifi and an usb ethernet dongle. Both devices were not supported for that. 

Maybe I can try to redirect the kernel's serial to an USB to serial adapter? Recording with another device.
Comment 70 Tomas Winkler 2016-03-17 22:10:03 UTC
Can't be captured in the pstore ? 
https://www.kernel.org/doc/Documentation/ABI/testing/pstore
Comment 71 Manuel D. 2016-03-19 10:28:31 UTC
Hopefully. 

What a great "feature" of the kernel. I didn't know this was possible.

Unfortunately CONFIG_PSTORE_CONSOLE is not enabled in Ubuntu kernels. Just compiling one with it enabled.

Found a nice article on stackoverflow how to test if pstore logging works.
May take some more days for the bug to appear then.
Comment 72 Daniel 2016-04-23 09:18:30 UTC
I tried to enable the pstore, but it doesn't seem to be working (at least not according to the testing procedure I found).
Could someone assist me with getting it working?
Some more details are in my stackoverflow post; http://unix.stackexchange.com/questions/273352/how-to-enable-kernel-pstore
Comment 73 Tomas Winkler 2016-05-10 14:12:47 UTC
I was finally was able to get the hands on UX303LN. Installed Ubuntu 16.04 LTS (kernel 4.4.0-21) and added TLP package, enabled the runtime PM also on AC but so far I was not able to reproduce the issue. 
With ME FW verionn
ME Code Firmware Version: 9.5.20.1742
ME NFTP Firmware Version: 9.5.20.1742
ME FITC Firmware Version: 9.5.15.1730
BIOS revision 4.6

I will try to stress this more the ways described in the report, if someone has reliable method of reproduction please let me know.

Thanks
Comment 74 Manuel D. 2016-05-15 13:13:07 UTC
I also upgraded to 16.04 in the meanwhile. 
Just re-enabled PM for the suspected device to see if it happens again.

There is no reliable way to reproduce this. I think its a good indicator for a coming freeze when the device disappears as I descirbed earlier.

Whenever I tried to reproduce it by, it didn't happen at all. Uptime and activity seems to increase the chance for a freeze. I recommend normal usage. Don't forget to save things before closing the lid.
Comment 75 Manuel D. 2016-05-15 20:32:04 UTC
Just had a freeze after re-enabled PM for the mei_me device. During resume on battery after charging.

Linux zen 4.4.0-22-generic #39-Ubuntu SMP Thu May 5 16:53:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Comment 76 Tomas Winkler 2016-06-19 09:32:18 UTC
This setup worked for us to enable pstore on Ubuntu 16.04 LTS
1. edit /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT: mem=2G ramoops.mem_address=0x80000000 ramoops.mem_size=0x1000000 ramoops.ramoops_ecc=1
grub-update 
2. ramoops at the last line of /etc/modules 
3. stack trace can be found /sys/fs/pstore after boot.

Still was not able to reproduce the issue, though
Comment 77 Manuel D. 2016-06-21 16:44:24 UTC
Followed your instructions to enable pstore and tested by triggering a kernel panic via 

echo c > /proc/sysrq-trigger

That worked. After two further days I just had a freeze after opening the lid, but nothing in /sys/fs/pstore this time.

I'll leave it running this way waiting for the next freeze.
Comment 78 Tomas Winkler 2016-06-21 18:37:36 UTC
Finally, today we also hit the issue, and also pstore was empty. Meaning there is no kernel panic happening.  We need to device a different way of capturing what is happening.
Comment 79 Maxime Martineau 2016-08-11 13:59:35 UTC
Additional information : the hard drive is shut down during the bug. Is only left the fan to turn.
Comment 80 murat 2016-12-27 20:33:33 UTC
I can reproduce this bug on a Sony Vaio Pro 13 which I believe has the identical Haswell chipset. I have used this machine with Arch Linux and without this issue for over a year and it just started happening a few months ago. Proving really hard to debug.
Comment 81 Alexey Kharlamov 2017-09-20 22:22:44 UTC
Hello there. I've struck the similar problem on thinkpad x230 with coreboot (inxi log in attachment)

The first suspend is always ok; the second one is always freezing.

Unloading all mei* modules helps

Using a 120fps camera I was able to make a photo of a kernel log (pic attached), a short message of it (flag values are probably not correct):

genirq: Flags mismatch irq 0: 00000080 (mei_me) vs 08915a00 (timer)
mei_me: ... request_threading_irq failed: irq: 0
dpm_run_callback: pci_pm_resume+0x070xa0 returns -16
PM: Device 0000:00:16.0 failed to resume async: error -16
BUG: Unable to handle kernel NULL pointer deference at ...
OOUPS
...
Call trace:
mei_me_pci_suspend+0x4d
...
Comment 82 Alexey Kharlamov 2017-09-20 22:23:03 UTC
Created attachment 258543 [details]
inxi device log
Comment 83 Alexey Kharlamov 2017-09-20 22:23:59 UTC
Created attachment 258545 [details]
ooups picture
Comment 84 Tomas Winkler 2017-10-02 07:37:53 UTC
(In reply to derlafff@ya.ru from comment #83)
> Created attachment 258545 [details]
> ooups picture

What kernel version are you running,
Can you please provide dmidecode output
Comment 85 Alexey Kharlamov 2017-10-03 11:31:16 UTC
Created attachment 258705 [details]
dmidecode output

Kernel is 

Linux 4.13.3-1-ARCH
Comment 86 Alexey Kharlamov 2017-10-03 11:33:40 UTC
This part of kernel log happens after a first suspend, so I am able to retrieve it:

[  560.723960] mei_me 0000:00:16.0: Refused to change power state, currently in D3
[  560.784691] genirq: Flags mismatch irq 0. 00000080 (mei_me) vs. 00015a00 (timer)
[  560.784754] mei_me 0000:00:16.0: request_threaded_irq failed: irq = 0.
Comment 87 Alexey Kharlamov 2017-10-03 11:34:28 UTC
Created attachment 258707 [details]
dmesg after first suspend
Comment 88 Tomas Winkler 2017-10-03 12:23:58 UTC
If this is 4.12 then you should apply this patch, it's still not applied
https://patchwork.kernel.org/patch/9967845/
Comment 89 Tyler S 2018-05-15 12:26:58 UTC
I am experiencing this issue on an ASUS ZENBOOK UX301LAA (BIOS 211 06/05/2015).

In my case, the issue is extremely repeatable: I can do 'echo mem > /sys/power/state' and wakeup exactly one time and have it work as expected without having to power cycle the system to make suspend work reliably again.

If I try to do 'echo mem > /sys/power/state' a second time, the laptop appears to suspend but the fans keep spinning. The light on the keyboard pulses as it normally would in suspend during this time. When depressing a key on the keyboard in this state (to exit S3), the keyboard light turns solid white to indicate power, but the screen nor the network functionality never come back, so I am unable to debug.

Furthermore, if I do enter/exit S3 exactly once and then reboot/shutdown, the system will hang right before the system effectively reboots/goes back to POST and I will have to manually power cycle it anyways.

I am not running tlp or any kind of software which alters runtime-pm settings in any way. This behavior happens with a bore-stock Debian 9 install (also happened with backports kernel 4.16.x which contains the patch Tomas provided above).

I have tried so many permutations of BIOS settings, cleaning up my DSDT tables, different kernels, etc... but the result is always the same: I can only suspend one time in a reliable fashion.

---

I read over this ticket and noticed Aaron's pm_test suggestion.  When the system comes back from suspend (and thus is in the 'wonky' state) I did:
echo devices > /sys/power/pm_test
echo mem > /sys/power/state

The system did suspend test successfully enough though it was second time, but GPU locked up and my X session crashed. Fortunately, system was responsive enough to allow me to collect info:

This maybe gives a hint that either GPU or mei is the problem? I have i915 power optimizations turned on now; will try with turned off if needed/go back to stock... the mei part is worrisome though.
[  974.039189] Restarting tasks ... done.
[  977.546063] wlp2s0: authenticate with ...
<OMITTED FOR PRIVACY>
[  977.565371] wlp2s0: associated
[  985.449148] [drm] GPU HANG: ecode 7:0:0x87c3ffff, in Xorg [549], reason: Hang on render ring, action: reset
[  985.449149] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  985.449150] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  985.449150] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  985.449150] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  985.449151] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  985.449177] drm/i915: Resetting chip after gpu hang
[  997.416363] drm/i915: Resetting chip after gpu hang
[ 1003.816187] mei_me 0000:00:16.0: timer: init clients timeout hbm_state = 2.
[ 1003.816201] mei_me 0000:00:16.0: unexpected reset: dev_state = INIT_CLIENTS fw status = 1E000045 60002106 00000200 00004400 00000000 40000010
[ 1034.055911] mei_me 0000:00:16.0: timer: init clients timeout hbm_state = 2.
[ 1034.055937] mei_me 0000:00:16.0: unexpected reset: dev_state = INIT_CLIENTS fw status = 1E000045 60002106 00000200 00004400 00000000 40000010
[ 1064.295470] mei_me 0000:00:16.0: timer: init clients timeout hbm_state = 2.
[ 1064.295484] mei_me 0000:00:16.0: unexpected reset: dev_state = INIT_CLIENTS fw status = 1E000045 60002106 00000200 00004400 00000000 40000010
[ 1064.295485] mei_me 0000:00:16.0: reset: reached maximal consecutive resets: disabling the device
Comment 90 Tyler S 2018-05-15 13:04:40 UTC
Sorry for double-posting.

GPU hang is only evident with 4.9 kernel: it is gone with 4.16.5... so progress is being made!

mei messages/issues after suspending once and subsequently fiddling with pm_test persist, but now I must use 'freezer' for pm_test rather than 'devices' to reliably trigger the same mei error messages seen above.
Comment 91 Tyler S 2018-05-15 15:23:00 UTC
I finally got suspend working reliably on this platform (and with runtime-pm to boot!). I had a a suspicion about that funny ME activity above and followed through on it...

I 'only' had to temporarily nuke my entire system, install Windows 8, and flash the latest version of Intel ME firmware using a package from another vendor (because Intel doesn't provide images directly and my vendor has not released an update in years).

I'll stop my ranting now ;) -- but you wonder why people hate this bloody ME thing...

If you're looking to repro the issue, I am including the firmware versions below. After flashing, the entire platform and all power saving features work perfectly.

Before:
ME Code Firmware Version: 9.5.3.1520
ME NFTP Firmware Version: 9.5.3.1520
ME FITC Firmware Version: 9.5.3.1520

After:
ME Code Firmware Version: 9.5.60.1952
ME NFTP Firmware Version: 9.5.60.1952
ME FITC Firmware Version: 9.5.3.1520
Comment 92 Manuel D. 2018-07-14 22:05:34 UTC
I did not have a single freeze over the past two years with PM enabled for all devices except this (like described in comment 58):

# echo 'auto' > '/sys/bus/pci/devices/0000:00:16.0/power/control';  # mei_me

Almost forgot that commented line and this issue...

But I just tested if I could still trigger the issue when (re-)activating PM for it - and yes, it happened twice within two days. It only happened when closing the lid - standby doesn't seem to trigger it.

Running Ubuntu 18.04 (kernel 4.15.0-23-generic) today and still have the same ME firmwares as in comment 62. 

@Tyler S
Could you tell me which different vendor package you've used? 

I'd like to confirm if the upgrade you've described also fixes it for me. At least we have the exact same model. 
And I already have Windows (10) installed in parallel.
Comment 93 Alexey Loukianov 2018-11-20 17:09:37 UTC
@Manuel D:

I have Asus Zenbook UX32LN and had been suffering from the "suspend and hibernate are unreliable" problem for years now. Problem had been occuring under Debian Jessie, Stretch and now under Buster. Also tried Arch, Ubuntu, Fedora - all the same. Currently I'm using LMDE 3 Cindy (based on Debian Stretch) with stock kernel being 4.9.0-8-amd64 and last backported kernel being 4.18.0-2-amd64.

Using this laptop under linux was a painful experience due to some long standing bugs. Brightness keys not working was fixed only in 4.10 while unstable suspend/resume is still not fixed.

Today I had stumbled upon this bug and with a hint from @Tyler S I was able to update ME engine firmware to a more recent version. Hint was to find a Lenovo notebook model with a similar CPU and chipset and download Intel ME FW update for it. WARNING: do it on your own risk, you might brick your laptop.

I had used files from here: https://pcsupport.lenovo.com/ru/ru/products/laptops-and-netbooks/thinkpad-x-series-laptops/thinkpad-x240s/20aj/downloads/ds038194

After the update:
ME Code Firmware Version: 9.5.60.1952
ME NFTP Firmware Version: 9.5.60.1952
ME FITC Firmware Version: 9.5.15.1730


Will check if this had finally fixed suspend/resume unreliability and report back here any findings. I really hope it did as it is a huge struggle to use ultrabook without hibernate functionality.

Note You need to log in before you can comment on or make changes to this bug.