Bug 200455

Summary: (Dell Latitude E7440 laptop with Intel Core-i7 CPU)Laptop freeze on hibernate(MEI error)
Product: Drivers Reporter: Yakov Sh. (yman)
Component: OtherAssignee: Tomas Winkler (tomas.winkler)
Status: NEEDINFO ---    
Severity: normal CC: alexander.usyskin, bugrfeaturer, georgmueller, pizza, tomasw, yu.c.chen
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.17 Subsystem:
Regression: No Bisected commit-id:
Attachments: MEI error log photo
kernel panic photo
MEI error log photo
attachment-18619-0.html
fix hw module get/put balance
panic fix

Description Yakov Sh. 2018-07-09 09:18:18 UTC
Running Arch Linux on Dell Latitude E7440 laptop with Intel Core-i7 CPU. Earlier, lid close worked smooth for me sending system to hibernation. After upgrade to 4.17.1 8 times out of 10 it's not working.


I can observe "power" LED light steadily, when in normal situation it blinks slowly. When I open lid up I see blank screen and no reaction on any key press or anything. The only thing that "helps" is holding "power" button for 5 secs, which turn laptop off. After booting up I can't find anything useful in logs.


Upgrading to 4.17.2 or 4.17.4 didn't help (I skipped 4.17.3). At the same time downgrading to 4.16.13 solves issue. I found post on reddit where user mention exact same problem on Fedora 28 with 4.17 kernel.


I run Gnome DE if that helps and will be happy to give any additional information about problem.
Comment 1 bugrfeaturer 2018-07-09 12:38:55 UTC
Getting logs/workaround:
* wait for a while (5m), then move to other virtual terminal;
* also try to switch various options of 'i915' module and 'intel' driver.
Comment 2 Chen Yu 2018-07-11 05:48:43 UTC
1. Please blacklist i915 module and try if it works.
2. And also you can try different pm_test mode to narrow down.
3. Besides, you can try test_resume mode to check if works.
Please refer to 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/power/basic-pm-debugging.txt
for detail. thanks.
Comment 3 Chen Yu 2018-07-11 05:50:16 UTC
Besides, you can try using git to download kernel code via
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
and leverage git bisect to find the bad commit.
Comment 4 Yakov Sh. 2018-07-14 17:41:12 UTC
Observations correction.
After system boots one hibernation work smooth, second - freeze system.

First of all I tried latest available kernel in Arch - 4.17.5. Nothing changed.

Blacklisting i915 didn't work. But using debugging guide and working from console I managed to get this messages on freeze:
mei_wdt mei::{LONG_STRING_LIKE_SERIAL_NUMBER}: get hw module failed
med_wdt mei::{LONG_SAME_STRING}: Could not enable cl device

After blacklisting mei and mei_me modules problem disappears. Interestingly, in a last month while struggling with hibernation I got kernel panics 2 times while rebooting. On the second occurrence (today) I read through text and it mention mei too. I can upload couple of photos if it's necessary.
Comment 5 Chen Yu 2018-08-15 06:46:30 UTC
Thanks for your investigation, so this seems to be mei related. Could you please upload the full log(picture) for mei?
Comment 6 Tomas Winkler 2018-08-15 08:29:03 UTC
There should be a fix for it in v4.17.6 


commit 7ac3afe1341e20359c0d1f9e1f358d175e52bcf4
Author: Alexander Usyskin <alexander.usyskin@intel.com>
Date:   Thu Jun 7 00:31:48 2018 +0300

    mei: discard messages from not connected client during power down.

    commit b7a020bff31318fc8785e6f96b1d38c1625cf1fb upstream.

    This fixes regression introduced by
    commit 8d52af6795c0 ("mei: speed up the power down flow")

    In power down or suspend flow a message can still be received
    from the FW because the clients fake disconnection.
    In normal case we interpret messages w/o destination as corrupted
    and link reset is performed in order to clean the channel,
    but during power down link reset is already in progress resulting
    in endless loop. To resolve the issue under power down flow we
    discard messages silently.
Comment 7 Yakov Sh. 2018-08-15 11:07:00 UTC
Created attachment 277869 [details]
MEI error log photo
Comment 8 Yakov Sh. 2018-08-15 11:10:09 UTC
Created attachment 277871 [details]
kernel panic photo
Comment 9 Yakov Sh. 2018-08-15 11:11:14 UTC
Created attachment 277873 [details]
MEI error log photo
Comment 10 Yakov Sh. 2018-08-15 11:12:15 UTC
Comment on attachment 277873 [details]
MEI error log photo

Messages I got trying to hibernate system manually
Comment 11 Yakov Sh. 2018-08-15 11:13:09 UTC
Comment on attachment 277871 [details]
kernel panic photo

Kernel panic happened once when I tried to reboot laptop, which mentions MEI
Comment 12 Yakov Sh. 2018-08-15 11:24:19 UTC
Unblacklisted both modules while running 4.17.14-arch1-1-ARCH, problem persist.
Comment 13 Tomas Winkler 2018-08-15 15:34:21 UTC
Okay, this really looks like another issue. 
We are trying to analyze this. 
Is the crash on 4.17.14, exactly the same as in 4.17.1?
Comment 14 Alexander Usyskin 2018-08-15 15:43:57 UTC
Created attachment 277877 [details]
attachment-18619-0.html

I’m on vacation with no email access till Sep 2.
For LMS and PRT tools issues contact Oren Weil jer.
For Mei driver - Tomas Winkler.
Comment 15 Tomas Winkler 2018-08-15 16:51:47 UTC
You can try to revert this one if and check if it helps
commit 257355a44b9929e55d6fd47bfff66971dc4de948
Author: Tomas Winkler <tomas.winkler@intel.com>
Date:   Sun Feb 25 20:07:04 2018 +0200

    mei: make module referencing local to the bus.c

    Module reference counting is relevant only to the
    mei client devices. Make the implementation clean
    and move it to bus.c
Comment 16 Georg Müller 2018-08-22 11:18:26 UTC
I also have a Dell Latitude E7440 with the same issue and reverting commit 257355a44b9929e55d6fd47bfff66971dc4de948 solved it.
Comment 17 Tomas Winkler 2018-08-22 13:03:54 UTC
Thanks Georg for report, I'm working on the fix.
Comment 18 Georg Müller 2018-08-22 18:59:53 UTC
If I can help with some additional added debug messages just write me what to test.
Comment 19 Tomas Winkler 2018-08-24 08:47:03 UTC
Created attachment 278055 [details]
fix hw module get/put balance

fix hw module get/put balance
Comment 20 Tomas Winkler 2018-08-24 08:47:57 UTC
Created attachment 278057 [details]
panic fix

need to unlink client before freeing
Comment 21 Tomas Winkler 2018-08-24 08:48:31 UTC
Georg, it would be great if you can check those two patches.
Comment 22 Georg Müller 2018-08-24 12:00:53 UTC
Applying the two patches fixes the issue.
Comment 23 Tomas Winkler 2018-08-24 19:28:47 UTC
Really appreciated your help, can I add your Tested-by: to the sumbission?
Comment 24 Georg Müller 2018-08-25 09:11:36 UTC
Sure. Thank you very much for the patches.
Comment 25 Georg Müller 2018-09-04 18:31:14 UTC
I have seen you have posted the patches to LKML, but they are still neither in Linus' tree nor considered for next stable.

This bug has "regression: no", even though it is a more or less serious regression.

Should I rather ask in Redhat bugzilla to include the patch? They normally track stable.
Comment 26 Tomas Winkler 2018-09-04 20:54:48 UTC
I hope it will be picked up for 4.19-rc3 looks like Greg has missed 4.19-rc2. Greg KH, is rather a very busy maintainer.
I believe, I had marked the patches for stable, so they will eventually land there. I'm sorry we caused this issue and I made an effor to provide the fix ASAP but I'm not sure how to speed up the process from this point on.
Comment 27 Georg Müller 2018-09-26 12:34:55 UTC
Kernel 4.18.10 and 4.19-rc5 both contain the patches for the issue.

Thank you very much. I think the issue can now be marked as resolved.
Comment 28 Tomas Winkler 2018-09-26 13:23:02 UTC
Thanks for the report. 
Tomas
Comment 29 Yakov Sh. 2018-09-30 11:41:33 UTC
Work as expected on 4.18.10 with MEI modules unblacklisted. Thanks everyone for the help!