Bug 178641 - (Desktop) Freeze on the third time on resume from suspend to RAM (all pm_test mode ok, BIOS issue suspected)
Summary: (Desktop) Freeze on the third time on resume from suspend to RAM (all pm_test...
Status: CLOSED INVALID
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-20 10:39 UTC by paviluf
Modified: 2018-03-28 16:37 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.8.15
Subsystem:
Regression: No
Bisected commit-id:


Attachments
journalctl -r (1.05 MB, text/plain)
2016-10-20 10:40 UTC, paviluf
Details
blacklist mod (1.65 KB, text/plain)
2016-10-20 10:42 UTC, paviluf
Details
systemd-suspend (1.32 KB, application/x-shellscript)
2016-10-20 10:43 UTC, paviluf
Details
lspci (1.81 KB, text/plain)
2016-10-22 10:14 UTC, paviluf
Details
dmesg after two good suspend cycles (65.14 KB, text/plain)
2016-10-27 11:03 UTC, paviluf
Details
dmesg without any out of tree driver (66.33 KB, text/plain)
2016-10-31 08:23 UTC, paviluf
Details
dmesg (66.07 KB, text/plain)
2017-06-18 11:55 UTC, paviluf
Details
journalctl (4.08 MB, text/plain)
2017-06-18 11:55 UTC, paviluf
Details
dmesg after a succeed suspend/resume (55.62 KB, text/plain)
2018-01-14 15:17 UTC, paviluf
Details
kernel config file (206.63 KB, text/plain)
2018-01-14 15:18 UTC, paviluf
Details
echo freeze > /sys/power/state (64.13 KB, text/plain)
2018-01-21 14:14 UTC, paviluf
Details
echo mem > /sys/power/state (64.13 KB, text/plain)
2018-01-21 14:15 UTC, paviluf
Details
echo freeze > /sys/power/state (73.82 KB, text/plain)
2018-01-21 14:16 UTC, paviluf
Details
lsmod (2.29 KB, text/plain)
2018-01-22 20:32 UTC, paviluf
Details
dmesg after hot reboot (49.89 KB, text/plain)
2018-01-30 23:02 UTC, paviluf
Details
hot reboot dmesg (54.01 KB, text/plain)
2018-02-20 12:05 UTC, paviluf
Details
dmesg excerpt on fail (pc frozen) (1.29 MB, image/jpeg)
2018-02-20 12:08 UTC, paviluf
Details

Description paviluf 2016-10-20 10:39:07 UTC
Hello,

I have my desktop for something like 5 years and suspend to RAM never worked. I never looked to make it work until now. I tried everything I found on the web with no luck. The 2 first time the suspend / resume work perfectly but the third time the PC freeze. The PC can freeze while going to suspend that it seem to never reach (still on), or, more frequently, on resume, ending up with a black screen or a frozen desktop. Very often the keyboard doesn't respond (caps led).

Here is what I tried 

- it work with windows
- tried to suspend on init 3
- disable a lot of things in bios
- tried to blacklist all kernel modules
- tried nouveau and nvidia drivers (GT430)
- tried an other graphic card (HD 5750) with radeon driver
- tried other sata ports for my ssd (samsung 840)
- tried to downgrade / updrade the motherboard bios
- tried with only one ram module and in other ram ports too
- unpluged everything I could externally and internally
- tried manjaro (KDE 5.7), fedora 24 (Gnome 3.20), fedora 25 (live), ubuntu 12.04, 14.04, 16.04, 16.10 (all live)
- tried to follow some part of https://01.org/blogs/rzhang/2015/best-practice-debug-linux-suspend/hibernate-issues and https://www.kernel.org/doc/Documentation/power/s2ram.txt
- tried to unbind / bind some drivers with a systemd-suspend script
- etc...

The motherboard is this one http://www.gigabyte.eu/products/product-page.aspx?pid=3766&dl=1#ov

I will be happy to provide every info you need to help me make suspend to ram working.

Thanks !
Comment 1 paviluf 2016-10-20 10:40:30 UTC
Created attachment 242041 [details]
journalctl -r

Here is a log of "journalctl -r" with a successful suspend/resume cycle and a failed one
Comment 2 paviluf 2016-10-20 10:42:47 UTC
Created attachment 242051 [details]
blacklist mod

Here is a blacklist module conf file that I tried
Comment 3 paviluf 2016-10-20 10:43:26 UTC
Created attachment 242061 [details]
systemd-suspend

Here is a systemd-suspend script that I tried
Comment 4 paviluf 2016-10-22 10:14:31 UTC
Created attachment 242191 [details]
lspci
Comment 5 paviluf 2016-10-22 10:23:10 UTC
Using this script to suspend

#!/bin/sh
sync
echo 1 > /sys/power/pm_trace
echo mem > /sys/power/state

give this on reboot with dmesg

[    0.903446]   Magic number: 0:422:762
[    0.903450]   hash matches drivers/base/power/main.c:818
[    0.903456] atkbd serio0: hash matches

and "cat /sys/power/pm_trace_dev_match" give

input
serio

I tried to rmmod serio_raw but that doesn't change anything. atkbd can't be unloaded it seems.
Comment 6 paviluf 2016-10-22 12:45:11 UTC
I tried "freezer", "devices", "platform", "processors" and "core" tests with this script and it seems that every tests worked

#!/bin/sh
sync
echo 1 > /sys/power/pm_trace
echo core > /sys/power/pm_test
echo mem > /sys/power/state


cat /sys/kernel/debug/suspend_stats
success: 26
fail: 0
failed_freeze: 0
failed_prepare: 0
failed_suspend: 0
failed_suspend_late: 0
failed_suspend_noirq: 0
failed_resume: 0
failed_resume_early: 0
failed_resume_noirq: 0
failures:
  last_failed_dev:	
			
  last_failed_errno:	0
			0
  last_failed_step:

What does that mean ?
Comment 7 paviluf 2016-10-22 16:52:27 UTC
I tried with "acpi=off" boot option but that doesn't change anything.
Comment 8 Zhang Rui 2016-10-25 06:17:38 UTC
please attach the dmesg after two good suspend cycles.
Comment 9 paviluf 2016-10-27 11:03:41 UTC
Created attachment 242901 [details]
dmesg after two good suspend cycles

Here it is. Thank you !
Comment 10 Niemand 2016-10-27 12:32:24 UTC
Hello Rui (Zhang),

I need assistance from Rafael (Wysocki), email: rjw@wysocki.net, also rafael.j.wysocki@intel.com . You are INTEL also, I 100% know. You need to contact him in your own department, INTEL OTC, since this is something I never saw before!

We have here INTEL possible ugly problem, with CORE2 family Sandy Bridge. The data are here:
Sandy Bridge: CPUID = 0x206A7, ark.intel.com: http://ark.intel.com/products/52227/Intel-Core-i7-2820QM-Processor-8M-Cache-up-to-3_40-GHz

So, we have failing log for PM suspend from kernel.org derived kernel... And log is here:  https://bugzilla.kernel.org/attachment.cgi?id=242051

Here are very confusing lines from it:

[ 181.127124] PM: noirq resume of devices complete after 11.258 msecs
[ 181.127463] PM: early resume of devices complete after 0.311 msecs
[ 181.127509] usb usb1: root hub lost power or was reset
[ 181.127702] usb usb2: root hub lost power or was reset
[ 181.131402] ehci-pci 0000:00:1a.0: cache line size of 4 is not supported
[ 181.131419] rtc_cmos 00:01: System wakeup disabled by ACPI
[ 181.131568] pcieport 0000:00:1c.2: System wakeup disabled by ACPI
[ 181.131588] ehci-pci 0000:00:1d.0: cache line size of 4 is not supported

Please, do note the following two lines: 
[ 181.131402] ehci-pci 0000:00:1a.0: cache line size of 4 is not supported
[ 181.131588] ehci-pci 0000:00:1d.0: cache line size of 4 is not supported

The cache size in x86/x86_64 (IA32 and 64) are ONLY/SOLELY 64 bytes (you should know this).

I also need this info, ASAP!

Thank you,
_nobody_
Comment 11 Zhang Rui 2016-10-31 03:25:11 UTC
(In reply to Niemand from comment #10)
> Hello Rui (Zhang),
> 
> I need assistance from Rafael (Wysocki), email: rjw@wysocki.net, also
> rafael.j.wysocki@intel.com . You are INTEL also, I 100% know. You need to
> contact him in your own department, INTEL OTC, since this is something I
> never saw before!

Yes, I work with Rafael, and both of us works on all kinds of suspend/resume related issues. :)
> 
> We have here INTEL possible ugly problem, with CORE2 family Sandy Bridge.
> The data are here:
> Sandy Bridge: CPUID = 0x206A7, ark.intel.com:
> http://ark.intel.com/products/52227/Intel-Core-i7-2820QM-Processor-8M-Cache-
> up-to-3_40-GHz
> 
Usually, you just need to file a new bug report here, so the problem will be in our radar.

> So, we have failing log for PM suspend from kernel.org derived kernel...

what kernel version are you using?
please try the latest upstream kernel, say, 4.9-rc3

> And
> log is here:  https://bugzilla.kernel.org/attachment.cgi?id=242051
> 
this looks like the driver blacklist, are you sure this is the right link?
hmmm, it seems that you've already have a bug report in kernel bugzilla, right? can you tell me the bug id?

> Here are very confusing lines from it:
> 
> [ 181.127124] PM: noirq resume of devices complete after 11.258 msecs
> [ 181.127463] PM: early resume of devices complete after 0.311 msecs
> [ 181.127509] usb usb1: root hub lost power or was reset
> [ 181.127702] usb usb2: root hub lost power or was reset
> [ 181.131402] ehci-pci 0000:00:1a.0: cache line size of 4 is not supported
> [ 181.131419] rtc_cmos 00:01: System wakeup disabled by ACPI
> [ 181.131568] pcieport 0000:00:1c.2: System wakeup disabled by ACPI
> [ 181.131588] ehci-pci 0000:00:1d.0: cache line size of 4 is not supported
> 
> Please, do note the following two lines: 
> [ 181.131402] ehci-pci 0000:00:1a.0: cache line size of 4 is not supported
> [ 181.131588] ehci-pci 0000:00:1d.0: cache line size of 4 is not supported
> 

This seems like a device/driver problem to me.
what's the symptom of the problem? again, please show me the bug report that describes the problem in details.
Comment 12 Zhang Rui 2016-10-31 03:30:11 UTC
(In reply to paviluf from comment #9)
> Created attachment 242901 [details]
> dmesg after two good suspend cycles
> 
[    4.174888] nvidia: module license 'NVIDIA' taints kernel.
[    4.174891] Disabling lock debugging due to kernel taint
[    4.178792] nvidia: module verification failed: signature and/or required key missing - tainting kernel

please retest without any out of tree driver.
Comment 13 paviluf 2016-10-31 08:23:30 UTC
Created attachment 243341 [details]
dmesg without any out of tree driver

I already tried with nouveau and even an other graphic card (HD 5750) with radeon driver with the same result. Here is dmesg without any out of tree driver.
Comment 14 Niemand 2016-10-31 10:12:48 UTC
Hello Rui,

I see that you completely missed my points in my comment 10. All Good, I'll take other approach, so you can understand me much better.

I am here talking not about other (use) case, other driver, but I am trying to support paviluf's case using paviluf's logs. And I supplied logs from his blacklisted log in my comment 10.

But now I supply his normal logs from https://bugzilla.kernel.org/show_bug.cgi?id=178641#c13 - Comment 13

And here, again, in his normal log: attachment 243341 [details], we have the following:

[   78.858081] ehci-pci 0000:00:1a.0: cache line size of 4 is not supported
[   78.858201] ehci-pci 0000:00:1d.0: cache line size of 4 is not supported

I am not saying that these two lines are causing resume hang-up, but I NEVER saw (I'll repeat) such comment: "cache line size of 4 is not supported"!

If you start googling on "cache line size of 4 is not supported", you never find one on the net (or I did not look enough carefully/in-depth)!

You should try on ANY INTEL platform on ANY Linux to execute the following commands:
$ cat /proc/cpuinfo | grep cache_alignment
$ getconf LEVEL1_ICACHE_LINESIZE
$ getconf LEVEL1_DCACHE_LINESIZE

And tell me what number do you see? You should see ONLY one number, guess which? ;-)

_nobody_
Comment 15 paviluf 2016-10-31 13:10:55 UTC
(In reply to Niemand from comment #14)
Hello Niemand,

Thanks for helping me. Don't know if it's good.

$ cat /proc/cpuinfo | grep cache_alignment
cache_alignment	: 64
cache_alignment	: 64
cache_alignment	: 64
cache_alignment	: 64

$ getconf LEVEL1_ICACHE_LINESIZE
64

$ getconf LEVEL1_DCACHE_LINESIZE
64
Comment 16 Niemand 2016-10-31 16:22:08 UTC
Yes, paviluf, this is exactly what the magic number is... 64 (bytes)! Neither 16, 32, or 128, even 256... And INTEL does know this (they had created this cache alignment in HW long time ago)! ;-)

So, let us see if Arjan van de Ven <arjan@linux.intel.com> (INTEL kernel maintainer) knows answers to my quest?! ;-)

Cache line size of 4 is not supported
https://bugzilla.redhat.com/show_bug.cgi?id=1390298

So, I have created the new thread. And I expect that mainly INTEL people (OTC) will come back to me on this one... Don't they? ;-)

_nobody_
Comment 17 paviluf 2016-11-04 12:04:34 UTC
(In reply to Zhang Rui from comment #12)
> please retest without any out of tree driver.

I hope you can look at this problem. Thanks !
Comment 18 Chen Yu 2016-11-07 09:59:31 UTC
(In reply to Niemand from comment #10)
> [ 181.131588] ehci-pci 0000:00:1d.0: cache line size of 4 is not supported
> 
> Please, do note the following two lines: 
> [ 181.131402] ehci-pci 0000:00:1a.0: cache line size of 4 is not supported
> [ 181.131588] ehci-pci 0000:00:1d.0: cache line size of 4 is not supported
> 
> The cache size in x86/x86_64 (IA32 and 64) are ONLY/SOLELY 64 bytes (you
> should know this).
> 
It looks like all the pci devices in your system reported that they only support cache line size of 4 bytes, which is read from pci config space. The io space might already be broken.
Comment 19 Niemand 2016-11-07 16:54:23 UTC
> It looks like all the pci devices in your system reported that they only
> support cache line size of 4 bytes, which is read from pci config space.
> The io space might already be broken.

Hello Yu (Chen),

I need clarification on your statement (since your statement is very vague). Namely, the following questions to be answered?

[1] Did you ever experience that pci devices on INTEL platform reported that they only support cache line size of 4 bytes? If YES, I need very hard/real proof of that! Hard evidences! Net pointers... Use cases, presented. You name it!
[2] The statement: "The IO space might already be broken" needs further explanation. I'll give you the examples:
    [A] Broken by factory (silicon invalidated/useless) assembly/process?
    [B] Broken by in operational use (platform use)?
    [C] Broken by FW/SW use?
    [D] You name it!

Are you sure you have enough expertise about this problem??? Please, give me some assurances?!

Thank you,
_nobody_
Comment 20 paviluf 2016-11-07 18:24:31 UTC
(In reply to Chen Yu from comment #18)
> The io space might already be broken.

What do you mean ? My computer work very well on Linux since years except this problem.
Comment 21 paviluf 2016-11-18 17:25:54 UTC
Any news ?
Comment 22 paviluf 2016-12-11 06:30:26 UTC
It will be great if we can find a fix. Windows works very well for example.
I hope someone is interested in the challenge.
Comment 23 Niemand 2016-12-11 11:55:43 UTC
paviluf,

I have proposal to you. Could you, please, upgrade your system to Fedora 25? And see if this solves your problem?

As I remember, you are with Fedora 24 (I again went through logs to confirm that).

Maybe this will solve you problem, but I doubt.

Thank you,
_nobody_
Comment 24 paviluf 2016-12-12 05:48:35 UTC
Thanks for the proposal but, as I said in my first post, I already tried with Fedora 25 live and it was the same :(
Comment 25 Niemand 2016-12-12 15:36:40 UTC
This is an excellent info (sorry I am forgetting the details), do NOT forget, that this Bugzilla has limited lifespan, and within 6 months from today will be closed without the proper solution (when Fedora 24 gets outdated).

And you want to keep it alive (I want to keep it alive, since it is interesting use case). And my part:
Cache line size of 4 is not supported
https://bugzilla.redhat.com/show_bug.cgi?id=1390298

I WILL (for 100%) keep alive, for sure! I suggest, you keep this one alive as well. :-)

_nobody_
Comment 26 paviluf 2017-01-07 16:17:41 UTC
Thank you Niemand !

So, developers, is there anything I can do to help making this work ? I remember you that it work on Windows.

Thanks !
Comment 27 paviluf 2017-03-18 06:03:37 UTC
I will sell this computer soon, so if you want to fix this bug it's now.
Thanks.
Comment 28 Niemand 2017-03-18 08:17:02 UTC
Jeremy,

I had lot of issues with INTEL (past decades). They (INTEL) are ignorant and arrogant. They are not going to help you. Not at all...

Their managers are getting $200K+ USD/per year for NOTHING (for bare bull shit). INTEL First Level Managers (INTEL FLM) do not care. And INTEL OTC, as you are one of the kind, will not care.

You'll loose couple of hundred $$ USD. INTEL with me, proud to announce, lost already at least $10M USD. For Good. ;-)

They'll loose more. Much more. I commit! :thumb up:

_nobody_
Comment 29 Chen Yu 2017-03-19 08:58:37 UTC
Paviluf,
please let me confirm:
1. Are you testing on latest vanilla kernel?( which is of Linus tree at kernel.org), as bugzilla kernel mainly focus on that upstream version.
2. When the system hangs during resume, is it possible to 'ping' it?(you can try either usb-mac-switcher or just your native mac interface) 
3. According to attachment 242041 [details], there isn't too much information we can get, 
   we should add "no_console_suspend ignore_loglevel" to get more information.
4. Since pm_test works well according to Comment 6, I suspect this might be either related to BIOS, or the system just can not be woken up. 
5. do not touch /sys/power/pm_trace for now( it might scribble the rtc)

So please use the following step to test:

1. boot up the system with
"no_console_suspend ignore_loglevel"
2. make sure your network is OK
3. provide the output of /proc/acpi/wakeup before suspended 
4. test:  rtcwake -m freeze -s 30
5. if step 4 succeed to return after 30 seconds, provide the dmesg log
6. continue to test: rtcwake -m mem -s 30
   if it does not woken up after 30s, please try to ping it, or
   reboot the system and provide the journalctl
7. if step 6 failed, please  test with your graphic driver disabled(you mentioned during resume the desktop is there but no response?)
Comment 30 paviluf 2017-04-09 08:24:10 UTC
Hello Chen,

Thank you for looking at this ! Unfortunately I didn't had time but has soon as I can I will give the infos needed.

Thanks !
Comment 31 Zhang Rui 2017-04-10 02:10:42 UTC
(In reply to paviluf from comment #30)
> Hello Chen,
> 
> Thank you for looking at this ! Unfortunately I didn't had time but has soon
> as I can I will give the infos needed.
> 
thanks.
PS: usually, we will close the bug report if we don't have a valid response from the reporter for longer than a month, so that we can focus on the active ones. In this case, please feel free to reopen it at any time, once the reporter can provide the information required.
Comment 32 paviluf 2017-06-18 11:55:25 UTC
Created attachment 257063 [details]
dmesg
Comment 33 paviluf 2017-06-18 11:55:55 UTC
Created attachment 257065 [details]
journalctl
Comment 34 paviluf 2017-06-18 12:11:00 UTC
(In reply to Chen Yu from comment #29)
> 1. Are you testing on latest vanilla kernel?( which is of Linus tree at
> kernel.org), as bugzilla kernel mainly focus on that upstream version.

I test with the kernel provided by the distros I have tested.

> 2. When the system hangs during resume, is it possible to 'ping' it?(you can
> try either usb-mac-switcher or just your native mac interface) 

I don't know how to do that.

> So please use the following step to test:
> 
> 1. boot up the system with
> "no_console_suspend ignore_loglevel"

Done.

> 2. make sure your network is OK
> 3. provide the output of /proc/acpi/wakeup before suspended 

cat /proc/acpi/wakeup
Device	S-state	  Status   Sysfs node
PCI0	  S5	*disabled  no-bus:pci0000:00
PEX0	  S5	*disabled  pci:0000:00:1c.0
PEX1	  S5	*disabled  pci:0000:00:1c.1
PEX2	  S5	*disabled  pci:0000:00:1c.2
PEX3	  S5	*disabled  pci:0000:00:1c.3
PEX4	  S5	*disabled
PEX5	  S5	*disabled
PEX6	  S5	*disabled
PEX7	  S5	*disabled
HUB0	  S5	*disabled
UAR1	  S3	*disabled  pnp:00:02
USBE	  S3	*enabled   pci:0000:00:1d.0
USE2	  S3	*enabled   pci:0000:00:1a.0
AZAL	  S5	*disabled  pci:0000:00:1b.0

> 4. test:  rtcwake -m freeze -s 30

That works well.

> 5. if step 4 succeed to return after 30 seconds, provide the dmesg log

attachment 257063 [details]

> 6. continue to test: rtcwake -m mem -s 30

That doesn't work on first try. The system suspend, after 30s it start to wake up but the motherboard start to beep continuously. According to the manual that correspond to a power problem (remember that suspend / wake up work perfectly well on Windows).

>    if it does not woken up after 30s, please try to ping it, or
>    reboot the system and provide the journalctl

attachment 257065 [details] (journalctl after a reboot)

> 7. if step 6 failed, please  test with your graphic driver disabled(you
> mentioned during resume the desktop is there but no response?)

Not sure what to do.

Thanks !
Comment 35 Zhang Rui 2017-06-19 03:38:48 UTC
(In reply to paviluf from comment #34)
> (In reply to Chen Yu from comment #29)
> > 1. Are you testing on latest vanilla kernel?( which is of Linus tree at
> > kernel.org), as bugzilla kernel mainly focus on that upstream version.
> 
> I test with the kernel provided by the distros I have tested.

then please either use latest linus' git tree, or download the latest vanilla kernel source from kernel.org, and rebuild your kernel and see if the problem still exists.

> 
> > 2. When the system hangs during resume, is it possible to 'ping' it?(you
> can
> > try either usb-mac-switcher or just your native mac interface) 
> 
> I don't know how to do that.
> 
> > So please use the following step to test:
> > 
> > 1. boot up the system with
> > "no_console_suspend ignore_loglevel"
> 
> Done.
> 
> > 2. make sure your network is OK
> > 3. provide the output of /proc/acpi/wakeup before suspended 
> 
> cat /proc/acpi/wakeup
> Device        S-state   Status   Sysfs node
> PCI0    S5    *disabled  no-bus:pci0000:00
> PEX0    S5    *disabled  pci:0000:00:1c.0
> PEX1    S5    *disabled  pci:0000:00:1c.1
> PEX2    S5    *disabled  pci:0000:00:1c.2
> PEX3    S5    *disabled  pci:0000:00:1c.3
> PEX4    S5    *disabled
> PEX5    S5    *disabled
> PEX6    S5    *disabled
> PEX7    S5    *disabled
> HUB0    S5    *disabled
> UAR1    S3    *disabled  pnp:00:02
> USBE    S3    *enabled   pci:0000:00:1d.0
> USE2    S3    *enabled   pci:0000:00:1a.0
> AZAL    S5    *disabled  pci:0000:00:1b.0
> 
> > 4. test:  rtcwake -m freeze -s 30
> 
> That works well.
> 
> > 5. if step 4 succeed to return after 30 seconds, provide the dmesg log
> 
> attachment 257063 [details]
> 
> > 6. continue to test: rtcwake -m mem -s 30
> 
> That doesn't work on first try. The system suspend, after 30s it start to
> wake up but the motherboard start to beep continuously. According to the
> manual that correspond to a power problem (remember that suspend / wake up
> work perfectly well on Windows).
> 
> >    if it does not woken up after 30s, please try to ping it, or
> >    reboot the system and provide the journalctl
> 
> attachment 257065 [details] (journalctl after a reboot)
> 
> > 7. if step 6 failed, please  test with your graphic driver disabled(you
> > mentioned during resume the desktop is there but no response?)
> 
> Not sure what to do.
> 
a simple way is to make sure the i915 driver is build as module, and then rename this file /lib/modules/your_kernel_name/kernel/drivers/gpu/drm/i915/i915.ko
Comment 36 Zhang Rui 2017-07-01 05:20:36 UTC
ping...
Comment 37 paviluf 2017-07-01 07:58:29 UTC
Don't worry, I won't forget this :) I just had little time and I don't know how to build a kernel. Is it ok if I use the linux-mainline package from arch aur ?

> a simple way is to make sure the i915 driver is build as module, and then
> rename this file /lib/modules/your_kernel_name/kernel/drivers/gpu/drm 
> /i915/i915.ko

I have a Nvidia card (no video output on motherboard - can test with an Amd one) so I don't think it's the right way.
Comment 38 Chen Yu 2017-09-04 07:25:09 UTC
(In reply to paviluf from comment #37)
> Don't worry, I won't forget this :) I just had little time and I don't know
> how to build a kernel. Is it ok if I use the linux-mainline package from
> arch aur ?
> 
> > a simple way is to make sure the i915 driver is build as module, and then
> > rename this file /lib/modules/your_kernel_name/kernel/drivers/gpu/drm 
> > /i915/i915.ko
> 
> I have a Nvidia card (no video output on motherboard - can test with an Amd
> one) so I don't think it's the right way.

Ok, then please blacklist your nvidia driver and test with latest upstream kernel. I'm not sure if the package from arch would be a pure kernel version w/o any other modification, but I think it would also be easy to install a new kernel from scatch via source code.
1. download the source code from https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.13.tar.xz
2. copy the kernel config at /boot/config-xxxx to your source code root dir, and rename it to .config(then you can not see it)
3. 'make menuconfig' in your source code dir, and disable the CONFIG_DEBUG_INFO by uncheck the option at:
Kernel hacking  --->
Compile-time checks and compiler options  --->
[] Compile the kernel with debug info
4. make all -j4; make modules_install; make install
5. after finished, reboot into the new kernel.

first, test the normal suspend to mem by
rtcwake -m mem -s 30


If it failed, test
echo core > /sys/power/pm_test
echo deep > /sys/power/mem_sleep
echo mem > /sys/power/stat
and wait for 5 seconds
Comment 39 Zhang Rui 2017-10-10 08:13:05 UTC
please attach the kernel config file.

please try to echo the USB roothub wakeup via sysfs and attach the dmesg after a succeed suspend/resume.
Comment 40 Chen Yu 2017-10-23 02:28:24 UTC
ping ..
Comment 41 paviluf 2017-10-23 11:14:56 UTC
Sorry, I didn't found the time to provide the needed infos. I will try ASAP.
Thank you.
Comment 42 Chen Yu 2017-12-11 08:14:10 UTC
Let me close this for now. Please feel free to reopen if this issue can be reproduced on latest vanilla kernel.
Comment 43 paviluf 2018-01-14 15:16:49 UTC
Hello,

I finally found some time ! So I installed Ubuntu 17.10 to start fresh. 

- I tested "rtcwake -m mem -s 30" and it failed on 4th attempt.
- I built kernel 4.13.16 as you indicate, rebooted on it and tested "rtcwake -m mem -s 30" and it failed on 4th attempt too.
- I tried 
 echo core > /sys/power/pm_test 
 echo deep > /sys/power/mem_sleep
 echo mem > /sys/power/state
 and that worked.

I don't know how to echo the USB roothub wakeup via sysfs but I here is dmesg after a succeed suspend/resume and the kernel config file.
Comment 44 paviluf 2018-01-14 15:17:34 UTC
Created attachment 273595 [details]
dmesg after a succeed suspend/resume
Comment 45 paviluf 2018-01-14 15:18:09 UTC
Created attachment 273597 [details]
kernel config file
Comment 46 Chen Yu 2018-01-15 01:09:44 UTC
(In reply to paviluf from comment #43)
> Hello,
> 
> I finally found some time ! So I installed Ubuntu 17.10 to start fresh. 
> 
> - I tested "rtcwake -m mem -s 30" and it failed on 4th attempt.
> - I built kernel 4.13.16 as you indicate, rebooted on it and tested "rtcwake
> -m mem -s 30" and it failed on 4th attempt too.
> - I tried 
>  echo core > /sys/power/pm_test 
>  echo deep > /sys/power/mem_sleep
>  echo mem > /sys/power/state
>  and that worked.
> 
> I don't know how to echo the USB roothub wakeup via sysfs but I here is
> dmesg after a succeed suspend/resume and the kernel config file.
There's no obvious error/warning in the log, and since the test_mode 'core' work well, let's bypass the BIOS and test suspend-to-freeze and see what would happend:
1. append "no_console_suspend ignore_loglevel" in the grub
2. echo freeze > /sys/power/state
3. press any key on the keyboard(or power button) to wake the system up
Comment 47 paviluf 2018-01-15 09:34:38 UTC
That work
Comment 48 paviluf 2018-01-18 08:25:52 UTC
Do you need more infos ?
Comment 49 Chen Yu 2018-01-19 06:16:24 UTC
(In reply to paviluf from comment #48)
> Do you need more infos ?

Humm, I don't have much clue now
1. Please check if you have serial port on your desktop? If this, we can root cause this issue
    within 1 day :-)
2. As you described in Comment 43, you have tried all the pm_test mode, could you confirm if you have tested each pm_test mode for at least 4 times?(because you said the rtcwake will failed at 4th try) Be aware, you should echo none > /sys/power/pm_test because trying rtcwake.

3. In most cases the graphic drivers might be the offenders, so please blacklist them nouveau/or nvidia native driver and try again. 

always provide the dmesg after each test, with no_console_suspend ignore_loglevel appended in grub file
Comment 50 Chen Yu 2018-01-19 06:18:16 UTC
(In reply to Chen Yu from comment #49)
> (In reply to paviluf from comment #48)
> > Do you need more infos ?
> 
> Humm, I don't have much clue now
> 1. Please check if you have serial port on your desktop? If this, we can
> root cause this issue
>     within 1 day :-)
> 2. As you described in Comment 43, you have tried all the pm_test mode,
> could you confirm if you have tested each pm_test mode for at least 4
> times?(because you said the rtcwake will failed at 4th try) Be aware, you
> should echo none > /sys/power/pm_test because trying rtcwake.
> 
> 3. In most cases the graphic drivers might be the offenders, so please
> blacklist them nouveau/or nvidia native driver and try again. 
> 
You might get black screen after resumed, please check if your keyboard is working by pressing Num Lock/Caps Lock, also the PS/2 keyboard is preferred to USB keyboard.
> always provide the dmesg after each test, with no_console_suspend
> ignore_loglevel appended in grub file
Comment 51 paviluf 2018-01-21 14:14:37 UTC
Created attachment 273775 [details]
echo freeze > /sys/power/state
Comment 52 paviluf 2018-01-21 14:15:50 UTC
Created attachment 273777 [details]
echo mem > /sys/power/state
Comment 53 paviluf 2018-01-21 14:16:12 UTC
Created attachment 273779 [details]
echo freeze > /sys/power/state
Comment 54 paviluf 2018-01-21 14:26:10 UTC
> Humm, I don't have much clue now
> 1. Please check if you have serial port on your desktop? If this, we can
> root cause this issue within 1 day :-)

I don't have a serial port.

> 2. As you described in Comment 43, you have tried all the pm_test mode,
> could you confirm if you have tested each pm_test mode for at least 4
> times?(because you said the rtcwake will failed at 4th try) Be aware, you
> should echo none > /sys/power/pm_test because trying rtcwake.
> 3. In most cases the graphic drivers might be the offenders, so please
> blacklist them nouveau/or nvidia native driver and try again. 

So I blacklisted nouveau and nvidia drivers. I did "echo none > /sys/power/pm_test". I again did :

 echo core > /sys/power/pm_test 
 echo deep > /sys/power/mem_sleep
 echo mem > /sys/power/state

5 times and that worked.

I also tried "echo freeze > /sys/power/state" 5 times and that worked.

> You might get black screen after resumed, please check if your keyboard is
> working by pressing Num Lock/Caps Lock, also the PS/2 keyboard is preferred
> to USB keyboard.

I don't have PS/2 keyboard.

I tried again "rtcwake -m mem -s 30" but now that fail on first try. I mean suspend is ok and on wake up black screen and keyboard doesn't work.

> always provide the dmesg after each test, with no_console_suspend
> ignore_loglevel appended in grub file

I provided them
Comment 55 paviluf 2018-01-21 14:29:01 UTC
> 1. Please check if you have serial port on your desktop? If this, we can
> root cause this issue within 1 day :-)

Well I have one internal serial port. Don't know if I can use it but if it's possible what should I do ?
Comment 56 Chen Yu 2018-01-22 03:33:32 UTC
(In reply to paviluf from comment #55)
> > 1. Please check if you have serial port on your desktop? If this, we can
> > root cause this issue within 1 day :-)
> 
> Well I have one internal serial port. Don't know if I can use it but if it's
> possible what should I do ?

OK, according to Comment 54, do you mean:
1. with nouveau and nvidia driver blacklisted,
Q: can you confirm if nouveau/nvidia/i915 has been blacklisted? lsmod
2. based on 1, with 'core' pm_test mode, all 5 times of suspend to mem work.
3. based on 1, pm_test set to none, rtcwake -m mem -s 30 gets black screen and no
    respond after resumed.
4. Has you tested: based on 1, pm_test set to none, and echo mem > /sys/power/state and wake up via power button or other key?


OK, you have a serial port, please append 'no_console_suspend ignore_loglevel console=ttyS0,115200 console=tty' in kernel boot commandline and try to gather the output from it. Then try step 3 above and record the serial port log, let' find out at which stage it failed.
Comment 57 paviluf 2018-01-22 20:31:13 UTC
(In reply to Chen Yu from comment #56)
> OK, according to Comment 54, do you mean:
> 1. with nouveau and nvidia driver blacklisted,
> Q: can you confirm if nouveau/nvidia/i915 has been blacklisted? lsmod

I blacklisted nouveau/nvidia and will provide lsmod. Should I also blacklist i915 too aven if it's not used (no video output on motherboard) ?  

> 2. based on 1, with 'core' pm_test mode, all 5 times of suspend to mem work.

Yes

> 3. based on 1, pm_test set to none, rtcwake -m mem -s 30 gets black screen
> and no respond after resumed.

Yes

> 4. Has you tested: based on 1, pm_test set to none, and echo mem >
> /sys/power/state and wake up via power button or other key?

I gets black screen and no respond after resumed too.

> OK, you have a serial port, please append 'no_console_suspend
> ignore_loglevel console=ttyS0,115200 console=tty' in kernel boot commandline
> and try to gather the output from it. Then try step 3 above and record the
> serial port log, let' find out at which stage it failed.

Do you have some infos on how to do that. I don't know if I will be able to use that internal port since it's just some pin (not a proper serial port).
Comment 58 paviluf 2018-01-22 20:32:03 UTC
Created attachment 273795 [details]
lsmod
Comment 59 paviluf 2018-01-30 10:34:25 UTC
(In reply to Chen Yu from comment #56)
> OK, you have a serial port, please append 'no_console_suspend
> ignore_loglevel console=ttyS0,115200 console=tty' in kernel boot commandline
> and try to gather the output from it. Then try step 3 above and record the
> serial port log, let' find out at which stage it failed.

Well I spent one day to try to gather the output of the kernel and that didn't worked at all. I already spent far too much time on this. If we don't find quickly why it isn't working whereas it's working on windows I will drop linux and only use windows or buy a mac. I'm so sick that on gnu/linux there is always something that doesn't work...
Comment 60 Chen Yu 2018-01-30 14:17:33 UTC
OK, no need to enable the serial port output if it is too hard. Please bear in mind that if you want to be benefit from flexibility/high performance of linux then you are also taking risk of using it - actually a workaround is to use suspend to freeze, which is similar to your requirement. 
Read from Comment 5 again, you mentioned that you are using pm_trace, which might be a clue - do you have a 'reset' key on your desktop, I mean, a hot reboot w/o power off? This is important, we can use this device for blind debugging.

If yes, please disable all the graphic drivers, then do the pm_trace and provide the dmesg after a hot-reboot, let's check if the offender is atkbd?

And, most importantly, please always try the latest uptream kernel from 
https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.15.tar.xz
Comment 61 paviluf 2018-01-30 23:01:59 UTC
Thank you for your help and sorry but that's so annoying and time consuming !
So, I have a 'reset' button on my desktop. I build kernel 4.15, I disabled all the graphic drivers (nvidia, nouveau, i915) and I did :

sync
echo 1 > /sys/power/pm_trace
echo mem > /sys/power/state

That fail on first try (can't wakeup - black screen and keyboard unresponsive).

Here is "cat /sys/power/pm_trace_dev_match" output

bdi

I will attach dmesg after a hot-reboot
Comment 62 paviluf 2018-01-30 23:02:59 UTC
Created attachment 273937 [details]
dmesg after hot reboot
Comment 63 paviluf 2018-02-06 17:50:07 UTC
I'm done, I managed to get a motherboard, processor and ram for free. The hardware is older and less powerful but that work. I'm really disappointed.
Comment 64 paviluf 2018-02-19 10:32:16 UTC
Well, the replacement motherboard died... I'm back to the motherboard that have the suspend problem...
Comment 65 Chen Yu 2018-02-19 13:33:26 UTC
ok, let's continue, and stick to 4.15,
[    1.257020]   Magic number: 0:320:41
[    1.257094] bdi 7:6: hash matches
Suggested there might be something wrong with the block device driver, and
this might be a new problem...
And sorry I forgot to mention, before doing any test, please:
1. remove the 
    'quiet splash vt.handoff=7' in your grub command line
2.   echo 0 > /sys/power/pm_async
    to disable the async suspend thus we can confirm the offender.

please provide dmesg and cat /sys/power/pm_trace_dev_match again.
let me think about adding a hack debug patch based on pm_trace, if it is still bdi.
Comment 66 paviluf 2018-02-20 12:04:54 UTC
I tried KDE neon so I'm doing the tests on a clean install (no blacklist and whatsoever) with kernel 4.13 that come with the distro for now.

So I removed "quiet splash" and added "no_console_suspend ignore_loglevel" from grub command line and updated grub. I did "echo 0 > /sys/power/pm_async". I rebooted and did :

sync
echo 1 > /sys/power/pm_trace
echo mem > /sys/power/state

That fail on first try

"cat /sys/power/pm_trace_dev_match" output

acpi

So I blacklisted acpi, rebuild initramfs and retried

pm_trace_dev_match output

block
rtc_cmos

So I blacklisted block and rtc_cmos, rebuild initramfs and retried

pm_trace_dev_match output sometimes

bdi

and sometimes

block
rtc_cmos
Comment 67 paviluf 2018-02-20 12:05:22 UTC
Created attachment 274283 [details]
hot reboot dmesg
Comment 68 paviluf 2018-02-20 12:08:06 UTC
Created attachment 274285 [details]
dmesg excerpt on fail (pc frozen)
Comment 69 paviluf 2018-03-28 16:34:37 UTC
I use Windows 10 on this computer, so I'm closing this...
Thanks.

Note You need to log in before you can comment on or make changes to this bug.