Bug 196907 - [Regression] s2idle does not work with PC300 NVMe SK hynix 512GB - Dell XPS 13 9360
Summary: [Regression] s2idle does not work with PC300 NVMe SK hynix 512GB - Dell XPS 1...
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Rafael J. Wysocki
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-11 17:04 UTC by Paul Menzel
Modified: 2019-12-25 17:34 UTC (History)
12 users (show)

See Also:
Kernel Version: 4.13
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Linux configuration (208.44 KB, text/plain)
2017-10-02 12:19 UTC, Paul Menzel
Details
XPS13 9360 config (195.13 KB, text/plain)
2017-10-02 12:51 UTC, Rafael J. Wysocki
Details
XPS13 9360 kernel config (based on the one from Paul) (208.44 KB, text/plain)
2017-10-02 21:43 UTC, Rafael J. Wysocki
Details
Linux messages for ACPI S3 (deep) (385.51 KB, text/plain)
2017-10-04 15:48 UTC, Paul Menzel
Details
full dmesg output from working s2idle after 4..14-rc3 + xorg-hwe package (70.56 KB, text/plain)
2017-10-06 03:16 UTC, Mario Limonciello
Details
packages installed on Ubuntu (315.95 KB, text/plain)
2017-10-06 03:16 UTC, Mario Limonciello
Details
Output of `dmidecode` (28.69 KB, text/plain)
2017-10-06 11:47 UTC, Paul Menzel
Details
Difference in Linux messages from non-working vs. working Note, the non-working example is captured from *deep*, as *s2idle* does not work. (65.00 KB, text/plain)
2017-10-06 12:07 UTC, Paul Menzel
Details
Mario Downgraded to 1.3.5, installed microcode, plugged in WD15 (79.45 KB, text/plain)
2017-10-06 13:00 UTC, Mario Limonciello
Details
Linux 4.11-rc1 messages (dmesg) from two suspends (s2idle) (65.42 KB, text/plain)
2017-10-09 16:29 UTC, Paul Menzel
Details
ACPI / PM: Prevent some Dells XPS13 9360 from using Low Power S0 Idle (1.94 KB, patch)
2017-10-24 13:56 UTC, Rafael J. Wysocki
Details | Diff
Picture of Linux messages with Linux 4.14-rc6 built from Ubuntu (1019.61 KB, image/jpeg)
2017-10-25 12:25 UTC, Paul Menzel
Details
Picture of Linux messages with Linux 4.14-rc6 built from Ubuntu [2/2] (1019.61 KB, image/jpeg)
2017-10-25 12:26 UTC, Paul Menzel
Details
ACPI / PM: Prevent some Dells XPS13 9360 from using Low Power S0 Idle (2.01 KB, patch)
2017-10-25 14:33 UTC, Rafael J. Wysocki
Details | Diff
ACPI / PM: Prevent Dell XPS13 9360 from using Low Power S0 Idle (1.97 KB, patch)
2017-10-26 08:57 UTC, Rafael J. Wysocki
Details | Diff
Linux 4.14-rc7+ with patch messages (dmesg) with one suspend (deep) (73.70 KB, text/plain)
2017-11-03 15:58 UTC, Paul Menzel
Details
attachment-21170-0.html (1.55 KB, text/html)
2018-01-20 17:41 UTC, Keith Busch
Details
attachment-13309-0.html (1.62 KB, text/html)
2019-07-11 02:26 UTC, Keith Busch
Details

Description Paul Menzel 2017-09-11 17:04:16 UTC
With commit f007cad (Revert "firmware: add sanity check on shutdown/suspend) from Linus’ master branch, resuming the Dell XPS 13 9360 from suspend does not work.

For whatever reason *s2idle* seems to be the default again. The same problem was reported in the past [1], and therefore the default was changed back to *deep*. So I wonder this wasn’t tested thoroughly.

Please revert back to *deep* by default until *s2idle* has been really tested.

Bug 192591 (Suspend to idle & ram issues on Dell XPS 13 9365) [1] seems to be a similar issue, and deals with *s2idle* versus *deep* too.

[1] https://lkml.org/lkml/2017/1/17/609
[2] https://bugzilla.kernel.org/show_bug.cgi?id=192591
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2017-10-01 09:27:48 UTC
What's the status of this? FWIW: My XPS13 (9360) with rc2+ uses deep by default and s2idle seems to work as well; but yes, I saw a few hickups during the early devel phase of 4.14, too.
Comment 2 Paul Menzel 2017-10-02 12:02:22 UTC
(In reply to Thorsten Leemhuis from comment #1)
> What's the status of this? FWIW: My XPS13 (9360) with rc2+ uses deep by
> default

Not for me. With v4.14-rc2-165-g770b782 it is s2idle by default.

```
commit e870c6c87cf9484090d28f2a68aa29e008960c93
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Mon Jul 31 23:43:18 2017 +0200

    ACPI / PM: Prefer suspend-to-idle over S3 on some systems
    
    Modify the ACPI system sleep support setup code to select
    suspend-to-idle as the default system sleep state if
    (1) the ACPI_FADT_LOW_POWER_S0 flag is set in the FADT and
    (2) the Low Power Idle S0 _DSM interface has been discovered and
    (3) the default sleep state was not selected from the kernel command
    line.
    
    The main motivation for this change is that systems where the (1) and
    (2) conditions are met typically ship with OSes that don't exercise
    the S3 path in the platform firmware which remains untested and turns
    out to be non-functional at least in some cases.
    
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Tested-by: Mario Limonciello <mario.limonciello@dell.com>
```

Again, despite having asked this already, the system this was tested on is *not* noted. Mario, Rafael, could we please have this information in the future in the commit message.

> and s2idle seems to work as well; but yes, I saw a few hickups
> during the early devel phase of 4.14, too.

I tested it again, and contrary to your results, *s2idle* does *not* work. *deep* works. (And is slow as hell, as Dell refuses to fix the firmware as documented in #185611.)

[1] https://bugzilla.kernel.org/show_bug.cgi?id=185611
Comment 3 Paul Menzel 2017-10-02 12:19:37 UTC
Created attachment 258695 [details]
Linux configuration
Comment 4 Rafael J. Wysocki 2017-10-02 12:25:51 UTC
(In reply to Paul Menzel from comment #3)
> Created attachment 258695 [details]
> Linux configuration

So I have a 9360 here and it works for me as of -rc2 (I'll test -rc3 later today).

What exactly doesn't work for you in s2idle?
Comment 5 Rafael J. Wysocki 2017-10-02 12:47:54 UTC
BTW, on my 9360 "deep" works too (AFAICS) and it is not "slow as hell".
Comment 6 Rafael J. Wysocki 2017-10-02 12:51:35 UTC
Created attachment 258697 [details]
XPS13 9360 config

Here's my .config for the XPS13 9360, FWIW.
Comment 7 Mario Limonciello 2017-10-02 13:01:50 UTC
I also saw suspend regressions (with deep) early in 4.14 (sometime between rc1 and rc2) but I have not observed any other regressions since those were fixed.

I would also like to hear what "doesn't work" in s2idle for you Paul.
Comment 8 Paul Menzel 2017-10-02 13:04:04 UTC
(In reply to Rafael J. Wysocki from comment #4)
> (In reply to Paul Menzel from comment #3)
> > Created attachment 258695 [details]
> > Linux configuration
> 
> So I have a 9360 here and it works for me as of -rc2 (I'll test -rc3 later
> today).

Mine is with touch screen, but that shouldn’t matter. Thank you for attaching your configuration in comment 6.

> What exactly doesn't work for you in s2idle?

After pressing the power button to “resume”, the screen stays black. I have to hold it for then seconds then to power off the system.

(In reply to Rafael J. Wysocki from comment #5)
> BTW, on my 9360 "deep" works too (AFAICS) and it is not "slow as hell".

Well, three to six seconds for resume is very slow compared to Google Chromebooks or Apple devices. Opening the screen, I want to continue working and not stare on a black screen.
Comment 9 Rafael J. Wysocki 2017-10-02 13:09:58 UTC
(In reply to Paul Menzel from comment #8)
> (In reply to Rafael J. Wysocki from comment #4)
> > (In reply to Paul Menzel from comment #3)
> > > Created attachment 258695 [details]
> > > Linux configuration
> > 
> > So I have a 9360 here and it works for me as of -rc2 (I'll test -rc3 later
> > today).
> 
> Mine is with touch screen, but that shouldn’t matter. Thank you for
> attaching your configuration in comment 6.
> 
> > What exactly doesn't work for you in s2idle?
> 
> After pressing the power button to “resume”, the screen stays black. I have
> to hold it for then seconds then to power off the system.

Something seems to be missing in your configuration.

Please check if the intel_hid module is loaded, for one (that's needed for the power button wakeups to work).

> (In reply to Rafael J. Wysocki from comment #5)
> > BTW, on my 9360 "deep" works too (AFAICS) and it is not "slow as hell".
> 
> Well, three to six seconds for resume is very slow compared to Google
> Chromebooks or Apple devices. Opening the screen, I want to continue working
> and not stare on a black screen.

Mine resumes within a second or so from "deep".  Suspicious ...
Comment 10 Rafael J. Wysocki 2017-10-02 13:15:55 UTC
(In reply to Rafael J. Wysocki from comment #9)
> (In reply to Paul Menzel from comment #8)
> > (In reply to Rafael J. Wysocki from comment #4)
> > > (In reply to Paul Menzel from comment #3)
> > > > Created attachment 258695 [details]
> > > > Linux configuration
> > > 
> > > So I have a 9360 here and it works for me as of -rc2 (I'll test -rc3
> later
> > > today).
> > 
> > Mine is with touch screen, but that shouldn’t matter. Thank you for
> > attaching your configuration in comment 6.
> > 
> > > What exactly doesn't work for you in s2idle?
> > 
> > After pressing the power button to “resume”, the screen stays black. I have
> > to hold it for then seconds then to power off the system.
> 
> Something seems to be missing in your configuration.
> 
> Please check if the intel_hid module is loaded, for one (that's needed for
> the power button wakeups to work).

Well, please check intel_vbtn too.  Both load here and I don't quite remember which one handles the events on this machine.
Comment 11 Rafael J. Wysocki 2017-10-02 13:19:29 UTC
BTW, you know that you can use the keyboard to wake up from s2idle?
Comment 12 Mario Limonciello 2017-10-02 13:24:19 UTC
> BTW, you know that you can use the keyboard to wake up from s2idle?

You need to configure it as a wakeup device though right?
Comment 13 Paul Menzel 2017-10-02 13:47:22 UTC
(In reply to Rafael J. Wysocki from comment #9)
> (In reply to Paul Menzel from comment #8)
> > (In reply to Rafael J. Wysocki from comment #4)
> > > (In reply to Paul Menzel from comment #3)
> > > > Created attachment 258695 [details]
> > > > Linux configuration
> > > 
> > > So I have a 9360 here and it works for me as of -rc2 (I'll test -rc3
> later
> > > today).
> > 
> > Mine is with touch screen, but that shouldn’t matter. Thank you for
> > attaching your configuration in comment 6.
> > 
> > > What exactly doesn't work for you in s2idle?
> > 
> > After pressing the power button to “resume”, the screen stays black. I have
> > to hold it for then seconds then to power off the system.
> 
> Something seems to be missing in your configuration.

Comparing my configuration with mine, I do not see anything that is missing on my side. It’d be great if you built a Linux image with my configuration just to rule out any problems in that regard.

> Please check if the intel_hid module is loaded, for one (that's needed for
> the power button wakeups to work).

Well, the power button press is recognized, and the light in the power button goes on after pressing it. The screen stays black though.

The modules *intel_hid* and *intel_vbtn* are both loaded.

I tested it again. Now the screen comes back but nothing else works, that means I see the status bar at the top, but the password dialog does not pop up, the mouse cursor is not there, and pressing the keyboard does nothing. Using the function key to control the keyboard light works though.

(In reply to Mario Limonciello from comment #12)
> > BTW, you know that you can use the keyboard to wake up from s2idle?
> 
> You need to configure it as a wakeup device though right?

Right. According to `grep enabled /proc/acpi/wakeup` only *XHC*, *LID0* and *PBTN* are enabled.

> > (In reply to Rafael J. Wysocki from comment #5)
> > > BTW, on my 9360 "deep" works too (AFAICS) and it is not "slow as hell".
> > 
> > Well, three to six seconds for resume is very slow compared to Google
> > Chromebooks or Apple devices. Opening the screen, I want to continue
> working
> > and not stare on a black screen.
> 
> Mine resumes within a second or so from "deep".  Suspicious ...

So you mean, after one second you can enter your password to unlock the screen? What distribution do you use? I am on Ubuntu 16.04.3 LTS (Xenial Xerus).
Comment 14 Rafael J. Wysocki 2017-10-02 15:34:58 UTC
On Monday, October 2, 2017 3:47:22 PM CEST bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196907
> 
> --- Comment #13 from Paul Menzel (pmenzel+bugzilla.kernel.org@molgen.mpg.de)
> ---
> (In reply to Rafael J. Wysocki from comment #9)
> > (In reply to Paul Menzel from comment #8)
> > > (In reply to Rafael J. Wysocki from comment #4)
> > > > (In reply to Paul Menzel from comment #3)
> > > > > Created attachment 258695 [details]
> > > > > Linux configuration
> > > > 
> > > > So I have a 9360 here and it works for me as of -rc2 (I'll test -rc3
> > later
> > > > today).
> > > 
> > > Mine is with touch screen, but that shouldn’t matter. Thank you for
> > > attaching your configuration in comment 6.
> > > 
> > > > What exactly doesn't work for you in s2idle?
> > > 
> > > After pressing the power button to “resume”, the screen stays black. I
> have
> > > to hold it for then seconds then to power off the system.
> > 
> > Something seems to be missing in your configuration.
> 
> Comparing my configuration with mine, I do not see anything that is missing
> on
> my side. It’d be great if you built a Linux image with my configuration just
> to
> rule out any problems in that regard.
> 
> > Please check if the intel_hid module is loaded, for one (that's needed for
> > the power button wakeups to work).
> 
> Well, the power button press is recognized, and the light in the power button
> goes on after pressing it.

OK, so the wakeup actually works.

> The screen stays black though.
> 
> The modules *intel_hid* and *intel_vbtn* are both loaded.
> 
> I tested it again. Now the screen comes back but nothing else works, that
> means
> I see the status bar at the top, but the password dialog does not pop up, the
> mouse cursor is not there, and pressing the keyboard does nothing. Using the
> function key to control the keyboard light works though.

Well, that's unusual.

It looks like something crashes during resume, then.

Can you set /sys/power/pm_test to "devices", try to suspend and see what
happens?  [It will resume automatically if it works.]

> (In reply to Mario Limonciello from comment #12)
> > > BTW, you know that you can use the keyboard to wake up from s2idle?
> > 
> > You need to configure it as a wakeup device though right?
> 
> Right. According to `grep enabled /proc/acpi/wakeup` only *XHC*, *LID0* and
> *PBTN* are enabled.

/proc/acpi/wakeup has nothing to do with that.

You need to do

# echo enabled > /sys/devices/platform/i8042/serio0/power/wakeup

(as root) for it to work.

> > > (In reply to Rafael J. Wysocki from comment #5)
> > > > BTW, on my 9360 "deep" works too (AFAICS) and it is not "slow as hell".
> > > 
> > > Well, three to six seconds for resume is very slow compared to Google
> > > Chromebooks or Apple devices. Opening the screen, I want to continue
> > working
> > > and not stare on a black screen.
> > 
> > Mine resumes within a second or so from "deep".  Suspicious ...
> 
> So you mean, after one second you can enter your password to unlock the
> screen?

Yes, after a second or so.

> What distribution do you use? I am on Ubuntu 16.04.3 LTS (Xenial Xerus).

I use openSUSE Leap 42.3 ATM, but there was some kind of Ubuntu installed on it
and it didn't have problems with s2idle or S3 either.
Comment 15 Thorsten Leemhuis 2017-10-02 15:51:44 UTC
/me puts his "regression tracker hat" down and takes his "just a user hat"

TWIMC: I have no idea why my machine (XPS13, 9360, without touch) uses S3 by default; I can try to investigate. I recently installed the latest firmware (Version: 2.2.1; Release Date: 08/18/2017), but I doubt that's the reason. I can enable S2I manually and it works. I will give it a shot to see how reliable it works. 

To add to the confusion another data point: S3 suddenly became unreliable again;  maybe it's the BIOS update, maybe 4.14 (4.13 seems to work reliable so far). And I have trouble with wifi as well (http://lists.infradead.org/pipermail/ath10k/2017-October/010189.html). #sigh
Comment 16 Paul Menzel 2017-10-02 16:02:14 UTC
(In reply to Thorsten Leemhuis from comment #15)
> /me puts his "regression tracker hat" down and takes his "just a user hat"
> 
> TWIMC: I have no idea why my machine (XPS13, 9360, without touch) uses S3 by
> default; I can try to investigate. I recently installed the latest firmware
> (Version: 2.2.1; Release Date: 08/18/2017), but I doubt that's the reason. I
> can enable S2I manually and it works. I will give it a shot to see how
> reliable it works.

I am still using 1.3.5 05/08/2017. I’ll test the update too. Thank you for the information.

> To add to the confusion another data point: S3 suddenly became unreliable
> again;  maybe it's the BIOS update, maybe 4.14 (4.13 seems to work reliable
> so far). And I have trouble with wifi as well
> (http://lists.infradead.org/pipermail/ath10k/2017-October/010189.html). #sigh

Please add me to CC of that discussion. I am also having wireless trouble, that NetworkManager is unable to connect, but it works manually using `wpa_supplicant`. I am also seeing the messages below.

```
ath10k_pci 0000:3a:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:3a:00.0.bin failed with error -2
ath10k_pci 0000:3a:00.0: Direct firmware load for ath10k/cal-pci-0000:3a:00.0.bin failed with error -2
```

I do not see `failed to extract amsdu: -11`, but that might only be because I didn’t use the wireless that extensively.
Comment 17 Paul Menzel 2017-10-02 16:16:23 UTC
(In reply to Rafael J. Wysocki from comment #14)
> On Monday, October 2, 2017 3:47:22 PM CEST
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=196907
> > 
> > --- Comment #13 from Paul Menzel
> (pmenzel+bugzilla.kernel.org@molgen.mpg.de)
> > ---
> > (In reply to Rafael J. Wysocki from comment #9)
> > > (In reply to Paul Menzel from comment #8)
> > > > (In reply to Rafael J. Wysocki from comment #4)
> > > > > (In reply to Paul Menzel from comment #3)

[…]

> > I tested it again. Now the screen comes back but nothing else works, that
> means
> > I see the status bar at the top, but the password dialog does not pop up,
> the
> > mouse cursor is not there, and pressing the keyboard does nothing. Using
> the
> > function key to control the keyboard light works though.
> 
> Well, that's unusual.
> 
> It looks like something crashes during resume, then.
> 
> Can you set /sys/power/pm_test to "devices", try to suspend and see what
> happens?  [It will resume automatically if it works.]

Thank you. That worked. Does 

It also looks like the module i915 doesn’t find the DMC firmware despite it being present in `/lib/firmware/i915` and `/lib/firmware/4.12.0-rc2+`.

```
kernel: i915 0000:00:02.0: Direct firmware load for i915/kbl_dmc_ver1_01.bin failed with error -2
kernel: i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/downloads/firmware], disabling runtime power management
```
Comment 18 Paul Menzel 2017-10-02 18:17:16 UTC
(In reply to Paul Menzel from comment #17)
> (In reply to Rafael J. Wysocki from comment #14)
> > On Monday, October 2, 2017 3:47:22 PM CEST
> > bugzilla-daemon@bugzilla.kernel.org wrote:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=196907
> > > 
> > > --- Comment #13 from Paul Menzel
> > (pmenzel+bugzilla.kernel.org@molgen.mpg.de)
> > > ---
> > > (In reply to Rafael J. Wysocki from comment #9)
> > > > (In reply to Paul Menzel from comment #8)
> > > > > (In reply to Rafael J. Wysocki from comment #4)
> > > > > > (In reply to Paul Menzel from comment #3)
> 
> […]
> 
> > > I tested it again. Now the screen comes back but nothing else works, that
> > means
> > > I see the status bar at the top, but the password dialog does not pop up,
> > the
> > > mouse cursor is not there, and pressing the keyboard does nothing. Using
> > the
> > > function key to control the keyboard light works though.
> > 
> > Well, that's unusual.
> > 
> > It looks like something crashes during resume, then.
> > 
> > Can you set /sys/power/pm_test to "devices", try to suspend and see what
> > happens?  [It will resume automatically if it works.]
> 
> Thank you. That worked. Does 
> 
> It also looks like the module i915 doesn’t find the DMC firmware despite it
> being present in `/lib/firmware/i915` and `/lib/firmware/4.12.0-rc2+`.
> 
> ```
> kernel: i915 0000:00:02.0: Direct firmware load for i915/kbl_dmc_ver1_01.bin
> failed with error -2
> kernel: i915 0000:00:02.0: Failed to load DMC firmware
> [https://01.org/linuxgraphics/downloads/firmware], disabling runtime power
> management
> ```

Now, I built Linux 4.14-rc3, and the firmware errors are gone. (Maybe because the + is not in there?) Anyway, now resuming from S0ix, this time I could even see the password window appear, but it froze right away, The mouse cursor doesn’t move anymore, and keyboard presses do nothing. Pressing the power button also does not help. Only holding it for ten seconds. I guess I could see the background as during suspend (which took ten seconds or so), I moved the cursor causing the password dialog to appear.

Hints for debugging are much appreciated. But right now it’s a regression, which, if not fixed before the 4.14 release, should result in having the commit enabling s2idle reverted.
Comment 19 Mario Limonciello 2017-10-02 18:26:11 UTC
Since you're running on Ubuntu 16.04 userspace, are you also running the accompanied X stack that goes with 4.13ish?  It should be the one from Artful.

Normally an out of sync userspace and kernel don't matter significantly, but I'd start there.
Comment 20 Rafael J. Wysocki 2017-10-02 20:20:50 UTC
(In reply to Thorsten Leemhuis from comment #15)
> /me puts his "regression tracker hat" down and takes his "just a user hat"
> 
> TWIMC: I have no idea why my machine (XPS13, 9360, without touch) uses S3 by
> default; I can try to investigate. I recently installed the latest firmware
> (Version: 2.2.1; Release Date: 08/18/2017), but I doubt that's the reason. I
> can enable S2I manually and it works. I will give it a shot to see how
> reliable it works. 

I'd look into the UEFI platform firmware setup.

It looks like the "Low Power S0 Idle" bit is not set in the ACPI tables on your machine and there is a switch for that in the platform firmware setup IIRC.
Comment 21 Rafael J. Wysocki 2017-10-02 20:23:10 UTC
@Paul: What happens if you suspend to idle and wake up using the keyboard?

Does it still behave as described in comment #18?
Comment 22 Rafael J. Wysocki 2017-10-02 20:27:15 UTC
Also please set /sys/power/pm_test to "platform", try to suspend and see what happens.
Comment 23 Rafael J. Wysocki 2017-10-02 20:33:47 UTC
Moreover, please try to disable the strong stack protection and retest.
Comment 24 Rafael J. Wysocki 2017-10-02 21:43:24 UTC
Created attachment 258703 [details]
XPS13 9360 kernel config (based on the one from Paul)

@Paul: I built the kernel using your .config, but I had to disable the "strong stack protection" feature, because my compiler doesn't support it, and that might result in further changes, so the .config actually used for the build is attached.

According to ./scripts/diffconfig the difference is minimum, though.

Everything works for me with this .config, care to try it?
Comment 25 Thorsten Leemhuis 2017-10-03 12:51:19 UTC
(In reply to Rafael J. Wysocki from comment #20)
> (In reply to Thorsten Leemhuis from comment #15)
> > TWIMC: I have no idea why my machine (XPS13, 9360, without touch) uses S3
> by
> > default; I can try to investigate. 
> I'd look into the UEFI platform firmware setup.

FWIW: Nothing there, but I found it's a layer 8 error: I still had "echo deep > /sys/power/mem_sleep" in rc.local from the old days (4.10 or something) where s2idle was first introduced, but didn't work properly yet (and was not yet disabled by default). Uhhps. Sorry for the noise

/me grumbles at himself
Comment 26 Paul Menzel 2017-10-04 13:46:39 UTC
(In reply to Rafael J. Wysocki from comment #24)
> Created attachment 258703 [details]
> XPS13 9360 kernel config (based on the one from Paul)
> 
> @Paul: I built the kernel using your .config, but I had to disable the
> "strong stack protection" feature, because my compiler doesn't support it,
> and that might result in further changes, so the .config actually used for
> the build is attached.
> 
> According to ./scripts/diffconfig the difference is minimum, though.
> 
> Everything works for me with this .config, care to try it?

Thank you, I used that configuration (4.4-rc2+-config-9360-alternative), and here are the results.

1.  Booting that system, then doing

```
$ echo s2idle | sudo tee /sys/power/mem_sleep
$ systemctl suspend
```

causes the system to quickly go off. Pressing the power button I see the same desktop background as before, meaning the screen lock didn’t kick in. Unfortunately, it is frozen.

2.  Reboot, doing

```
$ echo s2idle | sudo tee /sys/power/mem_sleep
$ echo enabled | sudo tee /sys/devices/platform/i8042/serio0/power/wakeup
$ echo platform | sudo tee /sys/power/pm_test
$ systemctl suspend
```

the system sleeps, and quickly after the power button LED goes off for half a second or less, and turns on again. The monitor stays black, but waiting and pressing keys, the password dialog is shown. So it more or less worked.

3.  After that, during the same boot

```
$ echo platform | sudo tee /sys/power/pm_test
$ systemctl suspend
```

the screen stays black.
Comment 27 Paul Menzel 2017-10-04 13:47:57 UTC
(In reply to Mario Limonciello from comment #19)
> Since you're running on Ubuntu 16.04 userspace, are you also running the
> accompanied X stack that goes with 4.13ish?  It should be the one from
> Artful.
> 
> Normally an out of sync userspace and kernel don't matter significantly, but
> I'd start there.

No, I just use the shipped userspace. The Linux kernel’s no regression policy demands, that it keeps working.
Comment 28 Mario Limonciello 2017-10-04 14:43:07 UTC
>No, I just use the shipped userspace. The Linux kernel’s no regression policy
>>demands, that it keeps working.
Whether or not anyone will want to admit it there are userspace dependencies on the graphics stack, and fixes that happen to land in userspace that fix freezes that occurred in the kernel stack.  There is a reason that Canonical backports upgraded versions of the X stack with their HWE releases of new kernels.

Anyway I know this policy exists, and if we don't have this figured out in time for 4.14 I would much rather see a commit that quirks your system to deep than one that reverts the default behavior on FADT low power idle bit.  There are other systems that this fixes.

Anyway that discussion aside for now let's keep digging.

Can you VT switch before going into s2idle and see if there is anything informative with a panic or stack trace?  Depending how hard it's down, maybe you can SSH into the box during this time to look what's going on.

Furthermore, have you confirmed that with rc3, rafael's config and adjusting sysfs attribute mem_sleep to "deep" things are actually resolved when you go down?

Do you encounter any other freezes on the system?
Comment 29 Paul Menzel 2017-10-04 15:48:30 UTC
Created attachment 258713 [details]
Linux messages for ACPI S3 (deep)

(In reply to Mario Limonciello from comment #28)
> >No, I just use the shipped userspace. The Linux kernel’s no regression
> policy
> >>demands, that it keeps working.
> Whether or not anyone will want to admit it there are userspace dependencies
> on the graphics stack, and fixes that happen to land in userspace that fix
> freezes that occurred in the kernel stack.  There is a reason that Canonical
> backports upgraded versions of the X stack with their HWE releases of new
> kernels.

If you mean the LTS enablement stacks, these are already installed.

```
$ LANG=C sudo apt-get install --install-recommends linux-generic-hwe-16.04 xserver-xorg-hwe-16.04
Reading package lists... Done
Building dependency tree       
Reading state information... Done
linux-generic-hwe-16.04 is already the newest version (4.10.0.35.37).
xserver-xorg-hwe-16.04 is already the newest version (1:7.7+16ubuntu3~16.04.1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
```

> Anyway I know this policy exists, and if we don't have this figured out in
> time for 4.14 I would much rather see a commit that quirks your system to
> deep than one that reverts the default behavior on FADT low power idle bit. 
> There are other systems that this fixes.
> 
> Anyway that discussion aside for now let's keep digging.
> 
> Can you VT switch before going into s2idle and see if there is anything
> informative with a panic or stack trace?

Switching to a VT beforehand indeed changes something. The screen comes back on, and I can read what was written there before. But I cannot insert anything. Interestingly, I am able to switch VTs, but still no dice when entering anything. When reaching the “X VT”, it stops working, and I cannot switch back (or the screen is not updated).

> Depending how hard it's down, maybe you can SSH into the box during this time
> to look what's going on.

Due to network trouble with this Linux kernel, NetworkManager does not assign an IP address, I have to stop NetworkManager and WPA Supplicant, and assign the IP address manually with `dhclient en…`. After resume, the system loses the IP address.

> Furthermore, have you confirmed that with rc3, rafael's config and adjusting
> sysfs attribute mem_sleep to "deep" things are actually resolved when you go
> down?

Yes, normal ACPI S3 (deep) always works. Please find the Linux messages (journalctl -k) attached.

> Do you encounter any other freezes on the system?

No.

[1] https://wiki.ubuntu.com/Kernel/LTSEnablementStack
Comment 30 Paul Menzel 2017-10-04 15:54:15 UTC
Just an update, I was able to still the ping the system, but I am unable to log in with SSH.
Comment 31 Mario Limonciello 2017-10-04 16:03:42 UTC
>If you mean the LTS enablement stacks, these are already installed.

OK great to hear.  You're in much better shape then, i'm not as worried about this.

>Due to network trouble with this Linux kernel, NetworkManager does not assign
>>an IP address, I have to stop NetworkManager and WPA Supplicant, and assign
>>the IP address manually with `dhclient en…`. After resume, the system loses
>>the IP address.

It looks like you might a USB Ethernet dongle according to your log.  Maybe you'll have more luck with that staying up (or coming back up) after resume.


>Switching to a VT beforehand indeed changes something. The screen comes back
>>on, and I can read what was written there before. But I cannot insert
>>anything. Interestingly, I am able to switch VTs, but still no dice when
>>entering anything. When reaching the “X VT”, it stops working, and I cannot
>>switch back (or the screen is not updated).

So the system is somewhat alive still, this feels like a regression in runtime PM for the graphics stack somewhere.  The S2I codepath exercises the same codepaths that runtime PM does


Looking at your log, I think you might actually have a DA200 HDMI/VGA/ethernet/USB type-C dongle.
Am I right?  Do you normally keep it plugged in all the time?  Does not having it plugged in while going into s2idle help?
Comment 32 Paul Menzel 2017-10-04 16:07:57 UTC
(In reply to Mario Limonciello from comment #31)

[…]

> Looking at your log, I think you might actually have a DA200
> HDMI/VGA/ethernet/USB type-C dongle.
> Am I right?

Yes, you are.

> Do you normally keep it plugged in all the time?  Does not
> having it plugged in while going into s2idle help?

No, I only plugged it in to get debug information. The problem is also there without the dongle plugged in.
Comment 33 Mario Limonciello 2017-10-04 16:11:18 UTC
>No, I only plugged it in to get debug information. The problem is also there
>>without the dongle plugged in.

OK.

Since you were able to VT switch, that makes me suspect that the system was able to log things after the wakeup for s2idle.  Can you share that too?  Anything interesting about the state of i915 would be good to see.
Comment 34 Rafael J. Wysocki 2017-10-04 20:24:28 UTC
(In reply to Paul Menzel from comment #27)
> (In reply to Mario Limonciello from comment #19)
> > Since you're running on Ubuntu 16.04 userspace, are you also running the
> > accompanied X stack that goes with 4.13ish?  It should be the one from
> > Artful.
> > 
> > Normally an out of sync userspace and kernel don't matter significantly,
> but
> > I'd start there.
> 
> No, I just use the shipped userspace. The Linux kernel’s no regression
> policy demands, that it keeps working.

Suspend-to-idle doesn't really depend on user space, so it shouldn't matter.
Comment 35 Mario Limonciello 2017-10-04 20:26:35 UTC
> Suspend-to-idle doesn't really depend on user space, so it shouldn't matter.
I was getting at potentially there are userspace fixes related to i915 to maybe help with this freeze if Paul was on the original 16.04 X stack.  This is N/A though.
Comment 36 Rafael J. Wysocki 2017-10-04 20:27:13 UTC
(In reply to Paul Menzel from comment #26)
> (In reply to Rafael J. Wysocki from comment #24)

[cut]

> 3.  After that, during the same boot
> 
> ```
> $ echo platform | sudo tee /sys/power/pm_test
> $ systemctl suspend
> ```
> 
> the screen stays black.

This means that suspend/resume quite fundamentally doesn't work for you at all if the target system state is S0.

IOW, the problem is not with the default setting, but with the whole thing being non-functional on your machine.
Comment 37 Rafael J. Wysocki 2017-10-04 20:38:13 UTC
Which basically is limited to your machine ATM as all of the users of 9360 I know about (including myself) except for you don't see any s2idle problems on it.

Anyway, please try:

$ sudo su -
# echo s2idle > /sys/power/mem_sleep
# echo enabled > /sys/devices/platform/i8042/serio0/power/wakeup
# echo 1 > /sys/power/pm_debug_messages
# echo mem > /sys/power/state

and then press any key on the keyboard to wake up.

If it does wake up and if you are able to get anything out of it then, please save the output of dmesg and attach it here.
Comment 38 Rafael J. Wysocki 2017-10-04 20:41:12 UTC
(In reply to Mario Limonciello from comment #35)
> > Suspend-to-idle doesn't really depend on user space, so it shouldn't
> matter.
> I was getting at potentially there are userspace fixes related to i915 to
> maybe help with this freeze if Paul was on the original 16.04 X stack.  This
> is N/A though.

I'm still suspecting a graphics stack issue which triggers when X starts to talk to the GPU after resume.  And I suspect that it happens because different pieces of it don't match.

The way S3 helps is likely because of resetting the HW or similar.
Comment 39 Rafael J. Wysocki 2017-10-04 20:44:05 UTC
@Paul: If you have a 4.13 4.13-rc still installed, it would be good to check if s2idle works for you with that.
Comment 40 Paul Menzel 2017-10-05 17:12:52 UTC
(In reply to Rafael J. Wysocki from comment #39)
> @Paul: If you have a 4.13 4.13-rc still installed, it would be good to check
> if s2idle works for you with that.

So, S0ix does not work with Linux 4.13-rc1+, Linus’ master branch from July 18, 2017. During suspend the image flashes, then resuming, the image gets back, but the system is frozen. No password dialog is shown.

It also does not work with Linux 4.13+ (“4.14-rc0”) from Linux’ master branch from September 11th, 2017. Suspending takes quite long until the power LED turns off. Resuming, the screen stays black.

PS: OT, but Dell doesn’t share what bugs are fixed in their firmware update [1].

[1] https://twitter.com/DellCares/status/915510808158855168
Comment 41 Paul Menzel 2017-10-05 18:01:07 UTC
(In reply to Rafael J. Wysocki from comment #37)
> Which basically is limited to your machine ATM as all of the users of 9360 I
> know about (including myself) except for you don't see any s2idle problems
> on it.
> 
> Anyway, please try:
> 
> $ sudo su -
> # echo s2idle > /sys/power/mem_sleep
> # echo enabled > /sys/devices/platform/i8042/serio0/power/wakeup
> # echo 1 > /sys/power/pm_debug_messages
> # echo mem > /sys/power/state
> 
> and then press any key on the keyboard to wake up.
> 
> If it does wake up and if you are able to get anything out of it then,
> please save the output of dmesg and attach it here.

Ok, test with Linux 4.14.-rc3, LightDM stopped, and on VT. System suspend and resumes, and I can press keys, which are show, but it hangs. No Linux messages are visible at all though.

```
$ echo mem | sudo tee /sys/power/state
mem
[my enter keys]

niaedtuniae^C^C^C^C^C^C^C^C^C^C^C
uniaend
```
Comment 42 Paul Menzel 2017-10-05 18:06:52 UTC
If you have an idea, how I can access or print out the logs with no working network connection after resume, I am all ears. (I guess it needs to be printed to the console.)
Comment 43 Rafael J. Wysocki 2017-10-05 20:54:19 UTC
(In reply to Paul Menzel from comment #41)
> (In reply to Rafael J. Wysocki from comment #37)
> > Which basically is limited to your machine ATM as all of the users of 9360
> I
> > know about (including myself) except for you don't see any s2idle problems
> > on it.
> > 
> > Anyway, please try:
> > 
> > $ sudo su -
> > # echo s2idle > /sys/power/mem_sleep
> > # echo enabled > /sys/devices/platform/i8042/serio0/power/wakeup
> > # echo 1 > /sys/power/pm_debug_messages
> > # echo mem > /sys/power/state
> > 
> > and then press any key on the keyboard to wake up.
> > 
> > If it does wake up and if you are able to get anything out of it then,
> > please save the output of dmesg and attach it here.
> 
> Ok, test with Linux 4.14.-rc3, LightDM stopped, and on VT. System suspend
> and resumes, and I can press keys, which are show, but it hangs. No Linux
> messages are visible at all though.

And under X it just hangs, right?

Please try to disable the C8-C10 idle states via

/sys/devices/system/cpu/cpu[0-3]/cpuidle/state[6-8]/disabled

and see if the problem is still there with that.
Comment 44 Rafael J. Wysocki 2017-10-05 21:01:17 UTC
I should have mentioned that there is a relatively easy workaround for non-working s2idle which is to make the boot loader append "mem_sleep_default=deep" to the kernel command line.
Comment 45 Rafael J. Wysocki 2017-10-05 21:27:05 UTC
Which reminds me of one more thing: please attach the output of dmidecode from your system.
Comment 46 Mario Limonciello 2017-10-06 03:15:15 UTC
So I spent some time trying to replicate Paul's setup to reproduce his problem.

1) Brand new XPS 9360 with Ubuntu 16.04 factory image.
2) Ran all regular Ubuntu updates.
3) Ran BIOS update to 2.2.1 using fwupd.
4) Compiled kernel 4.14-rc3 using the most recent config posted by Rafael.
5) Installed, rebooted
[    0.000000] Linux version 4.14.0-rc3 (test@test-XPS-13-9360) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)) #1 SMP Thu Oct 5 19:04:05 EDT 2017

6) systemctl suspend
7) Power button to wake up. No problems.
[   97.551459] PM: suspend entry (s2idle)
[   97.551460] PM: Syncing filesystems ... done.
[   97.581561] Freezing user space processes ... (elapsed 0.001 seconds) done.
[   97.583183] OOM killer disabled.
[   97.583184] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   97.584281] Suspending console(s) (use no_console_suspend to debug)
[ 1488.148940] OOM killer enabled.
[ 1488.148942] Restarting tasks ... done.
[ 1488.150759] audit: type=1400 audit(1507258475.839:123): apparmor="DENIED" operation="create" profile="/usr/sbin/cups-browsed" pid=960 comm="cups-browsed" family="unix" sock_type="stream" protocol=0 requested_mask="create" denied_mask="create"
[ 1488.161360] [drm] RC6 on
[ 1488.200752] PM: suspend exit

8) I noticed that I'm *not* using the latest Xorg HWE stack, just the regular one that came with 16.04.1 so I upgraded to try to get closer to Paul's config.
Before:
ii  xorg                                          1:7.7+13ubuntu3                              amd64        X.Org X Window System
ii  xorg-docs-core                                1:1.7.1-1ubuntu1                             all          Core documentation for the X.org X Window System
ii  xorriso                                       1.4.2-4ubuntu1                               amd64        command line ISO-9660 and Rock Ridge manipulation tool
ii  xserver-common                                2:1.18.4-0ubuntu0.4                          all          common files used by various X servers
ii  xserver-xorg                                  1:7.7+13ubuntu3                              amd64        X.Org X server
ii  xserver-xorg-core                             2:1.18.4-0ubuntu0.4                          amd64        Xorg X server - core server
ii  xserver-xorg-input-all                        1:7.7+13ubuntu3                              amd64        X.Org X server -- input driver metapackage
ii  xserver-xorg-input-evdev                      1:2.10.1-1ubuntu2                            amd64        X.Org X server -- evdev input driver
ii  xserver-xorg-input-synaptics                  1.8.2-1ubuntu3                               amd64        Synaptics TouchPad driver for X.Org server
ii  xserver-xorg-input-vmmouse                    1:13.1.0-1ubuntu2                            amd64        X.Org X server -- VMMouse input driver to use with VMWare
ii  xserver-xorg-input-wacom                      1:0.32.0-0ubuntu3                            amd64        X.Org X server -- Wacom input driver
ii  xserver-xorg-video-all                        1:7.7+13ubuntu3                              amd64        X.Org X server -- output driver metapackage
ii  xserver-xorg-video-amdgpu                     1.1.2-0ubuntu0.16.04.1                       amd64        X.Org X server -- AMDGPU display driver
ii  xserver-xorg-video-ati                        1:7.7.0-1                                    amd64        X.Org X server -- AMD/ATI display driver wrapper
ii  xserver-xorg-video-fbdev                      1:0.4.4-1build5                              amd64        X.Org X server -- fbdev display driver
ii  xserver-xorg-video-intel                      2:2.99.917+git20160325-1ubuntu1.2            amd64        X.Org X server -- Intel i8xx, i9xx display driver
ii  xserver-xorg-video-nouveau                    1:1.0.12-1build2                             amd64        X.Org X server -- Nouveau display driver
ii  xserver-xorg-video-qxl                        0.1.4-3ubuntu3                               amd64        X.Org X server -- QXL display driver
ii  xserver-xorg-video-radeon                     1:7.7.0-1                                    amd64        X.Org X server -- AMD/ATI Radeon display driver
ii  xserver-xorg-video-vesa                       1:2.3.4-1build2                              amd64        X.Org X server -- VESA display driver
ii  xserver-xorg-video-vmware                     1:13.1.0-2ubuntu3                            amd64        X.Org X server -- VMware display driver

After:
ii  xorg                                          1:7.7+13ubuntu3                              amd64        X.Org X Window System
ii  xorg-docs-core                                1:1.7.1-1ubuntu1                             all          Core documentation for the X.org X Window System
ii  xorriso                                       1.4.2-4ubuntu1                               amd64        command line ISO-9660 and Rock Ridge manipulation tool
ii  xserver-common                                2:1.18.4-0ubuntu0.4                          all          common files used by various X servers
rc  xserver-xorg                                  1:7.7+13ubuntu3                              amd64        X.Org X server
rc  xserver-xorg-core                             2:1.18.4-0ubuntu0.4                          amd64        Xorg X server - core server
ii  xserver-xorg-core-hwe-16.04                   2:1.19.3-1ubuntu1~16.04.2                    amd64        Xorg X server - core server
ii  xserver-xorg-hwe-16.04                        1:7.7+16ubuntu3~16.04.1                      amd64        X.Org X server
ii  xserver-xorg-input-all-hwe-16.04              1:7.7+16ubuntu3~16.04.1                      amd64        X.Org X server -- input driver metapackage
ii  xserver-xorg-input-evdev-hwe-16.04            1:2.10.5-1ubuntu1~16.04.1                    amd64        X.Org X server -- evdev input driver
ii  xserver-xorg-input-synaptics-hwe-16.04        1.9.0-1ubuntu1~16.04.1                       amd64        Synaptics TouchPad driver for X.Org server
ii  xserver-xorg-input-wacom-hwe-16.04            1:0.34.0-0ubuntu2~16.04.1                    amd64        X.Org X server -- Wacom input driver
ii  xserver-xorg-legacy-hwe-16.04                 2:1.19.3-1ubuntu1~16.04.2                    amd64        setuid root Xorg server wrapper
ii  xserver-xorg-video-all-hwe-16.04              1:7.7+16ubuntu3~16.04.1                      amd64        X.Org X server -- output driver metapackage
ii  xserver-xorg-video-amdgpu-hwe-16.04           1.3.0-0ubuntu1~16.04.1                       amd64        X.Org X server -- AMDGPU display driver
ii  xserver-xorg-video-ati-hwe-16.04              1:7.9.0-0ubuntu1~16.04.1                     amd64        X.Org X server -- AMD/ATI display driver wrapper
ii  xserver-xorg-video-fbdev-hwe-16.04            1:0.4.4-1build6~16.04.1                      amd64        X.Org X server -- fbdev display driver
rc  xserver-xorg-video-intel                      2:2.99.917+git20160325-1ubuntu1.2            amd64        X.Org X server -- Intel i8xx, i9xx display driver
ii  xserver-xorg-video-intel-hwe-16.04            2:2.99.917+git20170309-0ubuntu1~16.04.1      amd64        X.Org X server -- Intel i8xx, i9xx display driver
ii  xserver-xorg-video-nouveau-hwe-16.04          1:1.0.14-0ubuntu1~16.04.1                    amd64        X.Org X server -- Nouveau display driver
ii  xserver-xorg-video-qxl-hwe-16.04              0.1.5-2build1~16.04.1                        amd64        X.Org X server -- QXL display driver
ii  xserver-xorg-video-radeon-hwe-16.04           1:7.9.0-0ubuntu1~16.04.1                     amd64        X.Org X server -- AMD/ATI Radeon display driver
ii  xserver-xorg-video-vesa-hwe-16.04             1:2.3.4-1build3~16.04.1                      amd64        X.Org X server -- VESA display driver
rc  xserver-xorg-video-vmware                     1:13.1.0-2ubuntu3                            amd64        X.Org X server -- VMware display driver
ii  xserver-xorg-video-vmware-hwe-16.04           1:13.2.1-1build1~16.04.1                     amd64        X.Org X server -- VMware display driver

9) Rebooted, tried systemctl suspend again.
10) Wake up from power button.  No problems again.
[   69.715546] PM: suspend entry (s2idle)
[   69.715549] PM: Syncing filesystems ... done.
[   69.726396] Freezing user space processes ... (elapsed 0.002 seconds) done.
[   69.728961] OOM killer disabled.
[   69.728962] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[   69.730254] Suspending console(s) (use no_console_suspend to debug)
[   69.934204] psmouse serio1: Failed to disable mouse on isa0060/serio1
[   75.933261] ACPI: button: The lid device is not compliant to SW_LID.
[   76.177890] OOM killer enabled.
[   76.177896] Restarting tasks ... done.
[   76.197786] [drm] RC6 on
[   76.224637] PM: suspend exit

I'm not sure what else about my setup can be any different than Paul's to lead to his problems. What else have you changed from the factory Ubuntu image?
Funky DKMS packages?  TLP configuration?
Comment 47 Mario Limonciello 2017-10-06 03:16:28 UTC
Created attachment 258733 [details]
full dmesg output from working s2idle after 4..14-rc3 + xorg-hwe package
Comment 48 Mario Limonciello 2017-10-06 03:16:57 UTC
Created attachment 258735 [details]
packages installed on Ubuntu
Comment 49 Paul Menzel 2017-10-06 10:01:07 UTC
(In reply to Rafael J. Wysocki from comment #43)
> (In reply to Paul Menzel from comment #41)
> > (In reply to Rafael J. Wysocki from comment #37)
> > > Which basically is limited to your machine ATM as all of the users of
> 9360
> > I
> > > know about (including myself) except for you don't see any s2idle
> problems
> > > on it.
> > > 
> > > Anyway, please try:
> > > 
> > > $ sudo su -
> > > # echo s2idle > /sys/power/mem_sleep
> > > # echo enabled > /sys/devices/platform/i8042/serio0/power/wakeup
> > > # echo 1 > /sys/power/pm_debug_messages
> > > # echo mem > /sys/power/state
> > > 
> > > and then press any key on the keyboard to wake up.
> > > 
> > > If it does wake up and if you are able to get anything out of it then,
> > > please save the output of dmesg and attach it here.
> > 
> > Ok, test with Linux 4.14.-rc3, LightDM stopped, and on VT. System suspend
> > and resumes, and I can press keys, which are show, but it hangs. No Linux
> > messages are visible at all though.
> 
> And under X it just hangs, right?
> 
> Please try to disable the C8-C10 idle states via
> 
> /sys/devices/system/cpu/cpu[0-3]/cpuidle/state[6-8]/disabled
> 
> and see if the problem is still there with that.

Yes, the problem is still there with that. I stopped LightDM beforehand, and from a VT, I executed `echo mem | sudo tee /sys/power/state`.

One more thing, on the VT where there was a login screen (TTY2 in my case), the key presses were not shown. Switching the TTY was still possible though.
Comment 50 Paul Menzel 2017-10-06 10:02:43 UTC
(In reply to Rafael J. Wysocki from comment #44)
> I should have mentioned that there is a relatively easy workaround for
> non-working s2idle which is to make the boot loader append
> "mem_sleep_default=deep" to the kernel command line.

Yes, that is currently used to work around the problem.
Comment 51 Paul Menzel 2017-10-06 10:16:19 UTC
I booted with `splash quiet` removed from the Linux command line, and added `nomodeset`. Unfortunadely, that didn’t change a thing (LightDM stoppd, issue `echo mem | sudo tee /sys/power/state`.
Comment 52 Paul Menzel 2017-10-06 11:47:28 UTC
Created attachment 258743 [details]
Output of `dmidecode`

(In reply to Rafael J. Wysocki from comment #45)
> Which reminds me of one more thing: please attach the output of dmidecode
> from your system.

Please find it attached.
Comment 53 Paul Menzel 2017-10-06 11:53:25 UTC
(In reply to Mario Limonciello from comment #46)
> So I spent some time trying to replicate Paul's setup to reproduce his
> problem.

[…]

Here is the list of installed packages with the prefix `xserver-xorg`.

```
$ LANG=C dpkg -l xserver-xorg* | grep ^ii
ii  xserver-xorg-core-hwe-16.04            2:1.19.3-1ubuntu1~16.04.2               amd64        Xorg X server - core server
ii  xserver-xorg-hwe-16.04                 1:7.7+16ubuntu3~16.04.1                 amd64        X.Org X server
ii  xserver-xorg-input-all-hwe-16.04       1:7.7+16ubuntu3~16.04.1                 amd64        X.Org X server -- input driver metapackage
ii  xserver-xorg-input-evdev-hwe-16.04     1:2.10.5-1ubuntu1~16.04.1               amd64        X.Org X server -- evdev input driver
ii  xserver-xorg-input-synaptics-hwe-16.04 1.9.0-1ubuntu1~16.04.1                  amd64        Synaptics TouchPad driver for X.Org server
ii  xserver-xorg-input-wacom-hwe-16.04     1:0.34.0-0ubuntu2~16.04.1               amd64        X.Org X server -- Wacom input driver
ii  xserver-xorg-legacy                    2:1.18.4-0ubuntu0.4                     amd64        setuid root Xorg server wrapper
ii  xserver-xorg-video-all-hwe-16.04       1:7.7+16ubuntu3~16.04.1                 amd64        X.Org X server -- output driver metapackage
ii  xserver-xorg-video-amdgpu-hwe-16.04    1.3.0-0ubuntu1~16.04.1                  amd64        X.Org X server -- AMDGPU display driver
ii  xserver-xorg-video-ati-hwe-16.04       1:7.9.0-0ubuntu1~16.04.1                amd64        X.Org X server -- AMD/ATI display driver wrapper
ii  xserver-xorg-video-fbdev-hwe-16.04     1:0.4.4-1build6~16.04.1                 amd64        X.Org X server -- fbdev display driver
ii  xserver-xorg-video-intel-hwe-16.04     2:2.99.917+git20170309-0ubuntu1~16.04.1 amd64        X.Org X server -- Intel i8xx, i9xx display driver
ii  xserver-xorg-video-nouveau-hwe-16.04   1:1.0.14-0ubuntu1~16.04.1               amd64        X.Org X server -- Nouveau display driver
ii  xserver-xorg-video-qxl-hwe-16.04       0.1.5-2build1~16.04.1                   amd64        X.Org X server -- QXL display driver
ii  xserver-xorg-video-radeon-hwe-16.04    1:7.9.0-0ubuntu1~16.04.1                amd64        X.Org X server -- AMD/ATI Radeon display driver
ii  xserver-xorg-video-vesa-hwe-16.04      1:2.3.4-1build3~16.04.1                 amd64        X.Org X server -- VESA display driver
ii  xserver-xorg-video-vmware-hwe-16.04    1:13.2.1-1build1~16.04.1                amd64        X.Org X server -- VMware display driver
```
Comment 54 Paul Menzel 2017-10-06 12:07:55 UTC
Created attachment 258745 [details]
Difference in Linux messages from non-working vs. working

Note, the non-working example is captured from *deep*, as *s2idle* does not work.

> I'm not sure what else about my setup can be any different than Paul's to
> lead to his problems. What else have you changed from the factory Ubuntu
> image?
>
> Funky DKMS packages?  TLP configuration?

Please find the differences in the Linux messages attached. Here are the differences I see.

1. I still use the firmware 1.3.5. Mario uses 2.2.1. (I won’t use that without a published change-log.)

2. Despite Mario using the *latest* firmware, I have the newer microcode updates. (Why doesn’t Dell *latest* firmware ship them?)

3. I have Thunderbolt messages in my Linux messages. (That could be from when having the DA200 adapter plugged in?)

4. Oh, and I forgot, the drive is LUKS encrypted here. I thought, that’s standard by default.

Maybe you spot more differences.
Comment 55 Mario Limonciello 2017-10-06 12:59:53 UTC
Well I can make a few changes to get closer to your config.  My goal is to reproduce.

> 1. I still use the firmware 1.3.5. Mario uses 2.2.1. (I won’t use that
> >without a published change-log.)

I downgraded to 1.3.5.

>2. Despite Mario using the *latest* firmware, I have the newer microcode
>>updates. (Why doesn’t Dell *latest* firmware ship them?)

I installed intel-microcode from xenial-updates/restricted.

>3. I have Thunderbolt messages in my Linux messages. (That could be from when
>>having the DA200 adapter plugged in?)

I don't have a DA200 handy, but I tried to do my S2I test with a WD15 plugged in.

>4. Oh, and I forgot, the drive is LUKS encrypted here. I thought, that’s
>>standard by default.

Well this would require a full re-install.  I don't suspect LUKS to be the cause as S4 isn't in use.


With those changes to the configuration I still can't reproduce.
Comment 56 Mario Limonciello 2017-10-06 13:00:27 UTC
Created attachment 258747 [details]
Mario Downgraded to 1.3.5, installed microcode, plugged in WD15
Comment 57 Mario Limonciello 2017-10-06 13:17:47 UTC
Checked all DKMS packages on the system, none are built for 4.14-rc3.
Purged TLP to ensure it's not messing with going down.

Still no problems with S2I for me.

Other thing I could think would be firmware versions for the graphics FW.  Did you hand modify anything?

$ apt policy linux-firmware
linux-firmware:
  Installed: 1.157.12
  Candidate: 1.157.12
  Version table:
 *** 1.157.12 500
        500 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        500 http://archive.ubuntu.com/ubuntu xenial-updates/main i386 Packages
        100 /var/lib/dpkg/status

OEM image does make the following other modifications relative to stock Ubuntu, but I don't think they're relevant.
* diversion of /lib/firmware/ath10k/QCA6174/hw3.0/board.bin to /lib/firmware/ath10k/QCA6174/hw3.0/board.bin.wifi-qca6174 by wifi-qca6174-killer
* diversion of /lib/udev/rules.d/50-bluetooth-hci-auto-poweron.rules to /lib/udev/rules.d/50-bluetooth-hci-auto-poweron.rules.oem-bluez-autoenable by oem-bluez-autoenable
* diversion of /lib/firmware/ath10k/QCA6174/hw3.0/board-2.bin to /lib/firmware/ath10k/QCA6174/hw3.0/board-2.bin.wifi-qca6174 by wifi-qca6174-killer
* diversion of /etc/bluetooth/main.conf to /etc/bluetooth/main.conf.oem-bluez-autoenable by oem-bluez-autoenable
Comment 58 Mario Limonciello 2017-10-06 13:23:17 UTC
Regarding network problems, I did notice network manager and dhclient isn't getting along with 4.14-rc3:

[   21.634528] audit: type=1400 audit(1507295458.955:58): apparmor="DENIED" operation="create" profile="/sbin/dhclient" pid=1969 comm="dhclient" family="unix" sock_type="stream" protocol=0 requested_mask="create" denied_mask="create"
[   21.637698] audit: type=1400 audit(1507295458.959:59): apparmor="DENIED" operation="create" profile="/usr/lib/NetworkManager/nm-dhcp-helper" pid=1970 comm="nm-dhcp-helper" family="unix" sock_type="stream" protocol=0 requested_mask="create" denied_mask="create"

I would recommend doing the following before S2I:

# /etc/init.d/apparmor teardown
# /etc/init.d/apparmor stop

Hopefully our network will go back up then after leaving S2I.
Comment 59 Mario Limonciello 2017-10-06 13:39:26 UTC
Looking at your diff closer I notice that you *do* have different firmware than is in the linux-firmware package in Ubuntu.

 i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
-[drm] Finished loading DMC firmware i915/kbl_dmc_ver1_01.bin (v1.1)
+i915 0000:00:02.0: Direct firmware load for i915/kbl_dmc_ver1_01.bin failed with error -2
+i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/downloads/firmware], disabling runtime power management.

So I fetched https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/commit/i915/kbl_dmc_ver1_01.bin?id=9facc31d772a3ed399760d2168d644f9de84d6db and put in place on my system.

[    1.267409] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_01.bin (v1.1)

I still can't reproduce.
Comment 60 Rafael J. Wysocki 2017-10-06 13:57:53 UTC
OK

Thanks Mario for doing all this work!

Paul, please let us know what else we can do to reproduce the problem that you are seeing.
Comment 61 Paul Menzel 2017-10-06 14:29:19 UTC
(In reply to Rafael J. Wysocki from comment #60)
> OK
> 
> Thanks Mario for doing all this work!

Seconded. Thank you Mario. Also for the AppArmor help to fix networking.

> Paul, please let us know what else we can do to reproduce the problem that
> you are seeing.

Well, I am at a loss too, what is going on. You have my Linux messages, so no idea, what other difference there is besides the LUKS encrypted system.

Removing the wireless modules before suspend, also didn’t help. Despite the working networking, I am unable to login after resume.

Any hints on how to print something to the screen (`no_console_suspend` didn’t help) would be really great. I do have a BeagleBone Black, which supposedly works as a USB debug dongle. But no idea, how to build a Linux kernel to fix this.
Comment 62 Mario Limonciello 2017-10-06 14:39:40 UTC
Do you have anything else on your system that you've added that adjusts tunables in sysfs?  
Any options that you've put in /etc/modprobe.d that could related to module options?
Anything you've put in /etc/sysctl.conf or /etc/sysctl.d?
Comment 63 Paul Menzel 2017-10-06 14:51:51 UTC
(In reply to Mario Limonciello from comment #62)
> Do you have anything else on your system that you've added that adjusts
> tunables in sysfs?

Not that I know of.

> Any options that you've put in /etc/modprobe.d that could related to module
> options?

```
$ ls -l --full-time /etc/modprobe.d/
insgesamt 56
-rw-r--r-- 1 root root 2507 2015-07-31 05:42:17.000000000 +0200 alsa-base.conf
-rw-r--r-- 1 root root  325 2016-03-13 14:36:35.000000000 +0100 blacklist-ath_pci.conf
-rw-r--r-- 1 root root 1603 2016-03-13 14:36:35.000000000 +0100 blacklist.conf
-rw-r--r-- 1 root root  210 2016-03-13 14:36:35.000000000 +0100 blacklist-firewire.conf
-rw-r--r-- 1 root root  697 2016-03-13 14:36:35.000000000 +0100 blacklist-framebuffer.conf
-rw-r--r-- 1 root root  156 2015-07-31 05:42:17.000000000 +0200 blacklist-modem.conf
lrwxrwxrwx 1 root root   41 2017-03-20 18:30:19.027495898 +0100 blacklist-oss.conf -> /lib/linux-sound-base/noOSS.modprobe.conf
-rw-r--r-- 1 root root  583 2016-03-13 14:36:35.000000000 +0100 blacklist-rare-network.conf
-rw-r--r-- 1 root root 1077 2016-03-13 14:36:35.000000000 +0100 blacklist-watchdog.conf
-rw-r--r-- 1 root root  127 2015-03-11 19:00:32.000000000 +0100 dkms.conf
-rw-r--r-- 1 root root  390 2016-04-12 12:06:37.000000000 +0200 fbdev-blacklist.conf
-rw-r--r-- 1 root root  154 2015-11-10 06:13:21.000000000 +0100 intel-microcode-blacklist.conf
-rw-r--r-- 1 root root  347 2016-03-13 14:36:35.000000000 +0100 iwlwifi.conf
-rw-r--r-- 1 root root  104 2016-03-13 14:36:35.000000000 +0100 mlx4.conf
-rw-r--r-- 1 root root   30 2017-01-26 19:31:13.000000000 +0100 vmwgfx-fbdev.conf
$ md5sum /etc/modules
8b5141eea356759244831b4ed2df9aee  /etc/modules
```

> Anything you've put in /etc/sysctl.conf or /etc/sysctl.d?

```
$ ls -l --full-time /etc/sysctl.conf
-rw-r--r-- 1 root root 2084 2015-09-06 07:30:20.000000000 +0200 /etc/sysctl.conf
$ md5sum /etc/sysctl.conf
76c1d8285c578d5e827c3e07b9738112  /etc/sysctl.conf
$ ls -l /etc/sysctl.d/
insgesamt 40
-rw-r--r-- 1 root root   77 Jan 13  2016 10-console-messages.conf
-rw-r--r-- 1 root root  490 Jan 13  2016 10-ipv6-privacy.conf
-rw-r--r-- 1 root root  726 Jan 13  2016 10-kernel-hardening.conf
-rw-r--r-- 1 root root  257 Jan 13  2016 10-link-restrictions.conf
-rw-r--r-- 1 root root 1184 Jan 13  2016 10-magic-sysrq.conf
-rw-r--r-- 1 root root  509 Jan 13  2016 10-network-security.conf
-rw-r--r-- 1 root root 1292 Jan 13  2016 10-ptrace.conf
-rw-r--r-- 1 root root  506 Jan 13  2016 10-zeropage.conf
-rw-r--r-- 1 root root   69 Okt 24  2015 30-tracker.conf
lrwxrwxrwx 1 root root   14 Jul 19 01:56 99-sysctl.conf -> ../sysctl.conf
-rw-r--r-- 1 root root  519 Jan 13  2016 README
```

```
$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Device [8086:5904] (rev 02)
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:5916] (rev 02)
00:04.0 Signal processing controller [1180]: Intel Corporation Skylake Processor Thermal Subsystem [8086:1903] (rev 02)
00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f] (rev 21)
00:14.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Thermal subsystem [8086:9d31] (rev 21)
00:15.0 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:9d60] (rev 21)
00:15.1 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:9d61] (rev 21)
00:16.0 Communication controller [0780]: Intel Corporation Sunrise Point-LP CSME HECI [8086:9d3a] (rev 21)
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1)
00:1c.4 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port [8086:9d14] (rev f1)
00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port [8086:9d15] (rev f1)
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:9d18] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:9d58] (rev 21)
00:1f.2 Memory controller [0580]: Intel Corporation Sunrise Point-LP PMC [8086:9d21] (rev 21)
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:9d71] (rev 21)
00:1f.4 SMBus [0c05]: Intel Corporation Sunrise Point-LP SMBus [8086:9d23] (rev 21)
3a:00.0 Network controller [0280]: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter [168c:003e] (rev 32)
3b:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a] (rev 01)
3c:00.0 Non-Volatile memory controller [0108]: Device [1c5c:1284]
```
Comment 64 Paul Menzel 2017-10-06 14:53:01 UTC
(In reply to Paul Menzel from comment #61)
> (In reply to Rafael J. Wysocki from comment #60)
> > OK
> > 
> > Thanks Mario for doing all this work!
> 
> Seconded. Thank you Mario. Also for the AppArmor help to fix networking.
> 
> > Paul, please let us know what else we can do to reproduce the problem that
> > you are seeing.
> 
> Well, I am at a loss too, what is going on. You have my Linux messages, so
> no idea, what other difference there is besides the LUKS encrypted system.
> 
> Removing the wireless modules before suspend, also didn’t help. Despite the
> working networking, I am unable to login after resume.
> 
> Any hints on how to print something to the screen (`no_console_suspend`
> didn’t help) would be really great. I do have a BeagleBone Black, which
> supposedly works as a USB debug dongle. But no idea, how to build a Linux
> kernel to fix this.

Wasn’t there also a way to save something in EFI variables?
Comment 65 Mario Limonciello 2017-10-06 15:06:50 UTC
Thanks for sharing all that, I don't see anything different in module options.

I do however notice that we have different NVMe disks.  Yours is a Hynix, mine is a Toshiba.

Given state of your system not being so great even when you tear out pieces of graphics, I'm wondering if perhaps we're looking at an APST issue with that particular disk.  My explanation would be that you aren't going resident to PS4 at runtime typically, but S2I that does happen.

Try to boot up with setting nvme core's default_ps_max_latency to 0.

Share your output from: #nvme id-ctrl /dev/nvm0n1 in that boot.


Another possibility to look at this if it's APST would be break into your initramfs and to invoke S2I from the initramfs.  The root on /dev/nvmen0 won't be mounted, so hopefully if it's a problem with the disk it shouldn't bring userspace down with it.
Comment 66 Paul Menzel 2017-10-06 15:46:05 UTC
(In reply to Mario Limonciello from comment #65)
> Thanks for sharing all that, I don't see anything different in module
> options.
> 
> I do however notice that we have different NVMe disks.  Yours is a Hynix,
> mine is a Toshiba.
> 
> Given state of your system not being so great even when you tear out pieces
> of graphics, I'm wondering if perhaps we're looking at an APST issue with
> that particular disk.  My explanation would be that you aren't going
> resident to PS4 at runtime typically, but S2I that does happen.
> 
> Try to boot up with setting nvme core's default_ps_max_latency to 0.

Setting `nvme.default_ps_max_latency` on the Linux command line didn’t work. I guess it’s nvme-core, but no time now.

Setting it at the running system over sysfs worked (_us at the end), but resume didn’t work either.

> Share your output from: #nvme id-ctrl /dev/nvm0n1 in that boot.

```
NVME Identify Controller:
vid     : 0x1c5c
ssvid   : 0x1c5c
sn      : FS6BN01071010B54P   
mn      : PC300 NVMe SK hynix 512GB               
fr      : 20004A00
rab     : 1
ieee    : ace42e
cmic    : 0
mdts    : 5
cntlid  : 0
ver     : 10200
rtd3r   : 90f560
rtd3e   : ea60
oaes    : 0
oacs    : 0x16
acl     : 3
aerl    : 3
frmw    : 0x16
lpa     : 0x2
elpe    : 254
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 361
cctemp  : 363
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1e
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
ps    0 : mp:5.87W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:2.40W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:1.90W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.1000W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0060W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
```

Bingo! (I tried below first.)

> Another possibility to look at this if it's APST would be break into your
> initramfs and to invoke S2I from the initramfs.  The root on /dev/nvmen0
> won't be mounted, so hopefully if it's a problem with the disk it shouldn't
> bring userspace down with it.

I removed the options and started with `ro init=/bin/sh no_console_suspend`. That gave the following output without the time-stamps.

```
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
nvme 0000:3c:00.0: Refused to change power state, currently in D3
nvme 0000:3c:00.0: Refused to change power state, currently in D3
nvme 0000:3c:00.0: Refused to change power state, currently in D3
nvme 0000:3c:00.0: Refused to change power state, currently in D3
nvme 0000:3c:00.0: Refused to change power state, currently in D3
nvme nvme0: Removing after probe failure status: -19
nvme0n1: detected capacity change from 512… to 0
ACPI: button: The lid device is not compliant to SW_LID.
psmouse serio1: synaptics: queried max coordinates: x [..5666], y [..4734]
psmouse sorio1: synaptics: queried min coordinates: x [1276..], y [1118..]
```

After that several stack traces appeared.

Thorsten, Len, Rafael, what devices do you have?

Thank you Mario. The latest theory looks promising. I am off over the weekend though.
Comment 67 The Linux kernel's regression tracker (Thorsten Leemhuis) 2017-10-08 12:47:14 UTC
(In reply to Paul Menzel from comment #66)
> Thorsten, Len, Rafael, what devices do you have?

3c:00.0 Non-Volatile memory controller [0108]: Device [1c5c:1283]

For details see below. FWIW: After using S2Idle for a few days now (earlier I had used S3) with mainline I got the impression that both S2Idle and S3 work unreliable on my machine since I switched to 4.14. Most of the time it resumes, but sometimes it doesn't (screen stays blank). I could not find a way to reproduce it (but I haven't tried hard yet) :-/ FWIW: S3 worked fine with 4.13. Will keep an eye on it.

$ sudo nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid     : 0x1c5c
ssvid   : 0x1c5c
sn      : FJ6BN19201CYCC32A   
mn      : PC300 NVMe SK hynix 256GB               
fr      : 20005A00
rab     : 1
ieee    : ace42e
cmic    : 0
mdts    : 5
cntlid  : 0
ver     : 10200
rtd3r   : 90f560
rtd3e   : ea60
oaes    : 0
oacs    : 0x16
acl     : 3
aerl    : 3
frmw    : 0x16
lpa     : 0x2
elpe    : 254
npss    : 4
avscc   : 0x1
apsta   : 0x1
wctemp  : 361
cctemp  : 363
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1e
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 1
acwu    : 0
sgls    : 0
subnqn  : 
ps    0 : mp:5.87W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:2.40W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:1.90W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.1000W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0060W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
Comment 68 Paul Menzel 2017-10-09 09:16:15 UTC
Rafael, Mario, hopefully you got everything from my side, and can come up with a fix. It’d be also great, if one of you got yourself the same drive, that is built in the device I have access to here.

Nevertheless, please tell me, if you need anything else, or I can test patches.
Comment 69 Mario Limonciello 2017-10-09 13:18:43 UTC
I don't think we have all the details needed yet.

Paul I'm in particular having a hard time understanding your comment 66.  You mean that switching the APST latency on kernel command line did work to fix resume or not?  

From the past few comments I have to suspect that there was possibly a regression in the NVMe stack with relation to the Hynix SSD.

Within the last few releases there was changes to this stuff, and there are now a bunch of special cased quirks.
Examples:
https://github.com/torvalds/linux/blob/master/drivers/nvme/host/core.c#L1698
https://github.com/torvalds/linux/blob/master/drivers/nvme/host/pci.c#L2287

It's entirely possible a similar quirk will be needed for Hynix SSD.

Paul can you please do the following bisection on your setup?
1) Start with 4.11-rc1
2) Set the mem sleep default using sysfs to s2idle.  If it doesn't exist in a given kernel, echo "freeze" into sys/power/state.
3) Set keyboard wakeup.
4) Go to s2idle
5) Wakeup using keyboard (power button won't easily work for some kernels).

If you can bisect down to a given commit that would point closer to where the regression in NVMe stack was.  If 4.11-rc1 also reproduces it then it's not a regression in the NVMe stack, but it might still be a problem that needs to be quirked.
Comment 70 Paul Menzel 2017-10-09 15:15:07 UTC
(In reply to Mario Limonciello from comment #69)
> I don't think we have all the details needed yet.
> 
> Paul I'm in particular having a hard time understanding your comment 66. 
> You mean that switching the APST latency on kernel command line did work to
> fix resume or not?

No, it did *not* work. Today, I tried it again by setting `nvme.default_ps_max_latency_us=0` on the Linux kernel command line, and verifying that this value is stored in `/sys`.

> From the past few comments I have to suspect that there was possibly a
> regression in the NVMe stack with relation to the Hynix SSD.

At least to my particular model (512 GB) as it works with Thorsten’s 256 GB device.

> Within the last few releases there was changes to this stuff, and there are
> now a bunch of special cased quirks.
> Examples:
> https://github.com/torvalds/linux/blob/master/drivers/nvme/host/core.c#L1698
> https://github.com/torvalds/linux/blob/master/drivers/nvme/host/pci.c#L2287
> 
> It's entirely possible a similar quirk will be needed for Hynix SSD.

Well, it works with Microsoft Windows, doesn’t it?

> Paul can you please do the following bisection on your setup?
> 1) Start with 4.11-rc1
> 2) Set the mem sleep default using sysfs to s2idle.  If it doesn't exist in
> a given kernel, echo "freeze" into sys/power/state.
> 3) Set keyboard wakeup.
> 4) Go to s2idle
> 5) Wakeup using keyboard (power button won't easily work for some kernels).
> 
> If you can bisect down to a given commit that would point closer to where
> the regression in NVMe stack was.  If 4.11-rc1 also reproduces it then it's
> not a regression in the NVMe stack, but it might still be a problem that
> needs to be quirked.

I’ll do that test, but it’d be great if you got yourself such a Hynix SSD to test with.
Comment 71 Mario Limonciello 2017-10-09 15:43:22 UTC
>No, it did *not* work. Today, I tried it again by setting
>>`nvme.default_ps_max_latency_us=0` on the Linux kernel command line, and
>>verifying that this value is stored in `/sys`.

OK, that's unfortunate.  So it's not APST, but it is something else NVMe related.

> Well, it works with Microsoft Windows, doesn’t it?

So?  What's that got to do with a problem on the Linux kernel?  
That doesn't mean that there aren't quirks like those I linked above in the drivers used with Windows.  Also the exact codepaths and timing used in Windows definitely DO vary.

Windows inbox NVMe stack runs power management differently.  Intel's RAID driver (Rapid start) has different policy decisions than Linux does NVMe stack does.

I don't feel comparing this to Windows is useful.

>I’ll do that test, but it’d be great if you got yourself such a Hynix SSD to
>>test with.

I'm looking around, but so far all of the SSD's I've got access to are Toshiba and Samsung.


@Rafael - can you add some of the NVMe folks to this issue?  I think they would be able to get to the bottom of it more effectively than we can.
Comment 72 Paul Menzel 2017-10-09 16:29:49 UTC
Created attachment 258775 [details]
Linux 4.11-rc1 messages (dmesg) from two suspends (s2idle)

(In reply to Paul Menzel from comment #70)
> (In reply to Mario Limonciello from comment #69)

[…]

> > Paul can you please do the following bisection on your setup?
> > 1) Start with 4.11-rc1
> > 2) Set the mem sleep default using sysfs to s2idle.  If it doesn't exist in
> > a given kernel, echo "freeze" into sys/power/state.
> > 3) Set keyboard wakeup.
> > 4) Go to s2idle
> > 5) Wakeup using keyboard (power button won't easily work for some kernels).
> > 
> > If you can bisect down to a given commit that would point closer to where
> > the regression in NVMe stack was.  If 4.11-rc1 also reproduces it then it's
> > not a regression in the NVMe stack, but it might still be a problem that
> > needs to be quirked.
> 
> I’ll do that test, […]

Building Linux 4.11-rc1, and testing s2idle as described by your steps, the power button LED doesn’t turn off. But the screen after a flicker goes black, and pressing a key wakes the system up. According to the attached Linux messages, the system actually slept, but I am not sure because the LED didn’t turn off.
Comment 73 Mario Limonciello 2017-10-09 16:34:05 UTC
LED not turning off with 4.11-rc1 is to be expected.  There was some code that was added between then and now that adjusts the behavior of the EC.  So that does mean there is an NVMe regression between 4.11-rc1 and now that caused this behavior.

Can you proceed with the bisect?
Comment 74 Rafael J. Wysocki 2017-10-09 21:20:24 UTC
The SSD in my 9360 is a Samsung one.
Comment 75 Rafael J. Wysocki 2017-10-09 21:47:05 UTC
@Paul: Since we already know that it is failing in 4.13, I would bisect the non-merge changes in drivers/nvme/ between 4.11-rc1 and 4.13 to start with.  That's around 240 commits, so it should take just a few steps.
Comment 76 Rafael J. Wysocki 2017-10-09 22:26:56 UTC
Keith (CCed) is advising me that default_ps_max_latency_us is a parameter of nvme_core, not nvme.

@Paul: Can you please test again with nvme_core.default_ps_max_latency_us=0 and see if that makes any difference?
Comment 77 Paul Menzel 2017-10-10 01:05:47 UTC
(In reply to Rafael J. Wysocki from comment #76)
> Keith (CCed) is advising me that default_ps_max_latency_us is a parameter of
> nvme_core, not nvme.

Yes, I mentioned that in my comment.

> @Paul: Can you please test again with nvme_core.default_ps_max_latency_us=0
> and see if that makes any difference?

Sorry, my bet copying it from the laptop screen on a different computer by typing. In comment #70 I actually did use `nvme_core`, so it should have been.

> No, it did *not* work. Today, I tried it again by setting
> `nvme_corebo.default_ps_max_latency_us=0` on the Linux kernel command line,
> and
> verifying that this value is stored in `/sys`.

Sorry for the confusion.
Comment 78 Chen Yu 2017-10-10 05:51:54 UTC
Here's my test result, I have a TOSHIBA 256G nvme and s2i works well on top of 4.14-rc4

NVME Identify Controller:
vid     : 0x1179
ssvid   : 0x1179
sn      : Y6TS10UGT18T        
mn      : THNSN5256GPUK NVMe TOSHIBA 256GB        
fr      : 5KDA4101
rab     : 1
ieee    : 00080d
cmic    : 0
mdts    : 0
cntlid  : 0
ver     : 0
rtd3r   : 0
rtd3e   : 0
oaes    : 0
oacs    : 0x17
acl     : 3
aerl    : 3
frmw    : 0x2
lpa     : 0x2
elpe    : 127
npss    : 4
avscc   : 0
apsta   : 0x1
wctemp  : 351
cctemp  : 355
mtfa    : 0
hmpre   : 0
hmmin   : 0
tnvmcap : 0
unvmcap : 0
rpmbs   : 0
sqes    : 0x66
cqes    : 0x44
nn      : 1
oncs    : 0x1e
fuses   : 0
fna     : 0
vwc     : 0x1
awun    : 255
awupf   : 0
nvscc   : 0
acwu    : 0
sgls    : 0
ps    0 : mp:6.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:2.40W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:1.90W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0120W non-operational enlat:5000 exlat:25000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0060W non-operational enlat:100000 exlat:70000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 79 Chen Yu 2017-10-10 09:54:03 UTC
It's a little weird that 'devices' and 'platforms' mode works, as the real suspend-to-idle did mostly the same thing as 'platform', at lease for device drivers.
Comment 80 Keith Busch 2017-10-10 16:15:23 UTC
Okay, thanks for confirming the param. I thought it was likely a mistake since you mentioned it was confirmed in sysfs. I can't think of anything else particularly obvious that changed in the nvme driver that can account for this. Maybe something in pcie changed? That might make a bisect more steps if we need to account for that.
Comment 81 Rafael J. Wysocki 2017-10-10 16:59:24 UTC
(In reply to Chen Yu from comment #79)
> It's a little weird that 'devices' and 'platforms' mode works, as the real
> suspend-to-idle did mostly the same thing as 'platform', at lease for device
> drivers.

"platform" only worked once in a row, however.
Comment 82 Rafael J. Wysocki 2017-10-10 17:03:21 UTC
@Paul: We have a regression that broke s2idle for you between 4.11-rc1 and 4.13 and it appears to be related to NVMe.  I guess at this point the most straightforward way to find the problematic change is to carry out a bisection, especially that you seem to be able to reproduce the issue 100% of the time.
Comment 83 Paul Menzel 2017-10-12 12:46:40 UTC
(In reply to Rafael J. Wysocki from comment #82)
> @Paul: We have a regression that broke s2idle for you between 4.11-rc1 and
> 4.13 and it appears to be related to NVMe.  I guess at this point the most
> straightforward way to find the problematic change is to carry out a
> bisection, especially that you seem to be able to reproduce the issue 100%
> of the time.

Sorry, debugging this already brought me behind schedule with other tasks, so I won’t be able to get to this until the end of next week.

If you know a service, building me uploaded configs and a revision, and providing that over a repository, that would be great.
Comment 84 Rafael J. Wysocki 2017-10-12 15:29:08 UTC
(In reply to Paul Menzel from comment #83)
> (In reply to Rafael J. Wysocki from comment #82)
> > @Paul: We have a regression that broke s2idle for you between 4.11-rc1 and
> > 4.13 and it appears to be related to NVMe.  I guess at this point the most
> > straightforward way to find the problematic change is to carry out a
> > bisection, especially that you seem to be able to reproduce the issue 100%
> > of the time.
> 
> Sorry, debugging this already brought me behind schedule with other tasks,
> so I won’t be able to get to this until the end of next week.

That's OK, please take your time.

> If you know a service, building me uploaded configs and a revision, and
> providing that over a repository, that would be great.

You can look at the SUSE's Open Build Service (OBS), but (disclaimer) I haven't used it myself for quite a while, so not sure how suitable that is really.
Comment 85 Paul Menzel 2017-10-13 11:09:57 UTC
(In reply to Rafael J. Wysocki from comment #84)
> (In reply to Paul Menzel from comment #83)
> > (In reply to Rafael J. Wysocki from comment #82)
> > > @Paul: We have a regression that broke s2idle for you between 4.11-rc1
> and
> > > 4.13 and it appears to be related to NVMe.  I guess at this point the
> most
> > > straightforward way to find the problematic change is to carry out a
> > > bisection, especially that you seem to be able to reproduce the issue
> 100%
> > > of the time.
> > 
> > Sorry, debugging this already brought me behind schedule with other tasks,
> > so I won’t be able to get to this until the end of next week.
> 
> That's OK, please take your time.

Just a small update. The regression happened between 4.12 and 4.13-rc1.

> > If you know a service, building me uploaded configs and a revision, and
> > providing that over a repository, that would be great.
> 
> You can look at the SUSE's Open Build Service (OBS), but (disclaimer) I
> haven't used it myself for quite a while, so not sure how suitable that is
> really.

Also, Ubuntu builds each Linux kernel in some repository, don’t they. It’d be great if somebody knew, how to access these packages easily.
Comment 86 Rafael J. Wysocki 2017-10-13 13:11:23 UTC
(In reply to Paul Menzel from comment #85)
> (In reply to Rafael J. Wysocki from comment #84)
> > (In reply to Paul Menzel from comment #83)
> > > (In reply to Rafael J. Wysocki from comment #82)
> > > > @Paul: We have a regression that broke s2idle for you between 4.11-rc1
> > and
> > > > 4.13 and it appears to be related to NVMe.  I guess at this point the
> > most
> > > > straightforward way to find the problematic change is to carry out a
> > > > bisection, especially that you seem to be able to reproduce the issue
> > 100%
> > > > of the time.
> > > 
> > > Sorry, debugging this already brought me behind schedule with other
> tasks,
> > > so I won’t be able to get to this until the end of next week.
> > 
> > That's OK, please take your time.
> 
> Just a small update. The regression happened between 4.12 and 4.13-rc1.

Thanks!

> > > If you know a service, building me uploaded configs and a revision, and
> > > providing that over a repository, that would be great.
> > 
> > You can look at the SUSE's Open Build Service (OBS), but (disclaimer) I
> > haven't used it myself for quite a while, so not sure how suitable that is
> > really.
> 
> Also, Ubuntu builds each Linux kernel in some repository, don’t they. It’d
> be great if somebody knew, how to access these packages easily.

OBS could build Ubuntu packages too last time I looked at it, though.
Comment 87 Mario Limonciello 2017-10-13 13:59:07 UTC
>Also, Ubuntu builds each Linux kernel in some repository, don’t they. It’d >be
>great if somebody knew, how to access these packages easily.

As noted on the Ubuntu kernel team wiki's page:
https://wiki.ubuntu.com/Kernel/MainlineBuilds
There is an archive with debs here:
http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D

I suppose it would be good to confirm that 4.12 works with that and 4.13-rc1 fails since it will be a different kernel config than you are using.
Comment 88 Paul Menzel 2017-10-23 16:50:25 UTC
Here is the result.

```
$ git bisect bad
8110dd281e155e5010ffd657bba4742ebef7a93f is the first bad commit
commit 8110dd281e155e5010ffd657bba4742ebef7a93f
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Fri Jun 23 15:24:32 2017 +0200

    ACPI / sleep: EC-based wakeup from suspend-to-idle on recent systems
    
    Some recent Dell laptops, including the XPS13 model numbers 9360 and
    9365, cannot be woken up from suspend-to-idle by pressing the power
    button which is unexpected and makes that feature less usable on
    those systems.  Moreover, on the 9365 ACPI S3 (suspend-to-RAM) is
    not expected to be used at all (the OS these systems ship with never
    exercises the ACPI S3 path in the firmware) and suspend-to-idle is
    the only viable system suspend mechanism there.
    
    The reason why the power button wakeup from suspend-to-idle doesn't
    work on those systems is because their power button events are
    signaled by the EC (Embedded Controller), whose GPE (General Purpose
    Event) line is disabled during suspend-to-idle transitions in Linux.
    That is done on purpose, because in general the EC tends to be noisy
    for various reasons (battery and thermal updates and similar, for
    example) and all events signaled by it would kick the CPUs out of
    deep idle states while in suspend-to-idle, which effectively might
    defeat its purpose.
    
    Of course, on the Dell systems in question the EC GPE must be enabled
    during suspend-to-idle transitions for the button press events to
    be signaled while suspended at all, but fortunately there is a way
    out of this puzzle.
    
    First of all, those systems have the ACPI_FADT_LOW_POWER_S0 flag set
    in their ACPI tables, which means that the OS is expected to prefer
    the "low power S0 idle" system state over ACPI S3 on them.  That
    causes the most recent versions of other OSes to simply ignore ACPI
    S3 on those systems, so it is reasonable to expect that it should not
    be necessary to block GPEs during suspend-to-idle on them.
    
    Second, in addition to that, the systems in question provide a special
    firmware interface that can be used to indicate to the platform that
    the OS is transitioning into a system-wide low-power state in which
    certain types of activity are not desirable or that it is leaving
    such a state and that (in principle) should allow the platform to
    adjust its operation mode accordingly.
    
    That interface is a special _DSM object under a System Power
    Management Controller device (PNP0D80).  The expected way to use it
    is to invoke function 0 from it on system initialization, functions
    3 and 5 during suspend transitions and functions 4 and 6 during
    resume transitions (to reverse the actions carried out by the
    former).  In particular, function 5 from the "Low-Power S0" device
    _DSM is expected to cause the platform to put itself into a low-power
    operation mode which should include making the EC less verbose (so to
    speak).  Next, on resume, function 6 switches the platform back to
    the "working-state" operation mode.
    
    In accordance with the above, modify the ACPI suspend-to-idle code
    to look for the "Low-Power S0" _DSM interface on platforms with the
    ACPI_FADT_LOW_POWER_S0 flag set in the ACPI tables.  If it's there,
    use it during suspend-to-idle transitions as prescribed and avoid
    changing the GPE configuration in that case.  [That should reflect
    what the most recent versions of other OSes do.]
    
    Also modify the ACPI EC driver to make it handle events during
    suspend-to-idle in the usual way if the "Low-Power S0" _DSM interface
    is going to be used to make the power button events work while
    suspended on the Dell machines mentioned above
    
    Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

:040000 040000 ed00fcd01bf8bb1db1026b16701996e5af257cf7 ca1507d67f8ac3b4d5e2fe9e0aba466cb52a1f15 M	drivers
```

Here is the bisection log.

```
$ git bisect log
# bad: [5771a8c08880cdca3bfb4a3fc6d309d6bba20877] Linux v4.13-rc1
# good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12
git bisect start 'v4.13-rc1' 'v4.12'
# bad: [e5f76a2e0e84ca2a215ecbf6feae88780d055c56] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
git bisect bad e5f76a2e0e84ca2a215ecbf6feae88780d055c56
# bad: [1849f800fba32cd5a0b647f824f11426b85310d8] Merge tag 'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 1849f800fba32cd5a0b647f824f11426b85310d8
# good: [cbcd4f08aa637b74f575268770da86a00fabde6d] Merge tag 'staging-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good cbcd4f08aa637b74f575268770da86a00fabde6d
# bad: [408c9861c6979db974455b9e7a9bcadd60e0934c] Merge tag 'pm-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad 408c9861c6979db974455b9e7a9bcadd60e0934c
# good: [650fc870a2ef35b83397eebd35b8c8df211bff78] Merge tag 'docs-4.13' of git://git.lwn.net/linux
git bisect good 650fc870a2ef35b83397eebd35b8c8df211bff78
# good: [d62eb5edf6643ede7e48b4d03ba972c0e8949acc] Merge tag 'regulator-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
git bisect good d62eb5edf6643ede7e48b4d03ba972c0e8949acc
# bad: [8f8e5c3e2796eaf150d6262115af12707c2616dd] Merge branch 'acpi-pm'
git bisect bad 8f8e5c3e2796eaf150d6262115af12707c2616dd
# good: [301f8d7463b1f3d1fdb56ee1cb4abb674094531d] Merge branch 'pm-sleep'
git bisect good 301f8d7463b1f3d1fdb56ee1cb4abb674094531d
# good: [9a5f2c871af4cf6bd63ddb20061faa7049103350] Merge branches 'pm-domains', 'pm-avs' and 'powercap'
git bisect good 9a5f2c871af4cf6bd63ddb20061faa7049103350
# good: [d07ff6523b1ed24d636365f8479b0db70946dc14] Merge branch 'uuid-types'
git bisect good d07ff6523b1ed24d636365f8479b0db70946dc14
# bad: [a1a66393e39a97433bcc1737133ba7478993d247] ACPI / PM: Drop run_wake from struct acpi_device_wakeup_flags
git bisect bad a1a66393e39a97433bcc1737133ba7478993d247
# good: [ef884112e55c60d9e208b6524ae1841ae7e2fb2c] platform: x86: intel-hid: Wake up the system from suspend-to-idle
git bisect good ef884112e55c60d9e208b6524ae1841ae7e2fb2c
# bad: [8110dd281e155e5010ffd657bba4742ebef7a93f] ACPI / sleep: EC-based wakeup from suspend-to-idle on recent systems
git bisect bad 8110dd281e155e5010ffd657bba4742ebef7a93f
# first bad commit: [8110dd281e155e5010ffd657bba4742ebef7a93f] ACPI / sleep: EC-based wakeup from suspend-to-idle on recent systems
```
Comment 89 Mario Limonciello 2017-10-23 18:03:21 UTC
Thanks, I'm glad you did a bisect.  

Frankly that's really surprising to me (as I 'm sure it will be to Rafael as well).  That particular commit is used to exercise an AML codepath that puts the EC into low power mode and when this happens the EC turns off the power LED.  The EC doesn't do anything to the NVMe disk, so this result is fairly odd to me.  

Did you by chance purchase your XPS 9360 with Windows and have a dual boot?  If so, it would be interesting to know if Windows is also failing in this area (which would indicate some most likely failing NVMe hardware).

From an ACPI and EC perspective the AML codepath is actually the same on Windows when the machine enters low power idle resiliency phase from this commit.
Comment 90 Rafael J. Wysocki 2017-10-23 20:26:51 UTC
Well, it may be surprising, but it doesn't leave a lot of room for what can be done to address it.

It looks like 9360 should be blacklisted from using the "Low Power Idle S0" _DSM by default, which also make it use S3 instead of suspend-to-idle by default too.

That actually shouldn't hurt people as we know that 9360 *can* do S3.

I'll prepare a patch for that tomorrow.
Comment 91 Mario Limonciello 2017-10-23 20:29:14 UTC
Thanks Rafael, I agree that's the right solution for 4.14.  I'd still like to get to the bottom of this though on the 9360, so can we continue to keep this open to try to root cause?
Comment 92 Paul Menzel 2017-10-24 09:28:58 UTC
Unfortunately, reverting that patch is not easy on current master. Could you please provide a patch reverting this?

Also, skimming the commit [1], I do not see how it is just affecting the embedded controller, as the _DSM stuff seems to be a generic mechanism affecting all components [2].


[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8110dd281e155e5010ffd657bba4742ebef7a93f
[2] http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf
    (No HTTPS.)
Comment 93 Mario Limonciello 2017-10-24 13:54:11 UTC
I don't believe reverting that patch is the proper solution.  As Rafael said we can just blacklist all XPS 9360's.  I wish we could just blacklist against your SSD + system combination as everyone else with a 9360 has had success, but I don't believe that's easily doable.
That particular patch *does* fix (at least) 3 other systems that I know about.
So until we know why your SSD + system combination has problems that's the best solution.

You'd have to walk down your ASL execution path to see the connection on affected devices.  The _DSM method on the PEPD device for the 9360 when called with the arguments in that patch will cause the GUAM method to be executed.  The GUAM method will send a message to the EC to go into low power mode or leave low power mode.  In low power mode is also when the power LED will be turned off.
Comment 94 Rafael J. Wysocki 2017-10-24 13:56:13 UTC
Created attachment 260361 [details]
ACPI / PM: Prevent some Dells XPS13 9360 from using Low Power S0 Idle

Reverting commits is not the only way to make things work. :-)

Can you please check if this causes your machine go back to using S3?  If so, suspend-to-idle should work again, too.
Comment 95 Rafael J. Wysocki 2017-10-24 14:03:08 UTC
(In reply to Paul Menzel from comment #92)
> Unfortunately, reverting that patch is not easy on current master. Could you
> please provide a patch reverting this?
> 
> Also, skimming the commit [1], I do not see how it is just affecting the
> embedded controller, as the _DSM stuff seems to be a generic mechanism
> affecting all components [2].

That actually depends on how the _DSM is implemented and Mario said what it was really doing in comment #89.
Comment 96 Paul Menzel 2017-10-24 14:50:07 UTC
(In reply to Rafael J. Wysocki from comment #94)
> Created attachment 260361 [details]
> ACPI / PM: Prevent some Dells XPS13 9360 from using Low Power S0 Idle
> 
> Reverting commits is not the only way to make things work. :-)

I wanted to revert the patch on master to see if this fixes the issue, so the bisected commit is really the culprit.

> Can you please check if this causes your machine go back to using S3?  If
> so, suspend-to-idle should work again, too.

Sure, I’ll try to get to it as soon as possible.

Additionally, it looks like that there is some ACPI debug message switch I need to toggle to get more information about what happens in the culprit commit.
Comment 97 Rafael J. Wysocki 2017-10-24 19:37:41 UTC
Yes, you probably mean /sys/power/pm_debug_messages
Comment 98 Paul Menzel 2017-10-25 11:20:46 UTC
Here is the log with your patch. Despite `echo 1 | sudo tee /sys/power/pm_debug_messages`, there do not seem to be any debug messages written. The power button LED doesn’t turn off, but that seems to be expected.

```
$ journalctl -k
[…]
Okt 25 13:14:55 Ixpees kernel: PM: suspend entry (s2idle)
Okt 25 13:14:55 Ixpees kernel: PM: Syncing filesystems ... done.
Okt 25 13:14:55 Ixpees kernel: PM: Preparing system for sleep (s2idle)
Okt 25 13:15:41 Ixpees kernel: Freezing user space processes ... (elapsed 0.002 seconds) done.
Okt 25 13:15:41 Ixpees kernel: OOM killer disabled.
Okt 25 13:15:41 Ixpees kernel: Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Okt 25 13:15:41 Ixpees kernel: PM: Suspending system (s2idle)
Okt 25 13:15:41 Ixpees kernel: Suspending console(s) (use no_console_suspend to debug)
Okt 25 13:15:41 Ixpees kernel: psmouse serio1: Failed to disable mouse on isa0060/serio1
Okt 25 13:15:41 Ixpees kernel: PM: suspend of devices complete after 2357.524 msecs
Okt 25 13:15:41 Ixpees kernel: PM: late suspend of devices complete after 22.002 msecs
Okt 25 13:15:41 Ixpees kernel: PM: suspend-to-idle
Okt 25 13:15:41 Ixpees kernel: PM: noirq suspend of devices complete after 39.717 msecs
Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 9.019 seconds
Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 15.999 seconds
Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 14.999 seconds
Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 2.999 seconds
Okt 25 13:15:41 Ixpees kernel: PM: noirq resume of devices complete after 45.780 msecs
Okt 25 13:15:41 Ixpees kernel: PM: resume from suspend-to-idle
Okt 25 13:15:41 Ixpees kernel: PM: early resume of devices complete after 2.958 msecs
Okt 25 13:15:41 Ixpees kernel: ACPI: button: The lid device is not compliant to SW_LID.
Okt 25 13:15:41 Ixpees kernel: usb 1-3: reset full-speed USB device number 2 using xhci_hcd
Okt 25 13:15:41 Ixpees kernel: PM: resume of devices complete after 336.913 msecs
Okt 25 13:15:41 Ixpees kernel: usb 1-3:1.0: rebind failed: -517
Okt 25 13:15:41 Ixpees kernel: usb 1-3:1.1: rebind failed: -517
Okt 25 13:15:41 Ixpees kernel: PM: Finishing wakeup.
Okt 25 13:15:41 Ixpees kernel: OOM killer enabled.
Okt 25 13:15:41 Ixpees kernel: Restarting tasks ... 
Okt 25 13:15:41 Ixpees kernel: audit: type=1400 audit(1508930141.379:127): apparmor="DENIED" operation="create" profile="/usr/sbin/cups-browsed" pid=948 comm="cups-browsed" fam
Okt 25 13:15:41 Ixpees kernel: done.
Okt 25 13:15:41 Ixpees kernel: [drm] RC6 on
Okt 25 13:15:41 Ixpees kernel: PM: suspend exit
[…]
```

Two remarks regarding the commit.

1.  Is it possible to add a message to the log, that the quirk is applied?
2.  Please do not make the quirk depend on the firmware version. Keep in mind, that s2idle doesn’t work on a system, it can cause data loss, when the user didn’t save the work before suspend, which worked before.
Comment 99 Paul Menzel 2017-10-25 11:23:50 UTC
3. Also, it’d be great to be able to run-time configure, if the quirk should be applied or not. That way, people, where the quirk is not needed can easily override it without having to recompile their Linux kernel.
Comment 100 Paul Menzel 2017-10-25 11:29:23 UTC
(In reply to Rafael J. Wysocki from comment #97)
> Yes, you probably mean /sys/power/pm_debug_messages

Are you sure, that is for getting messages like below printed?

```
               acpi_handle_debug(adev->handle, "_DSM function mask: 0x%x\n",
                                 bitmask);
```
Comment 101 Paul Menzel 2017-10-25 12:21:04 UTC
(In reply to Paul Menzel from comment #98)
> Here is the log with your patch. Despite `echo 1 | sudo tee
> /sys/power/pm_debug_messages`, there do not seem to be any debug messages
> written. The power button LED doesn’t turn off, but that seems to be
> expected.
> 
> ```
> $ journalctl -k
> […]
> Okt 25 13:14:55 Ixpees kernel: PM: suspend entry (s2idle)
> Okt 25 13:14:55 Ixpees kernel: PM: Syncing filesystems ... done.
> Okt 25 13:14:55 Ixpees kernel: PM: Preparing system for sleep (s2idle)
> Okt 25 13:15:41 Ixpees kernel: Freezing user space processes ... (elapsed
> 0.002 seconds) done.
> Okt 25 13:15:41 Ixpees kernel: OOM killer disabled.
> Okt 25 13:15:41 Ixpees kernel: Freezing remaining freezable tasks ...
> (elapsed 0.001 seconds) done.
> Okt 25 13:15:41 Ixpees kernel: PM: Suspending system (s2idle)
> Okt 25 13:15:41 Ixpees kernel: Suspending console(s) (use no_console_suspend
> to debug)
> Okt 25 13:15:41 Ixpees kernel: psmouse serio1: Failed to disable mouse on
> isa0060/serio1
> Okt 25 13:15:41 Ixpees kernel: PM: suspend of devices complete after
> 2357.524 msecs

Just another note, that it took the system quite a while to suspend. Here you can see that since 13:14:55 almost a minute passed.

> […]
> ```
Comment 102 Paul Menzel 2017-10-25 12:25:49 UTC
Created attachment 260399 [details]
Picture of Linux messages with Linux 4.14-rc6 built from Ubuntu

Here is a picture of a boot with Linux 4.14-rc6 built by the Ubuntu Linux kernel team, that means *without* the revert commit, `no_console_suspend init=/bin/bash` on the Linux command line, and then the two commands below.

```
# echo 1 > /sys/power/pm_debug_messages
# echo mem > /sys/power/state
```

Note, that the power button only woke up the system after pressing it several seconds. :(

The next picture shows the rest of the messages.
Comment 103 Paul Menzel 2017-10-25 12:26:52 UTC
Created attachment 260401 [details]
Picture of Linux messages with Linux 4.14-rc6 built from Ubuntu [2/2]
Comment 104 Rafael J. Wysocki 2017-10-25 14:26:22 UTC
(In reply to Paul Menzel from comment #98)
> Here is the log with your patch. Despite `echo 1 | sudo tee
> /sys/power/pm_debug_messages`, there do not seem to be any debug messages
> written. The power button LED doesn’t turn off, but that seems to be
> expected.
> 
> ```
> $ journalctl -k
> […]
> Okt 25 13:14:55 Ixpees kernel: PM: suspend entry (s2idle)
> Okt 25 13:14:55 Ixpees kernel: PM: Syncing filesystems ... done.
> Okt 25 13:14:55 Ixpees kernel: PM: Preparing system for sleep (s2idle)

Well, this is confusing, because this means suspend-to-idle is in use.

Did you trigger this with "echo mem > /sys/power/state"?

If so, can you please check what's there in /sys/power/mem_sleep with the patch applied after a fresh boot.

[I'm assuming that you tested 4.14-rc6 with the patch on top.]

> Okt 25 13:15:41 Ixpees kernel: Freezing user space processes ... (elapsed
> 0.002 seconds) done.
> Okt 25 13:15:41 Ixpees kernel: OOM killer disabled.
> Okt 25 13:15:41 Ixpees kernel: Freezing remaining freezable tasks ...
> (elapsed 0.001 seconds) done.
> Okt 25 13:15:41 Ixpees kernel: PM: Suspending system (s2idle)
> Okt 25 13:15:41 Ixpees kernel: Suspending console(s) (use no_console_suspend
> to debug)
> Okt 25 13:15:41 Ixpees kernel: psmouse serio1: Failed to disable mouse on
> isa0060/serio1
> Okt 25 13:15:41 Ixpees kernel: PM: suspend of devices complete after
> 2357.524 msecs
> Okt 25 13:15:41 Ixpees kernel: PM: late suspend of devices complete after
> 22.002 msecs
> Okt 25 13:15:41 Ixpees kernel: PM: suspend-to-idle
> Okt 25 13:15:41 Ixpees kernel: PM: noirq suspend of devices complete after
> 39.717 msecs
> Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 9.019 seconds
> Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 15.999 seconds
> Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 14.999 seconds
> Okt 25 13:15:41 Ixpees kernel: PM: Timekeeping suspended for 2.999 seconds
> Okt 25 13:15:41 Ixpees kernel: PM: noirq resume of devices complete after
> 45.780 msecs
> Okt 25 13:15:41 Ixpees kernel: PM: resume from suspend-to-idle
> Okt 25 13:15:41 Ixpees kernel: PM: early resume of devices complete after
> 2.958 msecs
> Okt 25 13:15:41 Ixpees kernel: ACPI: button: The lid device is not compliant
> to SW_LID.
> Okt 25 13:15:41 Ixpees kernel: usb 1-3: reset full-speed USB device number 2
> using xhci_hcd
> Okt 25 13:15:41 Ixpees kernel: PM: resume of devices complete after 336.913
> msecs
> Okt 25 13:15:41 Ixpees kernel: usb 1-3:1.0: rebind failed: -517
> Okt 25 13:15:41 Ixpees kernel: usb 1-3:1.1: rebind failed: -517
> Okt 25 13:15:41 Ixpees kernel: PM: Finishing wakeup.
> Okt 25 13:15:41 Ixpees kernel: OOM killer enabled.
> Okt 25 13:15:41 Ixpees kernel: Restarting tasks ... 
> Okt 25 13:15:41 Ixpees kernel: audit: type=1400 audit(1508930141.379:127):
> apparmor="DENIED" operation="create" profile="/usr/sbin/cups-browsed"
> pid=948 comm="cups-browsed" fam
> Okt 25 13:15:41 Ixpees kernel: done.
> Okt 25 13:15:41 Ixpees kernel: [drm] RC6 on
> Okt 25 13:15:41 Ixpees kernel: PM: suspend exit
> […]
> ```

OK, so it didn't crash and the power led was not off, so the blacklisting seems to be effective, but then S3 should be the default suspend method too.

> Two remarks regarding the commit.
> 
> 1.  Is it possible to add a message to the log, that the quirk is applied?

Yes.

> 2.  Please do not make the quirk depend on the firmware version. Keep in
> mind, that s2idle doesn’t work on a system, it can cause data loss, when the
> user didn’t save the work before suspend, which worked before.

At this time I have no reason to believe that upgrading the firmware will not make the issue go away on your system.  At the same time, I want the quirk to only affect systems that need to be quirked.
Comment 105 Rafael J. Wysocki 2017-10-25 14:27:33 UTC
(In reply to Paul Menzel from comment #99)
> 3. Also, it’d be great to be able to run-time configure, if the quirk should
> be applied or not. That way, people, where the quirk is not needed can
> easily override it without having to recompile their Linux kernel.

OK, but let's first make sure that it actually works.
Comment 106 Rafael J. Wysocki 2017-10-25 14:29:11 UTC
(In reply to Paul Menzel from comment #100)
> (In reply to Rafael J. Wysocki from comment #97)
> > Yes, you probably mean /sys/power/pm_debug_messages
> 
> Are you sure, that is for getting messages like below printed?
> 
> ```
>                acpi_handle_debug(adev->handle, "_DSM function mask: 0x%x\n",
>                                  bitmask);
> ```

No, it isn't.

For that to be printed you need to enable dynamic debug in sleep.c.
Comment 107 Rafael J. Wysocki 2017-10-25 14:33:48 UTC
Created attachment 260403 [details]
ACPI / PM: Prevent some Dells XPS13 9360 from using Low Power S0 Idle

This updated version of the patch will print a message if the quirk is in effect.
Comment 108 Rafael J. Wysocki 2017-10-25 14:43:12 UTC
(In reply to Paul Menzel from comment #102)
> Created attachment 260399 [details]
> Picture of Linux messages with Linux 4.14-rc6 built from Ubuntu
> 
> Here is a picture of a boot with Linux 4.14-rc6 built by the Ubuntu Linux
> kernel team, that means *without* the revert commit,

What do you mean by "revert commit"?  The patch that I attached previously?

> `no_console_suspend init=/bin/bash` on the Linux command line, and then the
> two commands below.
> 
> ```
> # echo 1 > /sys/power/pm_debug_messages
> # echo mem > /sys/power/state
> ```
> 
> Note, that the power button only woke up the system after pressing it
> several seconds. :(

What state was it in after the wakeup?

> The next picture shows the rest of the messages.

Well, this clearly shows a problem with the nvme device whose power state cannot be changed back to D0, apparently during a device resume transition.

Unfortunately, one kernel configuration option is not set in the Ubuntu build and the messages from drivers/base/power/main.c don't show up.
Comment 109 Rafael J. Wysocki 2017-10-25 14:48:54 UTC
Please apply the patch from comment #107 on top of 4.14-rc6, boot it and verify the following:

(1) Whether or not the "Low Power S0 Idle interface disabled" (preceded by the device path) message is present in the dmesg output.
(2) Whether or not you see "deep" in /sys/power/mem_sleep

If (1) the message is present in the dmesg output and (2) you see "deep" in that file, please try to trigger system suspend with "echo mem > /sys/power/state" and let me know the outcome.

Otherwise, I will need look into the system I have access to, but that will only be possible when I'm back home from the conference I'm attending (that should be on Saturday).

@Mario: Can you please check the patch from comment #107 with the BIOS version line removed on your 9360?
Comment 110 Mario Limonciello 2017-10-25 20:34:47 UTC
@Rafael,

Sure.  Just tested on an 9360 with that patch on top of 4.14-rc6.

1) I confirmed this line is in dmesg output
2) /sys/power/mem_sleep has the value "deep"
3) Running that command does put it into S3.
Comment 111 Mario Limonciello 2017-10-25 21:13:33 UTC
@Keith,

You're still on CC.  Any thoughts about the nvme errors that were cropping up in OP's tests?  It's really suspicious that in s2idle it takes nearly a minute to go down:

Okt 25 13:15:41 Ixpees kernel: PM: suspend of devices complete after 2357.524 msecs

and the system log has this continually:

nvme 000:3c:00.0: Refused to change power state, currently in D3


In runtime-pm busted on this particular drive?
Comment 112 Rafael J. Wysocki 2017-10-26 08:54:41 UTC
Well, because the problem appears to be not reproducible without the LPS0 _DSM, it looks like during the execution of that _DSM the platform does something to the SSD that confuses its firmware and then it cannot be put back into D0 by the OS in the usual way.

That might have been fixed in the new platform firmware (BIOS) version, however.
Comment 113 Rafael J. Wysocki 2017-10-26 08:57:34 UTC
Created attachment 260405 [details]
ACPI / PM: Prevent Dell XPS13 9360 from using Low Power S0 Idle

Paul, this is the patch tested by Mario.

Can you please test it and confirm that you see the same results?
Comment 114 Paul Menzel 2017-10-30 13:39:23 UTC
(In reply to Rafael J. Wysocki from comment #113)
> Created attachment 260405 [details]
> ACPI / PM: Prevent Dell XPS13 9360 from using Low Power S0 Idle
> 
> Paul, this is the patch tested by Mario.
> 
> Can you please test it and confirm that you see the same results?

Just a small note, that I won’t have access to the device until next Monday.

Also, if somebody provides the change-log for the newer firmware versions, then I’d happy to update. Otherwise, I’d boycott Dell’s practices and stay with the firmware, and somebody can test.
Comment 115 Donald Buczek 2017-10-30 13:48:28 UTC
Thanks for your mail.

I'm out of office until  Nov 13, 2017

it? try helpdesk@molgen.mpg.de
Comment 116 Mario Limonciello 2017-10-30 13:52:10 UTC
>Also, if somebody provides the change-log for the newer firmware versions,
>>then I’d happy to update. Otherwise, I’d boycott Dell’s practices and stay
>>with the firmware, and somebody can test.

Sure, it's your own decision whether to update or not.  The changelog is typically posted with "Fixes and Enhancements".  Here's a recent example: 
http://www.dell.com/support/home/us/en/19/drivers/driversdetails?driverId=JGCWT
Comment 117 Paul Menzel 2017-11-03 15:58:29 UTC
Created attachment 260495 [details]
Linux 4.14-rc7+ with patch messages (dmesg) with one suspend (deep)

(In reply to Rafael J. Wysocki from comment #113)
> Created attachment 260405 [details]
> ACPI / PM: Prevent Dell XPS13 9360 from using Low Power S0 Idle
> 
> Paul, this is the patch tested by Mario.
> 
> Can you please test it and confirm that you see the same results?

Yes, I see the same results. I have to re-read the other comments next week.

```
$ dmesg | grep disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000]   8 disabled
[    0.000000]   9 disabled
[    0.000000] ACPI: Early table checksum verification disabled
[    0.038488] AppArmor: AppArmor disabled by boot time parameter
[    0.514822] ACPI: \_SB_.PEPD: Low Power S0 Idle interface disabled
[    1.221256] audit: initializing netlink subsys (disabled)
[  300.962574] OOM killer disabled.
[  305.285188] thunderbolt 0000:03:00.0: 0:3: disabled by eeprom
[  305.285190] thunderbolt 0000:03:00.0: 0:4: disabled by eeprom
[  305.285191] thunderbolt 0000:03:00.0: 0:5: disabled by eeprom
[  305.285290] thunderbolt 0000:03:00.0: 0:8: disabled by eeprom
[  305.285291] thunderbolt 0000:03:00.0: 0:9: disabled by eeprom
[  305.285346] thunderbolt 0000:03:00.0: 0:b: disabled by eeprom
```

Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Comment 118 Rafael J. Wysocki 2017-11-06 11:35:47 UTC
OK, thanks.

I'm going to apply this in the spirit of not breaking existing setups, even though I have some serious doubts on the idea of running outdated BIOSes. :-)
Comment 119 Paul Menzel 2017-11-06 13:04:21 UTC
(In reply to Rafael J. Wysocki from comment #118)
> OK, thanks.
> 
> I'm going to apply this in the spirit of not breaking existing setups, even
> though I have some serious doubts on the idea of running outdated BIOSes. :-)

Thank you to you and Mario for following through with this.

Now I tried to update the firmware, but fwupd shipped by Ubuntu 16.04.3 LTS used by Dell fails me [1].

[1] https://github.com/hughsie/fwupd/issues/303
Comment 120 Paul Menzel 2017-11-06 13:57:09 UTC
The problem is still reproducible with firmware version 2.3.1.

> DMI: Dell Inc. XPS 13 9360/0839Y6, BIOS 2.3.1 10/03/2017
Comment 121 Donald Buczek 2017-11-06 13:57:19 UTC
Thanks for your mail.

I'm out of office until  Nov 13, 2017

it? try helpdesk@molgen.mpg.de
Comment 122 Paul Menzel 2017-11-13 15:43:57 UTC
It looks like testing s2idle now became cumbersome as even with `echo s2idle | sudo tee /sys/power/mem_sleep` the “real” s2idle is not used and certain features(?) are deactivated. Is there a way to override that?

Also, should this bug be used to track the real issue, or only the regression and a new one be created?
Comment 123 Rafael J. Wysocki 2017-11-20 22:36:54 UTC
The regression as reported here was addressed by commit 71630b7a832f (ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360).

The underlying problem is firmware-related, as it can only be reproduced when the LP_S0 _DSM interface is used, so the commit above prevents your machine from using that _DSM.  However, that also causes s2idle to work in the "legacy" mode in which EC events are not enabled over suspend/resume, so the power button doesn't wake up the machine from it (the keyboard should still wake it up if enabled).

If you want to bypass the blacklist entry, apply this patch

https://patchwork.kernel.org/patch/10058593/

and run the kernel with acpi_sleep=nobl in the command line, but I'm not recommending you to do that, as the machine will most likely fail to resume from s2idle then.

I don't think it is useful to track it any more, as it is not likely to be addressed in the firmware (as it hasn't been addressed so far) and there is not enough information to address it in the kernel except for blacklisting the affected system(s) that has been done already.
Comment 124 Paul Menzel 2018-01-20 17:30:12 UTC
One late follow-up. I think it works with Microsoft Windows, so is it really a firmware issue?
Comment 125 Keith Busch 2018-01-20 17:41:23 UTC
Created attachment 273765 [details]
attachment-21170-0.html

Thank you for the message. I am travelling/working remotely in China until Jan 22. Responses will be delayed.
Comment 126 Rafael J. Wysocki 2018-01-21 02:00:28 UTC
(In reply to Paul Menzel from comment #124)
> One late follow-up. I think it works with Microsoft Windows, so is it really
> a firmware issue?

It is related to the firmware interaction with the NVMe device and we don't know what the interaction is exactly and how Windows deals with it, so the only way to address this is to avoid the issue altogether.
Comment 127 Paul Menzel 2018-02-13 10:47:44 UTC
Mario pointed me to bug 112121, where the NVMe device also goes missing. In bug 112121, comment 10 [2], it says that patch [3] fixes that issue. Unfortunately, applying that on Linux 4.16-rc1+ (commit 178e834c (Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6)) and building the Linux kernel, the SK Hynix NVMe device doesn’t come back up.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=112121
    "Some PCIe options cause devices to be removed after suspend"
[2] https://bugzilla.kernel.org/show_bug.cgi?id=112121#c10
[3] https://patchwork.kernel.org/patch/10212201/
Comment 128 Mario Limonciello 2018-11-08 16:22:12 UTC
@Paul Menzel,

In a non-public forum I heard of some other issues with a newer Hynix SSD (in a newer system) that reminded me of this.  The issues are being worked around by disabling runtime PM for the SSD but leaving APST enabled.  Would you be able to try that on your failing system in S2I to see if it helped?
Comment 130 Mario Limonciello 2018-11-08 17:09:15 UTC
Thanks for sharing.

@Paul if you confirm this helps your drive too that would be great.  It sounds like those patches might not be applied and if it fixes the issue for you too it's a good reason for them to be applied.

> We generally quirk for grave issues that make devices unusable.
> To me working around a D3 mode that consumers more power than D0
> does not quite fit that bill.
Comment 131 Mario Limonciello 2019-07-11 02:19:12 UTC
@Paul Menzel,

Can you please test Linus' tree with 71630b7a832f reverted?  There is a new patch series that was merged into 5.3 that adjusts the behavior of NVME disks over suspend to idle.  The series merged included commit d916b1be94b6dc8d293abed2451f3062f6af7551.

I expect it should fix your disk problem and would appreciate your confirmation so we can remove this now hopefully unnecessary blacklist.
Comment 132 Keith Busch 2019-07-11 02:26:24 UTC
Created attachment 283605 [details]
attachment-13309-0.html

I am currently taking excented leave and will not have access to this account through the end of July.
Comment 133 Mario Limonciello 2019-09-26 15:30:54 UTC
@Paul Menzel,

Ping, I would like to follow up.  Can you please test 5.3 final with 71630b7a832f reverted?
Comment 134 Yorick van Pelt 2019-12-25 16:58:25 UTC
Hi! My XPS 9360 now uses s2idle, but does not reach s0ix, causing a significant battery drain. Maybe the blacklist removal should be reverted.
Comment 135 Paul Menzel 2019-12-25 17:34:19 UTC
(In reply to Yorick van Pelt from comment #134)
> Hi! My XPS 9360 now uses s2idle, but does not reach s0ix, causing a
> significant battery drain. Maybe the blacklist removal should be reverted.

Yorick, thank you for the report, but please create a new ticket, and upload all the details for your device. This issue is for the specific problem with the NVMe SSD in the device, which was fixed according to my test.

Note You need to log in before you can comment on or make changes to this bug.