Bug 80911 - 3.16-rcX crashes on resume from Suspend-To-RAM - Dell M4400
Summary: 3.16-rcX crashes on resume from Suspend-To-RAM - Dell M4400
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Sleep-Wake (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Zhang Rui
URL: http://marc.info/?l=linux-kernel&m=14...
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-22 14:14 UTC by Zhang Rui
Modified: 2014-08-19 14:27 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.16-rc1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Screenshot of stack trace after crash (3.81 MB, image/png)
2014-08-16 06:27 UTC, Markus Gutschke
Details
Debug data requested by developers (4.00 MB, application/x-tar)
2014-08-16 09:45 UTC, Markus Gutschke
Details
Applied patch to skip _WAK in platform pm_test mode (applied to v3.17-rc1) (77.14 KB, application/x-tar)
2014-08-18 08:14 UTC, Markus Gutschke
Details
Unpatched 3.17-rc1 kernel testing (123.37 KB, text/plain)
2014-08-18 09:02 UTC, Markus Gutschke
Details

Description Zhang Rui 2014-07-22 14:14:39 UTC
From: markus@gutschke.com

> My Dell M4400 has been pretty well-supported by Linux a couple of 
> years now, but recent 3.16-rcX cause hard crashes when resuming from 
> Suspend-to-RAM.
>
> This is tricky to debug, as device drivers are not yet restored by the 
> time that the crash happens. So, I can't use Page-UP to scroll the 
> screen and see the full crash information. I also cannot use the 
> netconsole; the ethernet device is still suspended. For similar 
> reasons, crash kernels don't seem to work either.
>
> After about a day of false starts and a lengthy bi-secting session, I 
> finally narrowed things down to this change list:
>
> eec15edbb0e14485998635ea7c62e30911b465f0 is the first bad commit 
> commit eec15edbb0e14485998635ea7c62e30911b465f0
> Author: Zhang Rui <rui.zhang@intel.com>
> Date:   Fri May 30 04:23:01 2014 +0200
>
>     ACPI / PNP: use device ID list for PNPACPI device enumeration
>
>     ACPI can be used to enumerate PNP devices, but the code does not
>     handle this in the right way currently.  Namely, if an ACPI device
>     object
>      1. Has a _CRS method,
>      2. Has an identification of
>         "three capital characters followed by four hex digits",
>      3. Is not in the excluded IDs list,
>     it will be enumerated to PNP bus (that is, a PNP device object will
>     be create for it).  This means that, actually, the PNP bus type is
>     used as the default bus type for enumerating _HID devices in ACPI.
>
>     However, more and more _HID devices need to be enumerated to the
>     platform bus instead (that is, platform device objects need to be
>     created for them).  As a result, the device ID list in acpi_platform.c
>     is used to enforce creating platform device objects rather than PNP
>     device objects for matching devices.  That list has been continuously
>     growing recently, unfortunately, and it is pretty much guaranteed to
>     grow even more in the future.
>
>     To address that problem it is better to enumerate _HID devices
>     as platform devices by default.  To this end, change the way of
>     enumerating PNP devices by adding a PNP ACPI scan handler that
>     will use a device ID list to create PNP devices for the ACPI
>     device objects whose device IDs are present in that list.
>
>     The initial device ID list in the PNP ACPI scan handler contains
>     all of the pnp_device_id strings from all the existing PNP drivers,
>     so this change should be transparent to the PNP core and all of the
>     PNP drivers.  Still, in the future it should be possible to reduce
>     its size by converting PNP drivers that need not be PNP for any
>     technical reasons into platform drivers.
>
>     Signed-off-by: Zhang Rui <rui.zhang@intel.com>
>     [rjw: Rewrote the changelog, modified the PNP ACPI scan handler code]
>     Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>     Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
>
> :040000 040000 
> b7c07232aa46ae7b6faf9a907fb7274a02e4680fc2e05b31a61dccd087c554adecc89a
> 43a1ed81f7
> M drivers
> :040000 040000 
> 4eda970292fffbeebe167f9210502527df4e8ab421e9e6fd84c780a34bf3d48b5e7618
> b551da3b1a
> M include
>
> I took a photo of the crash. It feels silly to do, but I couldn't 
> think of a better solution. You can find it at 
> https://drive.google.com/file/d/0B8SxqKDe4hyheTlTLXY2YThkMXM
>
> As I mentioned earlier, a bunch of information has already scrolled 
> off the screen, but hopefully what is visible is somewhat helpful.
>
> I will have only limited internet access the next couple of weeks. But 
> I wanted to make sure I at least got the result of the bisection out 
> to LKML. I will make every best effort to collect additional data, if 
> asked to do so; but some of it might be delayed for a little bit, 
> until I can get access to reasonably powerful hardware or reasonably 
> fast internet.
>
Comment 1 Rafael J. Wysocki 2014-07-29 15:18:59 UTC
Are we waiting from the reporter's response?
Comment 2 Tigran Gabrielyan 2014-08-02 20:29:01 UTC
Same issue here, not sure if related.

I have Samsung RV515 laptop, AMD E-350 chip with 6310 gpu. Resume always fails with black screen, no text.
Comment 3 Markus Gutschke 2014-08-16 06:27:38 UTC
Created attachment 146801 [details]
Screenshot of stack trace after crash
Comment 4 Markus Gutschke 2014-08-16 09:45:02 UTC
Created attachment 146811 [details]
Debug data requested by developers

This archive contains debugging information retrieved for the following kernel
versions:

455c6fdbd219161bd09b1165f11699d6d73de11c: Linux 3.14
1860e379875dfe7271c649058aeddffe5afd9d0d: Linux 3.15
aca0a4eb4e325914ddb22a8ed06fcb0222da2a26: Last good commit
eec15edbb0e14485998635ea7c62e30911b465f0: First bad commit
b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f: Still bad (merge branch "acpi-enumeration")
b04c58b1ed26317bfb4b33d3a2d16377fc6acd0f.PATCHED: Patched with a debug patch provided by Rui

After eec15edbb0e14485998635ea7c62e30911b465f0, the kernel can no longer resume
from suspend. The crash looks like what is shown in "crash.png".

Even for defective kernels, "echo devices >/sys/power/pm_test" completes successfully,
whereas "echo platform >/sys/power/pm_test" triggers the crash.

CONFIG_PM_TRACE_RTC suggests that the bug is caused by LNXSYBUS:00:, but that might be
incorrect.

Please note that "dmesg" shows an early stack trace during boot. This might or might
not be related.

Please also not that the output from "acpidump" changes between kernel versions.

I also included the output from
  grep . /sys/bus/pnp/devices/*/firmware_node/*
  grep . /sys/bus/pnp/devices/*/*
  grep . /sys/bus/platform/devices/*/firmware_node/*
  grep . /sys/bus/platform/devices/*/*
for each of the kernels.

And the directory tree underneath /sys/bus/pnp/devices/ -> /sys/devices/pnp0/.
Comment 5 Tigran Gabrielyan 2014-08-16 16:13:23 UTC
I should add that after booting 3.16 and it failing, when I boot into 3.14 my screen flickers until I turn my computer off and remove the battery for up to 30 minutes.
Comment 6 Tigran Gabrielyan 2014-08-18 01:52:50 UTC
Ok, I tried 3.17rc1, same behavior -- but I tried something different. I removed quiet so I can see at what point it fail and it boots up and suspends properly. 

With `quiet` options, boots to black screen.
Without `quiet`, boots and suspends normally.
Comment 7 Markus Gutschke 2014-08-18 08:14:32 UTC
Created attachment 147001 [details]
Applied patch to skip _WAK in platform pm_test mode (applied to v3.17-rc1)

I applied a debugging patch provided by Rui. This patch skips _WAK in platform pm_test mode.

I had to pull a more recent kernel version, as the patch didn't apply to the older kernel tree that I had found through bi-section. I did all my tests with it applied to v3.17-rc1 7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9.

Without the patch, the kernel behaves a little bit differently from before. It now keeps the screen blank after a resume. So, I don't have a good way to tell whether it is really crashed and whether it has printed a stack trace. Or whether it is stuck somewhere else.

With the patch applied, I can now echo "platform" into "pm_test" and the system successfully runs through the test and then comes back.

Moreover, I can also successfully suspend and resume the system. Resumes sometimes take a while, and I keep seeing errors about USB devices not accepting addresses. I don't recall whether this happened with older kernels as well.

From what Rui said, I don't think this is a "correct" fix; but it should help shed some light on the problem.

I collected the same information as before. Please let me know, what else you would like me to do.
Comment 8 Zhang Rui 2014-08-18 08:26:46 UTC
(In reply to Markus Gutschke from comment #7)
> Created attachment 147001 [details]
> Applied patch to skip _WAK in platform pm_test mode (applied to v3.17-rc1)
> 
> I applied a debugging patch provided by Rui. This patch skips _WAK in
> platform pm_test mode.
> 
> I had to pull a more recent kernel version, as the patch didn't apply to the
> older kernel tree that I had found through bi-section. I did all my tests
> with it applied to v3.17-rc1 7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9.
> 
> Without the patch, the kernel behaves a little bit differently from before.
> It now keeps the screen blank after a resume.

You mean resume from "platform" pm_test mode, right?

> So, I don't have a good way to
> tell whether it is really crashed and whether it has printed a stack trace.
> Or whether it is stuck somewhere else.
>
please add boot option "no_console_suspend" and check if we can get more information from the screen.
 
> With the patch applied, I can now echo "platform" into "pm_test" and the
> system successfully runs through the test and then comes back.
> 
> Moreover, I can also successfully suspend and resume the system. Resumes
> sometimes take a while, and I keep seeing errors about USB devices not
> accepting addresses.

I don't see this in the dmesg output you attached.
Can you please try to unload the usb modules before suspend and reload them after resume and see if it helps?
Say, "echo platform > /sys/power/pm_test && modprobe -r uhci_hcd && modprobe -r ehci_hcd && echo mem > /sys/power/state && modprobe ehci_hcd && modprobe uhci_hcd"
Comment 9 Markus Gutschke 2014-08-18 09:02:19 UTC
Created attachment 147011 [details]
Unpatched 3.17-rc1 kernel testing

I tested with an *unpatched* 3.17-rc1 kernel and after setting the 
"no_console_suspend" command line option.

A "pm_test" test for "platform" completed successfully, but took a little while to wake up due to the USB problem ("no accepting address"). Thanks to the "no_console_suspend" option, I could see it print USB related error messages to the screen.

I then tested a full suspend (i.e. "none" in "pm_test"). The first time round, it failed to actually suspend (not sure whether that's a new bug). But the second try worked properly.

Again, waking up took a good while because of repeated USB problems.

I then repeated both tests a few more times with essentially the same results. Sometimes, the system comes back quite quickly. At other times it takes a while and reports a couple of USB related errors first.

I could imagine that when I said earlier that the system just went to a black screen (before we gave the "no_console_suspend" option), it was merely hung trying to re-initialize USB, and I gave up waiting too early.

I haven't yet tested with unloading USB modules, as my kernel has USB built-in. I'll have to rebuild the Debian package and that takes a while; I'll try to get you that data in a little while after I get some sleep. Or feel free to leave me a message, if you no longer want this data for so reason.

So, the good news is that v3.17-rc1 seems to fix the original problem -- even without needing your patch. There is a little bit of bad news left, as USB sometimes takes a long time to come back; but that's still a huge improvement over 3.15+. Thank you!

If it helps you, I could try to bi-sect when the kernel started working again. But that's a couple of hours of work, as building Debian kernels takes so annoyingly long. So, please let me know, if that information is at all helpful, or whether you don't really need it.
Comment 10 Zhang Rui 2014-08-18 09:26:31 UTC
(In reply to Markus Gutschke from comment #9)
> Created attachment 147011 [details]
> Unpatched 3.17-rc1 kernel testing
> 
> I tested with an *unpatched* 3.17-rc1 kernel and after setting the 
> "no_console_suspend" command line option.
> 
> A "pm_test" test for "platform" completed successfully, but took a little
> while to wake up due to the USB problem ("no accepting address"). Thanks to
> the "no_console_suspend" option, I could see it print USB related error
> messages to the screen.
> 
> I then tested a full suspend (i.e. "none" in "pm_test"). The first time
> round, it failed to actually suspend (not sure whether that's a new bug).
> But the second try worked properly.
> 
> Again, waking up took a good while because of repeated USB problems.
> 
> I then repeated both tests a few more times with essentially the same
> results. Sometimes, the system comes back quite quickly. At other times it
> takes a while and reports a couple of USB related errors first.
> 
> I could imagine that when I said earlier that the system just went to a
> black screen (before we gave the "no_console_suspend" option), it was merely
> hung trying to re-initialize USB, and I gave up waiting too early.
> 
> I haven't yet tested with unloading USB modules, as my kernel has USB
> built-in. I'll have to rebuild the Debian package and that takes a while;
> I'll try to get you that data in a little while after I get some sleep. Or
> feel free to leave me a message, if you no longer want this data for so
> reason.
> 
> So, the good news is that v3.17-rc1 seems to fix the original problem 

can you please add no_console_suspend to the broken 3.15+ kernel and see what we can get?
 
> even without needing your patch. There is a little bit of bad news left, as
> USB sometimes takes a long time to come back; but that's still a huge
> improvement over 3.15+. Thank you!
> 
> If it helps you, I could try to bi-sect when the kernel started working
> again. But that's a couple of hours of work, as building Debian kernels
> takes so annoyingly long. So, please let me know, if that information is at
> all helpful, or whether you don't really need it.

Yes, that would be helpful. As the git bisect result shows that it is me that introduces the broken commit, I'm actually a little nervious about this issue. :p. Thus it would be great to help me find the root cause of this problem and how it is fixed/workarounded in the kernel.
Comment 11 Rafael J. Wysocki 2014-08-18 19:25:19 UTC
@Markus: It is not entirely clear to me whether or not you have tested 3.16 or 3.16.1, so have you done that?
Comment 12 Markus Gutschke 2014-08-18 20:20:40 UTC
@Rafael, I have not yet tested with 3.16, as I was under the impression that there hadn't been any relevant or interesting changes between the bisection point and 3.16.

But since we now know that the crash goes away somewhere between the bisection point and 3.17-rc1, I am in the process of bisecting when the bug disappeared. I'll take a couple of hours though. Even on a really beefy server, it takes a while to run "make deb-pkg" (partially, because I can't figure out how to teach it not to build "linux-libc-dev". If you happen to have an idea how to turn that off, I'd love to know.

Even when we know which commit made the crash go away (either by fixing it, or by avoiding to trigger it), we are still stuck with the problem that USB now takes several timed-out attempts to come back from suspend. Let's discuss that, once I have the results for the bisection.
Comment 13 Markus Gutschke 2014-08-19 05:03:59 UTC
After a lengthy bisection, I found the commit that fixes the crash: dee1592638ab7ea35a32179b73f9284dead49c03

It looks as if this is a proper bug fix for this crash. So, that's good news.

Now I only need to figure out why 3.17-rc1 has trouble reliably resuming USB devices. It frequently needs multiple tries. And as far as I can tell, earlier kernels didn't have this problem.

But that's clearly a different bug, so I think this particular bug can be closed.

Thank you!
Comment 14 Rafael J. Wysocki 2014-08-19 12:56:15 UTC
Thanks, closing.

Fixed by dee1592638ab "ACPI / hotplug: Check scan handlers in acpi_scan_hot_remove()".
Comment 15 Zhang Rui 2014-08-19 14:27:29 UTC
This is really good news, thanks for your great effort, Markus!!!

BTW, Rafael, about commit dee1592638ab7ea35a32179b73f9284dead49c03, I'm a little confused.
First of all, the patch looks good to me, I agree we should check device->handler first, before checking device->handler->hotplug.demand_offline.

But to me, the problem should not be reproducible with the ACPI enumeration rework patches, say, in any 3.16-rc kernels, because we have a dummy handler even if ACPI_HOTPLUG_MEMORY is cleared, right?

Plus I'm wondering why the crash happens in the first place, i.e. commit dee1592638ab7ea35a32179b73f9284dead49c03 does not explain eec15edbb0e14485998635ea7c62e30911b465f0 is the First bad commit.

Note You need to log in before you can comment on or make changes to this bug.