Bug 42212

Summary: Asus UL80VT hangs on resume from suspend-to-ram
Product: Power Management Reporter: Benjamin Robin (benjarobin+kernel)
Component: Hibernation/SuspendAssignee: Aaron Lu (aaron.lu)
Status: CLOSED INVALID    
Severity: normal CC: aaron.lu, alan, florian, greg, maciej.rutecki, rjw, rui.zhang, stern
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.6 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 40982, 7216    
Attachments: lspci -vv
Log after suspending with 3.0.3
Log after suspending with 3.0.4 (missing the suspend part, was not sync)
pm-suspend-3.0.4.log
pm-suspend-3.0.3.log
Log after suspending with 3.0.3
bisect from kernel version 3.0.3 to 3.0.4
Configuration of the running kernel 3.0.4
Test patch that replace the personality modification
Test2 patch that replace the personality modification
kernel 3.1 trace
kernel 3.2 trace
The log, start when booting...
The kernel config (3.6.7)
The modules loaded and blacklisted
The kernel log after a resume (3.7.9)
dmesg: init=/usr/bin/bash -> suspend -> resume
The 5 tests of basic PM debugging

Description Benjamin Robin 2011-09-01 21:49:04 UTC
Created attachment 71212 [details]
lspci -vv

Regression from Kernel Version 3.0.3.

I cannot resume the box after successful suspend with 3.0.4 with the
following configuration:
Comment 1 Benjamin Robin 2011-09-01 21:53:28 UTC
Created attachment 71232 [details]
Log after suspending with 3.0.3
Comment 2 Benjamin Robin 2011-09-01 21:58:05 UTC
Created attachment 71242 [details]
Log after suspending with 3.0.4 (missing the suspend part, was not sync)
Comment 3 Benjamin Robin 2011-09-01 21:59:44 UTC
Created attachment 71252 [details]
pm-suspend-3.0.4.log
Comment 4 Benjamin Robin 2011-09-01 22:00:13 UTC
Created attachment 71262 [details]
pm-suspend-3.0.3.log
Comment 5 Benjamin Robin 2011-09-01 22:10:00 UTC
Created attachment 71272 [details]
Log after suspending with 3.0.3
Comment 6 Benjamin Robin 2011-09-01 22:17:16 UTC
Linux benjarobin-asus 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 20:24:07 UTC 2011 i686 Genuine Intel(R) CPU U7300 @ 1.30GHz GenuineIntel GNU/Linux

Nvidia card shut down (I have tried with and without the module that disable the graphic card)
I have try with and without Xorg started (Intel with KMS)
Comment 7 Greg Kroah-Hartman 2011-09-03 00:11:11 UTC
Any chance you can run 'git bisect' on the kernel to determine exactly which patch caused the problem between 3.0.3 and 3.0.4?
Comment 8 Benjamin Robin 2011-09-03 07:34:13 UTC
I am not going to compile the kernel with this computer (the processor is too slow and the Internet speed connexion is quite bad where I am).
However I will be able to compile the patched kernel with my PC with an core i7 connected to an high speed Internet connexion, but not before the 5-6 Sept.
Comment 9 Benjamin Robin 2011-09-03 21:46:11 UTC
I managed to run the compilation through ssh on my main PC.
I could not believe the result of the bisect, the commit that produce the regression doesn't look like related to my problem : 

commit 512228f0be3af44bf5cf6cc5750ddd279bbedaf3
Author: Andi Kleen <ak@linux.intel.com>
Date:   Fri Aug 19 16:15:10 2011 -0700

Add a personality to report 2.6.x version numbers

I have attached the steps of the bisect, I compiled twice and check that I was using the right kernel
Comment 10 Benjamin Robin 2011-09-03 21:47:05 UTC
Created attachment 71612 [details]
bisect from kernel version 3.0.3 to 3.0.4
Comment 11 Florian Mickler 2011-09-04 15:51:54 UTC
Does reverting that patch help? Do you have that personality activated (i.e. what config are you using)?
Comment 12 Benjamin Robin 2011-09-04 18:27:27 UTC
Reverting that patch fix the problem.

What I have done : 
I use the configuration files of ArchLinux: http://projects.archlinux.org/svntogit/packages.git/tree/repos/core-i686?h=packages/linux and build using the PKGBUILD script. I just add one line inside the PKGBUILD file after the line that apply : ftp://ftp.kernel.org/pub/linux/kernel/v3.0/patch-3.0.4.gz

patch -Rp1 -i "${srcdir}/personality-report-2.6.patch"

My Kernel command line: root=/dev/disk/by-uuid/e77bc280-de5d-464d-8790-05b5f1598505 ro nouveau.modeset=0 quiet

Linux benjarobin-asus 3.0-ARCH #1 SMP PREEMPT Sun Sep 4 20:12:39 CEST 2011 i686 Genuine Intel(R) CPU U7300 @ 1.30GHz GenuineIntel GNU/Linux
Comment 13 Benjamin Robin 2011-09-04 18:28:49 UTC
Created attachment 71692 [details]
Configuration of the running kernel 3.0.4
Comment 14 Benjamin Robin 2011-09-04 19:31:32 UTC
Created attachment 71702 [details]
Test patch that replace the personality modification

Even more interesting, if we don't apply the patch of "Add a personality to report 2.6.x version numbers", but instead the patch attached (We don't do any modification to personality.h, but we do minor modification to sys.c) ==>> this is still failing...

I can only see 2 possibilities : "name" or "current" variable is NULL or an invalid pointer
Comment 15 Benjamin Robin 2011-09-04 20:05:13 UTC
Created attachment 71712 [details]
Test2 patch that replace the personality modification

New test, This one is slightly different from the previous one : We are not using current->personality ==>> Test pass with success

So maybe current is NULL, how can I check that ? Something like : 

if(current == NULL) {
  printk(KERN_CRIT "current->personality NULL\n");
  return 0;
}
Comment 16 Andi Kleen 2011-09-04 20:39:19 UTC
Hmm, but newuname uses personality already, at least on 64bit kernels
Also you should see an oops if a NULL pointer is followed. And I don't
see how the low level suspend code should be calling uname anyways.

This doesn't make much sense.

Just to be sure when you do the suspend on the unpatched kernel multiple
times in a row does it always fail?
Comment 17 Benjamin Robin 2011-09-04 20:45:19 UTC
Yes, I have tested a lot of time (> 20) and always fail to resume.

I run another test and I am disappointed, it's failing to resume with this function :

static int override_release(char __user *release, int len)
{
    int ret = 0;
    printk(KERN_WARNING "*** Test 'current' value %d ***\n", current);
    if(current == NULL) {
        printk(KERN_CRIT "*** Test 'current' is NULL ***\n");
        return 0;
    }

    return ret;
}

Looks like if we are reading the value of current, it's just fail when resuming
Comment 18 Zhang Rui 2012-01-18 05:53:13 UTC
Does the problem still exist in the latest upstream kernel?
Comment 19 Benjamin Robin 2012-01-18 21:13:56 UTC
It looks like the problem doesn't exist anymore, BUT maybe the problem (bug) is just hidden : I did have some problem after suspend with the command swapoff (around kernel 3.1), but since kernel 3.1.6 I didn't notice any problem after a suspend :-)
Comment 20 Benjamin Robin 2012-02-12 11:46:07 UTC
I add this comment, first to request a reopen of this ticket and to explain what I did discovered with kernel 3.1 few month ago:

After upgrading to linux kernel 3.1, the suspend is working properly with this kernel version compiled using these ArchLinux script: http://projects.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux&id=20e846c85c47e5593afb5d67d5fb8fc6907d727e

I am able to go to sleep and wake-up without any problem. After a fresh start (without suspending) I can run swapoff -a without any problem. But if the PC go to sleep with pm-suspend, after the wake-up, if I ran swapoff -a, the kernel is logging a non fatal error. (The kernel log will be attached as 'kernel_trace_3.1.log').

***************************************
Today with kernel 3.2 I noticed some random kernel error only after suspending, but it looks like it's still related to swap (The kernel log will be attached as 'kernel_trace_3.2.log').
Comment 21 Benjamin Robin 2012-02-12 11:47:49 UTC
Created attachment 72358 [details]
kernel 3.1 trace
Comment 22 Benjamin Robin 2012-02-12 11:48:05 UTC
Created attachment 72359 [details]
kernel 3.2 trace
Comment 23 Alan 2012-08-30 13:32:23 UTC
Please try 3.6 when it comes out - this has some relevant fixes
Comment 24 Zhang Rui 2012-11-28 07:21:13 UTC
ping
Comment 25 Benjamin Robin 2012-11-28 21:29:49 UTC
Hi everybody,

I did run new test from a fresh install of ArchLinux with just the strict minimum of package (using systemd) on my Asus UL80VT : This computer has 2 graphics cards (intel + nvidia)

The running kernel is : Linux ben 3.6.7-1-ARCH #1 SMP PREEMPT Mon Nov 19 09:11:44 CET 2012 i686 GNU/Linux

First I tried by not blacklisting any kernel module, then I tried without KMS, i915, nouveau... 
=> Same result in both case and same error in the log :

After pressing the keyboard, the PC try to resume from suspend : The computer screen is black, no backlight without KMS, but with intel KMS backlight is on (still not displaying anything)...

And it looks like the kernel is alive but there are trace / error in the log...

Log and details will be attached.

Benjamin
Comment 26 Benjamin Robin 2012-11-28 21:32:49 UTC
Created attachment 87631 [details]
The log, start when booting...
Comment 27 Benjamin Robin 2012-11-28 21:33:42 UTC
Created attachment 87641 [details]
The kernel config (3.6.7)
Comment 28 Benjamin Robin 2012-11-28 21:34:20 UTC
Created attachment 87651 [details]
The modules loaded and blacklisted
Comment 29 Aaron Lu 2013-03-13 06:27:19 UTC
Looks like the kernel resumed fine, when restarting user space tasks, systemd crashed the kernel. Does this problem still exist in latest upstream kernel? Thanks.
Comment 30 Aaron Lu 2013-04-02 06:30:13 UTC
Hi benjarobin,

Any update?
Comment 31 Benjamin Robin 2013-04-02 19:01:52 UTC
(In reply to comment #30)
> Hi benjarobin,
> 
> Any update?

Hi,
Well depending of the kernel version I can "resume" or not. In the best case scenario likes with kernel 3.7.*, I can use a little bit the already started application. If I try to start a new application it may failed with a nice kernel error (No kernel panic). And if I am lucky I can shutdown properly the PC.

But in the worst case, likes with 3.8.4-1-ARCH, I just have a black screen (failed to resume ?), nothing in the log.

Each time the tests were done with and without KMS, intel, nouveau modules...

I really wants to help, but for that I need some advice how to debug it (I am a software developer and working with embedded device).
I think I can start with a kernel version where the resume "works" (I can have a working tty), but I don't know what to do with the kernel error...

Thanks.
Comment 32 Benjamin Robin 2013-04-02 19:40:36 UTC
Created attachment 97091 [details]
The kernel log after a resume (3.7.9)

As you can see there is a "NULL pointer dereference" inside the kswapd0 kernel process. But I am sure that the bug is not inside the kswapd0 since after each kernel upgrade the problem is trig by something else.
Comment 33 Aaron Lu 2013-04-03 06:01:40 UTC
Please boot with init=/bin/sh, and then suspend/resume, see if this works, thanks.
Comment 34 Benjamin Robin 2013-04-03 18:02:03 UTC
Created attachment 97231 [details]
dmesg: init=/usr/bin/bash -> suspend -> resume

I did new tests as you ask, by booting with init=/usr/bin/bash. I did use pm-suspend...

If I am not loading the intel (i915) driver the backlight do not "resume", still off after resume. To get the log I did run this command : 
 $ sync; pm-suspend ; sleep 2 ; dmesg &> /test.log ; sync

The associated log with this command is attached : Same error that the previous log "97091: The kernel log after a resume (3.7.9)"

If I am loading the intel (i915) driver, the backlight is on after resume, and I still have the exact same error.
Comment 35 Aaron Lu 2013-04-08 01:41:41 UTC
Hi Benjarobin,

Thanks for the test. I don't have any idea what happened here, perhaps you can try to bisect the problem. When doing bisect, I suggest you always use init=/usr/bin/bash, and you can start by finding which kernel can resume I think.
Comment 36 Aaron Lu 2013-04-09 06:27:06 UTC
BTW, I think you can follow Documentation/power/basic-pm-debugging.txt to identify what is the problem.

Start from freezer like this:
# cd /sys/power
# echo freezer > pm_test
# echo mem > state
And proceed to the next level until some problem occurs, this may give us some hints. Thanks.
Comment 37 Benjamin Robin 2013-04-09 17:26:59 UTC
Created attachment 97891 [details]
The 5 tests of basic PM debugging

Hi,

First of all, thanks a lot for your support.

For the bisect, I already did it (in 2011, take a look at the history of this bug report), and only found a non relevant commit...
Maybe this is a hardware issue, but why I can only see it with a linux 32 bits and after a suspend. I do not have any problem or errors in the log if I am not using suspend...

I did follow the documentation, but each test were successful... The last one "core" is working without any problem.
I attached the log of these 5 tests.

I did try to use s2ram, or just "echo none > pm_test; echo mem > state" => same result and same error that previous log "97091: The kernel log after a resume (3.7.9)"

I did try to update the kernel to the latest release (3.8.6), and I did have a kernel panic (numlock flashing) during resume : backlight off so I cannot see the trace or log it to the disk.
Maybe I can setup a UDP console, I have no idea if this is possible or this is too early to obtain something...
Comment 38 Aaron Lu 2013-04-10 02:52:14 UTC
(In reply to comment #37)
> Created an attachment (id=97891) [details]
> The 5 tests of basic PM debugging
> 
> Hi,
> 
> First of all, thanks a lot for your support.
> 
> For the bisect, I already did it (in 2011, take a look at the history of this
> bug report), and only found a non relevant commit...

Yes I saw that, but it looks to me you are having a different problem now along the time...

> Maybe this is a hardware issue, but why I can only see it with a linux 32
> bits
> and after a suspend. I do not have any problem or errors in the log if I am
> not
> using suspend...

This sounds relevant.

> 
> I did follow the documentation, but each test were successful... The last one
> "core" is working without any problem.
> I attached the log of these 5 tests.
> 
> I did try to use s2ram, or just "echo none > pm_test; echo mem > state" =>
> same
> result and same error that previous log "97091: The kernel log after a resume
> (3.7.9)"
> 
> I did try to update the kernel to the latest release (3.8.6), and I did have
> a
> kernel panic (numlock flashing) during resume : backlight off so I cannot see
> the trace or log it to the disk.
> Maybe I can setup a UDP console, I have no idea if this is possible or this
> is
> too early to obtain something...

It feels like some memory is corrupted after a power cycle due to suspend. Can you try removing some memory and re-test?
Comment 39 Benjamin Robin 2013-04-10 19:25:38 UTC
I am deeply sorry... This is a hardware issue, I was hopping since the beginning that was a software bug, that is only hitting Linux 32 bits. Windows and Linux x86_64 are working fine.

Why I didn't run these test before (1.5 years...) ? 
 - Tried with only 2 GB of RAM (half of them) and this is working in the 2 cases : with only the first ram bar, and with only the second

 - By default the Asus UL80VT is overclocked from 1300 MHz to 1733 MHz, if I disabled the turbo option from Windows (No option is BIOS : need to contact the support for that, I can only increase the frequency of 5%) suspend is working fine, no error in the kernel log...

Sorry again for the time wasted to help me and thanks again for your support.
But may I ask one more question.
What it is your advice : Trying with another ram bar ? Or just losing 30% of performance if I want to use suspend, but for that I need to boot under Windows...
Or maybe you have a better advice ?

Regards,

Benjamin Robin
Comment 40 Aaron Lu 2013-04-11 01:02:24 UTC
(In reply to comment #39)
> But may I ask one more question.
> What it is your advice : Trying with another ram bar ? Or just losing 30% of
> performance if I want to use suspend, but for that I need to boot under
> Windows...
> Or maybe you have a better advice ?

Why not use the 64bits Linux? It's so common today that running 32bits feel outdated :-)