Bug 60804

Summary: Baytrail-M & ILK mobile: Resume from S4 causes system reboot sporadically
Product: Power Management Reporter: Feng, Cancan (cancan.feng)
Component: Hibernation/SuspendAssignee: Lan Tianyu (tianyu.lan)
Status: CLOSED WILL_FIX_LATER    
Severity: high CC: aaron.lu, guang.a.yang, qingshuai.tian, tianyu.lan, yangweix.shui
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.11.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel loading log 1 at resume phase
kernel loading log 2 at resume phase
dmesg: Baytrail-M S4 reliability test reboot

Description Feng, Cancan 2013-08-28 01:42:15 UTC
System Environment:
--------------------------------------------
Kernel: (drm-intel-next-queued)30815646aadf5a45da2d6c664953acfac525e22e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 20 12:56:40 2013 +0100

    drm/i915: Don't destroy the vma placeholder during execbuffer reservation

Bug detail Description:
--------------------------------------------
This issue happens on ILK's mobile machine. System can suspend to disk successfully, but will reboot while resuming from S4 sporadically. This happens about 1 in 5 times. I tried 3.9, 3.8 and 3.6 kernel but can't find a good commit..

This issue also exists without i915 loaded. 

Reproduce step:
--------------------------------------------
1. booting up machine
2. echo disk > /sys/power/state --> system reboot
Comment 1 Lan Tianyu 2013-08-28 01:59:02 UTC
(In reply to Feng, Cancan from comment #0)
> System Environment:
> --------------------------------------------
> Kernel: (drm-intel-next-queued)30815646aadf5a45da2d6c664953acfac525e22e
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Aug 20 12:56:40 2013 +0100
> 
>     drm/i915: Don't destroy the vma placeholder during execbuffer reservation
> 
> Bug detail Description:
> --------------------------------------------
> This issue happens on ILK's mobile machine. System can suspend to disk
> successfully, but will reboot while resuming from S4 sporadically. This
> happens about 1 in 5 times. I tried 3.9, 3.8 and 3.6 kernel but can't find a
> good commit..
Please provide the output of acpidump.

Could you provide some logs? Maybe use a camera to shot the kernel log when system reboots during resuming.

> 
> This issue also exists without i915 loaded. 
> 
> Reproduce step:
> --------------------------------------------
> 1. booting up machine
> 2. echo disk > /sys/power/state --> system reboot
Comment 2 Feng, Cancan 2013-08-28 02:04:02 UTC
00:00.0 Host bridge [0600]: Intel Corporation Core Processor DRAM Controller [8086:0044] (rev 02)
00:02.0 VGA compatible controller [0300]: Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02)
00:19.0 Ethernet controller [0200]: Intel Corporation 82577LM Gigabit Network Connection [8086:10ea] (rev 05)
00:1a.0 USB Controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b3c] (rev 05)
00:1b.0 Audio device [0403]: Intel Corporation 5 Series/3400 Series Chipset High Definition Audio [8086:3b56] (rev 05)
00:1c.0 PCI bridge [0604]: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 [8086:3b42] (rev 05)
00:1c.1 PCI bridge [0604]: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 2 [8086:3b44] (rev 05)
00:1c.2 PCI bridge [0604]: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 3 [8086:3b46] (rev 05)
00:1c.3 PCI bridge [0604]: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 4 [8086:3b48] (rev 05)
00:1d.0 USB Controller [0c03]: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller [8086:3b34] (rev 05)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 Mobile PCI Bridge [8086:2448] (rev a5)
00:1f.0 ISA bridge [0601]: Intel Corporation Mobile 5 Series Chipset LPC Interface Controller [8086:3b07] (rev 05)
00:1f.2 RAID bus controller [0104]: Intel Corporation Mobile 82801 SATA RAID Controller [8086:282a] (rev 05)
00:1f.3 SMBus [0c05]: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller [8086:3b30] (rev 05)
00:1f.6 Signal processing controller [1180]: Intel Corporation 5 Series/3400 Series Chipset Thermal Subsystem [8086:3b32] (rev 05)
02:00.0 Network controller [0280]: Broadcom Corporation BCM4313 802.11b/g/n Wireless LAN Controller [14e4:4727] (rev 01)
03:00.0 CardBus bridge [0607]: Ricoh Co Ltd Device [1180:e476] (rev 02)
03:00.1 SD Host controller [0805]: Ricoh Co Ltd MMC/SD Host Controller [1180:e822] (rev 03)
03:00.4 FireWire (IEEE 1394) [0c00]: Ricoh Co Ltd FireWire Host Controller [1180:e832] (rev 03)
3f:00.0 Host bridge [0600]: Intel Corporation Core Processor QuickPath Architecture Generic Non-core Registers [8086:2c62] (rev 02)
3f:00.1 Host bridge [0600]: Intel Corporation Core Processor QuickPath Architecture System Address Decoder [8086:2d01] (rev 02)
3f:02.0 Host bridge [0600]: Intel Corporation Core Processor QPI Link 0 [8086:2d10] (rev 02)
3f:02.1 Host bridge [0600]: Intel Corporation Core Processor QPI Physical 0 [8086:2d11] (rev 02)
3f:02.2 Host bridge [0600]: Intel Corporation Core Processor Reserved [8086:2d12] (rev 02)
3f:02.3 Host bridge [0600]: Intel Corporation Core Processor Reserved [8086:2d13] (rev 02)
Comment 3 Aaron Lu 2013-08-28 02:07:54 UTC
Hello,

What is ILK mobile machine, is it a laptop?

Also, please try to do some basic debugging as described in https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt? Thanks.
Comment 4 Feng, Cancan 2013-08-28 03:14:13 UTC
(In reply to Aaron Lu from comment #3)
> Hello,
> 
> What is ILK mobile machine, is it a laptop?
> 
> Also, please try to do some basic debugging as described in
> https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt?
> Thanks.

Yes, it's a laptop. 

I did as what says in the website:

1. # echo reboot > /sys/power/disk
   # echo disk > /sys/power/state
   System will reboot at 2nd time from resuming.

2. # echo devices > /sys/power/pm_test
   # echo platform > /sys/power/disk
   # echo disk > /sys/power/state

   In this testing mode, I test each of these freezer,devices, platform, processors, core 5 times, but none of these five modes fails.

3. # echo shutdown > /sys/power/disk
   # echo disk > /sys/power/state 
   System will reboot at 3rd time from resuming.
Comment 5 Feng, Cancan 2013-08-28 04:59:30 UTC
(In reply to Lan Tianyu from comment #1)
> (In reply to Feng, Cancan from comment #0)
> > System Environment:
> > --------------------------------------------
> > Kernel: (drm-intel-next-queued)30815646aadf5a45da2d6c664953acfac525e22e
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Aug 20 12:56:40 2013 +0100
> > 
> >     drm/i915: Don't destroy the vma placeholder during execbuffer
> reservation
> > 
> > Bug detail Description:
> > --------------------------------------------
> > This issue happens on ILK's mobile machine. System can suspend to disk
> > successfully, but will reboot while resuming from S4 sporadically. This
> > happens about 1 in 5 times. I tried 3.9, 3.8 and 3.6 kernel but can't find
> a
> > good commit..
> Please provide the output of acpidump.
> 
> Could you provide some logs? Maybe use a camera to shot the kernel log when
> system reboots during resuming.

It's hard to take a picture, what do you think I record a video and email you?
Comment 6 Lan Tianyu 2013-08-28 05:40:06 UTC
Currently, I have no good idea. So let's have a try and maybe we could find some clues.
Comment 7 Feng, Cancan 2013-08-28 06:57:00 UTC
(In reply to Lan Tianyu from comment #6)
> Currently, I have no good idea. So let's have a try and maybe we could find
> some clues.

Hmm..I made a video but it's too big to send. So I captured two photos of kernel loading phase of resume. Next second, the system will reboot.
Comment 8 Feng, Cancan 2013-08-28 07:00:15 UTC
Created attachment 107341 [details]
kernel loading log 1 at resume phase
Comment 9 Feng, Cancan 2013-08-28 07:01:09 UTC
Created attachment 107342 [details]
kernel loading log 2 at resume phase
Comment 10 shui yangwei 2013-08-29 05:51:49 UTC
IVB: Apple MacBook Pro also have this issue. I tried to loop running S4 on this machine, It will reboot in about 10 rounds.
Comment 11 Lan Tianyu 2013-09-09 02:50:18 UTC
Now, I get such machine and work on this bug.
Comment 12 shui yangwei 2013-09-12 01:19:50 UTC
Baytrail machine is reproduceable, it will reboot by loop running S4 about 3 times.
Comment 13 Lan Tianyu 2013-09-12 01:27:55 UTC
Ok. I get the Feng Cancan's machine and reinstall a fresh fedora 19. The issue occur once after 108 s4. I will prepare some debug patch into kernel since it's so hard to reproduce.

(In reply to shui yangwei from comment #12)
> Baytrail machine is reproduceable, it will reboot by loop running S4 about 3
> times.
Byatrail should have a serial port to debug. Could you catch the log when it reboot?
Comment 14 shui yangwei 2013-09-12 02:58:21 UTC
Created attachment 108151 [details]
dmesg: Baytrail-M S4 reliability test reboot

(In reply to Lan Tianyu from comment #13)
> Ok. I get the Feng Cancan's machine and reinstall a fresh fedora 19. The
> issue occur once after 108 s4. I will prepare some debug patch into kernel
> since it's so hard to reproduce.
> 
> (In reply to shui yangwei from comment #12)
> > Baytrail machine is reproduceable, it will reboot by loop running S4 about
> 3
> > times.
> Byatrail should have a serial port to debug. Could you catch the log when it
> reboot?

OK, I append the dmesg here.
Comment 15 Lan Tianyu 2013-09-12 03:19:44 UTC
Please try the following patch.

diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 0b78f72..e292def 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -742,10 +742,10 @@ static int software_resume(void)
        if (swsusp_resume_device)
                goto Check_image;
 
-       if (!strlen(resume_file)) {
-               error = -ENOENT;
-               goto Unlock;
-       }
+//     if (!strlen(resume_file)) {
+//             error = -ENOENT;
+//             goto Unlock;
+//     }
 
        pr_debug("PM: Checking hibernation image partition %s\n", resume_file);
Comment 16 shui yangwei 2013-09-13 01:48:21 UTC
(In reply to Lan Tianyu from comment #15)
> Please try the following patch.
> 
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index 0b78f72..e292def 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -742,10 +742,10 @@ static int software_resume(void)
>         if (swsusp_resume_device)
>                 goto Check_image;
>  
> -       if (!strlen(resume_file)) {
> -               error = -ENOENT;
> -               goto Unlock;
> -       }
> +//     if (!strlen(resume_file)) {
> +//             error = -ENOENT;
> +//             goto Unlock;
> +//     }
>  
>         pr_debug("PM: Checking hibernation image partition %s\n",
> resume_file);

I tested this patch on latest -next-queued kernel of Daniel's tree, machine isn't reboot, but it hangs there and unreachable by loop running S4 at about 22 times round.
Comment 17 Lan Tianyu 2013-09-13 01:50:07 UTC
Could you attach the log ?
Comment 18 shui yangwei 2013-09-13 02:21:39 UTC
(In reply to Lan Tianyu from comment #17)
> Could you attach the log ?

I reboot the machine, and I find my machine resume from S4 and continue doing the reliability test, I think it is hang at the suspend part of S4. I have save the dmesg, but I don't know why there's only a little messages in it. I paste it below:


[  788.893800] [drm:ironlake_panel_vdd_off_sync], PP_STATUS: 0xabcd000f PP_CONTROL: 0x80000008
[  789.628751] ax88179_178a 1-6.2:1.0 enp0s20u6u2: ax88179 - Link status is: 1
[  789.881664] hpet_rtc_timer_reinit: 7 callbacks suppressed
[  789.885830] hpet1: lost 9599 rtc interrupts
[  790.212860] hpet1: lost 9599 rtc interrupts
[  790.558977] hpet1: lost 9599 rtc interrupts
[  790.885549] hpet1: lost 9600 rtc interrupts
[  791.159335] hpet1: lost 9600 rtc interrupts
[  791.432820] hpet1: lost 9600 rtc interrupts
[  791.709419] hpet1: lost 9600 rtc interrupts
[  792.035632] hpet1: lost 9599 rtc interrupts
[  792.382003] hpet1: lost 9599 rtc interrupts
[  792.708658] hpet1: lost 9600 rtc interrupts
Comment 19 Lan Tianyu 2013-09-13 02:39:43 UTC
From this log, it is hpet issue and not related with pm core's hibernation code.
Further more, s4 still works after reboot.
Comment 20 Aaron Lu 2013-09-13 03:09:51 UTC
(In reply to shui yangwei from comment #18)
> [  788.893800] [drm:ironlake_panel_vdd_off_sync], PP_STATUS: 0xabcd000f
> PP_CONTROL: 0x80000008
> [  789.628751] ax88179_178a 1-6.2:1.0 enp0s20u6u2: ax88179 - Link status is:
> 1
> [  789.881664] hpet_rtc_timer_reinit: 7 callbacks suppressed
> [  789.885830] hpet1: lost 9599 rtc interrupts
> [  790.212860] hpet1: lost 9599 rtc interrupts

Try unset CONFIG_HPET_EMULATE_RTC in you kernel config and see what would happen.
Comment 21 Lan Tianyu 2013-09-13 07:24:25 UTC
Hi Yangwei:
       Could you check the machine's swap partition and test with kernel param "resume=(swap partition e.g /dev/sda3)" ?
Comment 22 shui yangwei 2013-09-16 05:20:15 UTC
(In reply to Lan Tianyu from comment #21)
> Hi Yangwei:
>        Could you check the machine's swap partition and test with kernel
> param "resume=(swap partition e.g /dev/sda3)" ?

Yeah, I tested it just like what you say, loop running S4 120 rounds, all passed. Might this kernel command really worked.
Comment 23 Lan Tianyu 2013-09-16 05:41:35 UTC
This proves hibernation function works since this param is to make kernel to start hibernation resume. Original the hibernation resume is triggered by initrd. So I think the initrd was abnormal and didn't trigger hibernation resume. The reboot also is suspicious and seems not a ordinary reboot because the log doesn't show some logs of reboot.
Comment 24 Guang Yang 2013-11-08 08:15:49 UTC
Tianyu, any updated or idea for this bug?
Comment 25 Lan Tianyu 2013-11-08 08:36:16 UTC
Sorry, current have no idea about what the user space did to trigger the abnormal reboot.
Comment 26 Lan Tianyu 2013-12-02 08:20:15 UTC
Since this bug is hard to root cause and this maybe triggered by Bios(Baytrail-M is still under developing and Bios is not stable), mark this bug as WILL_FIX_LATER.