Bug 16532

Summary: random suspend (S3) problems on Intel DG45ID board
Product: ACPI Reporter: Oleksij Rempel (fishor) (bug-track)
Component: Power-Sleep-WakeAssignee: acpi_power-sleep-wake
Status: CLOSED CODE_FIX    
Severity: high CC: lenb, maciej.rutecki, marogge, rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 16055    
Attachments: dmesg 2.6.35-04792-ga649fd2
Dmesg
Output of lspci -vvv
dmesg with "GPU idle, missed IRQ"

Description Oleksij Rempel (fishor) 2010-08-07 08:55:51 UTC
Created attachment 27369 [details]
dmesg 2.6.35-04792-ga649fd2

I have problems to use suspend on my Intel DG45ID board. It seems to initiate suspend mode but will freeze on some stage, so it will stay with black screen and blinking cursor on left top corner.

I can't access the machine over ssh, and keyboard leds are not blinking. So there is no typical oops sings.

I can't make a bisection because it fail randomly. I didn't found good 100% reproduction case. Some times it will fail at first time some times only after reboot.

To reproduce is use:
echo mem > /sys/power/state

even with latest kernel, first suspend was good and second suspend failed. 

I get this two warning, may be it some how connected:
[    0.000000] ACPI Warning: 32/64 FACS address mismatch in FADT - two FACS tables! (20100428/tbfadt-369)
[    0.000000] ACPI Warning: 32/64X FACS address mismatch in FADT - 0xCBE61F40/0x00000000CBE67E40, using 32 (20100428/tbfadt-486)

Any tips are welcome.
Comment 1 Martin Rogge 2010-08-15 09:58:28 UTC
I find the same symptoms on my wife's Asus P50IJ laptop. Machine will sometimes lock up when entering suspend mode with empty screen and cursor in top left corner. Before 2.6.35 this problem did not exist - to be honest, maybe I have seen it once before, but it was so extremely rare that it could have been related to cosmic radiation.

Unfortunately even with 2.6.35 there is no safe way of triggering the issue which makes bisecting a bit of a gamble. 

Issue also confirmed with 2.6.35.1 and 2.6.35.2.

In the logs it looks like an ordinary suspend event only it is follwed by a boot sequence rather than the usual strange wake-up sequence. Here the snippets relating to a suspend event at 14:13:15.

/var/log/messages:
Aug 14 14:08:44 billy -- MARK --
Aug 14 14:13:15 billy logger: ACPI action sleep is not defined
Aug 14 14:13:15 billy dhcpcd: sending signal 1 to pid 2107
Aug 14 14:13:15 billy dhcpcd: received SIGHUP, releasing lease
Aug 14 14:13:15 billy dhcpcd: wlan0: removing interface
Aug 14 14:13:15 billy dhcpcd: waiting for pid 2107 to exit
Aug 14 14:13:15 billy kernel: cfg80211: Calling CRDA to update world regulatory domain
Aug 14 14:13:15 billy kernel: cfg80211: World regulatory domain updated:
Aug 14 14:13:15 billy kernel:     (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
Aug 14 14:13:15 billy kernel:     (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
Aug 14 14:13:15 billy kernel:     (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
Aug 14 14:13:15 billy kernel:     (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
Aug 14 14:13:15 billy kernel:     (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
Aug 14 14:13:15 billy kernel:     (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
Aug 14 14:17:12 billy syslogd 1.4.1: restart.

/var/log/syslog:
Aug 14 12:33:40 billy kernel: ACPI Error: Evaluating _BCM failed (20100428/video-532)
Aug 14 14:13:15 billy dhcpcd: dhcpcd not running
Aug 14 14:17:12 billy kernel: ACPI: RSDP 00000000000f93c0 00024 (v02 ACPIAM)

/var/log/debug:
Aug 14 11:50:01 billy kernel: wlan0: associated
Aug 14 14:13:15 billy kernel: wlan0: deauthenticating from xx:xx:xx:xx:xx:xx by local choice (reason=3)
Aug 14 14:17:12 billy kernel: e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
Comment 2 Martin Rogge 2010-08-15 10:02:14 UTC
Created attachment 27448 [details]
Dmesg
Comment 3 Martin Rogge 2010-08-15 10:02:52 UTC
Created attachment 27449 [details]
Output of lspci -vvv
Comment 4 Martin Rogge 2010-08-15 10:08:08 UTC
btw, when I talk about the "usual strange wake-up sequence" above I mean the fact that certain suspend activities are reported AFTER wake-up. Maybe it just an issue of flushing the printk buffer? Or is it related to the issue?

Check the time stamps below:

Aug 14 15:56:04 billy logger: ACPI action sleep is not defined
Aug 14 15:56:05 billy dhcpcd: sending signal 1 to pid 1419
Aug 14 15:56:05 billy dhcpcd: received SIGHUP, releasing lease
Aug 14 15:56:05 billy dhcpcd: wlan0: removing interface
Aug 14 15:56:05 billy dhcpcd: waiting for pid 1419 to exit
Aug 14 15:56:05 billy kernel: cfg80211: Calling CRDA to update world regulatory domain
Aug 14 15:56:05 billy kernel: cfg80211: World regulatory domain updated:
Aug 14 15:56:05 billy kernel:     (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
Aug 14 15:56:05 billy kernel:     (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
Aug 14 15:56:05 billy kernel:     (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
Aug 14 15:56:05 billy kernel:     (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
Aug 14 15:56:05 billy kernel:     (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
Aug 14 15:56:05 billy kernel:     (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
Aug 14 15:56:05 billy kernel: PM: Syncing filesystems ... done.
Aug 14 16:06:25 billy kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Aug 14 16:06:25 billy kernel: sd 0:0:0:0: [sda] Stopping disk
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1d.2: PCI INT C disabled
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1d.1: PCI INT B disabled
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1d.0: PCI INT A disabled
Aug 14 16:06:25 billy kernel: ath9k 0000:02:00.0: PCI INT A disabled
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1a.2: PCI INT D disabled
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1a.1: PCI INT B disabled
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1a.0: PCI INT A disabled
Aug 14 16:06:25 billy kernel: i915 0000:00:02.0: PCI INT A disabled
Aug 14 16:06:25 billy kernel: ATL1E 0000:03:00.0: PCI INT A disabled
Aug 14 16:06:25 billy kernel: ehci_hcd 0000:00:1d.7: PCI INT A disabled
Aug 14 16:06:25 billy kernel: ehci_hcd 0000:00:1a.7: PCI INT C disabled
Aug 14 16:06:25 billy kernel: ahci 0000:00:1f.2: PCI INT B disabled
Aug 14 16:06:25 billy kernel: HDA Intel 0000:00:1b.0: PCI INT A disabled
Aug 14 16:06:25 billy kernel: PM: suspend of devices complete after 131.760 msecs
Aug 14 16:06:25 billy kernel: PM: late suspend of devices complete after 21.463 msecs
Aug 14 16:06:25 billy kernel: ACPI: Preparing to enter system sleep state S3
Aug 14 16:06:25 billy kernel: PM: Saving platform NVS memory
Aug 14 16:06:25 billy kernel: CPU 1 is now offline
Aug 14 16:06:25 billy kernel: SMP alternatives: switching to UP code
Aug 14 16:06:25 billy kernel: Extended CMOS year: 2000
Aug 14 16:06:25 billy kernel: PM: Restoring platform NVS memory
Aug 14 16:06:25 billy kernel: Extended CMOS year: 2000
Aug 14 16:06:25 billy kernel: SMP alternatives: switching to SMP code
Aug 14 16:06:25 billy kernel: Booting Node 0 Processor 1 APIC 0x1
Aug 14 16:06:25 billy kernel: ACPI: Waking up from system sleep state S3
Aug 14 16:06:25 billy acpid: client 1548[0:0] has disconnected 
Aug 14 16:06:25 billy kernel: PM: early resume of devices complete after 1.034 msecs
Aug 14 16:06:25 billy kernel: i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1a.1: PCI INT B -> GSI 21 (level, low) -> IRQ 21
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1a.2: PCI INT D -> GSI 19 (level, low) -> IRQ 19
Aug 14 16:06:25 billy kernel: ehci_hcd 0000:00:1a.7: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Aug 14 16:06:25 billy kernel: HDA Intel 0000:00:1b.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 23 (level, low) -> IRQ 23
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19
Aug 14 16:06:25 billy kernel: uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18
Aug 14 16:06:25 billy kernel: ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 23 (level, low) -> IRQ 23
Aug 14 16:06:25 billy kernel: ahci 0000:00:1f.2: PCI INT B -> GSI 19 (level, low) -> IRQ 19
Aug 14 16:06:25 billy kernel: ATL1E 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
Aug 14 16:06:25 billy kernel: ath9k 0000:02:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
Aug 14 16:06:25 billy kernel: sd 0:0:0:0: [sda] Starting disk
Aug 14 16:06:25 billy kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 14 16:06:25 billy kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (unknown) filtered out
Aug 14 16:06:25 billy kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (unknown) filtered out
Aug 14 16:06:25 billy kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 14 16:06:25 billy kernel: ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (unknown) filtered out
Aug 14 16:06:25 billy kernel: ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (unknown) filtered out
Aug 14 16:06:25 billy kernel: ata1.00: configured for UDMA/133
Aug 14 16:06:25 billy kernel: ata2.00: ACPI cmd ef/10:03:00:00:00:a0 (unknown) filtered out
Aug 14 16:06:25 billy kernel: ata2.00: ACPI cmd ef/10:03:00:00:00:a0 (unknown) filtered out
Aug 14 16:06:25 billy kernel: ata2.00: configured for UDMA/100
Aug 14 16:06:25 billy kernel: usb 6-1: reset low speed USB device using uhci_hcd and address 2
Aug 14 16:06:25 billy kernel: usb 6-2: reset low speed USB device using uhci_hcd and address 3
Aug 14 16:06:25 billy kernel: PM: resume of devices complete after 1371.560 msecs
Aug 14 16:06:25 billy kernel: video LNXVIDEO:00: Restoring backlight state
Aug 14 16:06:25 billy logger: ACPI group processor / action LNXCPU:00 is not defined
Aug 14 16:06:25 billy logger: ACPI group processor / action LNXCPU:01 is not defined
Aug 14 16:06:25 billy acpid: client connected from 1548[0:0] 
Aug 14 16:06:25 billy acpid: 1 client rule loaded 
Aug 14 16:06:25 billy logger: ACPI group battery / action PNP0C0A:00 is not defined
Aug 14 16:06:25 billy logger: ACPI group ac_adapter / action ACPI0003:00 is not defined
Aug 14 16:06:32 billy dhcpcd: version 5.2.2 starting
Comment 5 Zhang Rui 2010-08-16 02:25:04 UTC
well, it's hard to debug a random suspend problem.

please boot with init=/bin/bash, and check if the problem can be reproduced in this case.
Comment 6 Martin Rogge 2010-08-18 18:29:07 UTC
The problem is somewhat random, but this is what I observed earlier: I bootet with init=/bin/bash, and I was able to execute the sleep cycle over 20 times (I didn't really count). After reboot into run level 3 (this time using /sbin/init) I could do the same. I switched to run level 4 (xorg), and when pressing the sleep button for the first time key the system froze.

So this might indicate the problem is to be looked for in the area of drm, video drivers etc.?
Comment 7 Len Brown 2010-08-31 02:26:25 UTC
Martin,
Can you see if any recent i915 driver changes caused this issue to appear?
Comment 8 Martin Rogge 2010-09-13 19:51:12 UTC
(In reply to comment #7)

First off, apologies for the delay which is due to my holiday.

> Martin,
> Can you see if any recent i915 driver changes caused this issue to appear?

With difficulty... When I first noticed the issue the laptop was on xf86-video-intel-2.12.0 and xorg-server-1.7.7. Since the i915 driver is currently a moving target, and since I am troubleshooting unrelated i915 issues on another machine, I installed some existing self-compiled packages on the laptop. Thus I am now on the development version of xf86-video-intel from three weeks ago, compiled for and running under xorg-server-1.8.2. However, the build process and the xorg dependencies are so arcane that I would not want to meddle with it too much.

Did you have any specific version in mind for me to try?

PS: just for info, in the meantime I have tried booting with the kernel parameter acpi_sleep=nonvs but it made no difference.
Comment 9 Martin Rogge 2010-09-13 22:45:39 UTC
currently testing 2.6.36-rc4. So far the problem hasn't shown up, but it is too early to tell.
Comment 10 Len Brown 2010-09-14 01:50:33 UTC
Martin's machine worked fine before 2.6.35.
Alexey, you filed this bug (so it is your bug:-) is that true
for your system too?  Did it suspend reliably before 2.6.35
and the regression started at 2.6.35?  Looks like your
failure was using an ubuntu-compiled kernel.  was there
a previous release of ubuntu or the ubnutu kernel that
worked better?
Comment 11 Oleksij Rempel (fishor) 2010-09-14 05:22:11 UTC
Hard to say. Many kernels i believed worked fine i got to crash some how (normally i use vanilla git kernel). I do not have 100% working regression test. Till today i used clean ubuntu 10.10 for quality testing. Now is time to go back to vanilla :D , today late i will do more testing.

I got serial port working on my PC, let see if can get some more info.
Comment 12 Oleksij Rempel (fishor) 2010-09-16 09:50:53 UTC
I use kernel 2.6.36-rc4-00076-g9c03f16 from git master. And 
xserver-xorg-video-intel:
  Installiert: 2:2.12.0-1ubuntu4
  Kandidat:    2:2.12.0-1ubuntu4
  Versionstabelle:
 *** 2:2.12.0-1ubuntu4 0
        500 http://de.archive.ubuntu.com/ubuntu/ maverick/main amd64 Packages
        100 /var/lib/dpkg/status

After some supends i got short error message:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ.

but it went to suspend and did wake up. It looks like current kernel can work around this hang.
Comment 13 Oleksij Rempel (fishor) 2010-09-16 09:52:53 UTC
Created attachment 30222 [details]
dmesg with "GPU idle, missed IRQ"
Comment 14 Martin Rogge 2010-09-19 20:30:15 UTC
I've had close to 30 real-life suspend & resume events with 2.6.36-rc4, and I think it is now safe to say the problem for me is no longer present with 2.6.36-rc4. 

Question is: what with 2.6.35? Mark the release as burnt or try and fix it?

Martin
Comment 15 Rafael J. Wysocki 2010-09-20 17:01:11 UTC
OK, closing as fixed in mainline.

If you can identify the mainline commit that fixed the issue for you, we can try
to backport it to 2.6.35.y.