Bug 10509 - Intel DG965RY system occasionally suspends back immediately after resume
Summary: Intel DG965RY system occasionally suspends back immediately after resume
Status: REJECTED DOCUMENTED
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Sleep-Wake (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: ykzhao
URL:
Keywords:
Depends on:
Blocks: 7216
  Show dependency tree
 
Reported: 2008-04-23 05:56 UTC by walken
Modified: 2008-06-02 01:53 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.25
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
acpidump output (90.22 KB, text/plain)
2008-04-23 05:58 UTC, walken
Details
lspci -vxxx output (28.26 KB, text/plain)
2008-04-23 07:17 UTC, walken
Details
/proc/acpi/wakeup output (694 bytes, text/plain)
2008-04-23 07:18 UTC, walken
Details
kern_log - see comment 7 (98.31 KB, text/plain)
2008-04-24 07:15 UTC, walken
Details
kern_log_acpid1 - see comment 7 (20.92 KB, text/plain)
2008-04-24 07:15 UTC, walken
Details
kern_log_acpid2 - see comment 7 (41.04 KB, text/plain)
2008-04-24 07:16 UTC, walken
Details
kern_log_acpid3 - see comment 7 (61.17 KB, text/plain)
2008-04-24 07:16 UTC, walken
Details
dmesg_boot (22.45 KB, text/plain)
2008-04-25 03:22 UTC, walken
Details
dmesg output - double suspend during normal usage (30.74 KB, text/plain)
2008-04-25 15:49 UTC, walken
Details
kern_log_suspend (224.52 KB, text/plain)
2008-06-01 22:58 UTC, walken
Details
kern_log_button (11.41 KB, text/plain)
2008-06-01 22:58 UTC, walken
Details
event_log_button (576 bytes, text/plain)
2008-06-01 22:58 UTC, walken
Details

Description walken 2008-04-23 05:56:13 UTC
Latest working kernel version: 2.6.20
Earliest failing kernel version: 2.6.24
Distribution: Debian etch
Hardware Environment: Intel DG965RY motherboard, P4 CPU 3.2GHz (family 15 model 6 stepping 5)
Software Environment: debian etch amd64
Problem Description: System occasionally suspends back immediately after resume

Steps to reproduce:

I'm using a 2.6.25 kernel. The system does suspend and resume correctly, but sometimes (roughly 40% of the time I'd estimate) it suspends again just after the resume succeeds.

The obvious workaround I've been using is to keep pressing the power button as many times as needed for the machine to stay on. The nicer workaround is to unload the button module right around a suspend to ram.

I did notice that when the system suspends multiple times, the value in /sys/firmware/acpi/interrupts/ff_pwr_btn increments by that many times.

I do run acpid, and my powerbtn.sh script includes the following:
/etc/init.d/apcupsd stop
/etc/init.d/ntp     stop
rmmod button    # Need this as a workaround...
sync
echo mem > /sys/power/state
modprobe button # Need this as a workaround...
/etc/init.d/ntp     start
sleep 5
/etc/init.d/apcupsd start
/etc/init.d/anacron start

I see there was a similar sounding thread "Machine shutdown after resume from S3" back in february on different Intel hardware, but there was no apparent resolution to it: http://www.mail-archive.com/linux-acpi@vger.kernel.org/msg12328.html

I'll attach the acpidump output on that system if that is any help.

Thanks,
Comment 1 walken 2008-04-23 05:58:44 UTC
Created attachment 15860 [details]
acpidump output
Comment 2 ykzhao 2008-04-23 07:03:19 UTC
Please attach the output of lspci -vxxx and /proc/acpi/wakeup.
Thanks.
Comment 3 walken 2008-04-23 07:17:23 UTC
Created attachment 15861 [details]
lspci -vxxx output
Comment 4 walken 2008-04-23 07:18:53 UTC
Created attachment 15862 [details]
/proc/acpi/wakeup output
Comment 5 walken 2008-04-23 08:24:47 UTC
Hmmm, either I'm becoming crazy, or I just halfway figured out something crazy...

First I should mention that this system is using an nfsroot (mounted using nfs over tcp, and I'm using a small initramfs with klibc-utils based scripts to mount that at boot time). Today I realized that after changing something apparently unrelated (I disabled ethernet flow control because it took a few seconds to renegotiate the link after a resume, resulting in a few nfs server not responding still trying messages) I was not able to reproduce the issue anymore...

I know this sounds really weird but the current state of testing on my side is:

* Now with ethernet flow control disabled on my eth0, which avoids any renegotiation delay after the resume, I can not reproduce the issue anymore even if the button module stays loaded.

* On the other hand, if disconnect the ethernet link while the machine is suspended, get the machine to resume, and wait 10 seconds before reconnecting the link (thus causing a few nfs server not responding messages in the mean time), then the machine suspends again some time after I reconnect the ethernet link (it can take a while, up to 30 seconds or so). This is reproduceable 100% of the time if the button module is loaded. Unloading the button module still works around that issue as explained in the initial report.

I'm not sure what to make of that. Probably the bug involves some interaction with userspace, possibly with acpid ???

Just for kicks I tried the above ethernet-cable-disconnection trick on a different machine which also uses nfsroot, and that machine did not hit the unexpected-suspend issue. So, it does look like there is a machine specific element to this, as well.

Sorry if this sounds incredibly confused. I just tried this again and I still can't believe it myself...
Comment 6 ykzhao 2008-04-24 00:15:25 UTC
Thanks for the info.
Will you please set CONFIG_ACPI_DEBUG in kernel configuration and do the following test? (button driver should be loaded).
a. kill the process who is using /proc/acpi/event( for example: kill acpid).
b. echo 0x80004 > /sys/module/acpi/parameters/debug_layer && echo 0x1f > /sys/module/acpi/parameters/debug_level
c. cat /proc/acpi/event >event_log (please press ctrl+c after step d is finished)
d. suspend& resume several cycles.

After the test, please attach the output of dmesg, event_log.
Thanks.
Comment 7 walken 2008-04-24 07:13:48 UTC
Thanks for indicating how to use the acpi debug stuff.

My crazy-sounding theory in comment 5 (which I swear worked about 10 times in a row yesterday) just won't work anymore today. So, probably it was too crazy to be true.

With debug_layer=0x80004 and debug_level=0x1f I tried as you described, suspending the machine (with echo mem > /sys/power/state) many times and and looking at /proc/acpi/event and dmesg output. But, nothing ever got output into /proc/acpi/event (I suppose you were hoping for PWRF events ?). I will attach the /var/log/kern.log output for 5 suspend cycles as the kern_log file.

Without suspending the machine, I then looked at the proc/acpi/event when pressing the power button. I noticed that sometimes I got 2 or 3 events when pressing the button only once... The frequency of that was not even close to 40% though, but then again I had trouble reproducing the bug today, so...

Finally with acpid loaded and with debug_layer=0x80004, debug_level=0x0800001f I suspended a bunch of times and logged the kern.log output for each. I'll attach the output as kern_log_acpid1 (machine suspended normally), kern_log_acpid2 (machine suspended twice in a row) and kern_log_acpid3 (machine suspended 3 times in a row).

From these logs I can guess what happens. Apparently it sometimes happend that when I press the power button, several interrupts get generated. Possibly this may be that my button bounces, I don't know (it can not feel this by touch). Anyway, a corresponding number of events show up at once in /proc/acpi/event and acpid suspends the corresponding number of times. Such bounces don't seem to happen when resuming from suspend state though or if they do, at least they don't generate anything in my logs.

Should acpid remove duplicate power button events somehow ? even if it can't tell the difference between a bounce and someone actually pressing the button twice, it does not make much sense to suspend the machine multiple times due to (a) button press(es) that happened before the first suspend...
Comment 8 walken 2008-04-24 07:15:19 UTC
Created attachment 15886 [details]
kern_log - see comment 7
Comment 9 walken 2008-04-24 07:15:46 UTC
Created attachment 15887 [details]
kern_log_acpid1 - see comment 7
Comment 10 walken 2008-04-24 07:16:12 UTC
Created attachment 15888 [details]
kern_log_acpid2 - see comment 7
Comment 11 walken 2008-04-24 07:16:35 UTC
Created attachment 15889 [details]
kern_log_acpid3 - see comment 7
Comment 12 ykzhao 2008-04-24 19:19:31 UTC
Thanks for the test.
What you said in comment #7 is correct. I want to confirm whether there exist the PWRF and SLPB event after the system is resumed. IMO that the system suspend again after it is resumed should be related with the PWRF/SLPB/SLPF event. But from the attached log it seems that there is no SLPB/PWRF event after the system is resumed. In such cases you can't reproduce the bug.

The exception that several interrupts are triggered when I press the power button should be related with the button bounce. Maybe it is related with the bios and it is not easy for acpid to recognize the duplicated power button events.

Anyway please boot the system with the option of "acpi.debug_layer=0x080004 acpi.debug_level=0x0800001f"  and attach the output of dmesg when the bug happens again.
Thanks.
Comment 13 walken 2008-04-24 19:41:21 UTC
Will do.

Just for clarity, should anything other than the ev_fixed_event_detect and ev_gpe_detect lines to be logged when an power button event shows up on /proc/acpi/event ? because, I've not seen that in my previous kern_log_acpid* logs taken with debug_layer=0x080004 and debug_level=0x0800001f. I worry if that means I'm missing some logging options...

Thanks,
Comment 14 ykzhao 2008-04-24 19:52:54 UTC
When the power button event shows up in /proc/acpi/event, the info of "ev_fixed_event_detect and ev_gpe_detect" should appear. Of course sometimes there will be some other info that we don't care. (It is my fault. I give the wrong option. )
Thanks.
Comment 15 walken 2008-04-25 03:22:12 UTC
Created attachment 15909 [details]
dmesg_boot

The attached file is a log of the system booting with acpi.debug_layer=0x080004
acpi.debug_level=0x0800001f. It does not include a suspend.

Regarding suspends, the logs kern_log_acpi2 and kern_log_acpid3 might already be what you want - they were the result of having acpid and all the usual stuff loaded, suspending using the power button, and resuming. This matches my normal usage - the only difference I can see is that I would normally go away for some time before resuming, while I did not in the above two cases.

Regardless, I'll go forward with normal usage and attach new logs when I hit more cases of the spurious suspend after resume.

Thanks,
Comment 16 walken 2008-04-25 15:49:57 UTC
Created attachment 15917 [details]
dmesg output - double suspend during normal usage

Here is one that happened during normal usage. For what I can tell, it's quite similar to kern_log_acpid2. There are two ev_fixed_event_detect before the first suspend, and the machine suspended a second time immediately after the first resume.
Comment 17 walken 2008-04-27 05:46:02 UTC
FYI - I won't be able to do additional testing on this from april 29 to may 29 as I'll be out of the country.
Comment 18 ykzhao 2008-04-28 00:00:10 UTC
Hi
   >the machine suspended a second time immediately after the first resume.
   Do you mean that the system enters the suspending state again automatically after resum? Or the system enter the suspending state by using "echo mem > /sys/power/state"?
   thanks.
Comment 19 walken 2008-04-28 01:55:55 UTC
Hi,

Yes, the second suspend was automatic.

kern_log was the only log where I suspended using "echo mem > /sys/power/state" from a shell prompt.

In all the more recent logs (kern_log_acpid* and dmesg*) the suspends were done by pressing the power button (just once) while acpid is loaded. acpid runs the powerbtn.sh script (described in the initial report), which then runs the "echo mem > /sys/power/state".

In kern_log_acpid2, kern_log_acpid3 and the last dmesg case, the machine suspended multiple times in a row with no user interaction between the suspends. I think acpid ran the powerbtn.sh script multiple times because the power button triggering the initial suspend had a bounce - these logs show some evidence of a bounce (with several ev_fixed_event_detect in a row) and an equal number of suspend cycles afterwards.

Thanks,
Comment 20 ykzhao 2008-05-05 02:33:02 UTC
Hi
   Will you please confirm whether the acpid and powerbtn script are used when the system is suspended again after resume? 
   If yes, please kill the acpid and do the following test: 
   a. echo 0x80004 > /sys/module/acpi/parameters/debug_layer && echo 0x0800001f > /sys/module/acpi/parameters/debug_level && echo 1 > /sys/module/printk/parameters/time
b. cat /proc/acpi/event >event_log (please press ctrl+c after step d is
finished)
c. suspend& resume several cycles and confirm whether the problem still exists. ( using the command of "echo mem > /sys/power/state" to enter the sleeping state)
    
    
Comment 21 walken 2008-06-01 22:56:37 UTC
Just came back from vacation - sorry for the delay.

Yes, acpid is loaded when I get the duplicate suspends, and the commands from the powerbtn script (start/stopping services, etc) do get executed.

I ran the commands suggested in comment 20. After 10 consecutive suspends and resumes using echo mem > /sys/power/state, I did not get any output from /proc/acpi/event, nor did I get any duplicate suspends. Actually, I have never observed any duplicate suspend when using the echo method rather than the power button. I will attach the dmesg output for the 10 suspends as kern_log_suspend.

While I was there, I also pressed 10 times the power button. As you'll see in the logs, this generated 16 events - I do easily get duplicate events from the power button. The ten pushes respectively generated 2, 3, 2, 1, 1, 1, 2, 1, 1, 1 and 1 events. I'll attach the logs for that as kern_log_button and event_log_button.

Hope this helps,
Comment 22 walken 2008-06-01 22:58:00 UTC
Created attachment 16359 [details]
kern_log_suspend
Comment 23 walken 2008-06-01 22:58:31 UTC
Created attachment 16360 [details]
kern_log_button
Comment 24 walken 2008-06-01 22:58:58 UTC
Created attachment 16361 [details]
event_log_button
Comment 25 ykzhao 2008-06-02 01:40:15 UTC
Hi, Walken
   Thanks for the test.
   From the test log we can know that the root cause is that sometimes several events will be produced when the power button is pressed once. When the power button script and acpid are used, the following two cases will be affected:
   a. Press power button to enter the suspend state
      After the power button is pressed once, maybe two events will be sent to the user space. When the acpid receives the first power event, the power button script will be called to enter the suspend state and the second power event will still be sent to user space. After the system is resumed, the second power button event will be processed and the power button script will be called to enter the suspend again. 
   b. Press power button to wake the sleeping state
      Power button is pressed to wake the sleeping state and maybe it will produce two events. The first event will wake up the sleeping state and not be sent to the user space(In fact it won't be processed by the button driver).If the second power event occurs after the Fixed_power event is reenabled, it will be sent to the user space. In such case the system will enter the suspended state again.

   So if the duplicated event can be eliminated, the system will work well. IMO this is related with hardware issue. For the case a it will be better to eliminate the duplicated event in user space. For the case b it can't be eliminated. 
   
Comment 26 ykzhao 2008-06-02 01:53:21 UTC
Hi, Walken
    From the test log it seem that the bug is is related with the hardware isssue.Sometimes events are produced when power button is pressed once. 
    If the power button script and acpid are used, please continue use the workaround in power button scripts. IMO it is not worth fixing this bug in kernel.
   >rmmod button    # Need this as a workaround...
   >sync
   >echo mem > /sys/power/state
   >modprobe button # Need this as a workaround...
    

Note You need to log in before you can comment on or make changes to this bug.