Bug 9163

Summary: Resume form memory suspend causes about a half a minute wait
Product: ACPI Reporter: Ciprian Dorin Craciun (ciprian.craciun)
Component: ECAssignee: acpi_ec
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, alan, alon.barlev, astarikovskiy, htejun, lenb, pj_beers, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v2.6.23.1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 9056    
Attachments: lspci output
dmesg output for the suspend/resume operation
config file for v2.6.23.1
config file for v2.6.22.9
lspci for t43p
My kernel config - t43p
First time s2disk - no errors (2.6.24-rc5; NO_HZ=y,HIGH_RES_TIMERS=y)
Second time s2disk - with error, completed uninterruptedly
Third time s2disk - with error, interrupted with keypresses to speed things up
First s2disk, then s2ram; 2.6.24-rc5; nohz=off highres=off; No errors
s2disk with errors (2.6.24-rc6; NO_HZ=y,HIGH_RES_TIMERS=y)
Contents of /proc/timer_list; 2.6.24-rc6, "hpet=disabled" on cmdline
Bootlog with nolapic_timer on the cmdline; 2.6.24-rc6
Bootlog without nolapic_timer on the cmdline; 2.6.24-rc6
2.6.25-rc3 dmesg with sequence of s2disk s2disk s2ram s2disk
test patch
dmesg output - 2.6.25 with patch

Description Ciprian Dorin Craciun 2007-10-14 07:16:21 UTC
Most recent kernel where this bug did not occur: v2.6.22.9

Distribution: Debian Etch

Hardware Environment: IBM ThinkPad T43. The output of lspci is attached.

Problem Description:

    When my laptop resumes it takes a lot of time, about half a minute to
resume.

    I watched the timings in the dmesg output (attached) and it seems it
hangs either at the souncard setup or the SATA controller. (I'm not an
expert, I just guessed that based on dmesg, so I might be wrong...)

    The problem is at the lines:

[    0.275398] PCI: Setting latency timer of device 0000:00:1e.2 to 64
[   18.437412] PM: Writing back config space on device 0000:00:1f.2 at
offset 1 (was 2b00005, writing 2b80005)

   And those PCI's corespond to:

00:1e.2 Multimedia audio controller:
   Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) AC'97 Audio
Controller (rev 03)
00:1f.2 IDE interface:
   Intel Corporation 82801FBM (ICH6M) SATA Controller (rev 03)

   I must add that I haven't encountered this problem with any of the
previous stable kernels. The last one I have used was v2.6.22.9. And
the difference of the two configs is not essential.

   Thanks!
Comment 1 Ciprian Dorin Craciun 2007-10-14 07:28:24 UTC
Created attachment 13151 [details]
lspci output
Comment 2 Ciprian Dorin Craciun 2007-10-14 07:29:12 UTC
Created attachment 13152 [details]
dmesg output for the suspend/resume operation
Comment 3 Ciprian Dorin Craciun 2007-10-14 07:31:55 UTC
    I just tried another suspend/resume and this time the suspend took about half a minute...

    Also my built-in keyboard and touch pad are dead. It needs a reboot.

    Ciprian.
Comment 4 Rafael J. Wysocki 2007-10-14 12:32:16 UTC
CCed Tejun and Alan, perhaps they can help.
Comment 5 Rafael J. Wysocki 2007-10-14 12:33:03 UTC
Ciprian, can you also attach your .config, please?
Comment 6 Ciprian Dorin Craciun 2007-10-14 12:36:24 UTC
Created attachment 13154 [details]
config file for v2.6.23.1
Comment 7 Rafael J. Wysocki 2007-10-14 12:45:54 UTC
Thanks.

Please also attach your working 2.6.22.9 .config, if possible.
Comment 8 Rafael J. Wysocki 2007-10-14 12:47:55 UTC
Please also try to boot with libata.noacpi=0 and see if that helps.
Comment 9 Ciprian Dorin Craciun 2007-10-14 12:53:00 UTC
Created attachment 13156 [details]
config file for v2.6.22.9
Comment 10 Ciprian Dorin Craciun 2007-10-14 13:15:24 UTC
    I have tried the libata.noacpi=0 but it didn't help.

    I also have found something else... When resuming the following happens:
    1) I press the power button or open the laptop's lid.
    2) The processor's fan starts to spin.
    3) Nothing else happens until... (I have waited for almost 2 minutes...)
    4) I press any key on the keyboard. And after this,
    5) The HDD led starts to blink,
    6) After a few seconds it comes out of suspend.

    It seems thus that it blocks until I press a key -- which I think happened every time but I haven't payed attention. And every time the time difference is between those two lines.

    Also it happened that this morning -- but only once -- the resume worked just fine. (Or maybe I have pressed a key just after opening the lid.)

    Ciprian.
Comment 11 Rafael J. Wysocki 2007-10-14 13:53:27 UTC
Well, let's add some more people to the CC list.
Comment 12 Tejun Heo 2007-10-17 03:18:10 UTC
So, Ciprian, if you hit key early during resume, resume is fast?
Comment 13 Ciprian Dorin Craciun 2007-10-17 03:49:39 UTC
    You see, here it's strange. The laptop starts spinning the processor fan, and sometimes the led from the wireless card blinks. But there is no hard disk activity until I press the any key. And after this it resumes. You can see from the timings:

[    0.275398] PCI: Setting latency timer of device 0000:00:1e.2 to 64
[   18.437412] PM: Writing back config space on device 0000:00:1f.2 at
offset 1 (was 2b00005, writing 2b80005)

    Here I think that only after 18 seconds I pressed a key.

    This also happens sometimes at suspend. For example the sleep-led (there is a special sleep led on my laptop) starts blinking when I initialize the suspend procedure and after a few seconds it stops blinking and the laptop suspends -- there is no fan spinning. But sometimes the fan still works and the led blinks until I press any key.

    Ciprian.

    P.S.: I am conscious that a laptop could also resume after a key press, and that on some laptops it is not sufficient to open the lid. But mine works just fine -- without any key pressing, just opening the lid -- in v2.6.22.9 -- and any kernel starting with v2.6.17 (or 18?)
Comment 14 Rafael J. Wysocki 2007-12-12 16:34:50 UTC
Ciprian, can you try the current mainline kernel with CONFIG_NO_HZ and
CONFIG_HIGH_RES_TIMERS unset, please?
Comment 15 Ciprian Dorin Craciun 2007-12-13 00:52:08 UTC
Unfortunately I do not own the IBM laptop anymore, but on the new laptop -- a Benq S31V -- with kernel 2.6.23.8 and almost the same configuration (including CONFIG_HO_HZ and CONFIG_HIGH_RES_TIMERS set) the problem doesn't exist anymore.

If you want me to carry some tests on this Laptop I will gladly do so.

Ciprian.
Comment 16 Rafael J. Wysocki 2007-12-13 07:04:31 UTC
Well, if the problem is not visible on your new laptop, no further testing is needed, IMO.

I'm closing this bug as unreproducible.  Please reopen if necessary, thanks.
Comment 17 PJBrs 2007-12-13 07:59:16 UTC
I have a similar laptop (t43p) with similar problems; it only resumes correctly if I boot with highres=off,nohz=off. Problems started with 2.6.21, which produced hangs wiht nohz. I didn't really investigate. From 2.6.22 the resume problems started occurring, including on 2.6.24-rc4. After suspending two or three times (I did not notice specific differences between s2disk and s2ram) resume starts taking ages. The process can only be speeded up by pressing some buttons. By the way I had problems with hdaps too (see bug 8757) which I 'solved' by not loading hdaps anymore.

I'd be happy to test, so I'll reopen the bug and reboot tonight (2.6.24-rc5) to see what happens. One note, I'm quite unfamiliar with testing stuff like this, so please tell me what info you need. I did not replicate all tests you asked from Ciprian. I'll add config and lspci output anyway. I'm on Slackware12, with kernel 2.6.23.9.
Comment 18 PJBrs 2007-12-13 08:00:39 UTC
Created attachment 14010 [details]
lspci for t43p
Comment 19 PJBrs 2007-12-13 08:01:21 UTC
Created attachment 14011 [details]
My kernel config  - t43p
Comment 20 PJBrs 2007-12-13 08:03:23 UTC
Sorry, I'm no good with bugzilla and it shows, I'm unable to reopen this bug... How should I continue?
Comment 21 PJBrs 2007-12-13 13:21:05 UTC
Rebooted with 2.6.24-rc5, nohz, highres. I could s2ram without problem for 10 times. I then suspended2disk once, also without problem. After that however, resuming after s2ram was very slow. I did not let it finish but hit a few buttons, without noticeable effect. I then hit fn-f7 a few times, and that did the trick. 

I rebooted and s2disk-ed a few times. First time no problem, second time I let it resume and it took a couple of minutes, first time I pressed a few any, non-specific, buttons and resume proceeded as normal.
Comment 22 Ciprian Dorin Craciun 2007-12-13 13:34:19 UTC
    I have to mention that my config also contained the hdaps module compiled into the kernel (I mean statically and not as a module). And I also confirm that I have experienced the same behavior as PJBrs encountered... So I would propose to transfer the ownership (the reporter) of the bug to PJBrs -- if such a thing is possible -- as he has the hardware and the same problem, and thus he could help.

    Ciprian.
Comment 23 Rafael J. Wysocki 2007-12-13 13:51:40 UTC
I'm not sure if we can change the reporter, but that shouldn't matter (if you don't mind receiving emails from the bugzilla, that is).
Comment 24 PJBrs 2007-12-17 12:11:50 UTC
I'm currently on 2.6.24-rc5, and booted with nohz=off, highres=off, and I've had no problems so far (booted about three days ago, s2disk and s2ram several times. 

As I said, I did have troubles when leaving out nohz=off, highres=off. I'm attaching three episodes of s2disk. The first is ok, the second is not ok but I let resume complete uninterruptedly, the third as not ok and interrupted with keypresses to speed things up.
Comment 25 PJBrs 2007-12-17 12:15:25 UTC
Created attachment 14090 [details]
First time s2disk - no errors (2.6.24-rc5; NO_HZ=y,HIGH_RES_TIMERS=y)
Comment 26 PJBrs 2007-12-17 12:16:17 UTC
Created attachment 14091 [details]
Second time s2disk - with error, completed uninterruptedly
Comment 27 PJBrs 2007-12-17 12:17:03 UTC
Created attachment 14092 [details]
Third time s2disk - with error, interrupted with keypresses to speed things up
Comment 28 PJBrs 2007-12-18 00:48:05 UTC
For good measure I also include dmesg with s2disk and then s2ram without problems, after three days of uptime on kernel 2.6.24-rc5, booted with nohz=off highres=off. I haven't kept count of how many times I suspended before recording this dmesg output.

One last thing, either NO_HZ=y or HIGH_RES_TIMERS=y is enough to produce the delays at resume. Both need to be switched off at boot to be able to suspend and resume normally.

Please let me know if there's anything more I can test and do to help.
Comment 29 PJBrs 2007-12-18 00:49:05 UTC
Created attachment 14102 [details]
First s2disk, then s2ram; 2.6.24-rc5; nohz=off highres=off; No errors
Comment 30 PJBrs 2008-01-01 10:29:54 UTC
I just tried 2.6.24-rc6. All is still fine with nohz=off highres=off on the kernel commandline, but without this the problems still remain. I did notice ata1.01: _GTF evaluation failed (AE 0x1001) and ata2.01: _GTF evaluation failed (AE 0x1001) with dmesg, so I'm attaching a dmesg record of s2disk with errors (NO_HZ=y, HIGH_RES_TIMERS=y).
Comment 31 PJBrs 2008-01-01 10:31:05 UTC
Created attachment 14246 [details]
s2disk with errors (2.6.24-rc6; NO_HZ=y,HIGH_RES_TIMERS=y)
Comment 32 Thomas Gleixner 2008-01-05 04:09:23 UTC
Can you please try "hpet=disable" on the kernel command line (with nohz & highres active) ?
Comment 33 PJBrs 2008-01-07 23:50:13 UTC
I tried with hpet=disabled (2.6.24-rc6), and I saw no big differences, save one; This time, the wait happened at suspend instead of resume. Everything went normal until the "s2disk: snapshotting system..." message appeared. At that point I had the slow-blinking (slow as in takes one minute for one blink) cursor, and I pressed some keys to speed things up. This made the next message appear (something like "s2disk: snapshot ready..."), and suspend proceeded as normal. I haven't rebooted (yet) to test whether this was a one-time behaviour, or that booting with hpet=disabled makes the wait occur on suspend instead of resume.

I also did s2ram a few times, which had problems as well, but on resume. I did not notice anything that I hadn't noticed without hpet=disabled before. By now, however, I am sure that the wait on resume-from-ram can only be shortened by pressing Fn-{some special thinkpad function key}-combo's, not the "any"-key. Furthermore, in some cases (but this is difficult to reproduce), s2ram only occurs after one more keypress, irrespective which key.

In sum: hpet=disabled did not cause major differences, except the wait occurring at suspend instead of resume with s2disk. By the way, I also tried with "hpet=disabled nohz=off", and then everything is fine (couple of s2disk, s2ram, no problems).

I'll try 2.6.24-rc7 next.
Comment 34 Thomas Gleixner 2008-01-08 01:07:38 UTC
> In sum: hpet=disabled did not cause major differences, except the wait
> occurring at suspend instead of resume with s2disk. By the way, I also tried
> with "hpet=disabled nohz=off", and then everything is fine (couple of s2disk,
> s2ram, no problems).

You said earlier that you have to add "nohz=off" _and_ "highres=off"
to make the problem go away.

With "hpet=disable" you only need "nohz=off". Or did you disable
highres in the kernel config ?

Thanks,
	tglx
Comment 35 PJBrs 2008-01-08 01:54:09 UTC
$ cat /proc/cmdline
root=/dev/sda1 ro vga=791 printk.time=Y hpet=disabled nohz=off

$ zgrep -e NO_HZ -e HIGH_RES -e HPET /proc/config.gz
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
# CONFIG_HPET is not set

Apparently I can substitute "highres=off" with "hpet=disabled".
Comment 36 Thomas Gleixner 2008-01-08 04:01:29 UTC
> Apparently I can substitute "highres=off" with "hpet=disabled".

Can you please provide the output of
# cat /proc/timer_list

with "hpet=disable" on the command line ?

Thanks,
	tglx
Comment 37 PJBrs 2008-01-08 04:16:15 UTC
Created attachment 14358 [details]
Contents of /proc/timer_list; 2.6.24-rc6, "hpet=disabled" on cmdline
Comment 38 PJBrs 2008-01-08 04:22:18 UTC
Hmm, I notice that the correct syntax is "hpet=disable", not "hpet=disabled" (with an extra "d" after disable). I assume, however, that they make no difference...?
Comment 39 Thomas Gleixner 2008-01-08 04:40:13 UTC
> Hmm, I notice that the correct syntax is "hpet=disable", not "hpet=disabled"
> (with an extra "d" after disable). I assume, however, that they make no
> difference...?

No difference, the match is on "disable".

Can you please add "nolapic_timer" to the kernel command line instead
of "hpet=disable(d)" ?

Thanks,
	tglx
Comment 40 PJBrs 2008-01-08 05:55:40 UTC
No problems with "nolapic_timer"!

I tested by doing s2disk, s2ram, s2ram, s2ram, s2disk, s2disk, s2disk; no problems so far. Need any further info?
Comment 41 Thomas Gleixner 2008-01-08 06:51:59 UTC
Can you please provide full boot logs with and without "nolapic_timer" on the kernel command line? Please add "apic=verbose" in both cases.

Thanks,
       tglx
Comment 42 Thomas Gleixner 2008-01-08 07:12:43 UTC
> Can you please provide full boot logs with and without "nolapic_timer" on the
> kernel command line? Please add "apic=verbose" in both cases.

I try to collect the info we have so far into one place. Please
correct me if I confused something.

1.) If you only do s2ram, then the problem never happens
2.) It's never the first s2disk cycle, but any consequent s2disk or s2ram cycle
3.) Problem goes away with one of the following command line option combos:
    A) "nohz=off highres=off"
    B) "hpet=disable nohz=off"
    C) "nolapic_timer"

Thanks,
	tglx
Comment 43 PJBrs 2008-01-08 07:55:15 UTC
Created attachment 14360 [details]
Bootlog with nolapic_timer on the cmdline; 2.6.24-rc6
Comment 44 PJBrs 2008-01-08 07:55:48 UTC
Created attachment 14361 [details]
Bootlog without nolapic_timer on the cmdline; 2.6.24-rc6
Comment 45 PJBrs 2008-01-08 08:28:42 UTC
> 1.) If you only do s2ram, then the problem never happens

Yes. I haven't tried this with all different options, but I have noticed anything different either.

> 2.) It's never the first s2disk cycle, but any consequent s2disk or s2ram
> cycle

Sometimes but rarely it does happen the first time, I didn't notice this at first. I'm not sure whether it actually worked twice in a row but I don't think so. I don't know whether it matters if it is a reboot or a normal boot either.  

> 3.) Problem goes away with one of the following command line option combos:
>     A) "nohz=off highres=off"
>     B) "hpet=disable nohz=off"
>     C) "nolapic_timer"

This is entirely correct.
Comment 46 Thomas Gleixner 2008-01-08 08:38:37 UTC
On Tue, 8 Jan 2008, bugme-daemon@bugzilla.kernel.org wrote:
> > 3.) Problem goes away with one of the following command line option combos:
> >     A) "nohz=off highres=off"
> >     B) "hpet=disable nohz=off"
> >     C) "nolapic_timer"
> 
> This is entirely correct.

And in all cases you can continue by pressing keys - if I understand
correctly it needs to be some special function key, right ?

After that the kernel works fine (no strange behaviour, slowness, ...) ?

Thanks,
	tglx
Comment 47 PJBrs 2008-01-08 13:26:07 UTC
(In reply to comment #46)
> And in all cases you can continue by pressing keys - if I understand
> correctly it needs to be some special function key, right ?

No, only in the case of s2ram it needs to be a special key (Fn+F7, or Fn+F3, or Fn+{WhatHaveYou}; a combo of your average pair of special laptop keys). The wait with s2disk can be shortened by pressing a couple of 'any' keys.

> After that the kernel works fine (no strange behaviour, slowness, ...) ?

One thing I have sometimes noticed (also occurred with nolapic_timer, and some other combinations of kernel-parameters I don't quite remember) is that the special thinkpad fancontrol script seems to lose control over the fan, but restarting it makes it work again. But that's probably too particular to be of any use.

I admit that the symptoms are quite varied, and I don't always remember all situations and symptoms. But I try to be as precise as possible :-)
Comment 48 PJBrs 2008-01-08 13:29:37 UTC
(In reply to comment #47)
> (In reply to comment #46)
> > After that the kernel works fine (no strange behaviour, slowness, ...) ?

I meant to say - in general, no, but I don't tax this box very much, so I might have missed it.
 
> One thing I have sometimes noticed (also occurred with nolapic_timer, and
> some
> other combinations of kernel-parameters I don't quite remember) is that the
> special thinkpad fancontrol script seems to lose control over the fan, but
> restarting it makes it work again. But that's probably too particular to be
> of
> any use.

To clarify, this script comes from thinkwiki.org, it's not officially delivered by lenovo with the laptop.
Comment 49 Thomas Gleixner 2008-01-08 13:46:56 UTC
> > > After that the kernel works fine (no strange behaviour, slowness, ...) ?
> 
> I meant to say - in general, no, but I don't tax this box very much, so I
> might
> have missed it.

At least it is nothing obvious.
 
> > One thing I have sometimes noticed (also occurred with nolapic_timer, and
> some
> > other combinations of kernel-parameters I don't quite remember) is that the
> > special thinkpad fancontrol script seems to lose control over the fan, but
> > restarting it makes it work again. But that's probably too particular to be
> of
> > any use.
> 
> To clarify, this script comes from thinkwiki.org, it's not officially
> delivered
> by lenovo with the laptop.

Yeah, that's a separate issue probably, but it might be worthwile to
disable the thinkpad stuff for a test.

Thanks,
	tglx
Comment 50 PJBrs 2008-01-10 10:22:28 UTC
I did some further testing. First, I switched to 2.6.24-rc7, and I loaded thinkpad_acpi without the fan_control option and without the fancontrol script.

Some results: I tested with X loaded but no users logged in, and I could s2disk twice without problem. I then did s2ram -> problem. I then wanted to know whether the problem was with X, so I rebooted to runlevel 3 (multiuser, network, no X), and could s2disk without problem three times, but the fourth time had the wait. Apparently, whether the problem occurs depends on the number of programs running, amount of memory used, or something like that. 

From all tests until now I'm only entirely confident that:
The problem goes away with one of the following command line option combos:
    A) "nohz=off highres=off"
    B) "hpet=disable nohz=off"
    C) "nolapic_timer"

With regard to which key to press; I think, but I'm not sure, that I once needed to press some combo of Fn+{$SpecialThinkpadKey} on s2disk, whereas before the rule seemed to be that:

1) the wait on s2disk could be shortened by pressing 'any' keys
2) the wait on s2ram could be shortened by pressing Fn+{$SpecialThinkpadKey}

It is still the rule, but I can't rule out the possibility that it has exceptions.
Comment 51 PJBrs 2008-01-10 10:48:06 UTC
Another test. I was able to reproduce the problem at the first time s2disk by loading 'many' documents (21 pdfs, 3 docs in open office, 1 doc in ms-office, crossover, kontact konqueror and then some). I don't think system load was very high, but memory load was about 1 GB (this machine has 1GB of RAM and 1,5 GB of swap).
Comment 52 PJBrs 2008-01-10 11:26:44 UTC
Last test for today; you wrote:
> 1.) If you only do s2ram, then the problem never happens

This still holds. Opened 61 pdfs and compiled a programme while doing s2ram two times - no problem. A total of about 10 times s2ram in a row, still no problem.

In sum, to get all info in one place :-)

1.) If you only do s2ram, then the problem never happens
2.) Apparently, the lower the memory use, the higher the chances of s2disk succeeding normally
3.) Problem goes away with one of the following command line option combos:
    A) "nohz=off highres=off"
    B) "hpet=disable nohz=off"
    C) "nolapic_timer"
4.) After the problems have started, normal operation still continues without obvious performance problems.
5.) I'm _almost_ sure that:
a) the wait on s2disk can be shortened by pressing 'any' keys (3 to 4 times)
b) the wait on s2ram can be shortened by pressing Fn+{$SpecialThinkpadKey} only, also several times (I did not count).
Comment 53 PJBrs 2008-01-22 01:19:58 UTC
Switched to 2.6.24-rc8, I only tested s2disk twice. Unfortunately, I still had trouble, which occurred when resuming for the second time. I didn't test in detail, I'll wait for the next release to retry.
Comment 54 PJBrs 2008-01-27 23:22:46 UTC
Now on 2.6.24, still no change. Is there anything else I might do to test? Or should I wait for 2.6.25-rc1 and see what that brings?
Comment 55 Rafael J. Wysocki 2008-01-28 12:47:00 UTC
Please apply patches from:
http://www.sisk.pl/kernel/hibernation_and_suspend/2.6.24/patches/
on top of 2.6.24 and see if that changes anything.
Comment 56 PJBrs 2008-01-29 11:47:18 UTC
Hi Rafael, I tried it, but had troubles after second time s2disk.

Any hopes for trying the -rt patch-series? And could you suggest any debugging options that might yield useful information?

By the way, I'm currently only testing whether I can reproduce the problem, I'm not testing all details as I reported in comment #52.

Anyway, thanks for the suggestion :-) Please keep alerting me when you need further testing.
Comment 57 Rafael J. Wysocki 2008-01-29 12:13:08 UTC
In the configuration as per Comment #55, please do:

# echo core > /sys/power/pm_test
# echo mem > /sys/power/state
(this will test the suspend/resume sequence with busy-waiting for 5 sec. instead of suspending)
# echo mem > /sys/power/state

and see if you have problems after that.
Comment 58 PJBrs 2008-01-30 13:58:15 UTC
No problems after that.

I had to recompile with 

CONFIG_PM_DEBUG=y
CONFIG_PM_VERBOSE=y

and now have huge logs. Do you need any dmesg output?
Comment 59 Rafael J. Wysocki 2008-01-30 14:24:14 UTC
Hm.  CONFIG_PM_VERBOSE shouldn't be necessary.

I'm starting to think that your box's BIOS wants us to save/restore the ACPI NVS area during hibernation, which we're not doing at the moment.  This functionality is going to be added at some point, though.
Comment 60 PJBrs 2008-01-31 07:14:35 UTC
(In reply to comment #59)
> Hm.  CONFIG_PM_VERBOSE shouldn't be necessary.

In fact it wasn't, but I thought I'd be better off safe than sorry :-)
Comment 61 Alon Bar-Lev 2008-02-03 10:34:18 UTC
I also have _GTF evaluation failed, using T42.

Feb  3 19:43:41 alon1 PCI: Setting latency timer of device 0000:00:1f.1 to 64
Feb  3 19:43:41 alon1 ata1.01: _GTF evaluation failed (AE 0x1001)
Feb  3 19:43:41 alon1 PM: Writing back config space on device 0000:00:1f.5 at offset 1 (was 2900007, writing 2900003)
Feb  3 19:43:41 alon1 ACPI: PCI Interrupt 0000:00:1f.5[B] -> Link [LNKB] -> GSI 5 (level, low) -> IRQ 5
Feb  3 19:43:41 alon1 PCI: Setting latency timer of device 0000:00:1f.5 to 64
Feb  3 19:43:41 alon1 ata2.01: _GTF evaluation failed (AE 0x1001)
Feb  3 19:43:41 alon1 ata1.00: ACPI cmd ef/02:00:00:00:00:a0 succeeded
Feb  3 19:43:41 alon1 ata1.00: ACPI cmd f5/00:00:00:00:00:a0 filtered out
Feb  3 19:43:41 alon1 ata1.00: ACPI cmd ef/03:45:00:00:00:a0 filtered out
Feb  3 19:43:41 alon1 ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out
Comment 62 PJBrs 2008-02-25 12:03:37 UTC
Tested 2.6.25-rc3 today. It appeared harder to reproduce the problem, I did s2disk, s2disk, s2disk, and then s2ram. Problems only occurred on resume from that s2ram. Did not test further, but I did notice the following in dmesg:

[  945.521152] ACPI: EC: missing OBF confirmation, don't expect it any longer.
[  945.605157] ACPI: EC: missing write data confirmation, don't expect it any lo
nger.
...
[  956.809345] ACPI: EC: acpi_ec_wait timeout, status = 40, expect_event = 2
[  956.809349] ACPI: EC: input buffer is not empty, aborting transaction

Incidentally, this was also the first time that I had the delays on resume. I'm attaching full logs, for good measure :-)
Comment 63 PJBrs 2008-02-25 12:05:25 UTC
Created attachment 14983 [details]
2.6.25-rc3 dmesg with sequence of s2disk s2disk s2ram s2disk
Comment 64 PJBrs 2008-04-02 13:47:02 UTC
And another test, 2.6.25-rc8; still the same as -rc3. As with -rc3, at the first time resume is problematic, I find the 
"ACPI: EC: missing OBF confirmation, don't expect it any longer.
ACPI: EC: missing write data confirmation, don't expect it any longer." messages in my logs. 

Oh well, still hoping it helps :-)

One more thing; previously (2.6.24 and earlier) I could reproduce the problem with doing s2disk twice. After I noted that the problem seemed related to memory use, I could reproduce it by doing s2disk the first time, after having loaded a lot of stuff. With 2.6.25-rc*, I wasn't able to do that. I issued s2disk four times in a row and couldn't reproduce the problem. One time I suspended to disk with only 600MB swap left, and s2disk needed ~450MB for its suspend image, still no problems. It seems that now I can only reproduce it by first issuing s2disk (for a couple of times?), then issuing s2ram, and the problem is then reproduced upon resume (from s2_ram_).
Comment 65 Alexey Starikovskiy 2008-04-02 14:04:30 UTC
Created attachment 15581 [details]
test patch

There is a test patch, which might appear in next kernel release. It would be great if you give it a try and check if it helps or not.
Comment 66 PJBrs 2008-04-04 01:04:47 UTC
Hi Alexey, the patch seems quite promising! I did a string of about 8 s2disk and s2ram, with lots of open pdf's, and could not reproduce the problem! It was a bit disappointing that after about the eighth resume I got a complete system hang, but it only occurred after "[drm] Loading R300 Microcode", so I'm pretty sure that's xorg-related :-)

I cannot give you dmesg-output right now, because of that hang, but I did a "grep EC /var/log/messages" , and I found the following. First, without your patch:

[    0.180890] ACPI: EC: EC description table is found, configuring boot EC
[    0.195535] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    0.221563] ACPI: EC: GPE = 0x1c, I/O: command/status = 0x66, data = 0x62
[    0.221576] ACPI: EC: driver started in poll mode
[    0.236115] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    4.591992] thinkpad_acpi: ThinkPad BIOS 1YET65WW (1.29 ), EC 1YHT29WW-1.06

... and this is happens upon resume:

[ 1532.037513] ACPI: EC: missing OBF confirmation, don't expect it any longer.
[ 1532.053513] ACPI: EC: missing write data confirmation, don't expect it any longer.

This is with your patch:

[    0.164259] ACPI: EC: EC description table is found, configuring boot EC
[    0.170870] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    0.192164] ACPI: EC: GPE = 0x1c, I/O: command/status = 0x66, data = 0x62
[    0.192170] ACPI: EC: driver started in poll mode
[    0.198634] ACPI: EC: non-query interrupt received, switching to interrupt mode
[    4.143958] thinkpad_acpi: ThinkPad BIOS 1YET65WW (1.29 ), EC 1YHT29WW-1.06

And each following line occurs upon successful resume, irrespective of whether it was from ram or disk:

[15122.865041] ACPI: EC: non-query interrupt received, switching to interrupt mode
[15217.449190] ACPI: EC: non-query interrupt received, switching to interrupt mode
[15434.263007] ACPI: EC: non-query interrupt received, switching to interrupt mode
[16918.370743] ACPI: EC: non-query interrupt received, switching to interrupt mode
[16994.838165] ACPI: EC: non-query interrupt received, switching to interrupt mode
[17130.493470] ACPI: EC: non-query interrupt received, switching to interrupt mode
[17220.770876] ACPI: EC: non-query interrupt received, switching to interrupt mode
[17432.865081] ACPI: EC: non-query interrupt received, switching to interrupt mode

I think its pretty safe to assume that this patch fixes the problem. Thanks a lot!! Let me know if there's any more testing you'd like me to do, or information you'd like to have.
Comment 67 PJBrs 2008-04-12 04:01:30 UTC
Hi, I was wondering what the current status of this patch is. Just for your interest, I'm happy to say that I haven't had resume problems since applying it to 2.6.25-rc8! So will it be included in a final 2.6.25? Or do you need any further testing to be done?
Comment 68 Alexey Starikovskiy 2008-04-12 04:04:14 UTC
No, it is considered too big for 2.6.25-rc8... I hope it manages into 2.6.26-rc1.
Comment 69 PJBrs 2008-04-16 13:29:52 UTC
Hi!

Unfortunately, after several days without problems, I finally hit a resume problem again, with 2.6.25-rc8, and your patch applied. It is more or less, but not exactly the same problem. This time, I resumed from ram and hit the non-blinking cursor. I started hitting buttons, and then special thinkpad Fn+F? combo's, but this time without any effect (except the screen getting brighter / darker, in the case of fn+pgup/pgdn ;-) ). I did wake it in the end, by hitting the power button several times, after which it woke up and then immediately shut down. I found the following in /var/log/syslog:

[132249.337459] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b0=1"
[132249.337464] ACPI: EC: read timeout, command = 130
[132252.775383] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
[132252.775387] ACPI: EC: write_cmd timeout, command = 128
[132252.775394] thinkpad_acpi: fan watchdog: error 5 while enabling fan, will try again later...
[132255.592950] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
[132255.592954] ACPI: EC: write_cmd timeout, command = 128
[132258.141983] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
[132258.141987] ACPI: EC: finish-write timeout, command = 131
[132258.141995] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for [EmbeddedControl] [20070126]
[132258.142009] ACPI Error (psparse-0537): Method parse/execution failed
[\_SB_.LID_._LID] (Node f780fb70), AE_TIME
[132260.100499] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
[132260.100503] ACPI: EC: write_cmd timeout, command = 128
[132260.100510] thinkpad_acpi: fan watchdog: error 5 while enabling fan, will try again later...

Apparently, this is much harder to hit. I never had this situation, and the patch corrected all prior "species" of this bug.

Let me know if you need more info or testing :-) (but only after the weekend, am gone for a few days). 
Comment 70 PJBrs 2008-04-16 13:30:54 UTC
P.S., the shutdown was seemingly due only to the power-button having been pressed, and shutdown proceeded as usual.Apr 16 22:02:50 slacktop kernel: [132249.337459] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b0=1"
Apr 16 22:02:50 slacktop kernel: [132249.337464] ACPI: EC: read timeout, command = 130
Apr 16 22:02:50 slacktop kernel: [132252.775383] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
Apr 16 22:02:50 slacktop kernel: [132252.775387] ACPI: EC: write_cmd timeout, command = 128
Apr 16 22:02:50 slacktop kernel: [132252.775394] thinkpad_acpi: fan watchdog: error 5 while enabling fan, will try again
later...
Apr 16 22:02:50 slacktop kernel: [132255.592950] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
Apr 16 22:02:50 slacktop kernel: [132255.592954] ACPI: EC: write_cmd timeout, command = 128
Apr 16 22:02:50 slacktop kernel: [132258.141983] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
Apr 16 22:02:50 slacktop kernel: [132258.141987] ACPI: EC: finish-write timeout, command = 131
Apr 16 22:02:50 slacktop kernel: [132258.141995] ACPI Exception (evregion-0420): AE_TIME, Returned by Handler for
[EmbeddedControl] [20070126]
Apr 16 22:02:50 slacktop kernel: [132258.142009] ACPI Error (psparse-0537): Method parse/execution failed
[\_SB_.LID_._LID] (Node f780fb70), AE_TIME
Apr 16 22:02:50 slacktop kernel: [132260.100499] ACPI: EC: acpi_ec_wait timeout, status = 0x29, event = "b1=0"
Apr 16 22:02:50 slacktop kernel: [132260.100503] ACPI: EC: write_cmd timeout, command = 128
Apr 16 22:02:50 slacktop kernel: [132260.100510] thinkpad_acpi: fan watchdog: error 5 while enabling fan, will try again
later...
Comment 71 PJBrs 2008-05-11 00:58:43 UTC
Managed to capture full dmesg output with resume problem on 2.6.25 with your patch applied. This time, I did not try to speed up resume by repeatedly pressing the power button. It took a few minutes to resume, after which everything worked as it should. I'll attach the dmesg output. Next thing I'll try will probably be 2.6.26-rc2 or rc3. Reproducing the problem takes a long time, uptime was 5 days (including s2disk / s2ram several times) before I hit the problem again.

By the way, should I keep using this bug report to report on this resume problem, or open a new one? The point is that all the symptoms remain the same, only the type of button pushed to speed things up has changed.
Comment 72 PJBrs 2008-05-11 01:02:24 UTC
Created attachment 16097 [details]
dmesg output - 2.6.25 with patch
Comment 73 PJBrs 2008-05-29 03:58:14 UTC
While trying 2.6.26-rc? I noticed that I still had one not-so-handy command in rc.local, i.e.:

echo enable,0xffff > /proc/acpi/ibm/hotkey

I removed this and rebooted 2.6.25 with patch, and have not had any problems since, uptime says 14 days up, with multiple s2ram and/or s2disk each day. So I think the bug should be marked as verified.

Apologies for the mix-up, I should have tested without any tampering with thinkpad-acpi module options :-)
Comment 74 Len Brown 2009-04-02 04:50:43 UTC
commit 223883b7aafa02410ed2e571d6032c876d0b23b8
Author: Alexey Starikovskiy <astarikovskiy@suse.de>
Date:   Fri Mar 21 17:07:21 2008 +0300

    ACPI: EC: Switch off GPE mode during suspend/resume

shipped in 2.6.26-rc1

closed