Bug 13306 - hibernate slow on _second_ run
hibernate slow on _second_ run
Status: CLOSED UNREPRODUCIBLE
Product: ACPI
Classification: Unclassified
Component: Power-Battery
All Linux
: P1 normal
Assigned To: Zhang Rui
:
Depends on:
Blocks: 7216 13070
  Show dependency treegraph
 
Reported: 2009-05-14 09:34 UTC by Johannes Berg
Modified: 2009-09-30 21:00 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.30-rc5-wl-23297-gbf2c6a3-dirty
Tree: Mainline
Regression: Yes


Attachments
acpi dump (115.44 KB, text/plain)
2009-08-25 00:02 UTC, Johannes Berg
Details
dmesg containing boot + multiple hibernates (254.54 KB, text/plain)
2009-09-04 04:58 UTC, Johannes Berg
Details
debug patch (782 bytes, patch)
2009-09-18 03:31 UTC, Zhang Rui
Details | Diff
debug patch (718 bytes, patch)
2009-09-21 02:45 UTC, Zhang Rui
Details | Diff
log (71.50 KB, application/x-bzip)
2009-09-24 10:00 UTC, Johannes Berg
Details

Description Johannes Berg 2009-05-14 09:34:32 UTC
On the first run, hibernate is acceptably fast. However, on the second and further attempts it stalls both during suspend and resume, at varying points (I'll try to capture the points but it's not always the same)

(This is on a MacBook unibody 13.3")
Comment 1 Rafael J. Wysocki 2009-05-14 14:33:03 UTC
What was the last kernel where it didn't happen?
Comment 2 Johannes Berg 2009-05-14 14:45:45 UTC
I'm fairly sure it started happening at -rc1 or -rc2 but haven't double-checked now. I just remember seeing it, swearing, and then not seeing it etc. until I realised it had to do with being the second hibernate.
Comment 3 Rafael J. Wysocki 2009-05-14 18:06:59 UTC
This: http://marc.info/?l=linux-kernel&m=124219402109245&w=4 ?
Comment 4 Andrew Morton 2009-05-16 23:49:20 UTC
(In reply to comment #3)
> This: http://marc.info/?l=linux-kernel&m=124219402109245&w=4 ?

It's unclear what you're asking Johannes here?  That email thread looks pretty different from the problem which Johannes is seeing?
Comment 5 Rafael J. Wysocki 2009-05-17 10:09:40 UTC
Sorry.  There's a patch in the email that I'd recommend to try.  Still, it's already in the Linus' tree, so I'd recommend trying the top of the Linus' tree, actually. :-)
Comment 6 Len Brown 2009-05-19 01:56:35 UTC
Johannes, does the latest upstream still have the issue?
Comment 7 Johannes Berg 2009-05-19 10:22:27 UTC
Sorry, it took me a while to find time to reproduce/test -- the issue seems to be fixed in current upstream (I'm currently at 363383277081ce831642b72df40932ee05ce40a2)
Comment 8 Rafael J. Wysocki 2009-05-19 22:02:14 UTC
Good, thanks for the verification.
Comment 9 Johannes Berg 2009-05-21 12:30:10 UTC
Unfortunately, the bug has reappeared with the latest tree -- it seems I need to actually work with the machine between hibernations in order to make it not slow.

Note the timing in these resume runs that I still had in my dmesg buffer now, on 2.6.30-rc6-wl-24018-g8020553-dirty (which is wireless-testing/master merged with Linus's master), there's always a ~20 second delay there:

[46821.976003] ACPI: Waking up from system sleep state S4
[46822.521462] HDA Intel 0000:00:08.0: restoring config space at offset 0x1 (was 0xb00006, writing 0xb00002)
[46822.529406] forcedeth 0000:00:0a.0: restoring config space at offset 0x1 (was 0xb00003, writing 0xb00007)
[46822.537330] ahci 0000:00:0b.0: restoring config space at offset 0x1 (was 0xb00403, writing 0xb00407)
[46843.140620] pci 0000:00:03.2: PME# disabled


[76097.115916] ACPI: Waking up from system sleep state S4
[76097.641471] HDA Intel 0000:00:08.0: restoring config space at offset 0x1 (was 0xb00006, writing 0xb00002)
[76097.649409] forcedeth 0000:00:0a.0: restoring config space at offset 0x1 (was 0xb00003, writing 0xb00007)
[76097.657325] ahci 0000:00:0b.0: restoring config space at offset 0x1 (was 0xb00403, writing 0xb00407)
[76117.200757] pci 0000:00:03.2: PME# disabled


[85368.496002] ACPI: Waking up from system sleep state S4
[85369.061463] HDA Intel 0000:00:08.0: restoring config space at offset 0x1 (was 0xb00006, writing 0xb00002)
[85369.069382] pci 0000:00:09.0: restoring config space at offset 0x7 (was 0x628000f0, writing 0x228000f0)
[85369.077592] forcedeth 0000:00:0a.0: restoring config space at offset 0x1 (was 0xb00003, writing 0xb00007)
[85369.086071] ahci 0000:00:0b.0: restoring config space at offset 0x1 (was 0xb00403, writing 0xb00407)
[85389.570622] pci 0000:00:03.2: PME# disabled
Comment 10 Rafael J. Wysocki 2009-07-07 10:53:11 UTC
On Tuesday 07 July 2009, you wrote:
> On Tue, 2009-07-07 at 02:00 +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.29 and 2.6.30.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.29 and 2.6.30.  Please verify if it still should
> > be listed and let me know (either way).
> > 
> > 
> > Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=13306
> > Subject		: hibernate slow on _second_ run
> > Submitter	: Johannes Berg <johannes@sipsolutions.net>
> > Date		: 2009-05-14 09:34 (54 days old)
> 
> Still exists in 31-rc2.
Comment 11 Rafael J. Wysocki 2009-07-27 21:03:56 UTC
On Monday 27 July 2009, Johannes Berg wrote:
> On Sun, 2009-07-26 at 22:45 +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.29 and 2.6.30.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.29 and 2.6.30.  Please verify if it still should
> > be listed and let me know (either way).
> 
> The problem persists in 31-rc3.
Comment 12 Rafael J. Wysocki 2009-07-27 21:09:19 UTC
On Monday 27 July 2009, Johannes Berg wrote:
> On Sun, 2009-07-26 at 22:45 +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.29 and 2.6.30.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.29 and 2.6.30.  Please verify if it still should
> > be listed and let me know (either way).
> 
> The problem persists in 31-rc3.

Thanks for the update.

Unfortunately, I have no idea about what's wrong here.

Generally, there are three possible areas the problem may be related to:
(1) interrupts,
(2) CPU hotplug,
(3) ordering of the platform callbacks,

but I can't say much more than that.
Comment 13 Johannes Berg 2009-07-31 09:46:15 UTC
Still persists in -rc4, but using "shutdown" mode works around it.
Comment 14 Rafael J. Wysocki 2009-07-31 11:30:59 UTC
That indicates a problem with ACPI AML.

Does suspend to RAM work correctly at the same time?
Comment 15 Johannes Berg 2009-07-31 11:39:35 UTC
I can't STR because I have an nvidia card.
Comment 16 Rafael J. Wysocki 2009-07-31 11:58:51 UTC
Ouch.

OK, can you please try STD with acpi_sleep=old_ordering, then?
Comment 17 Johannes Berg 2009-07-31 13:28:49 UTC
Interestingly, now after reproducing the problem with method=platform, method=shutdown also incurs the delay, where before I hibernated many times with method=shutdown w/o the problem -- it seems that once the problem is there it doesn't go away even with the shutdown method, but using shutdown method stops it from happening.
Comment 18 Johannes Berg 2009-07-31 13:31:06 UTC
(In reply to comment #16)
> OK, can you please try STD with acpi_sleep=old_ordering, then?

I suppose I'll have to reboot for that now?
Comment 19 Johannes Berg 2009-08-06 08:17:03 UTC
(In reply to comment #16)

> OK, can you please try STD with acpi_sleep=old_ordering, then?

Doesn't help either.
Comment 20 Rafael J. Wysocki 2009-08-06 12:13:15 UTC
So, I have no ideas except for the 'shutdown' mode workaround.

Can you please build the current git kernel with CONFIG_PM_VERBOSE and attach dmesg output from that including one hibernation-resume cycle?
Comment 21 Johannes Berg 2009-08-06 12:27:49 UTC
Would -rc5 be current enough?
Comment 22 Rafael J. Wysocki 2009-08-06 12:44:00 UTC
Depends. :-)

For the information I'd like to collect it would, for testing fixes that went in after -rc5, well, is wouldn't. ;-)
Comment 23 Johannes Berg 2009-08-07 12:57:00 UTC
[11639.607546] acpi device:36: legacy restore
[11639.609099] acpi device:37: legacy restore
[11639.610682] battery PNP0C0A:00: legacy restore
[11640.002609] Clocksource tsc unstable (delta = -371189971 ns)
[11660.130542] acpi LNXTHERM:00: legacy restore
[11660.132622] pci 0000:00:00.0: restore
[11660.134767] pci 0000:00:00.1: restore
Comment 24 Rafael J. Wysocki 2009-08-07 22:55:03 UTC
Hmm.  I wonder if this commit changes anything:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=74b5820808215f65b70b05a099d6d3c969b82689
Comment 25 Johannes Berg 2009-08-08 15:36:51 UTC
No such luck, that doesn't change anything (applied it as a patch to my tree)
Comment 26 Johannes Berg 2009-08-16 09:59:17 UTC
Hmm, ok, this is confusing -- I'm now seeing the problem even with the shutdown method, with the same message as in comment #23.

I think the one time I tried whether or not shutdown was a workaround I had done a single suspend with "platform" and then future ones with "shutdown" which seemed to help at the time.

In any case, "shutdown" is also incurring the delay now. Any help appreciated.

Oh and the TSC line in the log isn't always there, I don't think it's related.
Comment 27 Rafael J. Wysocki 2009-08-16 22:42:12 UTC
I wonder if you can reproduce it with:

# echo devices > /sys/power/pm_test

and then

# echo disk > /sys/power/state

repeated several times in a row.

If you can't, then please try with

# echo core > /sys/power/pm_test

instead of the first command.
Comment 28 Johannes Berg 2009-08-17 07:58:34 UTC
I can reproduce the delay (at battery legacy restore) after having already hibernated a bunch of times with "devices" in pm_test.

I think you meant after a clean reboot though, I'll test that later.
Comment 29 Johannes Berg 2009-08-22 07:16:41 UTC
With this patch

diff --git a/drivers/acpi/battery.c b/drivers/acpi/battery.c
index 58b4517..683b80a 100644
--- a/drivers/acpi/battery.c
+++ b/drivers/acpi/battery.c
@@ -499,6 +499,7 @@ static int acpi_battery_update(struct acpi_battery *battery)
 {
 	int result, old_present = acpi_battery_present(battery);
 	result = acpi_battery_get_status(battery);
+	printk(KERN_DEBUG "battery get status %d\n", result);
 	if (result)
 		return result;
 #ifdef CONFIG_ACPI_SYSFS_POWER
@@ -507,19 +508,26 @@ static int acpi_battery_update(struct acpi_battery *battery)
 		battery->update_time = 0;
 		return 0;
 	}
+	printk(KERN_DEBUG "checked battery present\n");
 #endif
 	if (!battery->update_time ||
 	    old_present != acpi_battery_present(battery)) {
+		printk(KERN_DEBUG "battery present check again, getting info\n");
 		result = acpi_battery_get_info(battery);
+		printk(KERN_DEBUG "battery got info %d\n", result);
 		if (result)
 			return result;
 		acpi_battery_init_alarm(battery);
+		printk(KERN_DEBUG "inited battery alarm\n");
 	}
+	printk("update done\n");
 #ifdef CONFIG_ACPI_SYSFS_POWER
 	if (!battery->bat.dev)
 		sysfs_add_battery(battery);
 #endif
-	return acpi_battery_get_state(battery);
+	result = acpi_battery_get_state(battery);
+	printk(KERN_DEBUG "battery got state %d\n", result);
+	return result;
 }
 
 /* --------------------------------------------------------------------------


I get

[19299.509310] battery PNP0C0A:00: legacy restore
[19300.530433] battery get status 0
[19300.532196] checked battery present
[19300.534051] battery present check again, getting info
[19315.970806] battery got info 0
[19315.972903] inited battery alarm
[19315.975057] update done
[19320.150618] battery got state 0

so it seems that acpi_battery_get_info() takes ~15s, and acpi_battery_get_state() ~5s.
Comment 30 Rafael J. Wysocki 2009-08-24 22:17:26 UTC
This seems to be related to the embedded controller.  Reassigning to ACPI.
Comment 31 Johannes Berg 2009-08-24 22:20:28 UTC
(In reply to comment #29)

> [19299.509310] battery PNP0C0A:00: legacy restore
> [19300.530433] battery get status 0
> [19300.532196] checked battery present
> [19300.534051] battery present check again, getting info
> [19315.970806] battery got info 0
> [19315.972903] inited battery alarm
> [19315.975057] update done
> [19320.150618] battery got state 0
> 
> so it seems that acpi_battery_get_info() takes ~15s, and
> acpi_battery_get_state() ~5s.

acpi_battery_get_info() just executes _BIF, and acpi_battery_get_state()
executes _BST, and they both do little else so it seems those invocations cause
the delay.
Comment 32 Alexey Starikovskiy 2009-08-24 23:50:43 UTC
please post outputs from dmesg and acpidump.
do you see slow read from /proc/acpi/battery/*/info too after restore?
does /proc/interrupts change during the read of the above file?
Comment 33 Johannes Berg 2009-08-25 00:02:09 UTC
I don't have /proc/acpi/battery/, but reading /sys/class/power_supply/BAT0/* is not quite that slow, about 4 seconds.

will attach acpidump, but don't have a useful dmesg right now, will attach after my next reboot.

The acpi interrupt doesn't change while reading the sysfs files.
Comment 34 Johannes Berg 2009-08-25 00:02:40 UTC
Created attachment 22836 [details]
acpi dump
Comment 35 Johannes Berg 2009-09-04 04:58:17 UTC
Created attachment 23005 [details]
dmesg containing boot + multiple hibernates

Here's the requested dmesg output from boot including I think two hibernates.
Comment 36 Johannes Berg 2009-09-04 05:27:26 UTC
Can't seem to be able to reopen this bug. Somebody please do that -- I've provided the info now.
Comment 37 Zhang Rui 2009-09-18 03:31:35 UTC
Created attachment 23115 [details]
debug patch

please apply this debug patch,
set CONFIG_ACPI_DEBUG
reboot with boot option "log-buf-len=4M"
re-do the test and attach the dmesg output after the bug is reproduced.
Comment 38 Zhang Rui 2009-09-21 02:45:43 UTC
Created attachment 23122 [details]
debug patch

refreshed patch attached. please try this one
Comment 39 Johannes Berg 2009-09-24 10:00:09 UTC
Created attachment 23170 [details]
log

Unfortunately truncated, but I had no idea it would be _this_ huge. My logbuf was only 2M, but 4M certainly wouldn't have been big enough.

If you tell me how to do the same with tracing instead of acpi, I can do -- but this took like 15 minutes to scroll by on the screen while hibernating so I won't be doing that again.
Comment 40 Zhang Rui 2009-09-28 05:52:01 UTC
please run "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" before you do the second hibernation.
does the long latency still exist?
Comment 41 Zhang Rui 2009-09-28 05:54:38 UTC
Note you need to set CONFIG_DETECT_HUNG_TASK and rebuilt your kernel if "/proc/sys/kernel/hung_task_timeout_secs" doesn't exist.
Comment 42 Johannes Berg 2009-09-28 07:34:53 UTC
Umm, I'm happy to try, but I'm sure that it's unrelated to the problem. The message just comes up because the debugging you asked for makes my system stall for long enough due to display output being slow that it thinks tasks are stuck. Without the debugging it doesn't come up.
Comment 43 Johannes Berg 2009-09-30 15:46:59 UTC
Hmm hmm. Looks like we may never find out -- it appears that my latest OSX upgrade changed the firmware too and now the ACPI battery reading doesn't stall this way any more. At least I haven't seen the problem since I used OSX a couple of days ago to do some stuff and upgraded it while I was using it...
Comment 44 Rafael J. Wysocki 2009-09-30 20:59:29 UTC
If this was a firmware bug, I don't want to know what it was. :-)
Comment 45 Johannes Berg 2009-09-30 21:00:34 UTC
Can't reproduce any more, marking so.

Note You need to log in before you can comment on or make changes to this bug.