Bug 6977 - S3: SMP resume hang - 2.6.17 regression - Dell D420
Summary: S3: SMP resume hang - 2.6.17 regression - Dell D420
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Sleep-Wake (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: acpi_power-sleep-wake
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-08-09 02:11 UTC by Steinar H. Gunderson
Modified: 2006-10-24 19:33 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.17
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Steinar H. Gunderson 2006-08-09 02:11:32 UTC
Most recent kernel where this bug did not occur: 2.6.16
Distribution: Debian GNU/Linux unstable ("sid")
Hardware Environment: Dell Latitude D420, Core Duo 1.2GHz
Problem Description: When resuming from suspend-to-RAM (s2ram -s -p -f or
equivalent), the disk never spins up, the screen stays black and the system is
hung. Bisecting between 2.6.16 and 2.6.17 gives that the
78eef01b0fae087c5fadbd85dd4fe2918c3a015f commit is the culprit; reverting it
against 2.6.17 makes suspend work well (or at least much better than it does
with plain 2.6.17). Setting CONFIG_SMP=n also makes it work.

See the lkml thread at
http://www.ussg.iu.edu/hypermail/linux/kernel/0608.0/1457.html for the rest of
the details.
Comment 1 Pavel Machek 2006-08-09 02:52:22 UTC
FYI the commit is:[PATCH] on_each_cpu(): disable local interrupts

When on_each_cpu() runs the callback on other CPUs, it runs with local
interrupts disabled.  So we should run the function with local interrupts
disabled on this CPU, too.

And do the same for UP, so the callback is run in the same environment on both
UP and SMP.  (strictly it should do preempt_disable() too, but I think
local_irq_disable is sufficiently equivalent).

Also uninlines on_each_cpu().  softirq.c was the most appropriate file I could
find, but it doesn't seem to justify creating a new file.

Oh, and fix up that comment over (under?) x86's smp_call_function().  It
drives me nuts.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

...okay, just add warn_on() to on_each_cpu, and notice where it hangs... but
you'll need some kind of console for that.

Does it work with minimal drivers?

Comment 2 Steinar H. Gunderson 2006-08-09 02:55:20 UTC
The machine has serial if I hook it up to the docking station, but I don't think
I have a proper cable here (that will have to wait a few weeks).

It does not work with minimal drivers.
Comment 3 Pavel Machek 2006-08-09 03:02:49 UTC
Hi!

> ------- Additional Comments From sgunderson@bigfoot.com  2006-08-09 02:55 -------
> The machine has serial if I hook it up to the docking station, but I don't think
> I have a proper cable here (that will have to wait a few weeks).
> 
> It does not work with minimal drivers.

You could also take kernel, replace half of for_each_cpus with new
variant, and see if it breaks... basically binary searching on
that. Unfortunately I do not see on_each_cpu used in kernel/power/...

Maybe the one in flush_tlb_all()?

								Pavel


Comment 4 Pavel Machek 2006-08-09 03:14:45 UTC
Andrew, we are talking
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=78eef01b0fae087c5fadbd85dd4fe2918c3a015f;hp=ac2b898ca6fb06196a26869c23b66afe7944e52e
here. 

I think it is wrong: it now does local_irq_enable() unconditionally, even if
interrupts were disabled before. That's probably what hurts suspend here, we
call flush_tlb_all() from pretty low level code.

Should we perhaps only restore to previous state of interrupts enabled/disabled
instead of enabling unconditionally?
Comment 5 Shaohua 2006-08-09 17:48:54 UTC
Did you try latest Linus's tree? we have some fixes about the issues, like 
changing flush_tlb_all to flush_tlb_local.
Comment 6 Steinar H. Gunderson 2006-08-10 02:38:54 UTC
Indeed! 2.6.18-rc4 (commit 9f737633e6ee54fc174282d49b2559bd2208391d) works fine
for me. It should be noted that I got the following during boot, though:

Lukewarm IQ detected in hotplug locking
BUG: warning at kernel/cpu.c:38/lock_cpu_hotplug()
 [<b0132383>] lock_cpu_hotplug+0x40/0x63
 [<b012a8e2>] __create_workqueue+0x50/0x11b
 [<f8af832e>] cpufreq_governor_dbs+0x91/0x29d [cpufreq_ondemand]
 [<b02162de>] __cpufreq_governor+0x3f/0xb9
 [<b0216425>] __cpufreq_set_policy+0xcd/0x100
 [<b0216f18>] store_scaling_governor+0x143/0x187
 [<b0216b86>] handle_update+0x0/0x5
 [<b01b4700>] kobject_cleanup+0x29/0x5e
 [<b0216dd5>] store_scaling_governor+0x0/0x187
 [<b02168af>] store+0x2e/0x3e
 [<b018af45>] sysfs_write_file+0x8c/0xb4
 [<b018aeb9>] sysfs_write_file+0x0/0xb4
 [<b0157e93>] vfs_write+0xa1/0x143
 [<b0158483>] sys_write+0x3c/0x63
 [<b0102c6b>] syscall_call+0x7/0xb

Overall, though, it suspends and resumes just fine. Is there any specific fix I
can point to to get this backported to my distribution's 2.6.17 kernel?
Comment 7 Steinar H. Gunderson 2006-08-10 05:42:15 UTC
Applying 55b2355eefc2f160246226d4d69fed431173a4d5 on top of 2.6.17 makes resume
work for me -- not every time (it sometimes hangs during suspend, probably due
to other bugs in the kernel), but much much better than it was in 2.6.17. Should
this be closed, or could the fix go into 2.6.17.x?

Note You need to log in before you can comment on or make changes to this bug.