Bug 6450 - soft lockup on suspend to mem if cpufreq loaded
Summary: soft lockup on suspend to mem if cpufreq loaded
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Sleep-Wake (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Shaohua
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-04-27 08:41 UTC by Daniel De Graaf
Modified: 2006-08-09 19:01 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.16.11
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
acpidump output (69.13 KB, text/plain)
2006-06-01 19:46 UTC, Daniel De Graaf
Details
2.6.18-rc2 configuration (48.85 KB, text/plain)
2006-07-27 18:42 UTC, Daniel De Graaf
Details
dmesg output (32.64 KB, text/plain)
2006-07-30 18:13 UTC, Daniel De Graaf
Details

Description Daniel De Graaf 2006-04-27 08:41:40 UTC
Most recent kernel where this bug did not occur: From testing with liveCDs,
around 2.6.12.
Distribution: Debian
Hardware Environment: Dell Latitude D510
Software Environment: reproduced in KDE/Xorg7 and from console
Problem Description: When the speedstep_centrino module is loaded, suspend will
produce a lockup.

Steps to reproduce:
1) modprobe speedstep_centrino
2) echo mem > /sys/power/state
3) press power key to resume and lockup

I've tried this with and without the cpufreq_userspace module loaded. Since
suspend seems to disable the serial console, I've copied all of this manually.

Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0
BUG: soft lockup detected on CPU#0!

Pid: 4, comm:             events/0
EIP: 0060:[<c01373c7>] CPU: 0
EIP is at handle_IRQ_event+0x18/0x69
 EFLAGS: 00000246    Not tainted  (2.6.16.11+z4 #1)
EAX: 0000000b EBX: f7bbfcc0 ECX: f7bbfcc0 EDX: c1907ee8
ESI: 00000000 EDI: c1907ee8 EBP: c1907ee8 DS: 007b ES: 007b
CR0: 8005003b CR2: 80050033 CR3: 37030000 CR4: 000006d0
 [<c01df6de>] vgacon_switch+0x77/0xcc
 [<c0102e14>] apic_timer_interrupt+0x1c/0x24
 [<c01374e0>] __do_IRQ+0x58/0xa8
 [<c01048ef>] do_IRQ+0x19/0x2a
 [<c0102df2>] common_interrupt+0x1a/0x20
 [<c033d019>] schedule+0x302/0x5b1
 [<c01265ae>] worker_thread+0x14c/0x14e
 [<c01265ae>] worker_thread+0x14c/0x14e
 [<c0115394>] default_wake_function+0x0/0x15
 [<c01291eb>] kthread+0xd2/0xd6
 [<c0126462>] worker_thread+0x0/0x14e
 [<c0129119>] kthread+0x0/0xd6
 [<c0101005>] kernel_thread_helper+0x5/0xb

Kernel config is at http://daniel.6dns.org/misc/config-2.6.16+z4 if needed
Comment 1 Shaohua 2006-04-27 18:30:20 UTC
1. Intel machine check reporting enabled on CPU#0
2. BUG: soft lockup detected on CPU#0!
Did you see other message between the output 1 & 2 above? Say xx device 
resumed or something.

if the output 1 is just followed by output 2, this is quite strange, 'events/0
' is stoped in current stage.
Comment 2 Daniel De Graaf 2006-04-27 20:13:06 UTC
The time that I recorded, there was no output between them - that is exactly
what was displayed on the screen. Other times that it has locked up, the
backtrace has been longer and it scrolled the EIP lines off, so I couldn't see
if anything was happening.

The output "
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0
" will be printed and it then hangs for ~30 sec before printing the backtrace.
Normally (without the cpufreq) it takes at most 5 seconds before printing a
bunch of device resumed messages and pulling up X.

The first 4 lines of the backtrace are always the same except for the address in
[<>]. If you need the output of other backtraces, I can type them up.
Comment 3 Shaohua 2006-04-27 20:25:23 UTC
Ok, can you enable 'CONFIG_CPU_FREQ_DEBUG' and with boot 
option 'cpufreq.debug=7' and tell me the output in resume? That helps us root 
cause the issue. Thanks!
Comment 4 Daniel De Graaf 2006-04-27 20:54:11 UTC
Rebooted with that command line (my kernel already had the debug option
selected). It goes in the "kernel" line in grub, correct?

There is no output before the backtrace is printed (after the Intel machine
check reporting enabled on CPU#0), and for the 1ms as it's scrolling up, I don't
see any more output. Most of the backtraces have been filling the whole screen,
so I'm not certain of no output. The EIP value has moved a bit, most recent one
was at yenta_interrupt+0x1d/0xc4 [yenta_socket] - but still with pid 4, events/0

Before the power down, I see several lines like this:
ACPI: PCI Interrupt for device 00**** disabled.

Any suggestions for getting this output to the serial console? Booting with
console=ttyS0,9600 works before the suspend, but not after or during. The last
output is "Stopping tasks: ===========================================|"
Comment 5 Shaohua 2006-04-27 20:58:56 UTC
No, the serial driver isn't resumed in current stage, so there is no way to 
get the output. Any change you can test the latest kernel like 2.6.17-rc2?
Comment 6 Daniel De Graaf 2006-04-27 21:18:16 UTC
Will test 2.6.17-rc3 tomorrow. I'm compiling using the debian make-kpkg, if that
makes any difference.
Comment 7 Daniel De Graaf 2006-04-28 07:23:04 UTC
Just tried 2.6.17-rc3. I'm getting a different location of the error, but still
no information from cupfreq.debug. Here's the new backtrace (I omitted the
addresses in <>)

Intel machine check reporting enabled on CPU#0
BUG: soft lockup detected on CPU#0!
 <c011f7e3> update_process_times+0x2b/0x78   <> timer_interrupt+0x3d/0x74
 <> handle_IRQ_event+0x39/0x69   <> __do_IRQ+0x58/0xa8
 <> do_IRQ+0x19/0x2e   <> common_interrupt+0x1a/0x20
 <> handle_IRQ_event+0x18/0x69   <> __do_IRQ+0x58/0xa8
 <> debug_mutex_add_waiter+0x24/0x44   <> do_IRQ+0x19/0x2e
 <> common_interrupt+0x1a/0x20   <> schedule+0x2bc/0x5a8
 <> worker_thread+0x14c/0x14e   <> linkwatch_event+0x0/0x33
 <> worker_thread+0x14c/0x14e   <> default_wake_function+0x0/0x15
 <> kthread+0xd2/0xd6   <> worker_thread+0x0/0x14e
 <> kthread+0x0/0xd6   <> kernel_thread_helper+0x5/0xb
Comment 8 Shaohua 2006-04-28 18:55:15 UTC
Can you add below line in the first line (befor any '#include <xxx>') of 
drivers/base/sys.c?
#define DEBUG

A guess is some sysdev driver wrongly enabled enable interrupt in their resume 
method, I'd like to know which one the cause. cpufreq driver is the best 
candidate, but my check doesn't find any violation, so I'd like to know more. 
Comment 9 Daniel De Graaf 2006-04-28 21:07:11 UTC
Identical backtrace from the kernel I compiled without DEBUG. Any other files
you would suggest adding debugging output to?
Comment 10 Shaohua 2006-04-28 23:22:25 UTC
Maybe your log level is too high. please do
echo 8 > /proc/sys/kernel/printk
before suspend. I guess you will see more info at resume. please try.
Comment 11 Daniel De Graaf 2006-04-29 09:08:30 UTC
Yes, this gave me much more output.

Messages on suspend:
Suspending System Devices
Suspending type 'acpi':
 acpi0
Suspending type 'irqrouter':
 irqrouter0
---continues in reverse order of below---
-----------------------------------------Suspend

Back to C!
Resuming system devices:
Resuming 'cpu0':
 cpu0
cpufreq-core: resuming cpu 0
Resuming type 'timer':
 timer0
Resuming type 'i8259':
 i82590
Resuming type 'i8237':
 i82370
Resuming type 'lapic':
 lapic0
Resuming type 'irqrouter':
 irqrouter0
BUG: soft lockup detected on CPU#0!
---same backtrace as before---
Comment 12 Shaohua 2006-04-29 20:32:24 UTC
please try the patch I just sent out at
http://marc.theaimsgroup.com/?l=linux-kernel&m=114636764413762&w=2
Comment 13 Daniel De Graaf 2006-04-29 21:50:42 UTC
tried the patch; no change
Comment 14 Shaohua 2006-04-29 22:13:28 UTC
Ok, please also apply:
http://marc.theaimsgroup.com/?l=linux-kernel&m=114636752911313&w=2
both patches should be applied.
Comment 15 Daniel De Graaf 2006-04-29 22:48:11 UTC
No change with both patches applied. I did notice something I found interesting
- before turning of the screen, this is displayed:

Suspending type 'cpu':
    cpu0
cpufreq-core: resuming cpu 0

Why is that "resuming" and not "suspending"?
Comment 16 Shaohua 2006-04-29 22:56:46 UTC
There is a typo in the code, it should output 'suspend' instead of 'resume'.
That's strange, the two patches are supposed to fix your problem. Is there any 
difference of the output with the two patches?
Comment 17 Daniel De Graaf 2006-05-30 20:37:27 UTC
Tried with 2.6.17-rc5, including the above patches (the second one was already
applied). Same output, with a slightly different backtrace:

Intel machine check reporting enabled on CPU#0
BUG: soft lockup detected on CPU#0!
 <> update_process_times+0x2b/0x78   <> timer_interrupt+0x3d/0x74
 <> handle_IRQ_event+0x39/0x69   <> __do_IRQ+0x58/0xa8
 <> do_IRQ+0x19/0x2e   <> common_interrupt+0x1a/0x20
 <> yenta_interrupt+0x1d/0xc4 [yenta_socket]  <> handle_IRQ_event+0x39/0x69
 <> __do_IRQ+0x58/0xa8  <> do_IRQ+0x19/0x2e
 <> common_interrupt+0x1a/0x20  <> handle_IRQ_event+0x18/0x69
 <> __do_IRQ+0x58/0xa8  <> debug_mutex_add_waiter+0x24/0x44
 <> do_IRQ+0x19/0x2e  <> common_interrupt+0x1a/0x20
 <> schedule+0x2bc/0x5a8  <> worker_thread+0x14c/0x14e
 <> linkwatch_event+0x0/0x2d  <> worker_thread+0x14c/0x14e
 <> default_wake_function+0x0/0x15  <> kthread+0xd2/0xd6
 <> worker_thread+0x0/0x14e <> kthread+0x0/0xd6
 <> kernel_thread_helper+0x5/0xb

I have the "system password" set in the BIOS, which means that I am prompted for
a password on resume. When I disabled this, the LCD did not turn on after sleep
(even without speedstep_centrino), so I can't test without it.
Comment 18 Shaohua 2006-05-30 20:50:11 UTC
Could you please do the same thing as I mentioned in comment 8 & 10 and let me 
know the output?
Comment 19 Daniel De Graaf 2006-05-31 17:03:29 UTC
It's the same output as I got in #11. Any other files I should #define DEBUG in?

Suspending System Devices
Suspending type 'acpi':
 acpi0
Suspending type 'irqrouter':
 irqrouter0
Suspending type 'lapic':
 lapic0
Suspending type 'i8237':
 i82370
Suspending type 'i8259':
 i82590
Suspending type 'timer':
 timer0
Suspending type 'cpu0':
 cpu0
cpufreq-core: resuming cpu 0

--suspend, press button, enter password--

Back to C!
Resuming system devices:
Resuming 'cpu0':
 cpu0
cpufreq-core: resuming cpu 0
Resuming type 'timer':
 timer0
Resuming type 'i8259':
 i82590
Resuming type 'i8237':
 i82370
Resuming type 'lapic':
 lapic0
Resuming type 'irqrouter':
 irqrouter0
BUG: soft lockup detected on CPU#0!
 <> ...
Comment 20 Shaohua 2006-05-31 22:09:57 UTC
Hmm, still strange. can you apply below one line debug patch, and let me know 
the output.

static void __sysdev_resume(struct sys_device *dev)
{
	struct sysdev_class *cls = dev->cls;
	struct sysdev_driver *drv;
+printk("irq disabled %d\n", irqs_disabled());
	/* First, call the class-specific one */
	if (cls->resume)
		cls->resume(dev);
Comment 21 Daniel De Graaf 2006-06-01 17:13:09 UTC
Suspending System Devices
Suspending type 'acpi':
 acpi0
Suspending type 'irqrouter':
 irqrouter0
Suspending type 'lapic':
 lapic0
Suspending type 'i8237':
 i82370
Suspending type 'i8259':
 i82590
Suspending type 'timer':
 timer0
Suspending type 'cpu0':
 cpu0
cpufreq-core: resuming cpu 0

--suspend, press button, enter password--

Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0
Back to C!
Resuming system devices:
Resuming 'cpu0':
 cpu0
irq disabled 1
cpufreq-core: resuming cpu 0
Resuming type 'timer':
 timer0
irq disabled 1
Resuming type 'i8259':
 i82590
irq disabled 1
Resuming type 'i8237':
 i82370
irq disabled 1
Resuming type 'lapic':
 lapic0
irq disabled 1
Resuming type 'irqrouter':
 irqrouter0
irq disabled 1
BUG: soft lockup detected on CPU#0!
 <> ...
Comment 22 Shaohua 2006-06-01 18:33:46 UTC
Can you also add
+printk("irq disabled %d\n", irqs_disabled());
in the last line of '__sysdev_resume'.

From the log, 
Resuming type 'irqrouter':
 irqrouter0
irq disabled 1
BUG: soft lockup detected on CPU#0!
 <> ...
It appears the oops is printed before type 'acpi' resume. It's sysdev, so it 
should be executed with interrupt disabled. But we saw a soft lockup here, so 
looks some drivers wrongly enabled interrupt. irqrouter is more likely the 
source.
Comment 23 Daniel De Graaf 2006-06-01 19:02:22 UTC
Resuming type 'i8237':
 i82370
irq disabled 1
irq disabled 1
Resuming type 'lapic':
 lapic0
irq disabled 1
irq disabled 1
Resuming type 'irqrouter':
 irqrouter0
irq disabled 1
BUG: soft lockup detected on CPU#0!
Comment 24 Shaohua 2006-06-01 19:37:44 UTC
>Resuming type 'irqrouter':
> irqrouter0
>irq disabled 1
>BUG: soft lockup detected on CPU#0!
Ok, so irqrouter wrongly enabled interrupt. Can you post your acpidmp output? 
I wonder if your acpi bios will do some memory read/write, which wrongly 
enables interrupt.
Comment 25 Daniel De Graaf 2006-06-01 19:46:40 UTC
Created attachment 8239 [details]
acpidump output
Comment 26 Shaohua 2006-07-18 18:36:12 UTC
Sorry for the long delay. Can you try a latest base kernle like 2.6.18-rc2 
please? I checked the acpi table, nothing insteresting.
Comment 27 Daniel De Graaf 2006-07-27 17:52:32 UTC
Tried with 2.6.18-rc2, and the system now locks up on resume (no crash dump).

Also, after reboot, the system clock is set to a random time.

Just before it suspends, I can see the text "ACPI Exception AE_NOT_FOUND"
After resume, the display contains the letters "Linu" in yellow. This also
occurs during a successful resume, but they only persist for a few seconds.
Comment 28 Shaohua 2006-07-27 18:11:15 UTC
So 2.6.18-rc2 is worse, can you attach your config used in 2.6.18-rc2?
Comment 29 Daniel De Graaf 2006-07-27 18:42:43 UTC
Created attachment 8627 [details]
2.6.18-rc2 configuration
Comment 30 Shaohua 2006-07-27 19:59:18 UTC
Can you enable NMI watchdog in the system? hopefully watchdog could print some 
more info when the resuem is dead.  I still couldn't reproduce the bug here.
Comment 31 Daniel De Graaf 2006-07-28 17:17:17 UTC
After enabling NMI in the kernel, the lockup no longer occurs. The value of
nmi_watchdog on the command line does not appear to matter (I tried absent, =1, =2)

However, the hardware clock is still being corrupted, and I can tell other
things are being overwritten - the console resolution changes after a resume
(expanded to full-screen instead of in a high-resolution box).

The config has added:
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_PCI_MSI is not set
Comment 32 Shaohua 2006-07-30 17:58:42 UTC
Is there any error message output with NMI watchdog enabled? If there is a 
dead lock, NMI watchdog will print some info. Better you could provide the 
dmesg.
Comment 33 Daniel De Graaf 2006-07-30 18:13:12 UTC
Created attachment 8648 [details]
dmesg output

Output of dmesg after suspend and resume. 

After resuming, hwclock gives the following error: The Hardware Clock registers
contain values that are either invalid (e.g. 50th day of month) or beyond the
range we can handle (e.g. Year 2095).

Resetting the hardware clock with --systohc works, and a reboot appears to
restore the other settings that are changed.
Comment 34 Shaohua 2006-08-09 17:56:55 UTC
>CONFIG_PM_TRACE=y
Ha, you enabled Linus's CMOS trace feature, which will corrupt CMOS. Please 
deselect it, and try again.
Comment 35 Daniel De Graaf 2006-08-09 19:01:14 UTC
Thanks - I selected option that too quickly without reading the description.
Suspend looks to be working correctly with the modules loaded using 2.6.18-rc4.

Note You need to log in before you can comment on or make changes to this bug.