Most recent kernel where this bug did not occur: From testing with liveCDs, around 2.6.12. Distribution: Debian Hardware Environment: Dell Latitude D510 Software Environment: reproduced in KDE/Xorg7 and from console Problem Description: When the speedstep_centrino module is loaded, suspend will produce a lockup. Steps to reproduce: 1) modprobe speedstep_centrino 2) echo mem > /sys/power/state 3) press power key to resume and lockup I've tried this with and without the cpufreq_userspace module loaded. Since suspend seems to disable the serial console, I've copied all of this manually. Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0 BUG: soft lockup detected on CPU#0! Pid: 4, comm: events/0 EIP: 0060:[<c01373c7>] CPU: 0 EIP is at handle_IRQ_event+0x18/0x69 EFLAGS: 00000246 Not tainted (2.6.16.11+z4 #1) EAX: 0000000b EBX: f7bbfcc0 ECX: f7bbfcc0 EDX: c1907ee8 ESI: 00000000 EDI: c1907ee8 EBP: c1907ee8 DS: 007b ES: 007b CR0: 8005003b CR2: 80050033 CR3: 37030000 CR4: 000006d0 [<c01df6de>] vgacon_switch+0x77/0xcc [<c0102e14>] apic_timer_interrupt+0x1c/0x24 [<c01374e0>] __do_IRQ+0x58/0xa8 [<c01048ef>] do_IRQ+0x19/0x2a [<c0102df2>] common_interrupt+0x1a/0x20 [<c033d019>] schedule+0x302/0x5b1 [<c01265ae>] worker_thread+0x14c/0x14e [<c01265ae>] worker_thread+0x14c/0x14e [<c0115394>] default_wake_function+0x0/0x15 [<c01291eb>] kthread+0xd2/0xd6 [<c0126462>] worker_thread+0x0/0x14e [<c0129119>] kthread+0x0/0xd6 [<c0101005>] kernel_thread_helper+0x5/0xb Kernel config is at http://daniel.6dns.org/misc/config-2.6.16+z4 if needed
1. Intel machine check reporting enabled on CPU#0 2. BUG: soft lockup detected on CPU#0! Did you see other message between the output 1 & 2 above? Say xx device resumed or something. if the output 1 is just followed by output 2, this is quite strange, 'events/0 ' is stoped in current stage.
The time that I recorded, there was no output between them - that is exactly what was displayed on the screen. Other times that it has locked up, the backtrace has been longer and it scrolled the EIP lines off, so I couldn't see if anything was happening. The output " Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0 " will be printed and it then hangs for ~30 sec before printing the backtrace. Normally (without the cpufreq) it takes at most 5 seconds before printing a bunch of device resumed messages and pulling up X. The first 4 lines of the backtrace are always the same except for the address in [<>]. If you need the output of other backtraces, I can type them up.
Ok, can you enable 'CONFIG_CPU_FREQ_DEBUG' and with boot option 'cpufreq.debug=7' and tell me the output in resume? That helps us root cause the issue. Thanks!
Rebooted with that command line (my kernel already had the debug option selected). It goes in the "kernel" line in grub, correct? There is no output before the backtrace is printed (after the Intel machine check reporting enabled on CPU#0), and for the 1ms as it's scrolling up, I don't see any more output. Most of the backtraces have been filling the whole screen, so I'm not certain of no output. The EIP value has moved a bit, most recent one was at yenta_interrupt+0x1d/0xc4 [yenta_socket] - but still with pid 4, events/0 Before the power down, I see several lines like this: ACPI: PCI Interrupt for device 00**** disabled. Any suggestions for getting this output to the serial console? Booting with console=ttyS0,9600 works before the suspend, but not after or during. The last output is "Stopping tasks: ===========================================|"
No, the serial driver isn't resumed in current stage, so there is no way to get the output. Any change you can test the latest kernel like 2.6.17-rc2?
Will test 2.6.17-rc3 tomorrow. I'm compiling using the debian make-kpkg, if that makes any difference.
Just tried 2.6.17-rc3. I'm getting a different location of the error, but still no information from cupfreq.debug. Here's the new backtrace (I omitted the addresses in <>) Intel machine check reporting enabled on CPU#0 BUG: soft lockup detected on CPU#0! <c011f7e3> update_process_times+0x2b/0x78 <> timer_interrupt+0x3d/0x74 <> handle_IRQ_event+0x39/0x69 <> __do_IRQ+0x58/0xa8 <> do_IRQ+0x19/0x2e <> common_interrupt+0x1a/0x20 <> handle_IRQ_event+0x18/0x69 <> __do_IRQ+0x58/0xa8 <> debug_mutex_add_waiter+0x24/0x44 <> do_IRQ+0x19/0x2e <> common_interrupt+0x1a/0x20 <> schedule+0x2bc/0x5a8 <> worker_thread+0x14c/0x14e <> linkwatch_event+0x0/0x33 <> worker_thread+0x14c/0x14e <> default_wake_function+0x0/0x15 <> kthread+0xd2/0xd6 <> worker_thread+0x0/0x14e <> kthread+0x0/0xd6 <> kernel_thread_helper+0x5/0xb
Can you add below line in the first line (befor any '#include <xxx>') of drivers/base/sys.c? #define DEBUG A guess is some sysdev driver wrongly enabled enable interrupt in their resume method, I'd like to know which one the cause. cpufreq driver is the best candidate, but my check doesn't find any violation, so I'd like to know more.
Identical backtrace from the kernel I compiled without DEBUG. Any other files you would suggest adding debugging output to?
Maybe your log level is too high. please do echo 8 > /proc/sys/kernel/printk before suspend. I guess you will see more info at resume. please try.
Yes, this gave me much more output. Messages on suspend: Suspending System Devices Suspending type 'acpi': acpi0 Suspending type 'irqrouter': irqrouter0 ---continues in reverse order of below--- -----------------------------------------Suspend Back to C! Resuming system devices: Resuming 'cpu0': cpu0 cpufreq-core: resuming cpu 0 Resuming type 'timer': timer0 Resuming type 'i8259': i82590 Resuming type 'i8237': i82370 Resuming type 'lapic': lapic0 Resuming type 'irqrouter': irqrouter0 BUG: soft lockup detected on CPU#0! ---same backtrace as before---
please try the patch I just sent out at http://marc.theaimsgroup.com/?l=linux-kernel&m=114636764413762&w=2
tried the patch; no change
Ok, please also apply: http://marc.theaimsgroup.com/?l=linux-kernel&m=114636752911313&w=2 both patches should be applied.
No change with both patches applied. I did notice something I found interesting - before turning of the screen, this is displayed: Suspending type 'cpu': cpu0 cpufreq-core: resuming cpu 0 Why is that "resuming" and not "suspending"?
There is a typo in the code, it should output 'suspend' instead of 'resume'. That's strange, the two patches are supposed to fix your problem. Is there any difference of the output with the two patches?
Tried with 2.6.17-rc5, including the above patches (the second one was already applied). Same output, with a slightly different backtrace: Intel machine check reporting enabled on CPU#0 BUG: soft lockup detected on CPU#0! <> update_process_times+0x2b/0x78 <> timer_interrupt+0x3d/0x74 <> handle_IRQ_event+0x39/0x69 <> __do_IRQ+0x58/0xa8 <> do_IRQ+0x19/0x2e <> common_interrupt+0x1a/0x20 <> yenta_interrupt+0x1d/0xc4 [yenta_socket] <> handle_IRQ_event+0x39/0x69 <> __do_IRQ+0x58/0xa8 <> do_IRQ+0x19/0x2e <> common_interrupt+0x1a/0x20 <> handle_IRQ_event+0x18/0x69 <> __do_IRQ+0x58/0xa8 <> debug_mutex_add_waiter+0x24/0x44 <> do_IRQ+0x19/0x2e <> common_interrupt+0x1a/0x20 <> schedule+0x2bc/0x5a8 <> worker_thread+0x14c/0x14e <> linkwatch_event+0x0/0x2d <> worker_thread+0x14c/0x14e <> default_wake_function+0x0/0x15 <> kthread+0xd2/0xd6 <> worker_thread+0x0/0x14e <> kthread+0x0/0xd6 <> kernel_thread_helper+0x5/0xb I have the "system password" set in the BIOS, which means that I am prompted for a password on resume. When I disabled this, the LCD did not turn on after sleep (even without speedstep_centrino), so I can't test without it.
Could you please do the same thing as I mentioned in comment 8 & 10 and let me know the output?
It's the same output as I got in #11. Any other files I should #define DEBUG in? Suspending System Devices Suspending type 'acpi': acpi0 Suspending type 'irqrouter': irqrouter0 Suspending type 'lapic': lapic0 Suspending type 'i8237': i82370 Suspending type 'i8259': i82590 Suspending type 'timer': timer0 Suspending type 'cpu0': cpu0 cpufreq-core: resuming cpu 0 --suspend, press button, enter password-- Back to C! Resuming system devices: Resuming 'cpu0': cpu0 cpufreq-core: resuming cpu 0 Resuming type 'timer': timer0 Resuming type 'i8259': i82590 Resuming type 'i8237': i82370 Resuming type 'lapic': lapic0 Resuming type 'irqrouter': irqrouter0 BUG: soft lockup detected on CPU#0! <> ...
Hmm, still strange. can you apply below one line debug patch, and let me know the output. static void __sysdev_resume(struct sys_device *dev) { struct sysdev_class *cls = dev->cls; struct sysdev_driver *drv; +printk("irq disabled %d\n", irqs_disabled()); /* First, call the class-specific one */ if (cls->resume) cls->resume(dev);
Suspending System Devices Suspending type 'acpi': acpi0 Suspending type 'irqrouter': irqrouter0 Suspending type 'lapic': lapic0 Suspending type 'i8237': i82370 Suspending type 'i8259': i82590 Suspending type 'timer': timer0 Suspending type 'cpu0': cpu0 cpufreq-core: resuming cpu 0 --suspend, press button, enter password-- Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0 Back to C! Resuming system devices: Resuming 'cpu0': cpu0 irq disabled 1 cpufreq-core: resuming cpu 0 Resuming type 'timer': timer0 irq disabled 1 Resuming type 'i8259': i82590 irq disabled 1 Resuming type 'i8237': i82370 irq disabled 1 Resuming type 'lapic': lapic0 irq disabled 1 Resuming type 'irqrouter': irqrouter0 irq disabled 1 BUG: soft lockup detected on CPU#0! <> ...
Can you also add +printk("irq disabled %d\n", irqs_disabled()); in the last line of '__sysdev_resume'. From the log, Resuming type 'irqrouter': irqrouter0 irq disabled 1 BUG: soft lockup detected on CPU#0! <> ... It appears the oops is printed before type 'acpi' resume. It's sysdev, so it should be executed with interrupt disabled. But we saw a soft lockup here, so looks some drivers wrongly enabled interrupt. irqrouter is more likely the source.
Resuming type 'i8237': i82370 irq disabled 1 irq disabled 1 Resuming type 'lapic': lapic0 irq disabled 1 irq disabled 1 Resuming type 'irqrouter': irqrouter0 irq disabled 1 BUG: soft lockup detected on CPU#0!
>Resuming type 'irqrouter': > irqrouter0 >irq disabled 1 >BUG: soft lockup detected on CPU#0! Ok, so irqrouter wrongly enabled interrupt. Can you post your acpidmp output? I wonder if your acpi bios will do some memory read/write, which wrongly enables interrupt.
Created attachment 8239 [details] acpidump output
Sorry for the long delay. Can you try a latest base kernle like 2.6.18-rc2 please? I checked the acpi table, nothing insteresting.
Tried with 2.6.18-rc2, and the system now locks up on resume (no crash dump). Also, after reboot, the system clock is set to a random time. Just before it suspends, I can see the text "ACPI Exception AE_NOT_FOUND" After resume, the display contains the letters "Linu" in yellow. This also occurs during a successful resume, but they only persist for a few seconds.
So 2.6.18-rc2 is worse, can you attach your config used in 2.6.18-rc2?
Created attachment 8627 [details] 2.6.18-rc2 configuration
Can you enable NMI watchdog in the system? hopefully watchdog could print some more info when the resuem is dead. I still couldn't reproduce the bug here.
After enabling NMI in the kernel, the lockup no longer occurs. The value of nmi_watchdog on the command line does not appear to matter (I tried absent, =1, =2) However, the hardware clock is still being corrupted, and I can tell other things are being overwritten - the console resolution changes after a resume (expanded to full-screen instead of in a high-resolution box). The config has added: CONFIG_X86_UP_IOAPIC=y CONFIG_X86_IO_APIC=y # CONFIG_PCI_MSI is not set
Is there any error message output with NMI watchdog enabled? If there is a dead lock, NMI watchdog will print some info. Better you could provide the dmesg.
Created attachment 8648 [details] dmesg output Output of dmesg after suspend and resume. After resuming, hwclock gives the following error: The Hardware Clock registers contain values that are either invalid (e.g. 50th day of month) or beyond the range we can handle (e.g. Year 2095). Resetting the hardware clock with --systohc works, and a reboot appears to restore the other settings that are changed.
>CONFIG_PM_TRACE=y Ha, you enabled Linus's CMOS trace feature, which will corrupt CMOS. Please deselect it, and try again.
Thanks - I selected option that too quickly without reading the description. Suspend looks to be working correctly with the modules loaded using 2.6.18-rc4.