Bug 8463

Summary: NOHZ: BUG: soft lockup detected on CPU#0!
Product: Process Management Reporter: Ralf Hildebrandt (ralf.hildebrandt)
Component: OtherAssignee: Thomas Gleixner (tglx)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, bunk, fmarier, gregor+debian, jbreker, mingo, nuada, protasnb, tglx
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.21.1 Subsystem:
Regression: --- Bisected commit-id:

Description Ralf Hildebrandt 2007-05-10 06:16:30 UTC
Most recent kernel where this bug did *NOT* occur: 2.6.20.x
Distribution: Debian/testing
Hardware Environment: Dual Xeon with SMP kernel
Software Environment:
Problem Description: dmesg reports

NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
NOHZ: local_softirq_pending 02
Clocksource tsc unstable (delta = 4686825172 ns)
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
NOHZ: local_softirq_pending 08
BUG: soft lockup detected on CPU#0!
 [<c013c69a>] softlockup_tick+0x90/0xb5
 [<c0123564>] update_process_times+0x28/0x5e
 [<c01328fe>] tick_sched_timer+0x48/0x9a
 [<c012eff1>] hrtimer_interrupt+0x13f/0x1c5
 [<c010e75d>] smp_apic_timer_interrupt+0x55/0x85
 [<c011461c>] __wake_up_common+0x39/0x59
 [<c01048d0>] apic_timer_interrupt+0x28/0x30
 [<c02db6d0>] rt_check_expire+0xf8/0x160
 [<c02db5d8>] rt_check_expire+0x0/0x160
 [<c01227b7>] run_timer_softirq+0x11e/0x17a
 [<c011ece9>] it_real_fn+0x0/0x17
 [<c011ecfb>] it_real_fn+0x12/0x17
 [<c011f8c2>] __do_softirq+0x74/0xd9
 [<c011f794>] ksoftirqd+0x0/0xba
 [<c0106624>] do_softirq+0x5f/0xa8
 [<c011f807>] ksoftirqd+0x73/0xba
 [<c012bd02>] kthread+0xae/0xd3
 [<c012bc54>] kthread+0x0/0xd3
 [<c0104a53>] kernel_thread_helper+0x7/0x14
 =======================
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22
NOHZ: local_softirq_pending 22

Steps to reproduce:
Comment 1 Piotr Radkowski 2007-05-21 05:27:23 UTC
Got something similar on 2.6.21.1
Distribution: Gentoo 2007.0
Hardware: Centrino Duo, Intel T2250 @1.73GHz

dmesg:
...
Clocksource tsc unstable (delta = 3040409348024 ns)
...
BUG: soft lockup detected on CPU#1!
 [<c01461c2>] softlockup_tick+0x90/0xbf
 [<c012b977>] update_process_times+0x28/0x5e
 [<c013a4c0>] tick_periodic+0x22/0x71
 [<c013a526>] tick_handle_periodic+0x17/0x71
 [<c01159be>] smp_apic_timer_interrupt+0x4f/0x7f
 [<c022f1a2>] acpi_hw_register_write+0x11b/0x14b
 [<c01049fc>] apic_timer_interrupt+0x28/0x30
 [<c02417b8>] acpi_processor_idle+0x20f/0x3d3
 [<c0102386>] cpu_idle+0x84/0xdb
 =======================
...
Comment 2 gregor herrmann 2007-05-31 09:04:28 UTC
I got a similar problem with 2.6.21.1 (and 2.6.22-rc3)

Most recent kernel where this bug did *NOT* occur: 2.6.20.x
Distribution: Debian/unstable
Hardware Environment: Thinkpad R60e, CPU: Intel(R) Celeron(R) M CPU 420 @ 1.60GHz

Description: Every other minute the machine (at least keyboard and mouse) freezes.

From the kern.log (with 2.6.21.1, 2.6.22-rc3 doesn't write anything along these
lines):

May 26 13:26:42 nerys kernel: Linux version 2.6.21-1-686 (Debian 2.6.21-3) (wal
di@debian.org) (gcc version 4.1.3 20070518 (prerelease) (Debian 4.1.2-8)) #1 SM
P Fri May 25 13:06:47 UTC 2007
[..]
May 26 13:26:42 nerys kernel: Kernel command line: root=/dev/mapper/crypt-root 
ro vga=791
[..]
May 26 13:26:42 nerys kernel: Clocksource tsc unstable (delta = -292814672 ns)
[..]
May 26 13:30:38 nerys kernel: BUG: soft lockup detected on CPU#0!
May 26 13:30:38 nerys kernel:  [<c014aad3>] softlockup_tick+0xa6/0xb5
May 26 13:30:38 nerys kernel:  [<c012a05b>] update_process_times+0x3b/0x5e
May 26 13:30:38 nerys kernel:  [<c0138d60>] tick_sched_timer+0x78/0xbb
May 26 13:30:38 nerys kernel:  [<c01358e0>] hrtimer_interrupt+0x131/0x1bd
May 26 13:30:38 nerys kernel:  [<c0138ce8>] tick_sched_timer+0x0/0xbb
May 26 13:30:38 nerys kernel:  [<c0114bbd>] smp_apic_timer_interrupt+0x6c/0x7d
May 26 13:30:38 nerys kernel:  [<c01f7e1a>] acpi_hw_register_write+0x11b/0x14b
May 26 13:30:38 nerys kernel:  [<c010481c>] apic_timer_interrupt+0x28/0x30
May 26 13:30:38 nerys kernel:  [<e0040967>] acpi_processor_idle+0x235/0x40a [pro
May 26 13:30:38 nerys kernel:  [<c01023b5>] cpu_idle+0xb5/0xd6
May 26 13:30:38 nerys kernel:  [<c0345a6b>] start_kernel+0x475/0x47d
May 26 13:30:38 nerys kernel:  [<c03451b8>] unknown_bootoption+0x0/0x202
May 26 13:30:38 nerys kernel:  =======================

Cf. also http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=426738
Comment 3 gregor herrmann 2007-06-04 14:40:06 UTC
Booting with clocksource=acpi_pm seems to cirumvent my problem (I got the idea
from Bug 8582).
Comment 4 gregor herrmann 2007-06-10 05:59:30 UTC
Short update:
clocksource=acpi_pm kills the resume after a suspend2ram.
clocksource=pit works (i.e. no soft lockups and working resume).

The situation is still the same with 2.6.22-rc4.
Comment 5 Thomas Gleixner 2007-06-10 06:27:43 UTC
Hmm, seems the TSC is buggy. But acpi_pm should work.

Can you try whether 2.6.22-rc4-mm2 works for you with acpi_pm ?
Comment 6 gregor herrmann 2007-06-10 11:16:26 UTC
Sounds like a nice challenge for a Sunday afternoon (I've never used any kernel
patches before) :-)

Ok, here's what I've done:
* downloaded and unpacked linux-2.6.22-rc4.tar.bz2
* downloaded und uncompressed 2.6.22-rc4-mm2.bz2
* patched the former with the latter
* run "make oldconfig" against the config of the latest Debian 2.6.22-rc4-686
kernel image
* built the kernel (with Debian's kernel-package; shouldn't change the kernel
but makes un/installing easier)

Results:
* booting without any clocksource= parameter:
  - according to
/sys/devices/system/clocksource/clocksource0/current_clocksource tsc is used
  - no freezes/lockups
  - suspend to ram (with uswsusp) and resume work
  - only odd thing: powertop says "< CPU was 100% busy; no C-states were entered >"

* booting with clocksource=acpi_pm:
  - the same: no lockups, resume works, same powertop output

* booting with clocksource=pit:
  - the same

I'll happily send further information or do other tests if you have any questions!

Comment 7 Andrew Morton 2007-07-27 16:35:37 UTC
Gregor, did we fix all this in 2.6.22?

Thanks.
Comment 8 gregor herrmann 2007-07-27 18:01:13 UTC
Thanks for coming back to this issue.

I'm now running the Debian kernel 2.6.22-1-686 (2.6.22-2) which is based on the 2.6.22.1 release.

* If I boot without any clocksource= parameter I get (according to /sys/devices/system/clocksource/clocksource0/current_clocksource) hpet. The system freezes every other minute, there are no messages in /var/log/kern.log and resume after suspend to RAM does not work.

* clocksource=tsc: the laptop doesn't get very far in booting, it hangs somewhere between detecting USB hubs and detecting the SATA controller (tried three times).

* clocksource=pit: no lockups, resume after s2ram works, no strange powertop outputs anymore

* clocksource=acpi_pm: no lockups, resume after s2ram doesn't work

* clocksource=jiffies: boots, no lockups on the console but X doesn't come up?! (tried twice). Then weird stuff with suspend to ram/disk happened :-/

If you need any information/logs/output or want me to test something specific just tell me!
Comment 9 Thomas Gleixner 2007-11-14 14:45:48 UTC
Gregor,

any news on this ?
Comment 10 gregor herrmann 2007-11-16 06:35:18 UTC
Thanks for reminding me of this issue.

And I have good news:
I just installed a 2.6.23 kernel (the package linux-image-2.6.23-1-686, version 2.6.23-1~experimental.1~snapshot.9723 from the Debian kernel team's repository) and I don't see any problems anymore (booting without an clocksource parameter and getting hpet). Great.

As far as I'm concerned I think this bug can be closed.

Thanks for your perseverance!
Comment 11 Thomas Gleixner 2007-11-16 14:56:13 UTC
Gregor,

Thanks. I'm closing it.

    tglx