Bug 10753

Summary: BUG: soft lockup - CPU#0 stuck for 130s!
Product: Memory Management Reporter: Nicolas Mailhot (Nicolas.Mailhot)
Component: OtherAssignee: platform_x86_64 (platform_x86_64)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, bunk
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25.3-18.fc9.x86_64 Subsystem:
Regression: --- Bisected commit-id:
Attachments: kernel log

Description Nicolas Mailhot 2008-05-19 14:55:00 UTC
Latest working kernel version: 2.6.25.2-5.fc10.x86_64 (but the bug may be triggered now by fedora-devel userspace changes)
Earliest failing kernel version: N/A one-time
Distribution: fedora-devel
Hardware Environment: AMD X2 on CK804 
Software Environment: totem-2.23.2-3.fc9.x86_64
Problem Description:

BUG: soft lockup - CPU#0 stuck for 130s! [totem:29381]
CPU 0:
Pid: 29381, comm: totem Tainted: G   M     2.6.25.3-18.fc9.x86_64 #1
RIP: 0010:[<ffffffff8103da1c>]  [<ffffffff8103da1c>] run_timer_softirq+0x17e/0x200
RSP: 0018:ffffffff814c1ea0  EFLAGS: 00000282
RAX: ffff81000e889fd8 RBX: ffffffff814c1f00 RCX: ffffffff8837125f
RDX: ffffffff814c1ec0 RSI: 0000000000007344 RDI: ffff81007e0d6a88
RBP: ffffffff814c1e20 R08: ffff81007e0d6ac0 R09: 0000000000000000
R10: 00002d6fdb26bbc9 R11: ffffffff814c1e20 R12: ffffffff8100cb16
R13: ffffffff814c1e20 R14: ffff81007e0d6a88 R15: ffff81007e0d6a00
FS:  00000000451e0950(0063) GS:ffffffff813f6000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f631a173000 CR3: 000000004175b000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
<IRQ>  [<ffffffff81039520>] ? __do_softirq+0x66/0xd3
[<ffffffff8100d06c>] ? call_softirq+0x1c/0x28
[<ffffffff8100e210>] ? do_softirq+0x34/0x72
[<ffffffff81039304>] ? irq_exit+0x3f/0x80
[<ffffffff8101bf7a>] ? smp_apic_timer_interrupt+0x8a/0xa3
[<ffffffff8100cb16>] ? apic_timer_interrupt+0x66/0x70
<EOI>  [<ffffffff81292430>] ? _spin_unlock_irqrestore+0x8/0xa
[<ffffffff81080eae>] ? release_pages+0x16a/0x1de
[<ffffffff81091757>] ? free_pages_and_swap_cache+0x7b/0x9b
[<ffffffff8108785d>] ? zap_page_range+0xd7/0xf0
[<ffffffff81085f7a>] ? sys_madvise+0x3e6/0x4dc
[<ffffffff8106d66b>] ? audit_syscall_entry+0x126/0x15a
[<ffffffff8106d33c>] ? audit_syscall_exit+0x331/0x353
[<ffffffff81013073>] ? syscall_trace_enter+0xb5/0xb9
[<ffffffff8100c052>] ? tracesys+0xd5/0xda
Comment 1 Nicolas Mailhot 2008-05-19 15:04:06 UTC
Created attachment 16205 [details]
kernel log
Comment 2 Andrew Morton 2008-05-19 15:30:23 UTC
Impressive.  Did it _really_ take that long?  Because I have a sneaking suspicion
that the warning is errant - we see so many of them.

otoh, madvise _could_ take a long time if say the whole world was swapped out
or something.

Did the application eventually recover and work OK?
Comment 3 Nicolas Mailhot 2008-05-19 15:52:21 UTC
The logs are not cooked (ok I removed a few irrelevant userspace fedora-devel segfaults)

The application didn't recover. I rebooted the system. Totem was playing a 4G video file at the time, and its recovery powers are rather limited when it looses video sync. I didn't clock the 130s but they seem realistic (before rebooting I switched to the console and after a while I did notice X was getting unstuck)
Comment 4 Thomas Gleixner 2008-09-04 13:24:52 UTC
Nicolas, any update on this one ?
Comment 5 Nicolas Mailhot 2008-09-05 11:28:00 UTC
I seem not to hit it anymore with current fedora-devel kernels. But I've done no heavy video streaming lately
Comment 6 wbrana 2008-09-06 01:56:57 UTC
I have got similar lockup:

BUG: soft lockup - CPU#0 stuck for 177s! [ksoftirqd/0:4]
Modules linked in: nvidia(P)
CPU 0:
Modules linked in: nvidia(P)
Pid: 4, comm: ksoftirqd/0 Tainted: P          2.6.26.3 #1
RIP: 0010:[<ffffffff80242087>]  [<ffffffff80242087>] run_timer_softirq+0x106/0x20b
RSP: 0018:ffffffff8093cf28  EFLAGS: 00000216
RAX: ffffffff8093cf28 RBX: ffffffff809ca110 RCX: ffffffff8093cf28
RDX: ffffffff8091285c RSI: 0000000000000001 RDI: ffffffff809c9f00
RBP: ffffffff8093cea0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff81012fc7bfd8 R11: ffffffff8086cf40 R12: ffffffff8020ceb6
R13: ffffffff80968200 R14: ffffffff8093cea0 R15: ffffffff809c9f00
FS:  0000000000000000(0000) GS:ffffffff808ba000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007ff7ce112000 CR3: 0000000124452000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Call Trace:
<IRQ>  [<ffffffff8024205b>] ? run_timer_softirq+0xda/0x20b
[<ffffffff8023d412>] ? __do_softirq+0x6b/0xeb
[<ffffffff8020d40c>] ? call_softirq+0x1c/0x28
<EOI>  [<ffffffff8023d911>] ? ksoftirqd+0x0/0x9c
[<ffffffff8020fdfc>] ? do_softirq+0x4a/0x9a
[<ffffffff8023d911>] ? ksoftirqd+0x0/0x9c
[<ffffffff8023d952>] ? ksoftirqd+0x41/0x9c
[<ffffffff8024c9b1>] ? kthread+0x7a/0xae
[<ffffffff8020d098>] ? child_rip+0xa/0x12
[<ffffffff8024c937>] ? kthread+0x0/0xae
[<ffffffff8020d08e>] ? child_rip+0x0/0x12
Comment 7 Thomas Gleixner 2008-09-06 04:08:47 UTC
That's not a fedora problem. It's a bug in mainline :(

A fix for this is available against 2.6.27-rc5:
http://bugzilla.kernel.org/attachment.cgi?id=17644&action=view

Full details at:
http://bugzilla.kernel.org/show_bug.cgi?id=11418

It would be great if you could test the patch on your machine.
Comment 8 wbrana 2008-10-13 11:39:46 UTC
Bug is fixed for me since 2.6.26.6 and 2.6.27.