Created attachment 44442 [details] Screenshot We have about 50 webservers. Most of them have kernel 2.6.32.23 (custom). As support for new hardware will be necessary on new servers we have to use more recent kernel on them. Trying 2.6.36.1 gave kernel panic after about a day of load (the same picture as in attachment for 2.6.36.3). I waited for patches and tryed 2.6.36.3 (as I didn't find anything related to this bug in 2.6.36.2 changelist) - result is the same. After couple of days we got kernel panic. I have nothing more except for screenshot made from KVM and kernel config. It seam that problem is inside scheduler or VFS, both were updated for 2.6.36. Picture and config in attachments.
Created attachment 44452 [details] Kernel config
hm, strange. hard to tell if the kernel was running sys_semctl() or sys_kill(). Probably sys_kill(). If you can generate more oops traces, please attach them to this report.
Created attachment 44592 [details] screenshot from 2.6.36.1 (very small resolution)
In 2.6.36.1 trace was the same. By the way, httpd (seen on both screenshots), does not use threads (it is multiprocess), and uses semaphores for interprocess communication.
I seem to have the same bug here in all Kernels from 2.6.35 up to at least 2.6.39.2 (read: the last one that does not crash is 2.6.34.*) I can reproduce them (reloading apache or just sending SIGUSR1 to it works often enough, depending on details in the apache configuration), so if there is any useful information I can provide I'll gladly do so. On 2.6.37.6 the error (unable to handle kernel paging request at 00000000801f0f1d) comes with this: [ 1456.519000] IP: [<ffffffff8102da7f>] task_rq_lock+0x34/0x79 [ 1456.530000] PGD 0 [ 1456.534000] Oops: 0000 [#1] SMP [ 1456.541000] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host0/uevent [ 1456.557000] CPU 1 [ 1456.561000] Pid: 15734, comm: apache2 Not tainted 2.6.37.6 #1 Dell Inc. PowerEdge R410/0N83VF [ 1456.578000] RIP: 0010:[<ffffffff8102da7f>] [<ffffffff8102da7f>] task_rq_lock+0x34/0x79 [ 1456.594000] RSP: 0018:ffff88031392ddd8 EFLAGS: 00010046 [ 1456.605000] RAX: 00000000801f0f05 RBX: 00000000000116c0 RCX: 000000000000000a [ 1456.619000] RDX: 0000000000000246 RSI: ffff88031392de20 RDI: 00007f27114da8f0 [ 1456.633000] RBP: ffff88031392ddf8 R08: ffff88030b6e0060 R09: dead000000200200 [ 1456.648000] R10: dead000000100100 R11: dead000000200200 R12: ffff88031392de20 [ 1456.662000] R13: 00007f27114da8f0 R14: 00000000000116c0 R15: 0000000000000001 [ 1456.676000] FS: 00007f2711f80740(0000) GS:ffff8800cf040000(0000) knlGS:0000000000000000 [ 1456.692000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1456.704000] CR2: 00000000801f0f1d CR3: 000000030b6b6000 CR4: 00000000000006e0 [ 1456.718000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1456.732000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1456.746000] Process apache2 (pid: 15734, threadinfo ffff88031392c000, task ffff88030c350000) [ 1456.763000] Stack: [ 1456.767000] 00007f27114da8f0 ffffffff81a3e260 0000000000000000 000000000000000f [ 1456.782000] ffff88031392de58 ffffffff81032fdf 0000000000000000 0000000081a3e290 [ 1456.797000] ffffffff81a2b000 0000000000000246 ffff88031392de58 ffff88030b6e0010 [ 1456.811000] Call Trace: [ 1456.816000] [<ffffffff81032fdf>] try_to_wake_up+0x2f/0x21f [ 1456.827000] [<ffffffff810331fb>] wake_up_process+0x10/0x12 [ 1456.839000] [<ffffffff8123cb41>] freeary+0x1a9/0x1de [ 1456.849000] [<ffffffff8123ce5f>] T.580+0x80/0xb9 [ 1456.858000] [<ffffffff810d1701>] ? fput+0x1c5/0x1d4 [ 1456.868000] [<ffffffff8123d5ce>] sys_semctl+0x74/0x81 [ 1456.878000] [<ffffffff81002a12>] system_call_fastpath+0x16/0x1b [ 1456.890000] Code: 89 fd 41 54 49 89 f4 53 48 c7 c3 c0 16 01 00 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 14 24 49 8b 45 08 49 89 de <8b> 40 18 4c 03 34 c5 50 b6 a9 81 4c 89 f7 e8 0c ad 55 00 49 8b [ 1456.928000] RIP [<ffffffff8102da7f>] task_rq_lock+0x34/0x79 [ 1456.940000] RSP <ffff88031392ddd8> [ 1456.947000] CR2: 00000000801f0f1d task_rq_lock+0x34 is mov 0x18(%rax),%eax So if I understand this correctly the kernel tries to copy from 0x801F0F1D to EAX and cannot access 0x801F0F1D. I have no core-file of the kernel (yet).
The call trace looks slightly different when compiled with O1 (O0 seems impossible with all versions of gcc). BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff81033c06>] task_rq_lock+0x44/0x99 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host0/scsi_host/host0/proc_name CPU 5 Pid: 4491, comm: apache2 Not tainted 2.6.37.6_obj_O1 #6 Dell Inc. PowerEdge R410/0N83VF RIP: 0010:[<ffffffff81033c06>] [<ffffffff81033c06>] task_rq_lock+0x44/0x99 RSP: 0018:ffff880318d49d18 EFLAGS: 00010086 RAX: 0000000000000286 RBX: 0000000000000000 RCX: 02000000000040c1 RDX: 0000000000000286 RSI: ffff880318d49d80 RDI: 0000000000000000 RBP: ffff880318d49d58 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000213 R12: 00000000000118c0 R13: ffff880318d49e28 R14: ffffffff81c9f660 R15: ffff880318d49d80 FS: 00007f330bf68740(0000) GS:ffff8800cf140000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000318c1d000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process apache2 (pid: 4491, threadinfo ffff880318d48000, task ffff880323309000) Stack: ffff880318d86b48 ffff8800cf1118c0 ffff880323144660 0000000000000000 0000000000000001 ffff880318d49e28 0000000000000000 000000000000000f ffff880318d49db8 ffffffff8103a8eb ffffffff81c40350 0000000521f7fc60 Call Trace: [<ffffffff8103a8eb>] try_to_wake_up+0x39/0x34c [<ffffffff8103ac30>] wake_up_process+0x13/0x15 [<ffffffff8128e5cb>] wake_up_sem_queue_do+0x24/0x46 [<ffffffff8128fbdd>] freeary+0x14d/0x172 [<ffffffff8128fcdd>] semctl_down+0xdb/0x12c [<ffffffff8104fc65>] ? sys_kill+0x90/0x1b0 [<ffffffff8128fdbf>] sys_semctl+0x91/0x9e [<ffffffff81002c12>] system_call_fastpath+0x16/0x1b Code: 89 7d f8 48 89 fb 49 89 f7 49 c7 c4 c0 18 01 00 49 c7 c6 60 f6 c9 81 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 17 <48> 8b 43 08 8b 40 18 4d 89 e5 4d 03 2c c6 4c 89 6d c8 4c 89 e f RIP [<ffffffff81033c06>] task_rq_lock+0x44/0x99 RSP <ffff880318d49d18> CR2: 0000000000000008 The problem: a NULL pointer as "task_struct p" in try_to_wake_up in line: rq = task_rq(p); in ASM: 48 8b 43 08 mov 0x8(%rbx),%rax 8b 40 18 mov 0x18(%rax),%eax First mov fails because RBX contains 0x0. RBX should contain *p. Following the trace I end up at wake_up_sem_queue_do in ipc/sem.c. At some point in time q->sleeper must be 0 here. Adding if(q->sleeper!=0) {...} else {printk(...)} shows that it is really 0 here sometimes. And as expected with broken data structures: it crashes later, elsewhere. With numcpus=1 (normal: 2 quad-core xeon) the bug cannot be triggered anymore. Any relevant changes in kernel-locking between 2.6.34 and 2.6.35?
Created attachment 66152 [details] patch that fixes the bug
Thanks for the detailed bug report. This is the race: task 1: processes IPC_RMID task 2: sleeps in semtimedop(), gets a signal. 1: acquires sem_lock() 2 semtimedop() returns from schedule() due to the signal 2 the first get_queue_result() returns -EINTR. 2: tries to acquire sem_lock(), sleeps 1: freeary puts all tasks into a list and properly marks them as IN_WAKEUP 1: freeary calls sem_rmid(), which calls ipc_rmid(), which sets ipcp->deleted 1: freeary releases sem_lock(); 2: tries to acquire sem_lock, sem_lock() fails because ipcp->deleted is set. 2: semtimedop() returns immediately, without checking for IN_WAKEUP <<<<<<<<<<<< This is the bug. get_queue_result must be called even if sem_lock() fails. 2: semtimedop() returns, the stack is reused (q->sleeper is stored on the stack) 1: freeary calls wake_up_sem_queue_do(), which accesses q->sleeper --> bug. The fix is simple: even if sem_lock() fails, get_queue_result() must be called. The bad thing is that this is the second bug moving the spinlock :-( Here is the (incomplete) initial fix: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c61284e99191b2284fb74dae6961d4d09e4e59e8 I'll upload my test app as well.
Created attachment 66162 [details] Test case: app that triggers the race Test case for the bug: - without the patch, it crashes the kernel quickly, e.g. with ./timedrm 500. Tested on a quad-core AMD. - with the patch, it works.
A patch referencing this bug report has been merged in Linux v3.1-rc1: commit d694ad62bf539dbb20a0899ac2a954555f9e4a83 Author: Manfred Spraul <manfred@colorfullife.com> Date: Mon Jul 25 17:11:47 2011 -0700 ipc/sem.c: fix race with concurrent semtimedop() timeouts and IPC_RMID