Bug 27142 - Kernel panic on highly loaded webserver in task_rq_lock
Kernel panic on highly loaded webserver in task_rq_lock
Status: CLOSED CODE_FIX
Product: Process Management
Classification: Unclassified
Component: Scheduler
All Linux
: P1 blocking
Assigned To: Ingo Molnar
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-20 12:26 UTC by Yuriy Yevtukhov
Modified: 2012-06-13 15:01 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.36.3
Tree: Mainline
Regression: Yes


Attachments
Screenshot (214.79 KB, image/jpeg)
2011-01-20 12:26 UTC, Yuriy Yevtukhov
Details
Kernel config (70.09 KB, text/plain)
2011-01-20 12:29 UTC, Yuriy Yevtukhov
Details
screenshot from 2.6.36.1 (very small resolution) (68.61 KB, image/png)
2011-01-21 08:50 UTC, Yuriy Yevtukhov
Details
patch that fixes the bug (1.64 KB, patch)
2011-07-21 06:15 UTC, Manfred Spraul
Details | Diff
Test case: app that triggers the race (3.12 KB, text/x-c++src)
2011-07-21 06:19 UTC, Manfred Spraul
Details

Description Yuriy Yevtukhov 2011-01-20 12:26:40 UTC
Created attachment 44442 [details]
Screenshot

We have about 50 webservers. Most of them have kernel 2.6.32.23 (custom). As support for new hardware will be necessary on new servers we have to use more recent kernel on them. Trying 2.6.36.1 gave kernel panic after about a day of load (the same picture as in attachment for 2.6.36.3). I waited for patches and tryed 2.6.36.3 (as I didn't find anything related to this bug in 2.6.36.2 changelist) - result is the same. After couple of days we got kernel panic. I have nothing more except for screenshot made from KVM and kernel config.
It seam that problem is inside scheduler or VFS, both were updated for 2.6.36.
Picture and config in attachments.
Comment 1 Yuriy Yevtukhov 2011-01-20 12:29:30 UTC
Created attachment 44452 [details]
Kernel config
Comment 2 Andrew Morton 2011-01-20 23:41:57 UTC
hm, strange.   hard to tell if the kernel was running sys_semctl() or sys_kill().  Probably sys_kill().  If you can generate more oops traces, please attach them to this report.
Comment 3 Yuriy Yevtukhov 2011-01-21 08:50:05 UTC
Created attachment 44592 [details]
screenshot from 2.6.36.1 (very small resolution)
Comment 4 Yuriy Yevtukhov 2011-01-21 09:12:37 UTC
In 2.6.36.1 trace was the same.
By the way, httpd (seen on both screenshots), does not use threads (it is multiprocess), and uses semaphores for interprocess communication.
Comment 5 Harald Laabs 2011-07-15 15:37:16 UTC
I seem to have the same bug here in all Kernels from 2.6.35 up to at least 2.6.39.2 (read: the last one that does not crash is 2.6.34.*)

I can reproduce them (reloading apache or just sending SIGUSR1 to it works often enough, depending on details in the apache configuration), so if there is any useful information I can provide I'll gladly do so.

On 2.6.37.6 the error (unable to handle kernel paging request at 00000000801f0f1d) comes with this:

[ 1456.519000] IP: [<ffffffff8102da7f>] task_rq_lock+0x34/0x79
[ 1456.530000] PGD 0 
[ 1456.534000] Oops: 0000 [#1] SMP
[ 1456.541000] last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host0/uevent
[ 1456.557000] CPU 1
[ 1456.561000] Pid: 15734, comm: apache2 Not tainted 2.6.37.6 #1 Dell Inc. PowerEdge R410/0N83VF
[ 1456.578000] RIP: 0010:[<ffffffff8102da7f>]  [<ffffffff8102da7f>] task_rq_lock+0x34/0x79
[ 1456.594000] RSP: 0018:ffff88031392ddd8  EFLAGS: 00010046
[ 1456.605000] RAX: 00000000801f0f05 RBX: 00000000000116c0 RCX: 000000000000000a
[ 1456.619000] RDX: 0000000000000246 RSI: ffff88031392de20 RDI: 00007f27114da8f0
[ 1456.633000] RBP: ffff88031392ddf8 R08: ffff88030b6e0060 R09: dead000000200200
[ 1456.648000] R10: dead000000100100 R11: dead000000200200 R12: ffff88031392de20
[ 1456.662000] R13: 00007f27114da8f0 R14: 00000000000116c0 R15: 0000000000000001
[ 1456.676000] FS:  00007f2711f80740(0000) GS:ffff8800cf040000(0000) knlGS:0000000000000000
[ 1456.692000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[ 1456.704000] CR2: 00000000801f0f1d CR3: 000000030b6b6000 CR4: 00000000000006e0
[ 1456.718000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1456.732000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1456.746000] Process apache2 (pid: 15734, threadinfo ffff88031392c000, task ffff88030c350000)
[ 1456.763000] Stack:
[ 1456.767000]  00007f27114da8f0 ffffffff81a3e260 0000000000000000 000000000000000f
[ 1456.782000]  ffff88031392de58 ffffffff81032fdf 0000000000000000 0000000081a3e290
[ 1456.797000]  ffffffff81a2b000 0000000000000246 ffff88031392de58 ffff88030b6e0010
[ 1456.811000] Call Trace:
[ 1456.816000]  [<ffffffff81032fdf>] try_to_wake_up+0x2f/0x21f
[ 1456.827000]  [<ffffffff810331fb>] wake_up_process+0x10/0x12
[ 1456.839000]  [<ffffffff8123cb41>] freeary+0x1a9/0x1de  
[ 1456.849000]  [<ffffffff8123ce5f>] T.580+0x80/0xb9
[ 1456.858000]  [<ffffffff810d1701>] ? fput+0x1c5/0x1d4
[ 1456.868000]  [<ffffffff8123d5ce>] sys_semctl+0x74/0x81 
[ 1456.878000]  [<ffffffff81002a12>] system_call_fastpath+0x16/0x1b
[ 1456.890000] Code: 89 fd 41 54 49 89 f4 53 48 c7 c3 c0 16 01 00 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 49 89 14 24 49 8b 45 08 49 89 de <8b> 40 18 4c 03 34 c5 50 b6 a9 81 4c 89 f7 e8 0c ad 55 00 49 8b
[ 1456.928000] RIP  [<ffffffff8102da7f>] task_rq_lock+0x34/0x79
[ 1456.940000]  RSP <ffff88031392ddd8>
[ 1456.947000] CR2: 00000000801f0f1d

task_rq_lock+0x34 is

mov    0x18(%rax),%eax

So if I understand this correctly the kernel tries to copy from 0x801F0F1D to EAX and cannot access 0x801F0F1D. I have no core-file of the kernel (yet).
Comment 6 Harald Laabs 2011-07-18 21:44:37 UTC
The call trace looks slightly different when compiled with O1 (O0 seems impossible with all versions of gcc).

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff81033c06>] task_rq_lock+0x44/0x99
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/host0/scsi_host/host0/proc_name
CPU 5
Pid: 4491, comm: apache2 Not tainted 2.6.37.6_obj_O1 #6 Dell Inc. PowerEdge R410/0N83VF
RIP: 0010:[<ffffffff81033c06>]  [<ffffffff81033c06>] task_rq_lock+0x44/0x99
RSP: 0018:ffff880318d49d18  EFLAGS: 00010086
RAX: 0000000000000286 RBX: 0000000000000000 RCX: 02000000000040c1
RDX: 0000000000000286 RSI: ffff880318d49d80 RDI: 0000000000000000
RBP: ffff880318d49d58 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000213 R12: 00000000000118c0
R13: ffff880318d49e28 R14: ffffffff81c9f660 R15: ffff880318d49d80
FS:  00007f330bf68740(0000) GS:ffff8800cf140000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000318c1d000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process apache2 (pid: 4491, threadinfo ffff880318d48000, task ffff880323309000)
Stack:
 ffff880318d86b48 ffff8800cf1118c0 ffff880323144660 0000000000000000
 0000000000000001 ffff880318d49e28 0000000000000000 000000000000000f
 ffff880318d49db8 ffffffff8103a8eb ffffffff81c40350 0000000521f7fc60
Call Trace:
 [<ffffffff8103a8eb>] try_to_wake_up+0x39/0x34c
 [<ffffffff8103ac30>] wake_up_process+0x13/0x15
 [<ffffffff8128e5cb>] wake_up_sem_queue_do+0x24/0x46
 [<ffffffff8128fbdd>] freeary+0x14d/0x172
 [<ffffffff8128fcdd>] semctl_down+0xdb/0x12c
 [<ffffffff8104fc65>] ? sys_kill+0x90/0x1b0
 [<ffffffff8128fdbf>] sys_semctl+0x91/0x9e
 [<ffffffff81002c12>] system_call_fastpath+0x16/0x1b
Code: 89 7d f8 48 89 fb 49 89 f7 49 c7 c4 c0 18 01 00 49 c7 c6 60 f6 c9 81 9c 58 0f 1f 44 00 00 48
89 c2 fa 66 0f 1f 44 00 00 49 89 17 <48> 8b 43 08 8b 40 18 4d 89 e5 4d 03 2c c6 4c 89 6d c8 4c 89 e
f
RIP  [<ffffffff81033c06>] task_rq_lock+0x44/0x99
 RSP <ffff880318d49d18>
CR2: 0000000000000008

The problem: a NULL pointer as "task_struct p" in try_to_wake_up in line:
rq = task_rq(p);
in ASM:
48 8b 43 08             mov    0x8(%rbx),%rax
8b 40 18                mov    0x18(%rax),%eax

First mov fails because RBX contains 0x0. RBX should contain *p.

Following the trace I end up at wake_up_sem_queue_do in ipc/sem.c. At some point in time q->sleeper must be 0 here. Adding
if(q->sleeper!=0) {...} else {printk(...)}
shows that it is really 0 here sometimes. And as expected with broken data structures: it crashes later, elsewhere.

With numcpus=1 (normal: 2 quad-core xeon) the bug cannot be triggered anymore. Any relevant changes in kernel-locking between 2.6.34 and 2.6.35?
Comment 7 Manfred Spraul 2011-07-21 06:15:59 UTC
Created attachment 66152 [details]
patch that fixes the bug
Comment 8 Manfred Spraul 2011-07-21 06:17:21 UTC
Thanks for the detailed bug report.

This is the race:
task 1: processes IPC_RMID
task 2: sleeps in semtimedop(), gets a signal.

1: acquires sem_lock()
2 semtimedop() returns from schedule() due to the signal
2 the first get_queue_result() returns -EINTR.
2: tries to acquire sem_lock(), sleeps
1: freeary puts all tasks into a list and properly marks them as IN_WAKEUP
1: freeary calls sem_rmid(), which calls ipc_rmid(), which sets ipcp->deleted
1: freeary releases sem_lock();
2: tries to acquire sem_lock, sem_lock() fails because ipcp->deleted is set.
2: semtimedop() returns immediately, without checking for IN_WAKEUP
<<<<<<<<<<<< This is the bug. get_queue_result must be called even if sem_lock() fails.
2: semtimedop() returns, the stack is reused (q->sleeper is stored on the stack)
1: freeary calls wake_up_sem_queue_do(), which accesses q->sleeper
 --> bug.

The fix is simple: even if sem_lock() fails, get_queue_result() must be called.

The bad thing is that this is the second bug moving the spinlock :-(
Here is the (incomplete) initial fix:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c61284e99191b2284fb74dae6961d4d09e4e59e8

I'll upload my test app as well.
Comment 9 Manfred Spraul 2011-07-21 06:19:43 UTC
Created attachment 66162 [details]
Test case: app that triggers the race

Test case for the bug:
- without the patch, it crashes the kernel quickly, e.g. with ./timedrm 500.
  Tested on a quad-core AMD.

- with the patch, it works.
Comment 10 Florian Mickler 2011-08-08 08:10:18 UTC
A patch referencing this bug report has been merged in Linux v3.1-rc1:

commit d694ad62bf539dbb20a0899ac2a954555f9e4a83
Author: Manfred Spraul <manfred@colorfullife.com>
Date:   Mon Jul 25 17:11:47 2011 -0700

    ipc/sem.c: fix race with concurrent semtimedop() timeouts and IPC_RMID

Note You need to log in before you can comment on or make changes to this bug.