Most recent kernel where this bug did *NOT* occur: 2.6.14 Distribution:Fedora Core 5 Hardware Environment:pogolinux PW1465A Dual opteron server with mirrored 160 Gig SATA drives Software Environment:Fedora Core 5; Linux Kernel 2.16.18-1.2239 for x86_64; Apache 2.2.3; php 5.1 Problem Description: We would like to report an error we received from one of our web servers. We are hesitantly suggesting that this is a software issue: we have an identical machine which has not exhibited this error. A line from the error log seems to provide some insight: Dec 12 10:13:01 clio kernel: <0>BUG: spinlock cpu recursion on CPU#1, suexec/27413 (Not tainted) the complete text of the error log is as follows: Dec 12 10:13:00 clio kernel: Unable to handle kernel NULL pointer dereference at 0000000000000010 RIP: Dec 12 10:13:00 clio kernel: [<ffffffff80225942>] __rb_rotate_left+0x7/0x5b Dec 12 10:13:00 clio kernel: PGD 3a828067 PUD 3d934067 PMD 0 Dec 12 10:13:00 clio kernel: Oops: 0000 [1] SMP Dec 12 10:13:00 clio kernel: last sysfs file: /block/hdb/size Dec 12 10:13:00 clio kernel: CPU 1 Dec 12 10:13:00 clio kernel: Modules linked in: ipv6 nfs lockd fscache nfs_acl rfcomm l2cap bluetooth sunrpc dm_mirror dm_mod video sbs i2c_ec i2c_core button battery asu s_acpi ac lp parport_pc parport sg tg3 ide_cd cdrom shpchp k8_edac edac_mc ohci_hcd serio_raw floppy ehci_hcd pcspkr raid1 ext3 jbd sata_svw libata sd_mod scsi_mod Dec 12 10:13:00 clio kernel: Pid: 27406, comm: suexec Not tainted 2.6.18-1.2239.fc5 #1 Dec 12 10:13:00 clio kernel: RIP: 0010:[<ffffffff80225942>] [<ffffffff80225942>] __rb_rotate_left +0x7/0x5b Dec 12 10:13:00 clio kernel: RSP: 0018:ffff810151397df0 EFLAGS: 00010282 Dec 12 10:13:00 clio kernel: RAX: ffff81005a1ded48 RBX: ffff810102505508 RCX: 0000000000000000 Dec 12 10:13:00 clio kernel: RDX: 0000000000000000 RSI: ffffffff806de5e0 RDI: ffff810203166088 Dec 12 10:13:00 clio kernel: RBP: ffff810203166088 R08: ffff8102031668c8 R09: 0000000000000000 Dec 12 10:13:00 clio kernel: R10: 000000005e4ae5f3 R11: ffff810151397c70 R12: ffff810102505508 Dec 12 10:13:00 clio kernel: R13: ffff81005a1ded48 R14: ffffffff806de5e0 R15: 0000000000000026 Dec 12 10:13:00 clio kernel: FS: 00002aaaaaabb850(0000) GS:ffff810103c3b1c0(0000) knlGS: 00000000f7fee8d0 Dec 12 10:13:00 clio kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 12 10:13:00 clio kernel: CR2: 0000000000000010 CR3: 00000000da38b000 CR4: 00000000000006e0 Dec 12 10:13:00 clio kernel: Process suexec (pid: 27406, threadinfo ffff810151396000, task ffff8101d7cf5080) Dec 12 10:13:00 clio kernel: Stack: ffffffff80212aff ffff81005a1ded40 ffff810102505518 0000000000000000 Dec 12 10:13:00 clio kernel: ffff81005a1ded40 ffff810151397eb8 ffffffff80312779 0000000046f0a978 Dec 12 10:13:00 clio kernel: 0000000000000000 1f3f0000aa8adfff ffff8101d7cf5080 000003eaffffffff Dec 12 10:13:00 clio kernel: Call Trace: Dec 12 10:13:00 clio kernel: [<ffffffff80212aff>] rb_insert_color+0xb2/0xda Dec 12 10:13:00 clio kernel: [<ffffffff80312779>] key_alloc+0x2b0/0x384 Dec 12 10:13:00 clio kernel: [<ffffffff8031377b>] keyring_alloc+0x29/0x5f Dec 12 10:13:00 clio kernel: [<ffffffff80314ea2>] alloc_uid_keyring+0x3d/0xa6 Dec 12 10:13:00 clio kernel: [<ffffffff80293a5c>] alloc_uid+0xa9/0x16f Dec 12 10:13:00 clio kernel: [<ffffffff802963d6>] set_user+0xf/0x97 Dec 12 10:13:00 clio kernel: [<ffffffff80297b5c>] sys_setuid+0x7d/0x154 Dec 12 10:13:00 clio kernel: [<ffffffff8025c00e>] system_call+0x7e/0x83 Dec 12 10:13:00 clio kernel: DWARF2 unwinder stuck at system_call+0x7e/0x83 Dec 12 10:13:00 clio kernel: Leftover inexact backtrace: Dec 12 10:13:00 clio kernel: Dec 12 10:13:00 clio kernel: Dec 12 10:13:00 clio kernel: Code: 48 8b 51 10 49 83 e0 fc 48 85 d2 48 89 57 08 74 0c 48 8b 02 Dec 12 10:13:00 clio kernel: RIP [<ffffffff80225942>] __rb_rotate_left+0x7/0x5b Dec 12 10:13:00 clio kernel: RSP <ffff810151397df0> Dec 12 10:13:00 clio kernel: CR2: 0000000000000010 Dec 12 10:13:01 clio kernel: <0>BUG: spinlock cpu recursion on CPU#1, suexec/27413 (Not tainted) Dec 12 10:13:01 clio kernel: lock: ffffffff80566200, .magic: dead4ead, .owner: hazel.cgi/ 27408, .owner_cpu: 1 Dec 12 10:13:01 clio kernel: Dec 12 10:13:01 clio kernel: Call Trace: Dec 12 10:13:01 clio kernel: [<ffffffff802691d9>] show_trace+0x34/0x47 Dec 12 10:13:01 clio kernel: [<ffffffff802691fe>] dump_stack+0x12/0x17 Dec 12 10:13:01 clio kernel: [<ffffffff80207749>] _raw_spin_lock+0x5d/0xf3 Dec 12 10:13:01 clio kernel: [<ffffffff803126dd>] key_alloc+0x214/0x384 Dec 12 10:13:01 clio kernel: [<ffffffff8031377b>] keyring_alloc+0x29/0x5f Dec 12 10:13:01 clio kernel: [<ffffffff80314ea2>] alloc_uid_keyring+0x3d/0xa6 Dec 12 10:13:01 clio kernel: [<ffffffff80293a5c>] alloc_uid+0xa9/0x16f Dec 12 10:13:01 clio kernel: [<ffffffff802963d6>] set_user+0xf/0x97 Dec 12 10:13:01 clio kernel: [<ffffffff80297b5c>] sys_setuid+0x7d/0x154 Dec 12 10:13:01 clio kernel: [<ffffffff8025c00e>] system_call+0x7e/0x83 Dec 12 10:13:01 clio kernel: DWARF2 unwinder stuck at system_call+0x7e/0x83 Dec 12 10:13:01 clio kernel: Leftover inexact backtrace: Dec 12 10:13:01 clio kernel: Steps to reproduce:
How often does this happen, and does it manifest itself in other ways? Can we discount the possibility of a simple RAM problem?
I would say that the spinlock recursion bug is almost certainly a dead herring as it occurs after the NULL-pointer deref message occurs. It'll be due to the key management spinlock not having been released because the process that was holding it got killed, and so CPU#1 still holds the lock.
This is where is looks to be: <__rb_rotate_left+0>: mov 0x8(%rdi),%rcx <__rb_rotate_left+4>: mov (%rdi),%r8 <__rb_rotate_left+7>: mov 0x10(%rcx),%rdx <---- <__rb_rotate_left+11>: and $0xfffffffffffffffc,%r8 So, it looks like: static void __rb_rotate_left(struct rb_node *node, struct rb_root *root) { struct rb_node *right = node->rb_right; struct rb_node *parent = rb_parent(node); if ((node->rb_right = right->rb_left)) <--- in the source code. %RDI contains the "node" argument on function entry. The "right" variable lurks in %RCX from the first insn. So "right" is zero at the time of the crash - ie: when the third instruction tries to dereference it to get "right->rb_left".
Can you enter this in the Red Hat bugzilla system?
Can you try running this script to see if this causes the fault to occur? It should exercise the creation and destruction of keys. #!/bin/bash NR_CPUS=2 if [ "$1" = "go" ] then for ((i=1; i<=500; i++)) do for ((i=1; i<=42; i++)) do keyctl add user a$i a @s >/dev/null || exit $? done keyctl clear @s || exit $? done exit 0 fi for ((i=1; i<=$NR_CPUS; i++)) do keyctl session - $0 go & done wait
We will have to schedule some downtime for this machine so we don't disrupt its normal traffic - sometime in the evening - and post further information as we have it. This machine has only seen this occur the one time - and at that time it was under heavy load - which is why we feel confident that this is not a hardware issue. This is being entered into the bugzilla.redhat system as well.
Thanks. I left that script running on my testbox overnight and it crashed too, though I managed to lose the original fault report (it was followed up by many many spinlock retake fault reports).
Okay... Found it: the key serial number collision avoidance code is wrong. This didn't use to be a problem as the key serial numbers were allocated from a simple incremented counter, and you'd have to go through 2 billion keys before encountering a collision. However, now that random numbers are used instead, collisions are much more likely.
Created attachment 10312 [details] Patch to fix the key serial number collision problem
merged a patch from David into -mm.
*** Bug 8067 has been marked as a duplicate of this bug. ***
*** Bug 7915 has been marked as a duplicate of this bug. ***