Bug 7727
Summary: | spinlock CPU recursion | ||
---|---|---|---|
Product: | Other | Reporter: | support |
Component: | Other | Assignee: | David Howells (dhowells) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | klute, s |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.18-1.2239 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | Patch to fix the key serial number collision problem |
Description
support
2006-12-21 10:59:24 UTC
How often does this happen, and does it manifest itself in other ways? Can we discount the possibility of a simple RAM problem? I would say that the spinlock recursion bug is almost certainly a dead herring as it occurs after the NULL-pointer deref message occurs. It'll be due to the key management spinlock not having been released because the process that was holding it got killed, and so CPU#1 still holds the lock. This is where is looks to be: <__rb_rotate_left+0>: mov 0x8(%rdi),%rcx <__rb_rotate_left+4>: mov (%rdi),%r8 <__rb_rotate_left+7>: mov 0x10(%rcx),%rdx <---- <__rb_rotate_left+11>: and $0xfffffffffffffffc,%r8 So, it looks like: static void __rb_rotate_left(struct rb_node *node, struct rb_root *root) { struct rb_node *right = node->rb_right; struct rb_node *parent = rb_parent(node); if ((node->rb_right = right->rb_left)) <--- in the source code. %RDI contains the "node" argument on function entry. The "right" variable lurks in %RCX from the first insn. So "right" is zero at the time of the crash - ie: when the third instruction tries to dereference it to get "right->rb_left". Can you enter this in the Red Hat bugzilla system? Can you try running this script to see if this causes the fault to occur? It should exercise the creation and destruction of keys. #!/bin/bash NR_CPUS=2 if [ "$1" = "go" ] then for ((i=1; i<=500; i++)) do for ((i=1; i<=42; i++)) do keyctl add user a$i a @s >/dev/null || exit $? done keyctl clear @s || exit $? done exit 0 fi for ((i=1; i<=$NR_CPUS; i++)) do keyctl session - $0 go & done wait We will have to schedule some downtime for this machine so we don't disrupt its normal traffic - sometime in the evening - and post further information as we have it. This machine has only seen this occur the one time - and at that time it was under heavy load - which is why we feel confident that this is not a hardware issue. This is being entered into the bugzilla.redhat system as well. Thanks. I left that script running on my testbox overnight and it crashed too, though I managed to lose the original fault report (it was followed up by many many spinlock retake fault reports). Okay... Found it: the key serial number collision avoidance code is wrong. This didn't use to be a problem as the key serial numbers were allocated from a simple incremented counter, and you'd have to go through 2 billion keys before encountering a collision. However, now that random numbers are used instead, collisions are much more likely. Created attachment 10312 [details]
Patch to fix the key serial number collision problem
merged a patch from David into -mm. *** Bug 8067 has been marked as a duplicate of this bug. *** *** Bug 7915 has been marked as a duplicate of this bug. *** |