Created attachment 26778 [details] test program to reproduce bug I am able to consistently trigger an oops or hang when issuing a large number of concurrent read operations from different threads to a single volume containing multiple striped disks using the cfq scheduler. I am not sure when this problem first appeared, but I can reproduce it on 2.6.33.5, 2.6.34, and 2.6.35-rc3. It can be triggered on a regular file in a striped btrfs volume, a regular file in ext3 on top of md, or a raw md device. A test program is attached. The following steps show how to reproduce the problem using a raid0 md volume with no file system: <as root> gcc -Wall -O2 concurrent-read.c -o concurrent-read -lpthread mdadm --create -f /dev/md0 --level=raid0 --raid-devices=2 /dev/sdb1 /dev/sdc1 echo 3 > /proc/sys/vm/drop_caches ./concurrent-read /dev/md0 The "concurrent-read" test creates 32 detached threads, each of which independently opens and reads 16K from a unique position in the specified file. Each thread then spawns another detached thread to repeat the process before it exits. The test program is intended to run indefinitely (looping over the first 100 MiB of the file), but it crashes the machine within seconds in my case. Sometimes it triggers a hang, other times it triggers an oops. I believe that the oops has occurred in multiple places inside of cfq-iosched.c on different occasions. I will attach an example oops from my test machine. The problem goes away if I select the deadline scheduler rather than the cfq scheduler. The problem also does not occur on single disks as far as I can tell.
Created attachment 26781 [details] oops from 2.6.35-rc3 This is an example oops collected from an unmodified 2.6.35-rc3 kernel. The machine has two dual-core AMD x86-64 processors, and the disk drives used to reproduce were 80 gig Western Digital SATA drives.
Thanks for an excellent bug report, I'll take a look at this. Not sure what the cause is yet, but the symptoms are that this: if (!RB_EMPTY_NODE(&cfqq->rb_node)) { new_cfqq = 0; /* * same position, nothing more to do */ if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree) return; cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); cfqq->service_tree = NULL; } in cfq_service_tree_add() has a NULL cfqq->service_tree, so it bombs when trying to deref that NULL rb root.
Jeff, we have a user-after-free in the queue merging. I ran the test case here locally, and it bombed here: /* Avoid a circular list and skip interim queue merges */ while ((__cfqq = new_cfqq->new_cfqq)) { if (__cfqq == cfqq) return; new_cfqq = __cfqq; } in cfq_setup_merge(). The disasm looks like this: 45d7: 48 8b 81 d8 00 00 00 mov 0xd8(%rcx),%rax 45de: 48 85 c0 test %rax,%rax 45e1: 0f 84 a1 02 00 00 je 4888 <cfq_dispatch_requests+0x868 and rax and rcx are both 0x6b6b etc. So that would seem to point to either a bug or a race in the referencing, since ->new_cfqq apparently goes away before it was supposed to.
FWIW, running it on a single drive also triggers the very same oops shortly. So no need to use raid0 for reproducing.
Thanks- I really appreciate you taking a look at the problem!
I'll look into this immediately.
A list operation race shouldn't be possible, since the operations dealing with setting up and disassembling the merge are protected by the queue lock. So, I think this is probably just a bug. The merge chain is kept as a singly linked list. So, let's say you have two cfqq's: cfqq1->new_cfqq = cfqq2; Now, cfqq2 goes away before cfqq1 has a chance to actually complete the merge. cfqq3 comes along and is a merge candidate for cfqq1. cfqq1->new_cfqq is present, so cfqq3 follows that to the now stale cfqq2 pointer. Then, it checks to see if cfqq2->new_cfqq is filled in and blows up as cfqq2 was freed (and hence has 0x6b6b6b6b...). That's my current theory. Feel free to poke holes in it. I'm trying to reproduce the problem here and will report back when I have some patches that pass my testing.
OK, I'll poke a hole in it. When cfqq1 setup a merge with cfqq2, it also took a reference on cfqq2 so it wouldn't go away.
One small poke I did was add a BUG_ON() for new_process_refs being 0, which would mean we do not grab any references to the cfqq. It triggers immediately. For triggering this, don't use PAGEALLOC but make sure SLAB_ALLOC is set. Reproduces trivially here, within 5-10 seconds on a single drive. Use a rotating drive.
Interesting. I wonder if we just shouldn't merge for that case either. process_refs = cfqq_process_refs(cfqq); /* * If the process for the cfqq has gone away, there is no * sense in merging the queues. */ if (process_refs == 0) return; I don't know why I didn't add the same check for the new_process_refs. It seems like it should be there. I'll test that and give it some more thought. Thanks for pointing that out, Jens!
It's not enough, it's still missing a reference, I did test that. If I add an artificial reference to the new_process_refs, it'll work but leak cfqq. That's as far as I got today. So that tells me that we're missing a ref somewhere, or losing one. I would advise you to re-evaluate the referencing and life time, on the merged list. I'll look into it tomorrow, if I have the time.
OK, the problem goes something like this: cfq_setup_merge: ffff88011f98bee8, ffff88011e053ac8: setup_merge: ffff88011f98bee8(1) -> ffff88011e053ac8(1) ==> ffff88011e053ac8 now has 2 process references cfq_setup_merge: ffff88011e053ac8, ffff88011f98bbd0: ffff880120c4e5a0 setup_merge: ffff880120c4e5a0(1) -> ffff88011e053ac8(2) ==> ffff88011e053ac8 (3 process references) cfq_exit_cfqq: ffff88011e053ac8 ==> ffff88011e053ac8 (2 process references) // cfq_put_cooperator: ffff88011e053ac8 does nothing cfq_exit_cfqq: ffff88011f98b8b8 cfq_put_cooperator: ffff88011f98b8b8 cfq_put_cooperator: dropping ref on ffff880120c4e498 cfq_put_queue: queue ffff880120c4e498 exiting cfq_put_queue: queue ffff88011f98b8b8 exiting cfq_exit_cfqq: ffff88011f98bee8 cfq_put_cooperator: ffff88011f98bee8 cfq_put_cooperator: dropping ref on ffff88011e053ac8 ==> valid, ffff88011e053ac8 (1) [...] cfqq new_cfqq new_cfqq->new_cfqq cfq_setup_merge: ffff88011f98b180, ffff880120c4e5a0: ffff88011e053ac8 ==> ffff88011e053ac8 never actually merged with ffff880120c4e5a0 ==> and nothing actually happens here, despite the fact that the queue of interest should have 1 process ref still, implying that ffff88011f98b180 does not have any process references, though I can't verify that from the logs. :( cfq_put_queue: queue ffff88011f98b180 exiting cfq_exit_cfqq: ffff880120c4e5a0 cfq_put_cooperator: ffff880120c4e5a0 cfq_put_cooperator: dropping ref on ffff88011e053ac8 ==> ffff88011f98b180 (0 process refs)!!!!! ==> this means that there is still outstanding I/O for this cfqq, but no process references left. cfq_put_queue: queue ffff88011e053ac8 exiting ==> OK, 3ac8 exited, but it is still referenced by e5a0! cfq_setup_merge: ffff88011e0539c0, ffff880120c4e5a0: ffff88011e053ac8 6b6b6b6b6b6b6b6b BUG: scheduling while atomic: swapper/0/0x10000100 So, the problem is that, when there are no process references left, that means that some of the queues in the ->new_cfqq chain could be completely gone, as we have dropped our reference to them. The fix is obvious: simply don't walk the chain if there are no process references. I'll attach the patch I've been testing shortly. Phil, I'll echo Jens' sentiment here. Thanks a ton for a great bug report. The reproducer really helped work this out much more quickly than would otherwise be possible! Would you mind placing some sort of license in the file? If it's OSS compatible, then I'll be able to include this in our regression test suite.
Created attachment 26821 [details] fix the crash As mentioned, we should not be following the ->new_cfqq links if the first cfqq does not have an active process reference. The first hunk of the patch is all that is required to fix the problem. The second hunk also makes sense to me.
Thanks Jeff, I have applied the patch and marked it for -stable backport. One thought, not sure if it's related but: https://bugzilla.redhat.com/show_bug.cgi?id=577968 is a funky CFQ bug, could perhaps be explained by errant memory like from this bug. May not be, but it would be worth while to double check at least.
Actually, Jeff, on second thought - could you mail me a full signed-off patch with description?
Yep, will do.
Created attachment 26829 [details] test program to reproduce bug, with license
> Phil, I'll echo Jens' sentiment here. Thanks a ton for a great bug report. > The reproducer really helped work this out much more quickly than would > otherwise be possible! Would you mind placing some sort of license in the > file? If it's OSS compatible, then I'll be able to include this in our > regression test suite. Thanks Jeff. I'm thrilled that you guys were able to track the problem down so quickly. Luckily the test case was pretty simple, because the original code that triggered it would have been a big headache to work with. I've uploaded a new copy of the test program with a license statement. I'll try out the patch that you posted to LKML on some extensive performance tests tonight. Thanks!
I confirmed that Jeff's patch fixes the problem for me. I just completed some extensive tests with the original application that triggered the problem. It ran for several hours without triggering any kernel errors. Thanks again for fixing the problem so quickly.