Bug 16217

Summary: cfq oops on concurrent reads to md device
Product: IO/Storage Reporter: Phil Carns (carns)
Component: Block LayerAssignee: Jeff Moyer (jmoyer)
Status: RESOLVED CODE_FIX    
Severity: high CC: alan, axboe, carns, jmoyer
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.35-rc3, 2.6.34, 2.6.33.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: test program to reproduce bug
oops from 2.6.35-rc3
fix the crash
test program to reproduce bug, with license

Description Phil Carns 2010-06-15 15:25:29 UTC
Created attachment 26778 [details]
test program to reproduce bug

I am able to consistently trigger an oops or hang when issuing a large number of concurrent read operations from different threads to a single volume containing multiple striped disks using the cfq scheduler.

I am not sure when this problem first appeared, but I can reproduce it on 2.6.33.5, 2.6.34, and 2.6.35-rc3.  It can be triggered on a regular file in a striped btrfs volume, a regular file in ext3 on top of md, or a raw md device.

A test program is attached.  The following steps show how to reproduce the problem using a raid0 md volume with no file system:

<as root>
gcc -Wall -O2 concurrent-read.c -o concurrent-read -lpthread
mdadm --create -f /dev/md0 --level=raid0 --raid-devices=2 /dev/sdb1 /dev/sdc1
echo 3 > /proc/sys/vm/drop_caches
./concurrent-read /dev/md0

The "concurrent-read" test creates 32 detached threads, each of which independently opens and reads 16K from a unique position in the specified file.  Each thread then spawns another detached thread to repeat the process before it exits.  The test program is intended to run indefinitely (looping over the first 100 MiB of the file), but it crashes the machine within seconds in my case.

Sometimes it triggers a hang, other times it triggers an oops.  I believe that the oops has occurred in multiple places inside of cfq-iosched.c on different occasions.  I will attach an example oops from my test machine.

The problem goes away if I select the deadline scheduler rather than the cfq scheduler.  The problem also does not occur on single disks as far as I can tell.
Comment 1 Phil Carns 2010-06-15 15:28:15 UTC
Created attachment 26781 [details]
oops from 2.6.35-rc3

This is an example oops collected from an unmodified 2.6.35-rc3 kernel.  The machine has two dual-core AMD x86-64 processors, and the disk drives used to reproduce were 80 gig Western Digital SATA drives.
Comment 2 Jens Axboe 2010-06-16 06:29:18 UTC
Thanks for an excellent bug report, I'll take a look at this. Not sure what the cause is yet, but the symptoms are that this:

        if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
                new_cfqq = 0;
                /*
                 * same position, nothing more to do
                 */
                if (rb_key == cfqq->rb_key &&
                    cfqq->service_tree == service_tree)
                        return;

                cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
                cfqq->service_tree = NULL;
        }

in cfq_service_tree_add() has a NULL cfqq->service_tree, so it bombs when trying to deref that NULL rb root.
Comment 3 Jens Axboe 2010-06-16 12:03:00 UTC
Jeff, we have a user-after-free in the queue merging. I ran the test case here locally, and it bombed here:

        /* Avoid a circular list and skip interim queue merges */
        while ((__cfqq = new_cfqq->new_cfqq)) {
                if (__cfqq == cfqq)
                        return;
                new_cfqq = __cfqq;
        }

in cfq_setup_merge(). The disasm looks like this:

    45d7:       48 8b 81 d8 00 00 00    mov    0xd8(%rcx),%rax
    45de:       48 85 c0                test   %rax,%rax
    45e1:       0f 84 a1 02 00 00       je     4888 <cfq_dispatch_requests+0x868

and rax and rcx are both 0x6b6b etc. So that would seem to point to either a bug or a race in the referencing, since ->new_cfqq apparently goes away before it was supposed to.
Comment 4 Jens Axboe 2010-06-16 12:18:03 UTC
FWIW, running it on a single drive also triggers the very same oops shortly. So no need to use raid0 for reproducing.
Comment 5 Phil Carns 2010-06-16 13:00:37 UTC
Thanks- I really appreciate you taking a look at the problem!
Comment 6 Jeff Moyer 2010-06-16 14:11:51 UTC
I'll look into this immediately.
Comment 7 Jeff Moyer 2010-06-16 16:01:12 UTC
A list operation race shouldn't be possible, since the operations dealing with setting up and disassembling the merge are protected by the queue lock.

So, I think this is probably just a bug.  The merge chain is kept as a singly linked list.  So, let's say you have two cfqq's:

cfqq1->new_cfqq = cfqq2;

Now, cfqq2 goes away before cfqq1 has a chance to actually complete the merge.  cfqq3 comes along and is a merge candidate for cfqq1.  cfqq1->new_cfqq is present, so cfqq3 follows that to the now stale cfqq2 pointer.  Then, it checks to see if cfqq2->new_cfqq is filled in and blows up as cfqq2 was freed (and hence has 0x6b6b6b6b...).

That's my current theory.  Feel free to poke holes in it.  I'm trying to reproduce the problem here and will report back when I have some patches that pass my testing.
Comment 8 Jeff Moyer 2010-06-16 16:52:21 UTC
OK, I'll poke a hole in it.  When cfqq1 setup a merge with cfqq2, it also took a reference on cfqq2 so it wouldn't go away.
Comment 9 Jens Axboe 2010-06-16 18:49:58 UTC
One small poke I did was add a BUG_ON() for new_process_refs being 0, which would mean we do not grab any references to the cfqq. It triggers immediately.

For triggering this, don't use PAGEALLOC but make sure SLAB_ALLOC is set. Reproduces trivially here, within 5-10 seconds on a single drive. Use a rotating drive.
Comment 10 Jeff Moyer 2010-06-16 19:09:51 UTC
Interesting.  I wonder if we just shouldn't merge for that case either.

	process_refs = cfqq_process_refs(cfqq);
	/*
	 * If the process for the cfqq has gone away, there is no
	 * sense in merging the queues.
	 */
	if (process_refs == 0)
		return;

I don't know why I didn't add the same check for the new_process_refs.  It seems like it should be there.  I'll test that and give it some more thought.

Thanks for pointing that out, Jens!
Comment 11 Jens Axboe 2010-06-16 19:23:46 UTC
It's not enough, it's still missing a reference, I did test that. If I add an artificial reference to the new_process_refs, it'll work but leak cfqq. That's as far as I got today. So that tells me that we're missing a ref somewhere, or losing one. I would advise you to re-evaluate the referencing and life time, on the merged list. I'll look into it tomorrow, if I have the time.
Comment 12 Jeff Moyer 2010-06-16 23:57:45 UTC
OK, the problem goes something like this:

cfq_setup_merge: ffff88011f98bee8, ffff88011e053ac8:	
setup_merge: ffff88011f98bee8(1) -> ffff88011e053ac8(1)

==> ffff88011e053ac8 now has 2 process references

cfq_setup_merge: ffff88011e053ac8, ffff88011f98bbd0:	ffff880120c4e5a0 
setup_merge: ffff880120c4e5a0(1) -> ffff88011e053ac8(2)

==> ffff88011e053ac8 (3 process references)

cfq_exit_cfqq: ffff88011e053ac8

==> ffff88011e053ac8 (2 process references)

// cfq_put_cooperator: ffff88011e053ac8  does nothing
cfq_exit_cfqq: ffff88011f98b8b8
cfq_put_cooperator: ffff88011f98b8b8
cfq_put_cooperator: dropping ref on ffff880120c4e498
cfq_put_queue: queue ffff880120c4e498 exiting
cfq_put_queue: queue ffff88011f98b8b8 exiting
cfq_exit_cfqq: ffff88011f98bee8

cfq_put_cooperator: ffff88011f98bee8
cfq_put_cooperator: dropping ref on ffff88011e053ac8

==> valid, ffff88011e053ac8 (1)

[...]

                      cfqq             new_cfqq         new_cfqq->new_cfqq
cfq_setup_merge: ffff88011f98b180, ffff880120c4e5a0:	ffff88011e053ac8 
==> ffff88011e053ac8 never actually merged with ffff880120c4e5a0
==> and nothing actually happens here, despite the fact that the queue
    of interest should have 1 process ref still, implying that
    ffff88011f98b180 does not have any process references, though I can't
    verify that from the logs.  :(

cfq_put_queue: queue ffff88011f98b180 exiting
cfq_exit_cfqq: ffff880120c4e5a0
cfq_put_cooperator: ffff880120c4e5a0
cfq_put_cooperator: dropping ref on ffff88011e053ac8
==> ffff88011f98b180 (0 process refs)!!!!!
==> this means that there is still outstanding I/O for this cfqq, but no
    process references left.

cfq_put_queue: queue ffff88011e053ac8 exiting
==> OK, 3ac8 exited, but it is still referenced by e5a0!

cfq_setup_merge: ffff88011e0539c0, ffff880120c4e5a0:	ffff88011e053ac8 6b6b6b6b6b6b6b6b 
BUG: scheduling while atomic: swapper/0/0x10000100

So, the problem is that, when there are no process references left, that means that some of the queues in the ->new_cfqq chain could be completely gone, as we have dropped our reference to them.  The fix is obvious: simply don't walk the chain if there are no process references.  I'll attach the patch I've been testing shortly.

Phil, I'll echo Jens' sentiment here.  Thanks a ton for a great bug report.  The reproducer really helped work this out much more quickly than would otherwise be possible!  Would you mind placing some sort of license in the file?  If it's OSS compatible, then I'll be able to include this in our regression test suite.
Comment 13 Jeff Moyer 2010-06-17 00:10:29 UTC
Created attachment 26821 [details]
fix the crash

As mentioned, we should not be following the ->new_cfqq links if the first cfqq does not have an active process reference.  The first hunk of the patch is all that is required to fix the problem.  The second hunk also makes sense to me.
Comment 14 Jens Axboe 2010-06-17 08:40:09 UTC
Thanks Jeff, I have applied the patch and marked it for -stable backport. One thought, not sure if it's related but:

https://bugzilla.redhat.com/show_bug.cgi?id=577968

is a funky CFQ bug, could perhaps be explained by errant memory like from this bug. May not be, but it would be worth while to double check at least.
Comment 15 Jens Axboe 2010-06-17 08:42:36 UTC
Actually, Jeff, on second thought - could you mail me a full signed-off patch with description?
Comment 16 Jeff Moyer 2010-06-17 13:11:07 UTC
Yep, will do.
Comment 17 Phil Carns 2010-06-17 18:01:51 UTC
Created attachment 26829 [details]
test program to reproduce bug, with license
Comment 18 Phil Carns 2010-06-17 18:09:27 UTC
> Phil, I'll echo Jens' sentiment here.  Thanks a ton for a great bug report. 
> The reproducer really helped work this out much more quickly than would
> otherwise be possible!  Would you mind placing some sort of license in the
> file?  If it's OSS compatible, then I'll be able to include this in our
> regression test suite.

Thanks Jeff.  I'm thrilled that you guys were able to track the problem down so quickly.  Luckily the test case was pretty simple, because the original code that triggered it would have been a big headache to work with.  I've uploaded a new copy of the test program with a license statement.

I'll try out the patch that you posted to LKML on some extensive performance tests tonight.

Thanks!
Comment 19 Phil Carns 2010-06-18 17:47:38 UTC
I confirmed that Jeff's patch fixes the problem for me.  I just completed some extensive tests with the original application that triggered the problem.  It ran for several hours without triggering any kernel errors.  Thanks again for fixing the problem so quickly.