Bug 41862

Summary: Crash with traceback, maybe software raid or luks related
Product: IO/Storage Reporter: Bruno Wolff III (bruno)
Component: MDAssignee: Neil Brown (neilb)
Status: RESOLVED CODE_FIX    
Severity: normal CC: florian, neilb
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.1.0-0.rc3.git5.0.fc17.i686.PAE Subsystem:
Regression: Yes Bisected commit-id:
Attachments: gdb disassembly output
Traceback captured via netconsole in a script session.
md_trim_bio disassembly output
Corrected md_trim_bio
patch to remove use-after-free bug in RAID1

Description Bruno Wolff III 2011-08-28 13:08:53 UTC
When running 3.1 kernels my machines stay up less than a day, but often stay up for a number of hours before crashing. I haven't noticed a particular trigger. The issue may have been different with earlier 3.1 kernels than rc3.git5. This morning I noticed that two machines both had to rebuild a software raid 1 device when I rebooted them. (The second machine was running 3.1.0-0.rc3.git5.1.fc16.i686.PAE, but I didn't get a picture of its traceback.)
Comment 1 Bruno Wolff III 2011-08-30 20:38:23 UTC
I have now seen this with the 3.1.0-0.rc4.git0.0.fc16.x86_64 kernel.
Comment 2 Neil Brown 2011-08-30 23:15:44 UTC
You haven't given an really details on what the problem is.
Without details (preferably lots of details) I cannot even begin to guess what is wrong.
Comment 3 Bruno Wolff III 2011-08-31 02:19:11 UTC
I don't have a lot of details. The system will run for a while (often multiple hours) and then crash with the traceback I took a picture of appearing on the screen instead of the normal desktop GUI. There is no warning with degraded performance or the like.

There may be some correlation with disk activity, but it hasn't happened enough to be sure.

Typically the machines will crash over night if they don't during the day.

I don't have a reliable way to trigger the crash.

I am not seeing the problem at all with 3.0 kernels.

I'll switch back to a 3.1 kernel and collect some system details that might point out some differences between my systems and other peoples. The three systems I have seen this with have significantly different hardware, but all are running luks encrypted devices on top of MD raid 1 devices. The traceback referred to symbols that suggest that MD raid or/and luks/dmcrypt may be involved.
Comment 4 Neil Brown 2011-08-31 02:26:22 UTC
You say you took a picture of a traceback.  Could you attach that picture?
Comment 5 Bruno Wolff III 2011-08-31 03:07:58 UTC
I tried to add the picture at bug creation, but it looks like it was too large and I didn't double check to make sure it was attached.
I just put it up on a web page at:
http://wolff.to/bruno/DSCN1433.JPG
Comment 6 Neil Brown 2011-08-31 06:40:23 UTC
Thanks for that.
It does look like a bug in md/raid1.
However this looks like the second oops message.  An earlier one would
be the important one, but while Linux was trying to clean up after that oops another oops hit and scrolled the first one off the screen.
I don't suppose you can scroll back?  Probably not.

The bug seems to happen at make_request+0x59b/088bc in the raid1 module.

If you still have exactly that raid1.ko, could you please

 gdb raid1.ko
 disassemble make_request

and collect all the output and post it here?

Thanks.
Comment 7 Bruno Wolff III 2011-08-31 13:47:19 UTC
Created attachment 71032 [details]
gdb disassembly output
Comment 8 Neil Brown 2011-09-01 04:06:21 UTC
Thanks.
That location in make_request is where it calls md_trim_bio.
md_trim_bio appears to have failed at 0x2f which is very early.  It must have been de-referencing 'bio'.  Which seems to imply that bio_clone-mddev returned NULL.  But that is not possible when GFP_NOIO is passed.

So I'm stumped.

If you could set up netconsole to send the kernel messages over the network to another machine, then we could get a full trace which might help.

Setting up netconsole is fairly straight forward.

Documentation/networking/netconsole.txt

describes it.  Just an extra kernel parameter, and run 'nc' on some other machine capturing the output.
Comment 9 Bruno Wolff III 2011-09-01 04:27:38 UTC
I was reading about netconsole earlier today and was wondering if it would help in this case.
I am getting pretty tired tonight, so probably won't get to it until tomorrow and Thursdays I end up pretty busy. So it may be Friday before I a report back. If I get can things set up in the morning, then I might get something back to you sooner.
Comment 10 Bruno Wolff III 2011-09-01 15:44:53 UTC
Created attachment 71132 [details]
Traceback captured via netconsole in a script session.

It took a bit more futzing around this morning than I would have liked, but I got it setup before I left, and it crashed by the time I got to work. The traceback looks like it is complete, so hopefully has what you need.
This is with the 3.1.0-0.rc4.git0.0.fc17.i686.PAE kernel running at the time of the crash.
Comment 11 Neil Brown 2011-09-01 21:46:44 UTC
Thanks a lot .
It is definitely a case of bio_clone_mddev returning NULL, and there are no prior messages about a memory allocation failing.
Weird.

I'm going to have to read a lot of code before I find out how this can happen.

I'll probably want to send you a patch to run with which will print out some more debugging information - I hope you can work with that...

Thanks.
Comment 12 Neil Brown 2011-09-01 21:58:55 UTC
Actually - one other request.
Could I get a disassembly listing of md_trim_bio.  This is a small chance that it isn't crashing where I think it is.
It will be in the vmlinux file rather than any module.
Comment 13 Bruno Wolff III 2011-09-02 05:40:31 UTC
I'll be able to test a patch; I'm not real fluent at doing kernel rebuilds, but I can do it.

I started looking at doing the disassembly and it seems the best way to do this is to install the debug packages. I'm downloading them now, but need to get to sleep. I should have it for you tomorrow.
Comment 14 Bruno Wolff III 2011-09-02 16:32:52 UTC
Created attachment 71372 [details]
md_trim_bio disassembly output

This is from 3.1.0-0.rc4.git0.0.fc17.i686.PAE and should match the complete traceback I got using netconsole.
Comment 15 Bruno Wolff III 2011-09-02 21:48:08 UTC
Created attachment 71482 [details]
Corrected md_trim_bio

I copied over the typescript file before exiting script and the previous version wasn't complete.
Comment 16 Neil Brown 2011-09-04 22:52:26 UTC
Thanks a lot of that - it helped more than I expected.

The problem is a use-after-free bug.  By the time it crashed it was doing really weird things because it had been basing decisions on corrupt data.
I will attach a patch.  If you could test it and confirm that it fixed the probem I would really appreciate it.

Thanks,
NeilBrown
Comment 17 Neil Brown 2011-09-04 22:53:19 UTC
Created attachment 71722 [details]
patch to remove use-after-free bug in RAID1
Comment 18 Bruno Wolff III 2011-09-05 13:41:10 UTC
I'll update the bug once I have some machines running with the patch. There isn't a strict reproducer for this (though disk activity seems to help), but after a couple days without a problem we'll have pretty good reason to believe the problem is fixed.
I have some stuff going on today and it takes a while for a kernel build on my hardware, but I'll probably have two machines on the test kernel tonight and hopefully another tomorrow.
Comment 19 Bruno Wolff III 2011-09-06 04:19:13 UTC
I now have two machines running an rc4 kernel with the above patch. I'll add a third system tomorrow. I'll make an update if I get a similar crash. If I don't see any by Friday, I'll add a comment about that as well.

Nothing was obviously wrong with the test kernel in a few minutes of general system use.
Comment 20 Neil Brown 2011-09-06 05:25:53 UTC
Thanks - I look forward to hearing the good news :-)

The patch (with addition to fix RAID10 as well) should appear in -next tomorrow and I'll push it to Linus a little later.
Comment 21 Florian Mickler 2012-01-12 21:27:40 UTC
A patch referencing this bug report has been merged in Linux v3.1-rc10:

commit 079fa166a2874985ae58b2e21e26e1cbc91127d4
Author: NeilBrown <neilb@suse.de>
Date:   Sat Sep 10 17:21:23 2011 +1000

    md/raid1,10: Remove use-after-free bug in make_request.
Comment 22 Bruno Wolff III 2012-01-14 03:50:52 UTC
The patch worked fine. The kernel.org mess happened right around the time the problem was fixed and I couldn't update the ticket.