When running 3.1 kernels my machines stay up less than a day, but often stay up for a number of hours before crashing. I haven't noticed a particular trigger. The issue may have been different with earlier 3.1 kernels than rc3.git5. This morning I noticed that two machines both had to rebuild a software raid 1 device when I rebooted them. (The second machine was running 3.1.0-0.rc3.git5.1.fc16.i686.PAE, but I didn't get a picture of its traceback.)
I have now seen this with the 3.1.0-0.rc4.git0.0.fc16.x86_64 kernel.
You haven't given an really details on what the problem is. Without details (preferably lots of details) I cannot even begin to guess what is wrong.
I don't have a lot of details. The system will run for a while (often multiple hours) and then crash with the traceback I took a picture of appearing on the screen instead of the normal desktop GUI. There is no warning with degraded performance or the like. There may be some correlation with disk activity, but it hasn't happened enough to be sure. Typically the machines will crash over night if they don't during the day. I don't have a reliable way to trigger the crash. I am not seeing the problem at all with 3.0 kernels. I'll switch back to a 3.1 kernel and collect some system details that might point out some differences between my systems and other peoples. The three systems I have seen this with have significantly different hardware, but all are running luks encrypted devices on top of MD raid 1 devices. The traceback referred to symbols that suggest that MD raid or/and luks/dmcrypt may be involved.
You say you took a picture of a traceback. Could you attach that picture?
I tried to add the picture at bug creation, but it looks like it was too large and I didn't double check to make sure it was attached. I just put it up on a web page at: http://wolff.to/bruno/DSCN1433.JPG
Thanks for that. It does look like a bug in md/raid1. However this looks like the second oops message. An earlier one would be the important one, but while Linux was trying to clean up after that oops another oops hit and scrolled the first one off the screen. I don't suppose you can scroll back? Probably not. The bug seems to happen at make_request+0x59b/088bc in the raid1 module. If you still have exactly that raid1.ko, could you please gdb raid1.ko disassemble make_request and collect all the output and post it here? Thanks.
Created attachment 71032 [details] gdb disassembly output
Thanks. That location in make_request is where it calls md_trim_bio. md_trim_bio appears to have failed at 0x2f which is very early. It must have been de-referencing 'bio'. Which seems to imply that bio_clone-mddev returned NULL. But that is not possible when GFP_NOIO is passed. So I'm stumped. If you could set up netconsole to send the kernel messages over the network to another machine, then we could get a full trace which might help. Setting up netconsole is fairly straight forward. Documentation/networking/netconsole.txt describes it. Just an extra kernel parameter, and run 'nc' on some other machine capturing the output.
I was reading about netconsole earlier today and was wondering if it would help in this case. I am getting pretty tired tonight, so probably won't get to it until tomorrow and Thursdays I end up pretty busy. So it may be Friday before I a report back. If I get can things set up in the morning, then I might get something back to you sooner.
Created attachment 71132 [details] Traceback captured via netconsole in a script session. It took a bit more futzing around this morning than I would have liked, but I got it setup before I left, and it crashed by the time I got to work. The traceback looks like it is complete, so hopefully has what you need. This is with the 3.1.0-0.rc4.git0.0.fc17.i686.PAE kernel running at the time of the crash.
Thanks a lot . It is definitely a case of bio_clone_mddev returning NULL, and there are no prior messages about a memory allocation failing. Weird. I'm going to have to read a lot of code before I find out how this can happen. I'll probably want to send you a patch to run with which will print out some more debugging information - I hope you can work with that... Thanks.
Actually - one other request. Could I get a disassembly listing of md_trim_bio. This is a small chance that it isn't crashing where I think it is. It will be in the vmlinux file rather than any module.
I'll be able to test a patch; I'm not real fluent at doing kernel rebuilds, but I can do it. I started looking at doing the disassembly and it seems the best way to do this is to install the debug packages. I'm downloading them now, but need to get to sleep. I should have it for you tomorrow.
Created attachment 71372 [details] md_trim_bio disassembly output This is from 3.1.0-0.rc4.git0.0.fc17.i686.PAE and should match the complete traceback I got using netconsole.
Created attachment 71482 [details] Corrected md_trim_bio I copied over the typescript file before exiting script and the previous version wasn't complete.
Thanks a lot of that - it helped more than I expected. The problem is a use-after-free bug. By the time it crashed it was doing really weird things because it had been basing decisions on corrupt data. I will attach a patch. If you could test it and confirm that it fixed the probem I would really appreciate it. Thanks, NeilBrown
Created attachment 71722 [details] patch to remove use-after-free bug in RAID1
I'll update the bug once I have some machines running with the patch. There isn't a strict reproducer for this (though disk activity seems to help), but after a couple days without a problem we'll have pretty good reason to believe the problem is fixed. I have some stuff going on today and it takes a while for a kernel build on my hardware, but I'll probably have two machines on the test kernel tonight and hopefully another tomorrow.
I now have two machines running an rc4 kernel with the above patch. I'll add a third system tomorrow. I'll make an update if I get a similar crash. If I don't see any by Friday, I'll add a comment about that as well. Nothing was obviously wrong with the test kernel in a few minutes of general system use.
Thanks - I look forward to hearing the good news :-) The patch (with addition to fix RAID10 as well) should appear in -next tomorrow and I'll push it to Linus a little later.
A patch referencing this bug report has been merged in Linux v3.1-rc10: commit 079fa166a2874985ae58b2e21e26e1cbc91127d4 Author: NeilBrown <neilb@suse.de> Date: Sat Sep 10 17:21:23 2011 +1000 md/raid1,10: Remove use-after-free bug in make_request.
The patch worked fine. The kernel.org mess happened right around the time the problem was fixed and I couldn't update the ticket.