Bug 42954

Summary: kernel oops when adding a bitmap to a raid1 md device
Product: IO/Storage Reporter: Flavio Stanchina (flavio)
Component: MDAssignee: Neil Brown (neilb)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: neilb
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.2.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel messages
kernel config
kernel trace with a different BUG

Description Flavio Stanchina 2012-03-18 11:29:48 UTC
The kernel oopsed on me after I added a bitmap to a RAID1 md device with:

  mdadm --grow /dev/md2 --bitmap=internal

This happened to me during normal operation -- twice, because I hoped the first time it was just bad luck. Then I booted with init=/bin/bash, i.e. nothing but the bare kernel, built a RAID1 on two spare partitions, added a bitmap as described above and it oopsed as soon as I tried to write something to the device. This was with a Debian distribution kernel, linux-image-3.2.0-1-amd64 versions 3.2.4-1 and 3.2.6-1; see Debian bug 661558:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=661558

I've now reproduced it with a vanilla 3.2.6 kernel, I'm attaching the kernel messages captured with netconsole.

The commands used are as follows:

mdadm --zero-superblock /dev/disk/by-id/ata-XXX-part1
mdadm --zero-superblock /dev/disk/by-id/ata-YYY-part3
mdadm --create /dev/md3 --metadata=0.90 --assume-clean -l1 -n2 \
 /dev/disk/by-id/ata-XXX-part1 \
 /dev/disk/by-id/ata-YYY-part3
mdadm --grow /dev/md3 --bitmap=internal
dd if=/dev/zero of=/dev/md3 bs=1M count=1

mdadm is Debian version 3.2.3-2.
Comment 1 Flavio Stanchina 2012-03-18 11:30:28 UTC
Created attachment 72641 [details]
kernel messages
Comment 2 Flavio Stanchina 2012-03-18 11:30:58 UTC
Created attachment 72642 [details]
kernel config
Comment 3 Neil Brown 2012-03-18 22:20:57 UTC
Thanks for the report.

I believe this is fixed by:

http://neil.brown.name/git?p=md;a=commitdiff;h=37b8fb4a7443ad1d83a977f4b1720b5617447fed

which I have queued to send to Linus as soon as 3.3 is out.  It will then be added to recent stable kernels.

(maybe I should just submit it now .. but I thought 3.3 was imminent).
Comment 4 Flavio Stanchina 2012-03-20 20:43:53 UTC
I applied the patch to a vanilla 3.2.6 and tried again, but got the same crash. I'm sorry I didn't capture the kernel messages as I was in a hurry, but can do that if you think it might be useful.
Comment 5 Neil Brown 2012-03-20 20:57:50 UTC
Yes please - and double check that you are really running the new kernel as I have a high degree of confidence that the patch fixes that problem.
Comment 6 Flavio Stanchina 2012-03-22 21:04:45 UTC
Well, it appears that I did indeed boot the wrong kernel. After I applied your patch and rebuilt the kernel, "make deb-pkg" added a + to the version number (I suppose it's meant to signal that the sources aren't pristine) and to apt-get, a trailing + means "install the package with this name WITHOUT the +" so it installed the unpatched kernel again. :-/

I've now rebuilt and installed the right kernel and I can confirm that your patch fixes the crash I've seen. However, you might want to look at the trace I'm attaching, because at 157.8 seconds there's another kernel BUG in drivers/md/md.c, probably unrelated to the bitmap thing but still firmly in your territory. :)

After creating the array, I ran "dd if=/dev/zero of=/dev/md3 bs=1M count=1" as usual and it survived. I ran dd again without count=1 to stress test the thing a bit more, oblivious to the fact that job control is disabled in a shell started with init=/bin/bash, so after a minute or so when I was satisfied that it was working fine I just hit Ctrl+Alt+Del and I got the kernel BUG you'll find in the trace.
Comment 7 Flavio Stanchina 2012-03-22 21:05:37 UTC
Created attachment 72687 [details]
kernel trace with a different BUG
Comment 8 Neil Brown 2012-03-22 21:19:51 UTC
Thanks.
That second one is fixed by:

http://neil.brown.name/git?p=md;a=commitdiff;h=c744a65c1e2d59acc54333ce80a5b0702a98010b

already sent to Linus, but he doesn't seem to have pulled it yet.