Bug 100491

Summary: Oops under bitmap_start_sync [md_mod] at boot
Product: IO/Storage Reporter: Sami Liedes (sami.liedes)
Component: MDAssignee: io_md
Status: NEW ---    
Severity: normal CC: adam, agk, evangelos, fred-kernel, giuliano.procida, gqjiang, jbrassow, jim.hague, ju93, kernel, neilb, net147, rgoldwyn, sami.liedes, stark
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.1.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg at loglevel=7
first 8k of rmeta_1 after sync completed
first 8k of rmeta_2 after sync completed
Patch to start with zeroed page
[PATCH] Enable loading of bitmaps for dm-md-raid1
attachment-5941-0.html
attachment-8561-0.html

Description Sami Liedes 2015-06-25 21:02:45 UTC
When trying to boot 4.1.0, I get this oops:

------------------------------------------------------------
Reading all physical valumes.  This may take a while...
Found volume group "rootvg" using metadata type lvm2
device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
md/raid1:mdX: active with 1 out of 2 mirrors
mdX: invalid bitmap file superblock: bad magic
md-cluster module not found.
mdX: Could not setup cluster service (256)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000100
IP: [<ffffffff8159e4a9>] _raw_spin_lock_irq+0x29/0x70
PGD 0
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: dm_crypt microcode dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx dm_mirror dm_region_hash dm_log dm_mod btrfs xor raid6_pq raid1 md_mod crct10dif_pclmul crc32_pclmul crc32_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd e1000e ahci libahci ehci_pci ptp xhci_pci ehci_hcd pps_core xhci_hcd thermal
CPU: 0 PID: 223 Comm: lvm Not tainted 4.1.0 #1
Hardware name: ASUS All Series/MAXIMUS VII RANGER, BIOS 2012 10/01/2014
task: ffff8805023fb1c0 ti: ffff880501dc4000 task.ti: ffff880501dc4000
RIP: 0010:[<ffffffff8159e4a9>]  [<ffffffff8159e4a9>] _raw_spin_lock_irq+0x29/0x70
RSP: 0018:ffff880501dc7b18  EFLAGS: 00010082
RAX: 0000000000000100 RBX: 0000000000000100 RCX: 0000000000000000
RDX: ffff880501dc7ba0 RSI: 0000000000000000 RDI: 0000000000000100
RBP: ffff880501dc7b28 R08: 0000000800000000 R09: 00000008ffffffff
R10: ffffffffa050fd67 R11: 0000000000000246 R12: 0000000000000000
R13: ffff880501dc7ba0 R14: 0000000000000000 R15: ffff880501dc7cf0
FS:  00007f267b99b800(0000) GS:ffff88051ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000100 CR3: 0000000501f6b000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 ffff880501dc7b58 0000000000000100 ffff880501dc7b88 ffffffffa03b58c8
 ffff8800dd4e52d0 00000000a0510080 ffff8800dd4e52e0 ffff880501fd0be8
 ffff880501dc7b88 0000000000000000 ffff880506a05010 0000000000000100
Call Trace:
 [<ffffffffa03b58c8>] bitmap_start_sync+0x38/0xe0 [md_mod]
    in bitmap_start_sync (drivers/md/bitmap.c:1475)
 [<ffffffffa03b7459>] bitmap_load+0x49/0x130 [md_mod]
    in bitmap_load (drivers/md/bitmap.c:1822)
 [<ffffffffa0579e3d>] raid_resume+0x13d/0x230 [dm_raid]
    in raid_resume (drivers/md/dm-raid.c:1690)
 [<ffffffffa0512469>] dm_table_resume_targets+0x99/0xe0 [dm_mod]
    in dm_table_resume_targets (drivers/md/dm-table.c:1637)    
 [<ffffffffa050fa21>] dm_resume+0xc1/0x100 [dm_mod]
    in dm_resume (drivers/md/dm.c:3290)
 [<ffffffffa0514dcb>] dev_suspend+0x12b/0x260 [dm_mod]
    in dev_suspend (drivers/md/dm-ioctl.c:1048)
 [<ffffffffa0514ca0>] ? table_load+0x340/0x340 [dm_mod]
 [<ffffffffa05156ae>] ctl_ioctl+0x22e/0x510 [dm_mod]
    in ctl_ioctl (drivers/md/dm-ioctl.c:1849)
 [<ffffffffa05159a3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
    in dm_ctl_ioctl (drivers/md/dm-ioctl.c:1867)
 [<ffffffff811d22a0>] do_vfs_ioctl+0x2e0/0x4e0
 [<ffffffff81258c46>] ? security_sem_associate+0x16/0x20
 [<ffffffff812450bb>] ? ipcget+0x14b/0x1d0
 [<ffffffff811d2521>] SyS_ioctl+0x81/0xa0
 [<ffffffff8159e9b2>] system_call_fastpath+0x16/0x75
 Code: 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 fa 66 0f 1f 44 00 00 e8 a3 5c b8 ff 65 ff 05 bc c4 a6 7e b8 00 01 00 00 <f0> 66 0f c1 03 0f b6 cc 38 c1 75 0b 48 83 c4 08 5b 5d c3 0f 1f
 RIP  [<ffffffff8159e4a9>] _raw_spin_lock_irq+0x29/0x70
  RSP <ffff880501dc7b18>
 CR2: 0000000000000100
 ---[ end trace 821307f06ab8f24a ]---
------------------------------------------------------------

Under kernel 4.0.5, the system at least boots. It does resync the raid array at every boot, which is obviously less than optimal; stopping the md devices at shutdown always fails with Device or resource in use. I'm not sure if that's me or Debian doing something wrong. FWIW, here are the relevant messages from 4.0.5 boot:

------------------------------------------------------------
device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
md/raid1:mdX: active with 1 out of 2 mirrors
mdX: invalid bitmap file superblock: bad magic
created bitmap (675 pages) for device mdX
mdX: bitmap initialized from disk: read 43 pages, set 44 of 1380520 bits
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:1, dev:dm-3
 disk 1, wo:0, o:1, dev:dm-5
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:1, dev:dm-3
 disk 1, wo:0, o:1, dev:dm-5
md: recovery of RAID array mdX
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
md: using 128k window, over a total of 1413652480k.
------------------------------------------------------------

My setup is dm-crypt on top of LVM raid1 on physical disk.
Comment 1 Sami Liedes 2015-06-25 21:32:44 UTC
Not being able to load md-cluster doesn't seem to be the only problem here. The system still does not boot after loading md_cluster.ko to the initramfs. Rather, I get output like this every 10 seconds:

------------------------------------------------------------
device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
md/raid1: mdX: active with 1 out of 2 mirrors
mdX: invalid bitmap file superblock: bad magic
dlm: no local IP address has been set
dlm: cannot start dlm lowcomms -107
mdX: Could not setup cluster service (-107)
mdX: failed to create bitmap (-107)
device-mapper: table: 253:6: raid: Fail to run raid array
device-mapper: ioctl: error adding target to table
device-mapper: reload ioctl on  failed: Transport endpoint is not connected
------------------------------------------------------------

Note: my setup is entirely local.
Comment 2 Sami Liedes 2015-06-28 21:00:06 UTC
On Thu, Jun 25, 2015 at 09:02:45PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=100491
> 
>             Bug ID: 100491
>            Summary: Oops under bitmap_start_sync [md_mod] at boot
[...]
> Reading all physical valumes.  This may take a while...
> Found volume group "rootvg" using metadata type lvm2
> device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
> md/raid1:mdX: active with 1 out of 2 mirrors
> mdX: invalid bitmap file superblock: bad magic
> md-cluster module not found.
> mdX: Could not setup cluster service (256)
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000100
> IP: [<ffffffff8159e4a9>] _raw_spin_lock_irq+0x29/0x70
> PGD 0
> Oops: 0002 [#1] PREEMPT SMP
[...]

I'm marking this as a regression in bugzilla, since this seems to
prevent booting on 4.1.0 at least in certain circumstances (namely
those which I have; I wonder if any raid1 recovery works?) while 4.0.6
boots correctly.

I bisected this down to one of four commits. Well, assuming that the
problem was caused by changes in drivers/md; a fair assumption, I
think. The commits are:

$ git bisect view --oneline
f9209a3 bitmap_create returns bitmap pointer
96ae923 Gather on-going resync information of other nodes
54519c5 Lock bitmap while joining the cluster
b97e9257 Use separate bitmaps for each nodes in the cluster

The crash happens whether or not CONFIG_MD_CLUSTER is enabled.

Here's the versions I tested:

git bisect start '--' 'drivers/md'
# bad: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
# good: [39a8804455fb23f09157341d3ba7db6d7ae6ee76] Linux 4.0
# bad: [9ffc8f7cb9647b13dfe4d1ad0d5e1427bb8b46d6] md/raid5: don't do chunk aligned read on degraded array.
# bad: [6dc69c9c460b0cf05b5b3f323a8b944a2e52e76d] md: recover_bitmaps() can be static
# bad: [4b26a08af92c0d9c0bce07612b56ff326112321a] Perform resync for cluster node failure
# good: [cf921cc19cf7c1e99f730a2faa02d80817d684a2] Add node recovery callbacks
# skip: [96ae923ab659e37dd5fc1e05ecbf654e2f94bcbe] Gather on-going resync information of other nodes
# bad: [f9209a323547f054c7439a3bf67c45e64a054bdd] bitmap_create returns bitmap pointer
# skip: [54519c5f4b398bcfe599f652b4ef4004d5fa63ff] Lock bitmap while joining the cluster

	Sami
Comment 3 Goldwyn Rodrigues 2015-06-29 12:10:36 UTC
The reason you are facing the crash or performing a resync everytime you boot in earlier kernels is that the bitmap superblock is corrupted. This is the reason you are getting the message:

mdX: invalid bitmap file superblock: bad magic

The reason cluster code is being activated is: some padding from the superblock has been changed to store the number of nodes in the cluster. Now the corruption in your superblock is most likely present in this area as well, and has the number of nodes set to non-zero. 

Could you run mdadm --examine-bitmap /dev/sdXX using Neil's cluster branch (http://git.neil.brown.name/?p=mdadm.git;a=shortlog;h=refs/heads/cluster). Please provide the output here.

Note: The command will produce more junk, but it will prove the theory.

Finally, do you not allow the resync to complete? It should have written a clean superblock after resync is complete.
Comment 4 Sami Liedes 2015-06-29 13:15:22 UTC
I usually indeed do allow the resync to complete. At next boot the superblock still seems to be corrupt always. Is it invalidated by failing to detach the DM device at shutdown? For some reason, I always get a bunch of "systemd-shutdown[1]: Could not detach DM /dev/dm-[23456]: Device or resource busy" at shutdown (The first relevant error seems to be "Failed to detach loop devices: Invalid argument". Here's a photo of the messages: https://goo.gl/photos/jbURnDrfo3w2SGpJA). If that doesn't cause it, it must be that the resync fails to write a clean one.

So should I run the mdadm command on the LVM rmeta volumes? I have:

# lvs --all
  LV                 VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  newroot            rootvg rwi-aor---   1.32t                                    100.00
  [newroot_rimage_1] rootvg iwi-aor---   1.32t
  [newroot_rimage_2] rootvg iwi-aor---   1.32t
  [newroot_rmeta_1]  rootvg ewi-aor---   4.00m
  [newroot_rmeta_2]  rootvg ewi-aor---   4.00m

(newroot is the volume that always resyncs at boot; as you can see, it's in sync here.)

Running mdadm --examine-bitmap on either the rimage devices or the rmeta devices says the bitmap magic is invalid, so I assume LVM stores the bitmap somewhere else...

I'm at the limits of my knowledge about LVM, but a comment in /etc/lvm/lvm.conf says thus (I assume the "log" is either the same as the bitmap or stored with it?):

# "raid1"  - This implementation leverages MD's RAID1 personality through
#            device-mapper.  It is characterized by a lack of log options.
#            (A log is always allocated for every device and they are placed
#            on the same device as the image - no separate devices are
#            required.)  This mirror implementation does not require I/O
#            to be blocked in the kernel in the event of a failure.
#            This mirror implementation is not cluster-aware and cannot be
#            used in a shared (active/active) fashion in a cluster.
Comment 5 Goldwyn Rodrigues 2015-06-30 02:15:03 UTC
Would you be able to provide me with the exact options used to create the device. I will try to replicate an environment as close as possible to yours. 

The trouble is from md so output of /proc/mdstat would be useful. However, it is the interaction of dm device with md which seems to be the problem.
Comment 6 Sami Liedes 2015-06-30 11:59:02 UTC
--------------------
$ cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
unused devices: <none>
--------------------

The device started as a normal LVM volume, which I lvconverted to a 2-way RAID1 logical volume. The filesystem is from February 2013, and I assume the LVM devices where generated with kernel/lvm from around that time. The commands must have been something like this, off the top of my head:

[pvcreate on some volumes]
# vgcreate rootvg physical_volumes...
# lvcreate --size 1T --name newroot rootvg
[operations on /dev/rootvg/newroot, like creating a luks encrypted volume and a filesystem on it]
# lvconvert --type raid1 -m1 /dev/rootvg/newroot /path/to/physicalvolume1 /path/to/physicalvolume2

After that, one of the physical volumes has failed, and I recovered it somehow after removing the failing disk; I suspect by first converting it to a normal (linear) logical volume, then running the above lvconvert again. Sorry, the details are a bit fuzzy, especially since I found some of the lvconvert operations to replace a failed volume quite confusing (I feel lucky that I didn't drop the working mirror or something).

I guess I could try to replicate it myself in a virtual machine. Still, I guess it might be easier to figure out what's wrong right now and maybe try to deduce from that how it became so. I guess LVM must store the md bitmap somewhere so I can dump it? I only (roughly) understand the interaction between LVM and device mapper. I guess there's no command or /proc file that would give details of the md mapping like 'dmsetup table' does for device mapper? Are these things documented somewhere besides the source code? :)

I'm also pretty sure the volumes did stay in sync at some point. I could try to grep my logs to see if the perpetual resyncing started when I replaced the failed disk or at some later point, e.g. after a kernel upgrade.
Comment 7 Goldwyn Rodrigues 2015-06-30 18:03:49 UTC
Thanks for the inputs. I could not recreate the situation on my system. 

Yes, I think the "Could not detach" shutdown messages have something to do with the bitmaps MD not being correct/in-sync.

From the looks of it, the bitmap superblock is corrupted. Could you boot with loglevel=7 parameter. This will give the MD bitmap super information as soon as it outputs the error. Boot with 4.0.5 to be safe because the error is happening there as well.

I feel the bug exists before the md-cluster code was checked in. These patches force the failure in a worse way.
Comment 8 Sami Liedes 2015-06-30 22:00:47 UTC
Created attachment 181491 [details]
dmesg at loglevel=7

dmesg at loglevel=7 from kernel 4.0.7 attached. The array was reportedly in sync before rebooting. Still the first relevant thing I can see there is "device-mapper: raid: Device 0 specified for rebuild: Clearing superblock".
Comment 9 Neil Brown 2015-07-01 03:49:40 UTC
The original crash is, in part, because "request_module" in the kernel returns a positive error code instead of a negative error code as documented.  Sad.  We'll fix that, but having md-cluster module available will avoid that problem.

dm-raid has separate regions for the data and the metadata.
This is seen in:

  [newroot_rimage_1] rootvg iwi-aor---   1.32t
  [newroot_rimage_2] rootvg iwi-aor---   1.32t
  [newroot_rmeta_1]  rootvg ewi-aor---   4.00m
  [newroot_rmeta_2]  rootvg ewi-aor---   4.00m

rimage_X is the data, rmeta_X is the metadata.
This metadata contains a 4K block describing the array, and then the bitmap.
(mdadm doesn't know how to read this).


If you can 'dd' the first 8K of these 'rmeta' devices and attach that, I could have a look and see if there is anything to explain why the magic number is wrong.
Comment 10 Neil Brown 2015-07-01 04:13:15 UTC
Can I just clarify something:
when you boot with a 4.0 kernel, does it all work as expected, or do you still get the "bad magic" errors?
Comment 11 Sami Liedes 2015-07-01 07:39:10 UTC
With 4.0.x, I get "bad magic" and the array is reconstructed. With 4.1, the system fails to boot as described in Comment 1 (or the initial report if md-cluster is not available).
Comment 12 Sami Liedes 2015-07-01 08:01:46 UTC
Created attachment 181521 [details]
first 8k of rmeta_1 after sync completed
Comment 13 Sami Liedes 2015-07-01 08:10:23 UTC
Created attachment 181531 [details]
first 8k of rmeta_2 after sync completed

First 8k of rmeta_[12] attached. These are at the point when the sync has been completed (on a running system, so I guess they might be slightly inconsistent if one is written after extracting the first but before extracting the second).

I can probably also extract them at the point after shutdown / before boot later today, though that's slightly tricky since recovery begins automatically early at boot. I can probably figure out their physical location on disk and dump them from a non-lvm/raid aware OS.
Comment 14 Goldwyn Rodrigues 2015-07-01 09:10:25 UTC
Created attachment 181561 [details]
Patch to start with zeroed page

Could you please try this patch and report the output?
Comment 15 Neil Brown 2015-07-01 11:11:13 UTC
I just noticed another error in the code - obvious when you see it:

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 135a0907e9de..d2d294f3e4a0 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -671,7 +671,7 @@ out:
        kunmap_atomic(sb);
        /* Assiging chunksize is required for "re_read" */
        bitmap->mddev->bitmap_info.chunksize = chunksize;
-       if (nodes && (bitmap->cluster_slot < 0)) {
+       if (err == 0 && nodes && (bitmap->cluster_slot < 0)) {
                err = md_setup_cluster(bitmap->mddev, nodes);
                if (err) {
                        pr_err("%s: Could not setup cluster service (%d)\n",



This still doesn't fix the fact that it resyncs all the time.  That looks like a fairly deep problem with the integration of md/raid into dm.

Thanks for the logs.  They show that the "bitmap superblock" certainly has garbage for 'magic' and version, though the 'events' counter is correct.
I'm quite sure I could recreated this for testing, so feel free to do anything you like to try to repair what you have.  Possibly using lvconvert to remove the mirror, then lvconvert to put it back would work, but I have little experience with this.
I'll try to come up with something more substantial in the next day or two.
Comment 16 Neil Brown 2015-07-02 05:57:24 UTC
Sorry - another question just to clarify: could you report the output of
dmsetup table
dmsetup info

shortly after boot?  Thanks.
Comment 17 Sami Liedes 2015-07-02 15:23:11 UTC
Behavior at boot with 4.1.1 + patch from Comment 14 + patch from Comment 15 applied:

  Reading all physical volumes. This may take a while...
  Found volume group "rootvg" using metadata type lvm2
[2.27] device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
[2.27] md/raid1:mdX: active with 1 out of 2 mirrors
[2.27] mdX: invalid bitmap file superblock: bad magic
[2.27] mdX: failed to create bitmap (-22)
[2.27] device-mapper: table: 253:6: raid: Fail to run raid array
[2.27] device-mapper: ioctl: error adding target to table
  device-mapper: reload ioctl on  failed: Invalid argument
  2 logical volume(s) in volume group "rootvg" now active
  Reading all physical volumes. This may take a while...
  Found volume group "rootvg" using metadata type lvm2
[12.3] device-mapper: raid: Device 0 specified for rebuild: Clearing superblock
[12.3] md/raid1:mdX: active with 1 out of 2 mirrors
[12.3] mdX: invalid bitmap file superblock: bad magic
[12.3] mdX: failed to create bitmap (-22)
[... and so on, the same messages repeating every 10 seconds]

Shortly after boot on 4.0.7:

$ sudo dmsetup table
rootvg-newroot_rimage_1: 0 1938857984 linear 8:22 10241
rootvg-newroot_rimage_1: 1938857984 888446976 linear 8:21 2049
rootvg-backup_crypt: 0 1677717504 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:0 4096
rootvg-scratch: 0 1468006400 linear 8:34 2054
rootvg-newroot: 0 2827304960 raid raid1 5 0 rebuild 0 region_size 2048 2 253:2 253:3 253:4 253:5
rootvg-newroot_crypt: 0 2827300864 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:6 4096
rootvg-newroot_rmeta_2: 0 8192 linear 8:2 2048
rootvg-newroot_rmeta_1: 0 8192 linear 8:22 2049
rootvg-backup: 0 974766080 linear 8:36 2052
rootvg-backup: 974766080 702955520 linear 8:35 147654663
rootvg-scratch_crypt: 0 1468002304 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:1 4096
rootvg-newroot_rimage_2: 0 2827304960 linear 8:2 10240

$ sudo dmsetup info
Name:              rootvg-newroot_rimage_1
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 5
Number of targets: 2
UUID: LVM-jVJFdlmF0JFIstUe5PGj05TSVBOdvnhagwXAuZl9Wz28tsfeYkp60chiiVJi3oLo

Name:              rootvg-backup_crypt
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 9
Number of targets: 1
UUID: CRYPT-LUKS1-21107b8520f546168113f2435fa5751b-rootvg-backup_crypt

Name:              rootvg-scratch
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 1
Number of targets: 1
UUID: LVM-jVJFdlmF0JFIstUe5PGj05TSVBOdvnhao5QuWUNb4cieO6Vebt2AudgMQ51oasSd

Name:              rootvg-newroot
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      1
Major, minor:      253, 6
Number of targets: 1
UUID: LVM-jVJFdlmF0JFIstUe5PGj05TSVBOdvnhaaoYTyDfZ7vRS4OqTd0sAm6lg3zDd5CR5

Name:              rootvg-newroot_crypt
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 7
Number of targets: 1
UUID: CRYPT-LUKS1-9fd745bf8f8244ff808b674b52f78194-rootvg-newroot_crypt

Name:              rootvg-newroot_rmeta_2
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      0
Major, minor:      253, 2
Number of targets: 1
UUID: LVM-jVJFdlmF0JFIstUe5PGj05TSVBOdvnhaUpv34B9iBL0mSiWIQHNt6uC6eFd6JG1v

Differences to these after sync has been completed:

# diff -u dmsetup-table-after-boot.txt <(dmsetup table)
--- dmsetup-table-after-boot.txt        2015-07-02 09:32:26.392129572 +0300
+++ /dev/fd/63  2015-07-02 18:21:41.059161832 +0300
@@ -2,7 +2,7 @@
 rootvg-newroot_rimage_1: 1938857984 888446976 linear 8:21 2049
 rootvg-backup_crypt: 0 1677717504 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:0 4096
 rootvg-scratch: 0 1468006400 linear 8:34 2054
-rootvg-newroot: 0 2827304960 raid raid1 5 0 rebuild 0 region_size 2048 2 253:2 253:3 253:4 253:5
+rootvg-newroot: 0 2827304960 raid raid1 3 0 region_size 2048 2 253:2 253:3 253:4 253:5
 rootvg-newroot_crypt: 0 2827300864 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 253:6 4096
 rootvg-newroot_rmeta_2: 0 8192 linear 8:2 2048
 rootvg-newroot_rmeta_1: 0 8192 linear 8:22 2049


# diff -u dmsetup-info-after-boot.txt <(dmsetup info)
--- dmsetup-info-after-boot.txt 2015-07-02 09:32:30.791104303 +0300
+++ /dev/fd/63  2015-07-02 18:21:58.539068283 +0300
@@ -33,7 +33,7 @@
 Read Ahead:        256
 Tables present:    LIVE
 Open count:        1
-Event number:      1
+Event number:      2
 Major, minor:      253, 6
 Number of targets: 1
 UUID: LVM-jVJFdlmF0JFIstUe5PGj05TSVBOdvnhaaoYTyDfZ7vRS4OqTd0sAm6lg3zDd5CR5
Comment 18 jbrassow 2015-07-02 15:39:37 UTC
There,

rootvg-newroot: 0 2827304960 raid raid1 5 0 rebuild 0 region_size 2048 2 253:2 253:3 253:4 253:5

I'm wondering why whatever is loading the table is purposefully rebuilding dev#0 position every time.
Comment 19 jbrassow 2015-07-02 15:41:49 UTC
The sync issue does not seem to be the fault of the kernel, but whatever is instantiating the device in the first place.
Comment 20 Sami Liedes 2015-07-02 15:43:43 UTC
Ah, so a boot script bug possibly?
Comment 21 jbrassow 2015-07-02 19:17:42 UTC
not sure.  I've never seen this type of issue before.  I don't even think it is possible for the 'rebuild' param to be stored in the LVM metadata.  (You could also check that by doing 'vgcfgbackup -f my_ma.txt <vg_name>'.)  That means that something is evaluating the need to enforce a rebuild (or something else is going wildly wrong).
Comment 22 Evangelos Foutras 2015-07-06 21:50:48 UTC
This issue has been observed downstream in Arch Linux. [1]

I was able to reproduce it in a virtual machine using a RAID1 LVM logical volume for the root filesystem. Aside from the kernel oops with Linux 4.1.1, I didn't notice any resyncs (or other issues) with Linux 4.0.7 and 3.14.47.

Since this issue is blocking Linux 4.1.x from moving out of our testing repository in Arch, it would be nice to know if you have been able to reproduce it and possibly have an idea of where the problem might be.

I'm not sure if I can provide any other helpful information but I can test any stuff pointed out.

[1] https://bugs.archlinux.org/task/45548
Comment 23 Ryan Tucker 2015-07-13 00:10:06 UTC
I am also able to reproduce this with 4.1.1 on a couple Ubuntu systems with RAID1 LVM for the root filesystem.  My symptoms are more-or-less identical to the others, except without the bad superblocks, resyncs, etc.  No issues with 4.0.(7,8).
Comment 24 Goldwyn Rodrigues 2015-07-13 12:20:41 UTC
Created attachment 182511 [details]
[PATCH] Enable loading of bitmaps for dm-md-raid1

This patch adds Neil's comment #15 to the patch in comment #14. This should work for devices without major corruptions in the bitmap super block.
Comment 25 Goldwyn Rodrigues 2015-07-13 12:22:14 UTC
Evangelos, Ryan: Could you try the patch in comment #24 and report back? If you are not experiencing bad superblock errors, this patch should work.
Comment 26 Evangelos Foutras 2015-07-13 12:46:55 UTC
(In reply to Goldwyn Rodrigues from comment #25)
> Evangelos, Ryan: Could you try the patch in comment #24 and report back? If
> you are not experiencing bad superblock errors, this patch should work.

Works for me, thanks! (Applied on top of Linux 4.1.2.)
Comment 27 Ryan Tucker 2015-07-15 04:07:23 UTC
Patch works for me as well, also on top of 4.1.2.  Excellent!
Comment 28 Johannes U. 2015-07-16 09:41:40 UTC
A similar problem affects my system also. From kernel 4.1.x to 4.2 (and linux-next as of yesterday) RAID5 does not work ("clearing superblock" and then oops because md_cluster not compiled into kernel). Last time I got a message about invalid hostname when I tried to boot a kernel with md_cluster compiled-in.
Should I try the patch or post dmesg output, or is there nothing I could help with because the root cause was already found?
Comment 29 Ryan Tucker 2015-07-26 18:00:07 UTC
With 4.2.0-rc3, I can no longer reproduce the oops issue, and everything is happy.
Comment 30 stark 2015-09-01 15:42:11 UTC
Fwiw I'm seeing the same problem with both 4.1.6 and 4.2 (from the Debian 4.2-1~exp1 experimental package).

I've even gone as far creating an entirely new mirrored volume for my root and dd'ing the filesystem from the old one to the new one. I waited until it completed syncing. It still reports an invalid bitmap file with a bad magic.

I did have a problem with some volumes rebuilding on every boot on 4.0 but that wasn't the case for the root volume. I've now replaced both of those the same way as above and it has stopped doing those unnecessary rebuilds. But it's still not able to find the root on 4.1 or 4.2.
Comment 31 Sami Liedes 2015-09-01 21:25:13 UTC
I also saw this:

[Start from the state described in previous comments where every boot starts a resync.]
1. [in 4.0.6] lvconvert -m0 rootvg/newroot && lvconvert -m1 rootvg/newroot
2. Boot 4.2
   -> does not boot (I have a photo of the messages, no oops)
3. Boot 4.0.6
   -> boots; no longer resyncs, so *something* has changed
4. [in 4.0.6] lvconvert -m0 rootvg/newroot
5. Boot 4.2
6. [in 4.2] lvconvert -m1 rootvg/newroot

So far I haven't tried to reboot 4.2 after step 6, but I'll do it if there's general interest in knowing what happens.
Comment 32 Sami Liedes 2015-09-06 11:51:40 UTC
Ok, it still doesn't boot in 4.2 after those steps (at least after a crash) :(
Comment 33 Sami Liedes 2015-09-29 12:25:36 UTC
No idea if this is related, but the same computer also fails to even start to sync a raid array just after creating it. I filed a separate Bug 105241.
Comment 34 Adam McKenna 2016-05-20 03:04:11 UTC
So to save everyone else the hour and a half I spent figuring out how to work around this, here's what you need to do:

# lvconvert --mirrors -1 <lv device> <pv_device>
# vgchange -ay
# lvconvert -m 1 <lv_device>

Where lv_device is the device that's having issues, and <pv_device> is the SECONDARY device on the RAID1 mierror.

This will throw a warning, but will remove the secondary device.

vgchange should then be able to bring up the (now un-mirrored) volumes.

 After vgchange, run lvconvert again to add the mirror back.  This requires a resync.

The above has to be done after every reboot.
Comment 35 stark 2016-05-20 03:19:02 UTC
Created attachment 216931 [details]
attachment-5941-0.html

> The above has to be done after every reboot.

I don't think something that has to be done after every reboot is really a
workaround.

How is this still a thing?
Comment 36 Sami Liedes 2016-05-20 17:26:18 UTC
Has somebody managed to figure out in which case exactly this happens? I have a system where this at least used to happen even for new volumes in the volume group, but I have so far failed to reproduce this in a virtual machine for a new volume group.
Comment 37 stark 2016-05-20 17:33:18 UTC
Created attachment 216981 [details]
attachment-8561-0.html

The impression I get is that the bug causing the corruption was found and
fixed but nobody actually added any workaround for users with existing
corruption.
Comment 38 Adam McKenna 2016-05-24 06:50:38 UTC
I may have been mistaken when I said this needs to be run on every reboot.  When I posted my last update, it was after I had spent several hours working on this and I thought I had rebooted into 4.5.  But I found out later I had rebooted back into 3.16 and that may be what caused the corruption again.

Just updated to 4.5 so I will update again after next reboot if this is still an issue.
Comment 39 Giuliano Procida 2016-09-16 17:39:43 UTC
I've run into similar trouble with LVM RAID 5 and 6 volumes.

Older kernels load the bitmap superblock despite the zero magic and write it back again as zero (the padding is also written back with whatever junk it had).

I fixed this by booting with an old kernel (3.14), and re-rewriting the magic "bitm" at byte offset 4096 in the rmeta mapped devices. sync and poweroff the machine (or otherwise stop the bad superblock being re-rewritten to disk.

The problem has occurred twice now on different volumes. My guess is that something goes wrong when the MD unload / superblock write-back happens, perhaps reset or poweroff happens too soon.

The first volume was one I didn't regular start. The latter was one I created very recently, running a 4.something kernel.