Created attachment 115881 [details] log file The libguestfs test suite runs mdadm in various combinations. Currently the mdadm --stop test causes a soft lockup and eventual crash. See the very long stack trace which I'll attach to this bug. This has just started happening in the Rawhide kernel, in the last week. kernel 3.13.0-0.rc1.git0.1.fc21 mdadm-3.3-4.fc21.x86_64 Fedora bug: https://bugzilla.redhat.com/show_bug.cgi?id=1033971 The first stack trace is below, but see the attached file for the many subsequent errors seen. mdadm --stop /dev/md123 [ 157.114285] BUG: soft lockup - CPU#0 stuck for 23s! [md123_raid1:146] [ 157.114285] Modules linked in: raid1 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx kvm_amd kvm snd_pcsp snd_pcm snd_page_alloc snd_timer serio_raw snd soundcore ata_generic pata_acpi virtio_balloon virtio_pci virtio_mmio virtio_net virtio_scsi virtio_blk virtio_console virtio_rng virtio_ring virtio ideapad_laptop sparse_keymap rfkill sym53c8xx scsi_transport_spi crc8 crc_ccitt crc32 crc_itu_t libcrc32c megaraid megaraid_sas megaraid_mbox megaraid_mm [ 157.114285] irq event stamp: 5730664 [ 157.114285] hardirqs last enabled at (5730663): [<ffffffff8175f926>] _raw_spin_unlock_irqrestore+0x36/0x70 [ 157.114285] hardirqs last disabled at (5730664): [<ffffffff8176a5ad>] apic_timer_interrupt+0x6d/0x80 [ 157.114285] softirqs last enabled at (5730448): [<ffffffff8107b098>] __do_softirq+0x198/0x430 [ 157.114285] softirqs last disabled at (5730443): [<ffffffff8107b71d>] irq_exit+0xcd/0xe0 [ 157.114285] CPU: 0 PID: 146 Comm: md123_raid1 Not tainted 3.13.0-0.rc1.git0.1.fc21.x86_64+debug #1 [ 157.114285] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 157.114285] task: ffff88001844a5f0 ti: ffff8800184da000 task.ti: ffff8800184da000 [ 157.114285] RIP: 0010:[<ffffffff8175f92b>] [<ffffffff8175f92b>] _raw_spin_unlock_irqrestore+0x3b/0x70 [ 157.114285] RSP: 0018:ffff8800184dbcc8 EFLAGS: 00000296 [ 157.114285] RAX: ffff88001844a5f0 RBX: ffff8800184dbc60 RCX: 0000000000000000 [ 157.114285] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000296 [ 157.114285] RBP: ffff8800184dbcd8 R08: 0000000000000000 R09: 0000000000000000 [ 157.114285] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88001844ad90 [ 157.114285] R13: 0000000000000002 R14: ffffffff810b6a38 R15: ffff8800184dbc40 [ 157.114285] FS: 0000000000000000(0000) GS:ffff88001f000000(0000) knlGS:0000000000000000 [ 157.114285] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 157.114285] CR2: 00007f6f943bc000 CR3: 00000000198aa000 CR4: 00000000000006f0 [ 157.114285] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 157.114285] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000 [ 157.114285] Stack: [ 157.114285] ffff880019835098 0000000000000296 ffff8800184dbdd8 ffffffffa021ffd5 [ 157.114285] ffff8800198350e0 ffff880019835068 7fffffffffffffff ffff8800195c7450 [ 157.114285] ffff8800184dbe40 ffffffff81390d9b ffff880019835068 0000000000000001 [ 157.114285] Call Trace: [ 157.114285] [<ffffffffa021ffd5>] raid1d+0x6a5/0xe50 [raid1] [ 157.114285] [<ffffffff81390d9b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 157.114285] [<ffffffff815951e8>] md_thread+0x118/0x130 [ 157.114285] [<ffffffff810c6dc0>] ? abort_exclusive_wait+0xb0/0xb0 [ 157.114285] [<ffffffff815950d0>] ? mddev_unlock+0xe0/0xe0 [ 157.114285] [<ffffffff810a01df>] kthread+0xff/0x120 [ 157.114285] [<ffffffff810a00e0>] ? insert_kthread_work+0x80/0x80 [ 157.114285] [<ffffffff8176987c>] ret_from_fork+0x7c/0xb0 [ 157.114285] [<ffffffff810a00e0>] ? insert_kthread_work+0x80/0x80 [ 157.114285] Code: 55 08 48 8d 7f 18 53 48 89 f3 be 01 00 00 00 e8 1c 4d 97 ff 4c 89 e7 e8 d4 81 97 ff f6 c7 02 74 1f e8 0a 22 97 ff 48 89 df 57 9d <66> 66 90 66 90 5b 41 5c 65 ff 0c 25 60 c9 00 00 5d c3 0f 1f 00
Thanks for the report. Was the array performing a resync or recovery at the time?
Quite likely. Note this is a test program which rapidly creates and stops the array. You can see the test program here: https://github.com/libguestfs/libguestfs/blob/master/tests/md/test-mdadm.sh and you can see the actual commands that it executes by looking at the log file attached to this bug.
So in this case it looks as if the scenario is: - Add a four disk MD array to a booting guest. - Immediately run 'mdadm --stop /dev/mdXXX' as soon as the guest has booted. The mdadm command hangs, whereas before recent changes it did not hang.
I think I've found it. The bug was caused by the introduction of the MD_STILL_CLOSED flag. This should fix it. diff --git a/drivers/md/md.c b/drivers/md/md.c index b6b7a2866c9e..e60cebf3f519 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7777,7 +7777,7 @@ void md_check_recovery(struct mddev *mddev) if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) return; if ( ! ( - (mddev->flags & ~ (1<<MD_CHANGE_PENDING)) || + (mddev->flags & MD_UPDATE_SB_FLAGS & ~ (1<<MD_CHANGE_PENDING)) || test_bit(MD_RECOVERY_NEEDED, &mddev->recovery) || test_bit(MD_RECOVERY_DONE, &mddev->recovery) || (mddev->external == 0 && mddev->safemode == 1) || I wonder why I couldn't reproduce it under qemu-kvm. I under understand what is happening correctly, the bug should cause the md123_raid1 thread to spin for a short while until the mdadm thread calls md_unregister_thread, at which point the md123_raid1 thread should just exit. Please confirm that this patch fixes your problem. Thanks
Yes, this patch fixes the test in the libguestfs test suite on my Fedora Rawhide machine. $ make -C tests/md check LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 TESTS=test-mdadm.sh make: Entering directory `/home/rjones/d/libguestfs/tests/md' make check-TESTS make[1]: Entering directory `/home/rjones/d/libguestfs/tests/md' 310 seconds: ./test-mdadm.sh PASS: test-mdadm.sh ============= 1 test passed ============= make[1]: Leaving directory `/home/rjones/d/libguestfs/tests/md' make: Leaving directory `/home/rjones/d/libguestfs/tests/md'
Thanks. I'll send the patch upstream. I would close this bug too, but is seems I cannot. I cannot even assign it to me.... ho hum.
Fixed by pull rq "[GIT PULL REQUEST]: md fixes for 3.13-rc"