Bug 24012

Summary: Commit g18733b0 hangs my LVM+raid10 system
Product: IO/Storage Reporter: Giacomo Catenazzi (cate)
Component: MDAssignee: io_md
Status: CLOSED CODE_FIX    
Severity: normal CC: djwong, florian, maciej.rutecki, neilb, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37-rc3-00325-g4492862 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 21782    
Attachments: serial console (with sysreq-t)
serial console, with higher speed (thus no ata timeouts)
smartctl -a for the 4 hd
log with the patch to display more info

Description Giacomo Catenazzi 2010-11-29 19:48:09 UTC
Hello,

the commit be20e6c67b6c6024c19094d3447f144d1 block the boot process of my machine (Debian sid), at mdadm. Sometime, pressing with C-C or SysRc-S+E) I can continue the boot process (but not always), but without being able to mount /home (and /pub). "sync" hangs hard. I see no oops or other "strange" message in dmesg.

My home system has 4 disks, with a LVM+raid10 configuration.

Note: root partition seems to be mounted correctly (also on the same LVM+raid10 configuration).

ciao
     cate
Comment 1 Darrick J. Wong 2010-11-29 20:27:26 UTC
Hm... all that patch does is enables flush/fua support for md devices.  I'm not really sure why it would cause a hang.  Could you please post dmesg and sysrq-t output so that we can examine what the system is doing?
Comment 2 Giacomo Catenazzi 2010-11-29 21:41:53 UTC
Created attachment 38582 [details]
serial console (with sysreq-t)
Comment 3 Giacomo Catenazzi 2010-11-29 21:44:52 UTC
Note: the ATA timeouts seem to be caused by delay of serial console (not 100% sure, but on normal sysreq-t I did not noticed such pattern on log)
Comment 4 Darrick J. Wong 2010-11-29 23:27:45 UTC
Just out of curiosity, was this setup working fine with 2.6.36?

The reason I ask is, it looks like various filesystem processes (jbd2, fsck.ext4) are hanging up in md_flush_request, which is where md issues flush requests to the underlying devices.  This sort of looks like the flush is issued and then the drives fail to respond, causing the system to stop.

As I recall, in the 2.6.36 days, a flush request (then known as a barrier) would cause the device's IO queue to be emptied out before the flush would be sent to the drive.  In 2.6.37, that emptying behavior is gone, which means that flushes and writes are intermixed in the command stream, which improves performance.  The patch you tracked down enables md to send these flush requests.  On your system, I see that NCQ is enabled, which permits the kernel to issue multiple commands simultaneously.  However, it isn't unheard of for there to exist drives that cannot handle simultaneous flush and write requests[1].  Perhaps these Samsung drives of yours suffer a similar behavior.  There might be a way to tell:

Does the hang behavior go away if you boot with libata.force=noncq ?

It might also be interesting to see the smartctl -a output of all four drives.

[1] https://ata.wiki.kernel.org/index.php/Known_issues#Seagate_harddrives_which_time_out_FLUSH_CACHE_when_NCQ_is_being_used
Comment 5 Neil Brown 2010-11-29 23:40:03 UTC
Very strange...
This is what I see in the 'Show State' output:

Two processes - jbd2/dm-0-8 and fsck.ext4 - are stuck in md_flush_request.
This means they are trying to perform REQ_FLUSH requests, but md only allows one at a time, and there is one currently in progress.

One process - flush-253:0 - is stuck in md_write_start. waiting for a super-block update to complete.   This process is handling a regular write request, not a REQ_FLUSH request (if it were, the stack would be different).  So it isn't blocking the REQ_FLUSH requests.

One process - md0_raid10 - is writing out the superblock, on behalf of the md_write_start request.  It has submitted the IO (which would be REQ_FLUSH, and would have been even before the offending commit) and is waiting for it to complete.  But for some reason it isn't completing.

The only way this would be different before the identified commit is that the first two processes would not be stuck in md_flush_request.  They would have progressed further and would be stuck in md_write_start instead.  And there is nothing they could do between where md_flush_request is called and where md_write_start is called that could possibly unblock the other threads.

The number of ata errors about  'flush' are certainly worrying.  I would certainly be more comfortable if I could see a stack trace without any ATA error - possible it would be different and the differences might be illuminating.

So: could you try to produce another sysrq-T output, and if you can get one with no errors reported, so much the better.
Comment 6 Giacomo Catenazzi 2010-11-30 06:56:49 UTC
Created attachment 38612 [details]
serial console, with higher speed (thus no ata timeouts)

In this attachment I used an higher baudrate, and the ata error seems to be gone away. (now I've only a warning about instable clock-rate, due to serial console).

Now I'm using 2.6.36, without problems.
Comment 7 Giacomo Catenazzi 2010-11-30 07:05:20 UTC
Created attachment 38622 [details]
smartctl -a for the 4 hd

the smartctl is done with 2.6.36 (but it should matter, is it?)

I tested also with libata.force=noncq, but the machine still hangs.
Comment 8 Neil Brown 2010-11-30 07:20:25 UTC
Thanks for the second trace.

Unfortunately it isn't complete - immediately after 'SysRq: Show State'
there is corruption so that the first process that is fully displayed is bdi-default.  However maybe that doesn't matter.

There are some strong similarities: both jbd2/dm-0-8 and fsck.ext4 are
again stuck in md_flush_request.  However both flush-253:0 and md0_raid10 are idle.

So we simply have md_flush_request blocking when it shouldn't.

It looks like ->flush_bio is not being cleared when it should be.
The code which clears it is at the end of md_submit_flush_data.

It is vaguely possible that a memory ordering issue could move the assignment to after the wake_up so the waiting thread doesn't get woken properly.  I suspect that is unlikely, particularly on an x86 machine, but just to confirm, are you able to try with a spinlock around that code?
i.e.

in md_submit_flush_data, in drivers/md/md.c

+       spin_lock_irq(&mddev->write_lock);
	if (atomic_dec_and_test(&mddev->flush_pending)) {
		mddev->flush_bio = NULL;
		wake_up(&mddev->sb_wait);
	}
+       spin_unlock_irq(&mddev->write_lock);
Comment 9 Giacomo Catenazzi 2010-11-30 07:45:58 UTC
It still hangs with such patch.

About the corruption of console log: right. I did not check, and it was not expected: at SysRq-t (so after waiting some time) I expect serial buffers to be empty.

Later I'll try again, but unfortunately now I must work.
Comment 10 Neil Brown 2010-11-30 08:00:07 UTC
Thanks for testing - it is very helpful having such thing eliminated as possibilities.

I've tried reproducing it myself and cannot, so they much be some important difference between your config and mine.  I'll have another look tomorrow and see if anything new occurs to me.
Comment 11 Neil Brown 2010-12-01 02:08:46 UTC
A fresh day and a fresh look didn't reveal anything...  well actually it
did show my that there is a chance of a mempool running dry which I'll have
to try to fix, but I'm sure that isn't affecting you.

Could you try this patch and collect the output?  Hopefully it will
provide some visibility into what is going wrong.

thanks,
NeilBrown



diff --git a/drivers/md/md.c b/drivers/md/md.c
index 84c46a1..8ba9bc9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -363,9 +363,10 @@ static void md_end_flush(struct bio *bio, int err)
 	mddev_t *mddev = rdev->mddev;
 
 	rdev_dec_pending(rdev, mddev);
-
+	printk("End flush %p\n", bio);
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
 		/* The pre-request flush has finished */
+		printk("last flush done\n");
 		queue_work(md_wq, &mddev->flush_work);
 	}
 	bio_put(bio);
@@ -392,6 +393,7 @@ static void submit_flushes(mddev_t *mddev)
 			bi->bi_private = rdev;
 			bi->bi_bdev = rdev->bdev;
 			atomic_inc(&mddev->flush_pending);
+			printk("Submit flush %p\n", bi);
 			submit_bio(WRITE_FLUSH, bi);
 			rcu_read_lock();
 			rdev_dec_pending(rdev, mddev);
@@ -415,6 +417,7 @@ static void md_submit_flush_data(struct work_struct *ws)
 			generic_make_request(bio);
 	}
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
+		printk("Flush %p complete\n", mddev->flush_bio);
 		mddev->flush_bio = NULL;
 		wake_up(&mddev->sb_wait);
 	}
@@ -422,6 +425,7 @@ static void md_submit_flush_data(struct work_struct *ws)
 
 void md_flush_request(mddev_t *mddev, struct bio *bio)
 {
+	printk("Want flush %p\n", bio);
 	spin_lock_irq(&mddev->write_lock);
 	wait_event_lock_irq(mddev->sb_wait,
 			    !mddev->flush_bio,
@@ -432,6 +436,7 @@ void md_flush_request(mddev_t *mddev, struct bio *bio)
 	atomic_set(&mddev->flush_pending, 1);
 	INIT_WORK(&mddev->flush_work, md_submit_flush_data);
 
+	printk("flushing %p\n", bio);
 	submit_flushes(mddev);
 
 	if (atomic_dec_and_test(&mddev->flush_pending))
Comment 12 Neil Brown 2010-12-01 04:41:11 UTC
One other thought: just for completeness could you report the output of

grep -H . /sys/block/*/device/scsi_disk/*/cache_type

please?  I suspect they will all be "write back", and even if they aren't I'm not sure who it could affect the behaviour, but it might be useful to know.

Thanks.
Comment 13 Giacomo Catenazzi 2010-12-01 06:50:35 UTC
Created attachment 38732 [details]
log with the patch to display more info

Note: I've "updated" to rc4, and I still applied the spinlock patch of msg #8

The summary of results:

EXT4-fs (dm-0): re-mounted. Opts: errors=remount-ro
Want flush ffff88022db4aa80
flushing ffff88022db4aa80
Submit flush ffff88022db4df08
Submit flush ffff88022db4dd88
Submit flush ffff88022db4da88
Submit flush ffff88022db4d008
Want flush ffff88022db4af00
Want flush ffff88022d6c16c0
SysRq : Show State

Additionally (with 2.6.36):
cateee:/home/cate# grep -H . /sys/block/*/device/scsi_disk/*/cache_type
/sys/block/sda/device/scsi_disk/0:0:0:0/cache_type:write back
/sys/block/sdb/device/scsi_disk/1:0:0:0/cache_type:write back
/sys/block/sdc/device/scsi_disk/4:0:0:0/cache_type:write back
/sys/block/sdd/device/scsi_disk/5:0:0:0/cache_type:write back
Comment 14 Giacomo Catenazzi 2010-12-01 07:42:41 UTC
Same configuration (rc4 + two patches) but without the flush commit (git 18733b0), I've no log about flush. Nothing. Also doing a "sync".

But now that I'm writing, the kernel notices me (thus after some minutes after boot):

EXT4-fs (dm-1): error count: 1
EXT4-fs (dm-1): initial error at 1290174273: ext4_mb_generate_buddy:726
EXT4-fs (dm-1): last error at 1290174273: ext4_mb_generate_buddy:726

[after some minutes more, no further kernel messages]

I hope it helps.
Comment 15 Neil Brown 2010-12-01 08:43:48 UTC
Thanks.  Helps a lot.

I cannot quite see what is happening yet, but I'm close.  I need to examine
dm more closely.

It is related to the fact that dm is sending request to md, and to the
way generic_make_request serialise recursive calls.  The flush requests
to the lower-levels are being serialised behind a flush request to md, and
as md only allows one at a time, it deadlocks.

So I should have it figured out shortly.

Thanks again.
Comment 16 Neil Brown 2010-12-01 08:58:51 UTC
yes... It looks like dm can easily submit multiple flush requests to the one device.  If you could report the output of 
  dmsetup table
we should should see dm-0 having multiple targets (probably linear) referencing
the md array.  Each target gets its own flush request.

So dm queues to flush requests for the raid10.
raid10 converts the first to 4 flush requests for the components.
Then it tries to convert the second and blocks waiting for the first set of
4 to complete.  But they are queued behind this second request.

This patch pushes the 4 flush requests into a separate process so they are queued separately and aren't blocked.  So it should fix the problem, and as a bonus, it fixes the concern I had with the mempool potentially exhausting.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 84c46a1..d551701 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -357,6 +357,8 @@ EXPORT_SYMBOL(mddev_congested);
  * Generic flush handling for md
  */
 
+static void md_submit_flush_data(struct work_struct *ws);
+
 static void md_end_flush(struct bio *bio, int err)
 {
 	mdk_rdev_t *rdev = bio->bi_private;
@@ -366,15 +368,18 @@ static void md_end_flush(struct bio *bio, int err)
 
 	if (atomic_dec_and_test(&mddev->flush_pending)) {
 		/* The pre-request flush has finished */
+		INIT_WORK(&mddev->flush_work, md_submit_flush_data);
 		queue_work(md_wq, &mddev->flush_work);
 	}
 	bio_put(bio);
 }
 
-static void submit_flushes(mddev_t *mddev)
+static void submit_flushes(struct work_struct *ws)
 {
+	mddev_t *mddev = container_of(ws, mddev_t, flush_work);
 	mdk_rdev_t *rdev;
 
+	atomic_set(&mddev->flush_pending, 1);
 	rcu_read_lock();
 	list_for_each_entry_rcu(rdev, &mddev->disks, same_set)
 		if (rdev->raid_disk >= 0 &&
@@ -397,6 +402,11 @@ static void submit_flushes(mddev_t *mddev)
 			rdev_dec_pending(rdev, mddev);
 		}
 	rcu_read_unlock();
+
+	if (atomic_dec_and_test(&mddev->flush_pending)) {
+		INIT_WORK(&mddev->flush_work, md_submit_flush_data);
+		queue_work(md_wq, &mddev->flush_work);
+	}
 }
 
 static void md_submit_flush_data(struct work_struct *ws)
@@ -404,8 +414,6 @@ static void md_submit_flush_data(struct work_struct *ws)
 	mddev_t *mddev = container_of(ws, mddev_t, flush_work);
 	struct bio *bio = mddev->flush_bio;
 
-	atomic_set(&mddev->flush_pending, 1);
-
 	if (bio->bi_size == 0)
 		/* an empty barrier - all done */
 		bio_endio(bio, 0);
@@ -414,10 +422,10 @@ static void md_submit_flush_data(struct work_struct *ws)
 		if (mddev->pers->make_request(mddev, bio))
 			generic_make_request(bio);
 	}
-	if (atomic_dec_and_test(&mddev->flush_pending)) {
-		mddev->flush_bio = NULL;
-		wake_up(&mddev->sb_wait);
-	}
+
+	mddev->flush_bio = NULL;
+	wake_up(&mddev->sb_wait);
+
 }
 
 void md_flush_request(mddev_t *mddev, struct bio *bio)
@@ -429,13 +437,9 @@ void md_flush_request(mddev_t *mddev, struct bio *bio)
 	mddev->flush_bio = bio;
 	spin_unlock_irq(&mddev->write_lock);
 
-	atomic_set(&mddev->flush_pending, 1);
-	INIT_WORK(&mddev->flush_work, md_submit_flush_data);
-
-	submit_flushes(mddev);
+	INIT_WORK(&mddev->flush_work, submit_flushes);
 
-	if (atomic_dec_and_test(&mddev->flush_pending))
-		queue_work(md_wq, &mddev->flush_work);
+	queue_work(md_wq, &mddev->flush_work);
 }
 EXPORT_SYMBOL(md_flush_request);
Comment 17 Giacomo Catenazzi 2010-12-01 11:50:26 UTC
Thank you!

This patch fixes the problem.

BTW
cateee:/home/cate# dmsetup table
cateee-pub: 0 629145600 linear 9:0 156238208
cateee-root: 0 58589184 linear 9:0 384
cateee-root: 58589184 41943040 linear 9:0 1016070528
cateee-windows: 0 41943040 linear 9:0 1358823808
cateee-home: 0 97648640 linear 9:0 58589568
cateee-home: 97648640 230686720 linear 9:0 785383808
cateee-home: 328335360 300810240 linear 9:0 1058013568
Comment 18 Rafael J. Wysocki 2010-12-02 21:07:29 UTC
Handled-By : Neil Brown <neilb@suse.de>
Patch : https://bugzilla.kernel.org/show_bug.cgi?id=24012#c16
Comment 19 Giacomo Catenazzi 2010-12-13 19:58:05 UTC
Hello Neil,

I don't yet see the patch in #16 in the Linus' tree. Did you forget about the fix?

ciao
     cate
Comment 20 Neil Brown 2010-12-13 22:12:46 UTC
No, I didn't forget.
It is in my for-linus branch (at git://neil.brown.name/md) but I haven't
asked him to pull yet because there is something else that I think I need to check ... only I don't remember what it is :-(
I'll send a pull request later today (where 'today' is interpreted against a time zone of my choice....)
Comment 21 Rafael J. Wysocki 2010-12-19 11:58:58 UTC
Fixed by commit a035fc3 md: fix possible deadlock in handling flush requests.