Bug 118581

Summary: BLKZEROOUT not zeroing MD device
Product: IO/Storage Reporter: Sitsofe Wheeler (sitsofe)
Component: MDAssignee: io_md
Status: RESOLVED CODE_FIX    
Severity: normal CC: hch, martin.petersen, shli, snitzer
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: Test program that shows the problem (64bi x86 only)
BLKZEROOUT LVM test script

Description Sitsofe Wheeler 2016-05-20 06:12:23 UTC
Created attachment 216941 [details]
Test program that shows the problem (64bi x86 only)

Description of the problem
Certain disks with an MD device on top return success when the MD device is sent a BLKZEROOUT despite not actually zeroing the data.

Steps to reproduce:
1. On ESXi create a VM with a 1GByte VMDK attached to a PVSCSI controller running Fedora 23 64 bit.
2. Boot the VM.
3. Create an mdadm device on top the VMDK device e.g.
mdadm --create /dev/md0 /dev/sdb --raid-devices=1 --force --level=linear
where sdb is the device backed by the VMDK.
4. Write junk to the mdadm device e.g.
dd if=/dev/urandom of=/dev/md0 conv=fsync bs=512 count=1
5. Run
cat /sys/block/sdb/queue/write_same_max_bytes
6. Send BLKZEROOUT to the md device and check the return code e.g.
blkdiscard --zeroout /dev/md0
echo $?
7. Run
cat /sys/block/sdb/queue/write_same_max_bytes
8. Read the data back and check it is zeros e.g.
dd if=/dev/md0 iflag=direct bs=512 count=1 | hexdump

Expected results:
Step 6. to output 0 (indicating success).
Step 8. to show the disk is all zeros.

Actual results:
Step 6. outputs 0 (so it was successful)
Step 8. shows data other than all zeros.

How reproducible is the problem:
The problem happens every time after boot but not when using a disk attached to a virtualised LSI controller (which always returns write_same_max_bytes of 0).

Version information:
4.6.0-1.vanilla.knurd.1.fc23.x86_64 kernel
Fedora 23
VMware ESXi, 6.0.0, 3029758

Additional information:
This issue has also been reproduced on Ubuntu 16.04 with a the default 4.4.0-22-generic kernel. 

Stopping and then reassembling the array makes the problem go away.

When BLKZEROOUT is issued to an MD device atop a PVSCSI controller 
supplied VMDK from ESXi 6.0 the call returns immediately and with a zero 
return code. Unfortunately, inspecting the data on the MD device shows 
that it has not been zeroed and is untouched even after POSIX_FADV_DONTNEED 
is used on the MD device.

The only clue I've seen is that 
/sys/block/sd?/queue/write_same_max_bytes starts out being 33553920 but 
after a WRITE SAME is issued it becomes 0. If the MD device is created 
after write_same_max_bytes has become 0 on the backing disk then 
BLKZEROOUT seems to work correctly.

Originally filed over on http://thread.gmane.org/gmane.linux.scsi/113726 .
Comment 1 Shaohua Li 2016-05-20 16:17:12 UTC
does the sd? really support write same? why the write_saame_max_bytes becomes 0 after the command issued? MD doesn't do nothing related to write same so far. It just passes the command to underlayer disk. So I think it's better scsi people look at this issue.
Comment 2 Sitsofe Wheeler 2016-05-20 19:06:10 UTC
What might be happening is the device says "I can do write same" but when the kernel issues the command it decides "uh oh I don't think you can" (perhaps because it's blocked due to controller behaviour) and then goes and turns it off. See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1371591/comments/27 for a VMware take on a bug related to this.

It's worth noting that putting LVM directly on the sd? device and running the test doesn't show this problem hence the reason I think it might be MD related. Further, after the problem occurs this is the result:

# cat /sys/block/sdb/queue/write_same_max_bytes 
0
# cat /sys/block/md0/queue/write_same_max_bytes 
33553920

If MD really has to pass the command down then if the lower layer starts saying it can't do cope with write same then shouldn't MD follow suit too?
Comment 3 Sitsofe Wheeler 2016-05-21 03:35:02 UTC
https://www.redhat.com/archives/dm-devel/2013-September/msg00095.html describes how WRITE SAME can be enabled only to be disabled after the first WRITE SAME issued fails (see 2c).
Comment 4 Sitsofe Wheeler 2016-05-21 04:34:27 UTC
Further information in that thread ("SCSI's heuristics for enabling WRITE SAME still need work [was: dm mpath: disable WRITE SAME if it fails]") says the caller device has to clear write same on themselves if write same (ever) fails and is cleared below - see http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/18933/focus=18963 for where this was intially implemented for dm-multipath devices (the patch mentioned within is this
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f84cb8a46a771f36a04a02c61ea635c968ed5f6a ).

Discussion about why this dynamic write same disabling behaviour has to stay can be found later in the thread - http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/18933/focus=18962 (seems to be down to problematic SCSI<->ATA/SCSI-RAID controllers).

Later, in a different device mapper thread ("Re: dm: disable WRITE SAME if it fails") generic device mapper disabling of write same at failure time was discussed (http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/19839/focus=20320 ) and implemented (http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7eee4ae2dbb2be0a15a4406718806e48b18ba831 ) superceding the multipath only patch.
Comment 5 Shaohua Li 2016-05-23 18:49:16 UTC
this doesn't make sense to me. even md doesn't check write same return value and disabling future write same commands, the write same command will return a failure and blkdev_issue_zeroout will fallback to write 0 data to disk, so you should still see the data is zeroed.
Comment 6 Sitsofe Wheeler 2016-05-23 21:39:09 UTC
Somehow failure isn't being returned (I don't know if the issue VMware pointed out was ever fixed). One way to reproduce the problem is with the scsi_debug driver:

1. Run
modprobe scsi_debug write_same_length=131072
2. Use dmesg to work out where where the disk appeared and build an mdadm device on top e.g. if it appears as sdc do:
mdadm --create /dev/md1 /dev/sdc --raid-devices=1 --force --level=linear
3. Do a blkzeroout test and confirm that data is being overwritten with zeros.
4. Disable write same on the scsi_debug block device by doing something like
echo 0 > /sys/block/sdc/device/scsi_disk/*/max_write_same_blocks 
5. Do a blkzeroout test and notice that data is not being overwritten with zeros but no errors are being returned.

Shaohua, do you see the issue with these steps?
Comment 7 Sitsofe Wheeler 2016-05-26 05:37:20 UTC
Adding Mike Snitzer to the CC list for additional input because the steps in comment #6 seem to affect LVM (which is device mapper) too...
Comment 8 Shaohua Li 2016-05-26 14:49:36 UTC
so the problem is blkdev_issue_write_same always return 0 even disk doesn't support write same. I'm still checking the reason. For the md side, we should disable write same if IO error happens, but I thought we should do this in block layer, because we will very similar things like DM. Stay tuned.
Comment 9 Mike Snitzer 2016-05-26 17:55:30 UTC
(In reply to Shaohua Li from comment #8)
> so the problem is blkdev_issue_write_same always return 0 even disk doesn't
> support write same. I'm still checking the reason. For the md side, we
> should disable write same if IO error happens, but I thought we should do
> this in block layer, because we will very similar things like DM. Stay tuned.

DM does disable write same if it fails:
For bio-based DM see drivers/md/dm.c:clone_endio()
For request-based DM see drivers/md/dm.c:dm_done()

I agree that ideally blkdev_issue_write_same() would return -EOPNOTSUPP accordingly (like blkdev_issue_discard does if discards aren't supported).  But I seem to recall Martin Petersen (now cc'd) saying we couldn't achieve that with WRITE_SAME for some reason.. that we had no choice but to issue the WRITE_SAME with the hopse that it'd work.. but if it failed then we'd flag the device as not supporting it, etc.
Comment 10 Mike Snitzer 2016-05-26 17:58:27 UTC
(In reply to Sitsofe Wheeler from comment #7)
> Adding Mike Snitzer to the CC list for additional input because the steps in
> comment #6 seem to affect LVM (which is device mapper) too...

Hmm, comment#6 doesn't have anything to do with LVM commands..

(In reply to Mike Snitzer from comment #9)

> I agree that ideally blkdev_issue_write_same() would return -EOPNOTSUPP
> accordingly (like blkdev_issue_discard does if discards aren't supported). 
> But I seem to recall Martin Petersen (now cc'd) saying we couldn't achieve
> that with WRITE_SAME for some reason.. that we had no choice but to issue
> the WRITE_SAME with the hopse that it'd work.. but if it failed then we'd
> flag the device as not supporting it, etc.

Ah, comment#4 seems to have the references explaining why this is.
Comment 11 Sitsofe Wheeler 2016-05-26 19:33:16 UTC
(In reply to Mike Snitzer from comment #10)
> 
> Hmm, comment#6 doesn't have anything to do with LVM commands..

Sure if you swap LVM commands for the MD ones you get a similar effect:

1. Run
modprobe scsi_debug write_same_length=131072
2. Use dmesg | tail to work out where where the disk appeared and build an LVM device on top e.g. if it appears as sdc do:
pvcreate /dev/sdc
vgcreate vg /dev/sdc
lvcreate -L4MB -n lv vg
3. Fill the /dev/mapper/vg-lv with ones, do a blkzeroout and then confirm that data is being overwritten with zeros (e.g. using dd/tr/blkdiscard --zero/hexdump or the program attached to comment #0).
4. Disable write same on the scsi_debug block device by doing something like
echo 0 > /sys/block/sdc/device/scsi_disk/*/max_write_same_blocks 
5. Do the ones then blkzeroout test from step 3 and notice that data is not being overwritten with zeros but no errors are being returned.
Comment 12 Sitsofe Wheeler 2016-06-01 05:16:14 UTC
It sounds like other users have seen BLKZEROOUT spuriously succeeding in the wild too - see http://thread.gmane.org/gmane.linux.scsi/113726 for a user reporting a usb storage related race with it. 

This issue seems to be fixed by the patch Shaohua Li posted on https://patchwork.kernel.org/patch/9137311/ . I've followed up with comment here http://permalink.gmane.org/gmane.linux.kernel/2229377 (it works but perhaps md should still set its own write same to 0 and perhaps the variables in the patch should be renamed) but the patch progress seems currently stalled. I think this issue is a candidate for stable backport as well because of its current small size.
Comment 13 Sitsofe Wheeler 2016-06-19 21:53:01 UTC
Sadly progress on this seems to have stalled again (see https://groups.google.com/forum/#!topic/linux.kernel/1WGDSlyY0y0 for the first attempt at fixing this a few years back) - http://thread.gmane.org/gmane.linux.kernel/2236800/focus=2242105 shows after a V2 release of the patches was made an objection was raised wrt to the current supression of EOPNOTSUPP. Sadly there are two issues:
1. How to fix this problem in future kernels.
2. How to fix the problem in stable (already released) kernels.

While consensus is built on how to solve 1 it would be good to have a bandaid for 2 because released stable kernels are used by the enterprise distro releases. Without a backport, BLKZEROOUT can't be relied upon for certain user setups...
Comment 14 Sitsofe Wheeler 2016-07-04 04:45:11 UTC
On 2016-06-24 patches by Christoph Hellwig were posted on the linux-block mailing list (http://www.spinics.net/lists/linux-block/msg03991.html - "zeroout fixes") against the 4.8 kernel that address this issue. The post also wonders about the best way to fix things up for 4.7.

Using the attached script shows the problem is resolved for me with those patches so the aforementioned patch could fix the problem in future kernels. There are still a few questions though:
1. How to fix this in stable kernels?
2. When (or should) stacked devices set their own write_same_max_bytes to 0?

With respect to 2. with Christoph's patch a dm device will set its write_same_max_bytes to 0 if the underlying device somehow sets it to 0 at the point a write same is issued but not if a user has manually disabled write same by echoing 0 into the underlying device's max_write_same_blocks. md devices never automatically reset their write_same_max_bytes to 0.
Comment 15 Sitsofe Wheeler 2016-07-04 04:45:36 UTC
Created attachment 221941 [details]
BLKZEROOUT LVM test script

On 2016-06-24 patches by Christoph Hellwig were posted on the linux-block mailing list (http://www.spinics.net/lists/linux-block/msg03991.html - "zeroout fixes") against the 4.8 kernel that address this issue. The post also wonders about the best way to fix things up for 4.7.

Using the attached script shows the problem is resolved (i.e. the blocks are always zeroed) with the aforementioned patches applied. There are still a few questions though:
1. How to fix this in stable kernels?
2. When (or should) stacked devices set their own write_same_max_bytes to 0?

With respect to 2. with Christoph's patches a dm device will set its write_same_max_bytes to 0 if the underlying device somehow sets it to 0 at the point a BLKZEROOUT is issued but not if a user has manually disabled write same by echoing 0 into the underlying device's max_write_same_blocks. md devices never automatically reset their write_same_max_bytes to 0.
Comment 16 Sitsofe Wheeler 2017-08-01 05:19:10 UTC
Looks like this might have been addressed by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=26483819f89c5cf9d27620d70c95afeeeb9bece5 in 4.11 . I'll try and test this out...