Created attachment 216941 [details] Test program that shows the problem (64bi x86 only) Description of the problem Certain disks with an MD device on top return success when the MD device is sent a BLKZEROOUT despite not actually zeroing the data. Steps to reproduce: 1. On ESXi create a VM with a 1GByte VMDK attached to a PVSCSI controller running Fedora 23 64 bit. 2. Boot the VM. 3. Create an mdadm device on top the VMDK device e.g. mdadm --create /dev/md0 /dev/sdb --raid-devices=1 --force --level=linear where sdb is the device backed by the VMDK. 4. Write junk to the mdadm device e.g. dd if=/dev/urandom of=/dev/md0 conv=fsync bs=512 count=1 5. Run cat /sys/block/sdb/queue/write_same_max_bytes 6. Send BLKZEROOUT to the md device and check the return code e.g. blkdiscard --zeroout /dev/md0 echo $? 7. Run cat /sys/block/sdb/queue/write_same_max_bytes 8. Read the data back and check it is zeros e.g. dd if=/dev/md0 iflag=direct bs=512 count=1 | hexdump Expected results: Step 6. to output 0 (indicating success). Step 8. to show the disk is all zeros. Actual results: Step 6. outputs 0 (so it was successful) Step 8. shows data other than all zeros. How reproducible is the problem: The problem happens every time after boot but not when using a disk attached to a virtualised LSI controller (which always returns write_same_max_bytes of 0). Version information: 4.6.0-1.vanilla.knurd.1.fc23.x86_64 kernel Fedora 23 VMware ESXi, 6.0.0, 3029758 Additional information: This issue has also been reproduced on Ubuntu 16.04 with a the default 4.4.0-22-generic kernel. Stopping and then reassembling the array makes the problem go away. When BLKZEROOUT is issued to an MD device atop a PVSCSI controller supplied VMDK from ESXi 6.0 the call returns immediately and with a zero return code. Unfortunately, inspecting the data on the MD device shows that it has not been zeroed and is untouched even after POSIX_FADV_DONTNEED is used on the MD device. The only clue I've seen is that /sys/block/sd?/queue/write_same_max_bytes starts out being 33553920 but after a WRITE SAME is issued it becomes 0. If the MD device is created after write_same_max_bytes has become 0 on the backing disk then BLKZEROOUT seems to work correctly. Originally filed over on http://thread.gmane.org/gmane.linux.scsi/113726 .
does the sd? really support write same? why the write_saame_max_bytes becomes 0 after the command issued? MD doesn't do nothing related to write same so far. It just passes the command to underlayer disk. So I think it's better scsi people look at this issue.
What might be happening is the device says "I can do write same" but when the kernel issues the command it decides "uh oh I don't think you can" (perhaps because it's blocked due to controller behaviour) and then goes and turns it off. See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1371591/comments/27 for a VMware take on a bug related to this. It's worth noting that putting LVM directly on the sd? device and running the test doesn't show this problem hence the reason I think it might be MD related. Further, after the problem occurs this is the result: # cat /sys/block/sdb/queue/write_same_max_bytes 0 # cat /sys/block/md0/queue/write_same_max_bytes 33553920 If MD really has to pass the command down then if the lower layer starts saying it can't do cope with write same then shouldn't MD follow suit too?
https://www.redhat.com/archives/dm-devel/2013-September/msg00095.html describes how WRITE SAME can be enabled only to be disabled after the first WRITE SAME issued fails (see 2c).
Further information in that thread ("SCSI's heuristics for enabling WRITE SAME still need work [was: dm mpath: disable WRITE SAME if it fails]") says the caller device has to clear write same on themselves if write same (ever) fails and is cleared below - see http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/18933/focus=18963 for where this was intially implemented for dm-multipath devices (the patch mentioned within is this http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f84cb8a46a771f36a04a02c61ea635c968ed5f6a ). Discussion about why this dynamic write same disabling behaviour has to stay can be found later in the thread - http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/18933/focus=18962 (seems to be down to problematic SCSI<->ATA/SCSI-RAID controllers). Later, in a different device mapper thread ("Re: dm: disable WRITE SAME if it fails") generic device mapper disabling of write same at failure time was discussed (http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/19839/focus=20320 ) and implemented (http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7eee4ae2dbb2be0a15a4406718806e48b18ba831 ) superceding the multipath only patch.
this doesn't make sense to me. even md doesn't check write same return value and disabling future write same commands, the write same command will return a failure and blkdev_issue_zeroout will fallback to write 0 data to disk, so you should still see the data is zeroed.
Somehow failure isn't being returned (I don't know if the issue VMware pointed out was ever fixed). One way to reproduce the problem is with the scsi_debug driver: 1. Run modprobe scsi_debug write_same_length=131072 2. Use dmesg to work out where where the disk appeared and build an mdadm device on top e.g. if it appears as sdc do: mdadm --create /dev/md1 /dev/sdc --raid-devices=1 --force --level=linear 3. Do a blkzeroout test and confirm that data is being overwritten with zeros. 4. Disable write same on the scsi_debug block device by doing something like echo 0 > /sys/block/sdc/device/scsi_disk/*/max_write_same_blocks 5. Do a blkzeroout test and notice that data is not being overwritten with zeros but no errors are being returned. Shaohua, do you see the issue with these steps?
Adding Mike Snitzer to the CC list for additional input because the steps in comment #6 seem to affect LVM (which is device mapper) too...
so the problem is blkdev_issue_write_same always return 0 even disk doesn't support write same. I'm still checking the reason. For the md side, we should disable write same if IO error happens, but I thought we should do this in block layer, because we will very similar things like DM. Stay tuned.
(In reply to Shaohua Li from comment #8) > so the problem is blkdev_issue_write_same always return 0 even disk doesn't > support write same. I'm still checking the reason. For the md side, we > should disable write same if IO error happens, but I thought we should do > this in block layer, because we will very similar things like DM. Stay tuned. DM does disable write same if it fails: For bio-based DM see drivers/md/dm.c:clone_endio() For request-based DM see drivers/md/dm.c:dm_done() I agree that ideally blkdev_issue_write_same() would return -EOPNOTSUPP accordingly (like blkdev_issue_discard does if discards aren't supported). But I seem to recall Martin Petersen (now cc'd) saying we couldn't achieve that with WRITE_SAME for some reason.. that we had no choice but to issue the WRITE_SAME with the hopse that it'd work.. but if it failed then we'd flag the device as not supporting it, etc.
(In reply to Sitsofe Wheeler from comment #7) > Adding Mike Snitzer to the CC list for additional input because the steps in > comment #6 seem to affect LVM (which is device mapper) too... Hmm, comment#6 doesn't have anything to do with LVM commands.. (In reply to Mike Snitzer from comment #9) > I agree that ideally blkdev_issue_write_same() would return -EOPNOTSUPP > accordingly (like blkdev_issue_discard does if discards aren't supported). > But I seem to recall Martin Petersen (now cc'd) saying we couldn't achieve > that with WRITE_SAME for some reason.. that we had no choice but to issue > the WRITE_SAME with the hopse that it'd work.. but if it failed then we'd > flag the device as not supporting it, etc. Ah, comment#4 seems to have the references explaining why this is.
(In reply to Mike Snitzer from comment #10) > > Hmm, comment#6 doesn't have anything to do with LVM commands.. Sure if you swap LVM commands for the MD ones you get a similar effect: 1. Run modprobe scsi_debug write_same_length=131072 2. Use dmesg | tail to work out where where the disk appeared and build an LVM device on top e.g. if it appears as sdc do: pvcreate /dev/sdc vgcreate vg /dev/sdc lvcreate -L4MB -n lv vg 3. Fill the /dev/mapper/vg-lv with ones, do a blkzeroout and then confirm that data is being overwritten with zeros (e.g. using dd/tr/blkdiscard --zero/hexdump or the program attached to comment #0). 4. Disable write same on the scsi_debug block device by doing something like echo 0 > /sys/block/sdc/device/scsi_disk/*/max_write_same_blocks 5. Do the ones then blkzeroout test from step 3 and notice that data is not being overwritten with zeros but no errors are being returned.
It sounds like other users have seen BLKZEROOUT spuriously succeeding in the wild too - see http://thread.gmane.org/gmane.linux.scsi/113726 for a user reporting a usb storage related race with it. This issue seems to be fixed by the patch Shaohua Li posted on https://patchwork.kernel.org/patch/9137311/ . I've followed up with comment here http://permalink.gmane.org/gmane.linux.kernel/2229377 (it works but perhaps md should still set its own write same to 0 and perhaps the variables in the patch should be renamed) but the patch progress seems currently stalled. I think this issue is a candidate for stable backport as well because of its current small size.
Sadly progress on this seems to have stalled again (see https://groups.google.com/forum/#!topic/linux.kernel/1WGDSlyY0y0 for the first attempt at fixing this a few years back) - http://thread.gmane.org/gmane.linux.kernel/2236800/focus=2242105 shows after a V2 release of the patches was made an objection was raised wrt to the current supression of EOPNOTSUPP. Sadly there are two issues: 1. How to fix this problem in future kernels. 2. How to fix the problem in stable (already released) kernels. While consensus is built on how to solve 1 it would be good to have a bandaid for 2 because released stable kernels are used by the enterprise distro releases. Without a backport, BLKZEROOUT can't be relied upon for certain user setups...
On 2016-06-24 patches by Christoph Hellwig were posted on the linux-block mailing list (http://www.spinics.net/lists/linux-block/msg03991.html - "zeroout fixes") against the 4.8 kernel that address this issue. The post also wonders about the best way to fix things up for 4.7. Using the attached script shows the problem is resolved for me with those patches so the aforementioned patch could fix the problem in future kernels. There are still a few questions though: 1. How to fix this in stable kernels? 2. When (or should) stacked devices set their own write_same_max_bytes to 0? With respect to 2. with Christoph's patch a dm device will set its write_same_max_bytes to 0 if the underlying device somehow sets it to 0 at the point a write same is issued but not if a user has manually disabled write same by echoing 0 into the underlying device's max_write_same_blocks. md devices never automatically reset their write_same_max_bytes to 0.
Created attachment 221941 [details] BLKZEROOUT LVM test script On 2016-06-24 patches by Christoph Hellwig were posted on the linux-block mailing list (http://www.spinics.net/lists/linux-block/msg03991.html - "zeroout fixes") against the 4.8 kernel that address this issue. The post also wonders about the best way to fix things up for 4.7. Using the attached script shows the problem is resolved (i.e. the blocks are always zeroed) with the aforementioned patches applied. There are still a few questions though: 1. How to fix this in stable kernels? 2. When (or should) stacked devices set their own write_same_max_bytes to 0? With respect to 2. with Christoph's patches a dm device will set its write_same_max_bytes to 0 if the underlying device somehow sets it to 0 at the point a BLKZEROOUT is issued but not if a user has manually disabled write same by echoing 0 into the underlying device's max_write_same_blocks. md devices never automatically reset their write_same_max_bytes to 0.
Looks like this might have been addressed by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=26483819f89c5cf9d27620d70c95afeeeb9bece5 in 4.11 . I'll try and test this out...