Created attachment 257051 [details]
kernel oops dmesg
I discovered a possible regression regarding bcache discard bio splitting. I would like to achieve a bcache accelerated discard-aware environment for qemu-kvm virtual machines. If bcache is present anywhere in the storage stack, there is a kernel oops, when issuing discard request from the guest VM.
I tested a more complex scenario, (one similar to my end goal), and a simple one to rule out everything else than bcache. Both lead to the same oops. (See attachment)
The complex one looks as follows: There is a drbd device, which acts as a backing device for bcache. The cache device is just plain partition on SSD. Bucket size is 2M. I created a LUKS device on top of the resulting bcache device, then an LVM PV on top of LUKS, and defined a few thin volumes within it. One thin volume is passed to each qemu-kvm VM with virtio scsi controlller and the discard=unmap qemu option.
The simple one is as follows: The backing device of bcache is an LVM thin volume. The caching device is a plain SSD partition. And the bcache0 device itself is passed to the VM through virtio scsi.
With newer kernels (tested a few 4.10 and 4.11 versions so far) both scenarios hang the I/O of the VM and produce the attached oops, when the VM tries to discard blocks. Leaving bcache out makes it work. Disabling discard in the qemu disk also makes the oops go away.
I remembered this to have worked before, so I tested it with older kernels. The oldest kernel I could find around my testing environment is 4.8.15-300.fc25. I can confirm it fully works, with both bcache and discard enabled. Both the complex and simple storage stack described earlier, so there might be a regression.
I also read that bucket sizes larger than 1M used to cause similar problems around 4.4 version, because of the way that discard bios are split, and how bvecs should fit in one memory page. So i tested with a lower bucket size: with 512k the result is the same. I also tried to set the max_unmap_size to 512k in qemu scsi disk settings, that made no difference.
I have noticed that the guest VM is also a factor. Testing with a Windows VM always triggers the oops. With a Fedora guest I usually need 2-3 runs of blkdiscard to trigger the issue. On 4.8.15 kernel both Linux and Win guests worked just fine. Qemu version I used was qemu-2.7.1-6.fc25 for all the tests, only difference is kernel version.
It seems that i have the same issue with bcache discarding mechanism in blk_queue_split.
But source of discard command is mkfs.ext4 instead of VirtIO.
The bug report of my problem is https://bugs.gentoo.org/671122.
I have a patch which I believe fixes your issue: https://www.spinics.net/lists/linux-bcache/msg06997.html
It looks like it will go in to the 5.1 kernel.
(In reply to Daniel Axtens from comment #2)
> I have a patch which I believe fixes your issue:
> It looks like it will go in to the 5.1 kernel.
I have tested this patch just minute ago, and i am confirm that it actually resolved my problem.
Now it would be great if it will be backported to actual kernel releases.
I have tagged it for inclusion in the stable trees. First it will need to hit mainline, which should happen in ~4 weeks, then it should automatically end up in stable trees some time after that. I don't know what your distro process is however - maybe there is a way to shortcut that.
It should also land in the Ubuntu and SuSE trees, spearheaded by myself and Coly Li respectively. I don't know what the Gentoo process is but there are now backports on the list that go back to 3.13, so it should be pretty easy for someone to do.
Patch was accepted in mainline kernel tree and backported to stable kernels.
I personally use this patch for many months now, and it works flawlessly.
I think this bug report can be closed.